next previous
Next: Datamining ... Previous: Genetic algorithms ...

Dynamic Programming


Introduction
The study of dynamic programming, as we know it today, started at around 1940 and a lot of the initial work had been done by Bellman and Wald. Dynamic programming (DP) can be viewed as recursive programming since common to all dynamic programming optimization problems is a functional optimality equation which leads to a recursive solution method. In numerous papers Bellman identified optimality conditions of optimization problems and through his work on functional equations, dynamic programming and the principle of optimality became well known. Stochastic sequential decision problems are closely linked to stochastic DP. The modern study of stochastic sequential decision problems began with Wald's work on sequential statistical problems during the Second World War. Today the existing theory on DP is vast, especially on stochastic DP which is closely related to Markov decision processes.

Many problems in management can be formulated as a DP problem. In particular problems involving uncertainty and allocation problems. Applications of stochastic dynamic programming arise in many areas, such as aerospace dynamics, financial economics, resource management, inventory management, robotics and power generation. Very few people, it seems, actually know and understand DP and this explains why DP is not used in industry and commerce for a lot of suitable problems. DP is not a black box method. It usually requires a great deal of mathematical skill to formulate problems in a DP form, even though DP does not require a lot of mathematical knowledge. A mathematically gifted high school student would be able to understand the DP principles and models.

In the sequel a mathematical introduction to the usual DP models is given. If the general description does not make sense to the reader in the first reading it is probably a good idea to then first read through the two examples given later on which illustrate the use of DP. Particularly the first example is straightforward.

A general mathematical description of DP
Common to all DP formulations are states and actions. Let $ S$ be the set of states that a system can occupy and let $ A$ be the set of actions that can be taken. When a system is in state $ s\in S$ then typically the set of actions that can be taken is $ A_s\subset A$. Taking action $ a_s \in A_s$ from state $ s$ leads to a transition of the system to a new state $ \hat{s} \in S$ and incurs a reward (or cost) $ r(s,a_s)$. This is the common ingredient of DP problems. However, this is not enough information to define an optimization problem. Different classes of DP problems extend this framework in different ways. For example, in deterministic DP problems $ \hat{s}$ is defined by $ s$ and $ a_s$, i.e. the new state of the system is determined by the old state and by the action that has been taken. In stochastic DP problems $ \hat{s}$ is stochastic with the probability distribution depending on $ s$ and $ a_s$. For some problems the number of transitions of states is finite, for others it is infinite, again for others the number of transitions is finite with probability one. Also the objective function varies depending on the problem.

For many deterministic, finite horizon (i.e. finite number of transitions of states) DP problems the objective can be written as

maximize $\displaystyle \sum_{i=0}^{n-1} r(s_i,a_i)\quad +R(s_n)$     (1)

where $ s_0$ is a given initial state, $ a_i \in A_{s_i}$, $ n$ is the number of transitions, $ s_{i+1}=\hat{s}(s_i,a_i)$ and $ R(s_n)$ is a terminal reward. The maximization in (1) takes place over $ \{a_i\}_{i=0}^{n-1}$. Let's assume that the objects $ S$, $ A_s$, $ r$ and $ R$ are such that the maximum in 1 exists.

For many stochastic, finite horizon DP problems the objective can be formulated as

maximize $\displaystyle \mbox{$I\!\!E$}$$\displaystyle \left ( \sum_{i=0}^{n-1} r(s_i,a_i)\quad +R(s_n)
\right )$     (2)

where now $ s_{i+1}$ is a random variable with the distribution depending on $ s_i$ and on $ a_i \in A_{s_i}$. $ \mbox{$I\!\!E$}$ is the expectation operator. The maximization in (2) takes place over all policies of choosing actions, which means that maximization takes place over all functions $ \pi$ such that $ a_i=\pi(i,s_i)$. $ \pi$ determines which action is taken when in the $ i$-th decision period the state occupied by the system is $ s_i$.

For many stochastic, infinite horizon DP problems the objective can be written as

maximize $\displaystyle \mbox{$I\!\!E$}$$\displaystyle \left ( \sum_{i=0}^{\infty} \alpha^i r(s_i,a_i)
\right )$     (3)

where $ 0<\alpha<1$ is a discount factor. To ensure convergence, assume that the function $ r$ is bounded.

A lot of problems can be written in one of the forms (1), (2), (3) or in one of these forms when they are slightly changed. The DP optimality condition to (1) consists of a value function

$\displaystyle F_0(s)$ $\displaystyle :=$ $\displaystyle R(s),$  
$\displaystyle F_j(s)$ $\displaystyle :=$ $\displaystyle \max_{\{a_i\}_{i=n-j}^{n-1}} \left\{ \sum_{i=n-j}^{n-1}
r(s_i,a_i) \quad +R(s_n)\quad :s_{n-j}=s \right\}$   for $\displaystyle j\ge 1$  

and then establishing the functional equation
$\displaystyle F_j(s) = \max_{a\in A_s} \left\{ r(s,a) + F_{j-1}(\hat{s}(s,a))
\right\}.$     (4)

Then the optimal objective value of (1) is $ F_n(s_0)$ and can be calculated recursively using (4). Having found $ F_n(s_0)$ the optimal path of states and actions of problem (1) are found as follows: $ s_1$ and $ a_0$ are the state and action for which
$\displaystyle F_n(s_0)= r(s_0,a_0) + F_{n-1}(s_1),$      

where $ a_0 \in A_{s_0}$ and $ s_1=\hat{s}(s_0,a_0)$. Similarly $ s_2$ and $ a_1$ are the state and action for which
$\displaystyle F_{n-1}(s_1)= r(s_1,a_1) + F_{n-2}(s_2),$      

where $ a_1 \in A_{s_1}$ and $ s_2=\hat{s}(s_1,a_1)$. And so on ....

For the objective (2) of a stochastic, finite horizon problem the value function is

$\displaystyle F_0(s)$ $\displaystyle :=$ $\displaystyle R(s),$  
$\displaystyle F_j(s)$ $\displaystyle :=$ $\displaystyle \max_{\pi} \left\{ \mbox{$I\!\!E$}\left(\sum_{i=n-j}^{n-1}
r(s_i,...
...(s_n) \quad \Biggr\vert s_{n-j}=s \right) \right\} \quad\quad
\mbox{for }j\ge 1$  

where $ \pi$ is a policy of choosing actions, which means that $ \pi$ is a function and $ a_i=\pi(i,s)$. The maximization in the above definition takes place over the set of policies. The functional equation is
$\displaystyle F_j(s)$ $\displaystyle =$ $\displaystyle \max_{a\in A_s} \left\{ r(s,a) + \mbox{$I\!\!E$}F_{j-1}(\hat{s}(s,a))
\right\}$  

where $ \hat{s}(s,a)$ is a random variable now. The argument $ a$, which achieves the maximum in the RHS, defines $ \tilde{\pi}(j,s)$ where $ \tilde{\pi}$ is an optimal policy.

For the objective (3) of a stochastic DP problem with infinite horizon the value function and functional equation are

$\displaystyle F(s)$ $\displaystyle :=$ $\displaystyle \max_{\pi} \left\{ \mbox{$I\!\!E$}\left(
\sum_{i=0}^{\infty} \alpha^i r(s_i,a_i) \quad \Biggr\vert
s_0=s \right)
\right\},$  
$\displaystyle F(s)$ $\displaystyle =$ $\displaystyle \max_{a\in A_s} \left\{ r(s,a) + \alpha \mbox{$I\!\!E$}F(\hat{s}(s,a))
\right\}.$  

For many DP problems the state space $ S$ can be partitioned into subsets $ S_0$, $ S_1$, $ S_2$,... such that when $ s\in S_i$ (for some $ i\ge 0$) and any action $ a\in A_s$ is taken then the system next occupies a state $ \hat{s}$ satisfying $ \hat{s}\in S_{i+1}$. In such case the problem has decision periods or stages.

For many DP problems the difficulty is identifying suitable states, actions and stages. The next section discusses two introductory examples, the first one is straightforward, the second one has some entertainment value and shows the difficulty of finding states, actions and stages.

Shortest path problem
Consider the shortest path problem of getting from node 7 to node 0 in the acyclic graph of Figure 1 where traversing each arc has a specific cost. A suitable definition of state is

state $\displaystyle i:$    getting from node $\displaystyle i$    to node $\displaystyle 0.$      

This means every node is associated with a state. A suitable definition of a value function would be
$\displaystyle F(i):=$ the cost of the cheapest way from node $\displaystyle i$ to $\displaystyle 0$      

A suitable definition of action to be taken from state $ i$ is
$\displaystyle A(i):=$    next node to be visited from node $\displaystyle i$ on a cheapest route.      

The states of this problem can be partitioned into stages. State 0 is the only state of stage 0, states 1,2,3 belong to stage 1, states 4,5,6 belong to stage 2 and state 7 is the only state of stage 3. The DP optimality equation can be written as
$\displaystyle F(i)= \min_{j} \left\{ \begin{array}{c}
(\mbox{cost of going from...
...he stage one before that which state }i
\mbox{ belongs to} \end{array} \right\}$      

In words: the optimal route from node $ i$ to 0 consists always of a first arc traversal towards node 0 and then to carry on from there on the cheapest possible route to 0. The DP calculation is straightforward, $ F(0)=0$ and having calculated $ F$ for all states of stage $ k$ the recursion yields $ F$ for all states of stage $ k+1$. It turns out that F(7)=8 which means that the cheapest route has cost 8. The optimal path can be obtained by working the optimal actions backwards starting from state 7.

Figure 1: A shortest path problem
\begin{figure}\epsfysize =8cm
\centerline{\epsffile{shortest_path.eps}}\end{figure}

If the problem is of the same shape but larger, with stages $ 0,1,2,...,m,m+1$ and with $ n$ nodes in each of stages $ 1,2,...,m$, and if every node is connected to every node of the neighbouring stage, then there are $ n^m$ different paths from one end of the network to the other end. The work involved in the DP calculation is $ O(m n^2)$, i.e. a lot less than total enumeration when $ m$ and $ n$ are large.

[The classical algorithm for shortest path problems is Dijkstra's Algorithm. However, Dijkstra's Algorithm can be viewed as and derived from a special form of Dynamic Programming.]

A spinning wheel problem
Consider the game, illustrated in Figure 2, where a spinning wheel has as possible outcomes the decimal digits 0,1,2,...,9. All possible outcomes have equal probability. The wheel is spun four times and after each spin a player has to put the outcome digit into one of the four boxes. The aim is to have a large four digit number at the end. What is the best strategy for placing the digits? For example, if the first spin returns a 9 the player would (if he or she is sensible) place it into the first box from the left. If at any time a spin returns a 0 the player would place it into the last box which is still free. But what should be done if the first or second spin returns a 6 or 7?

Figure 2: A spinning wheel game
\begin{figure}\epsfysize =6cm
\centerline{\epsffile{spinning_wheel.eps}}\end{figure}

Two DP approaches to this problem will be discussed. The first approach is easier to understand but less elegant than the second one. A good objective in this game is to achieve a high expectation of the final four digit number. In the first approach let's identify a state as

$\displaystyle (p_1,p_2,p_3,p_4)$      

where each of $ p_1$, $ p_2$, $ p_3$, $ p_4$ corresponds to one of the objects $ \Box$, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. A state $ (p_1,p_2,p_3,p_4)$ is interpreted as the four places being presently filled in with $ p_1$, $ p_2$, $ p_3$, $ p_4$ from left to right. If $ p_i=\Box$ then place $ i$ contains no digit and is free. The states can be grouped into stages, stage $ j$ being the set of those states $ (p_1,p_2,p_3,p_4)$ such that exactly $ j$ of $ p_1$, $ p_2$, $ p_3$, $ p_4$ are equal to the object $ \Box$, i.e. those states which have exactly $ j$ free places left. A suitable value function is
$\displaystyle F(p_1,p_2,p_3,p_4)$ $\displaystyle :=$ expectation of the four digit number at the end  
    given that the game starts with the places being $\displaystyle p_1,p_2,p_3,p_4$  
    and given that an optimal placing-strategy is applied.  

Let $ r$ be the random variable of the result of a spin. The DP optimality equation can be written as
$\displaystyle F(p_1,p_2,p_3,p_4)=$   $\displaystyle \mbox{$I\!\!E$}$$\displaystyle \Biggr( \max \Biggr\{ F(r,p_2,p_3,p_4)
I[p_1=\Box], \quad F(p_1,r,p_3,p_4)I[p_2=\Box],$      
$\displaystyle F(p_1,p_2,r,p_4)I[p_3=\Box],\quad F(p_1,p_2,p_3,r)I[p_4=\Box] \Biggr\} \Biggr)$      

where $ I[..]$ is the indicator function taking value 1 if the argument is true and 0 otherwise. For states of stage 0 the value of $ F$ is simply the number $ 1000 p_1$ $ +100 p_2$ $ +10 p_3$ $ + p_4$. Knowing $ F$ for all states of stage $ j$ the value of $ F$ can be calculated for all states of stage $ j+1$ using the above recursion. $ F(\Box,\Box,\Box,\Box)$ is the maximal expectation of the four digit number in the spinning wheel game when using an optimal strategy. Doing this DP calculation an optimal placing strategy can be discovered. There are $ 11^4$ states and probably nobody would like to do this DP calculation by hand.

Another DP approach, which allows a solution by hand, is the following: First of all observe that each place can be given a value. The first place has value $ 10^3$, the second one has value $ 10^2$, the third one has value $ 10^1$ and the last has value 1. Let a state be identified by

$\displaystyle (i,A_1,...,A_i)$      

where $ i=1,2,3,4$ and $ A_1,...,A_i$ are values of free places with $ A_1,...,A_i$ arranged in decreasing order. For instance, the state $ (3,A_1,A_2,A_3)$ corresponds to there being three free places with values $ A_1,A_2,A_3$. Let the value functions be
$\displaystyle F_i(A_1,...,A_i)$ $\displaystyle :=$ maximal expectation of the end number when  
    a spinning wheel game is played, there are $\displaystyle i$    free places  
    and the values of the places are $\displaystyle A_1,...,A_i.$  

The DP optimality equation is
$\displaystyle F_i(A_1,..,A_i)=$   $\displaystyle \mbox{$I\!\!E$}$$\displaystyle \Biggr( \max_{j\in \{1,..,i\}}
\Biggr\{ rA_j + F_{i-1}(B_1,..,B_{i-1})\quad :B_k=A_k$    for $\displaystyle k<j,$      
$\displaystyle B_k=A_{k+1}$ for $\displaystyle k\ge j \Biggr\} \Biggr)$      

where $ r$ is the random variable of the result of a spin. Let's outline the DP calculation.
$\displaystyle F_1(A_1)=$   $\displaystyle \mbox{$I\!\!E$}$$\displaystyle (rA_1) = 4.5 A_1$      

because if there is one free place only then there is no choice but to place the first spin in this free place and the expectation of a spin is 4.5. Now consider a state $ (2,A_1,A_2)$. By assumption $ A_1>A_2$. If $ r$ is placed in the first place then the expectation of the end number is
$\displaystyle rA_1+F_1(A_2) \quad (=rA_1 + 4.5A_2).$      

If $ r$ is placed in the second place then the expectation of the end number is
$\displaystyle F_1(A_1)+rA_2 \quad (=4.5A_1 + rA_2).$      

Therefore it is optimal to place $ r$ in the first place if $ r>4.5$ and in the second place if $ r<4.5$. Then
$\displaystyle F_2(A_1,A_2)$ $\displaystyle =$ $\displaystyle \mbox{$I\!\!E$}$$\displaystyle \left(F_1(A_1)+rA_2 \quad \vert r\le 4\right)$   P$\displaystyle (r\le 4)$  
    $\displaystyle +$$\displaystyle \mbox{$I\!\!E$}$$\displaystyle \left(rA_1+F_1(A_2) \quad \vert r\ge 5\right)$   P$\displaystyle (r\ge 5)$  
  $\displaystyle =$ $\displaystyle (4.5A_1 +2A_2)\frac{1}{2} + (7A_1 + 4.5A_2)\frac{1}{2}$  
  $\displaystyle =$ $\displaystyle \frac{23}{4}A_1 + \frac{13}{4}A_2$  

By looking at the last line one can identify the optimal strategy of placing a spin when three free places are left: If $ r$, the result of a spin, is greater than $ \frac{23}{4}$ (i.e. if $ r=6,7,8,9$) it is placed in the first free place, if $ \frac{23}{4}>r>\frac{13}{4}$ (i.e. if $ r=4,5$) it is placed in the second free place and otherwise it is placed in the last free place. Calculating $ F_3(A_1,A_2,A_3)$ in a similar way as $ F_2(A_1,A_2)$ has been calculated, one finds
$\displaystyle F_3(A_1,A_2,A_3)=\frac{129}{20}A_1 + \frac{9}{2}A_2 + \frac{51}{20}A_3$      

From this, one can read off the optimal strategy for where to place the first spin in the spinning wheel game with four free places: if $ r=7,8,9$ (i.e. $ r>\frac{129}{20}$) it is put in the first, if $ r=5,6$ (i.e. $ \frac{129}{20} > r > \frac{9}{2}$) it is put in the second, if $ r=3,4$ (i.e. $ \frac{9}{2} > r > \frac{51}{20}$) it is put in the third and if $ r=0,1,2$ (i.e. $ \frac{51}{20} > r$) it is put in the last place. One finds
$\displaystyle F_4(A_1,A_2,A_3,A_4)=\frac{1383}{200}A_1 + \frac{1057}{200}A_2
+ \frac{743}{200}A_3 + \frac{417}{200}A_4.$      

With $ A_1=10^3$, $ A_2=10^2$, $ A_3=10$, $ A_4=1$ the expectation of the four digit number under an optimal strategy in the spinning wheel game turns out to be 7482.735. If the outcomes of the spins were placed at random then the expected value of the four digit number would be 4999.5.


next previous
Next: Datamining ... Previous: Genetic algorithms ...