The study of dynamic programming, as we know it today, started at around 1940 and a lot of the initial work had been done by Bellman and Wald. Dynamic programming (DP) can be viewed as recursive programming since common to all dynamic programming optimization problems is a functional optimality equation which leads to a recursive solution method. In numerous papers Bellman identified optimality conditions of optimization problems and through his work on functional equations, dynamic programming and the principle of optimality became well known. Stochastic sequential decision problems are closely linked to stochastic DP. The modern study of stochastic sequential decision problems began with Wald's work on sequential statistical problems during the Second World War. Today the existing theory on DP is vast, especially on stochastic DP which is closely related to Markov decision processes.

Many problems in management can be formulated as a DP problem. In particular problems involving uncertainty and allocation problems. Applications of stochastic dynamic programming arise in many areas, such as aerospace dynamics, financial economics, resource management, inventory management, robotics and power generation. Very few people, it seems, actually know and understand DP and this explains why DP is not used in industry and commerce for a lot of suitable problems. DP is not a black box method. It usually requires a great deal of mathematical skill to formulate problems in a DP form, even though DP does not require a lot of mathematical knowledge. A mathematically gifted high school student would be able to understand the DP principles and models.

In the sequel a mathematical introduction to the usual DP models is
given. If the general description does not make sense to the reader
in the first reading it is probably a good idea to then first read through
the two examples given later on which illustrate the use of DP.
Particularly the first example is straightforward.

**A general mathematical description of DP**

Common to all DP formulations are states and actions.
Let be the set of states that a system can occupy and let
be the set of actions that can be taken. When a system is in state
then typically the set of actions that can be taken is
.
Taking action
from state leads to a transition
of the system to a new state
and incurs a
reward (or cost) . This is the common ingredient of
DP problems. However, this is not enough information to define an
optimization problem. Different classes of DP problems extend this
framework in different ways.
For example, in deterministic DP problems is defined
by and , *i.e.* the new state of the system is determined
by the old state and by the action that has been taken.
In stochastic DP problems is stochastic with the
probability distribution depending on and .
For some problems the number of transitions of states is
finite, for others it is infinite, again for others the number of
transitions is finite with probability one.
Also the objective function varies depending on the problem.

For many deterministic, finite horizon (*i.e.* finite number
of transitions of states) DP problems the objective can be written as

where is a given initial state, , is the number of transitions, and is a terminal reward. The maximization in (1) takes place over . Let's assume that the objects , , and are such that the maximum in 1 exists.

For many stochastic, finite horizon DP problems the objective
can be formulated as

where now is a random variable with the distribution depending on and on . is the expectation operator. The maximization in (2) takes place over all policies of choosing actions, which means that maximization takes place over all functions such that . determines which action is taken when in the -th decision period the state occupied by the system is .

For many stochastic, infinite horizon DP problems the objective
can be written as

where is a discount factor. To ensure convergence, assume that the function is bounded.

A lot of problems can be written in one of the forms
(1),
(2),
(3) or in one of these forms
when they are slightly changed.
The DP optimality condition to (1)
consists of a value function

for |

and then establishing the functional equation

Then the optimal objective value of (1) is and can be calculated recursively using (4). Having found the optimal path of states and actions of problem (1) are found as follows: and are the state and action for which

where and . Similarly and are the state and action for which

where and . And so on ....

For the objective (2)
of a stochastic, finite horizon problem the value function is

where is a policy of choosing actions, which means that is a function and . The maximization in the above definition takes place over the set of policies. The functional equation is

where is a random variable now. The argument , which achieves the maximum in the RHS, defines where is an optimal policy.

For the objective (3)
of a stochastic DP problem with infinite horizon the value function
and functional equation are

For many DP problems the state space can be partitioned into subsets , , ,... such that when (for some ) and any action is taken then the system next occupies a state satisfying . In such case the problem has decision periods or stages.

For many DP problems the difficulty is identifying suitable states,
actions and stages. The next section discusses two introductory
examples, the first one is straightforward, the second one has
some entertainment value and shows the difficulty of
finding states, actions and stages.

**Shortest path problem**

Consider the shortest path problem of getting from node 7 to node 0
in the acyclic graph of Figure 1
where traversing each arc has a specific cost.
A suitable definition of state is

state getting from node to node |

This means every node is associated with a state. A suitable definition of a value function would be

the cost of the cheapest way from node to |

A suitable definition of action to be taken from state is

next node to be visited from node on a cheapest route. |

The states of this problem can be partitioned into stages. State 0 is the only state of stage 0, states 1,2,3 belong to stage 1, states 4,5,6 belong to stage 2 and state 7 is the only state of stage 3. The DP optimality equation can be written as

In words: the optimal route from node to 0 consists always of a first arc traversal towards node 0 and then to carry on from there on the cheapest possible route to 0. The DP calculation is straightforward, and having calculated for all states of stage the recursion yields for all states of stage . It turns out that F(7)=8 which means that the cheapest route has cost 8. The optimal path can be obtained by working the optimal actions backwards starting from state 7.

If the problem is of the same shape but larger, with stages
and with
nodes in each
of stages , and if every node is connected to every
node of the neighbouring stage,
then there are different paths
from one end of the network to the other end.
The work involved in the DP calculation is ,
*i.e.* a lot less than total enumeration when and are
large.

[The classical algorithm for shortest path problems is Dijkstra's Algorithm.
However, Dijkstra's Algorithm can be viewed as and derived from
a special form of
Dynamic Programming.]

**A spinning wheel problem**

Consider the game, illustrated in Figure 2,
where a spinning wheel has as possible outcomes the decimal
digits 0,1,2,...,9. All possible outcomes have equal probability.
The wheel is spun four times and after each spin a player has to
put the outcome digit into one of the four boxes. The aim is to
have a large four digit number at the end. What is the best
strategy for placing the digits?
For example, if the first spin returns a 9 the player would
(if he or she is sensible) place it into the first box from the left.
If at any time a spin returns a 0 the player would place it into the
last box which is still free.
But what should be done if the first or second spin returns a
6 or 7?

Two DP approaches to this problem will be discussed. The first
approach is easier to understand but less elegant than the second one.
A good objective in this game is to achieve a high expectation
of the final four digit number. In the first approach let's
identify a state as

where each of , , , corresponds to one of the objects , 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. A state is interpreted as the four places being presently filled in with , , , from left to right. If then place contains no digit and is free. The states can be grouped into stages, stage being the set of those states such that exactly of , , , are equal to the object ,

expectation of the four digit number at the end | |||

given that the game starts with the places being | |||

and given that an optimal placing-strategy is applied. |

Let be the random variable of the result of a spin. The DP optimality equation can be written as

where is the indicator function taking value 1 if the argument is true and 0 otherwise. For states of stage 0 the value of is simply the number . Knowing for all states of stage the value of can be calculated for all states of stage using the above recursion. is the maximal expectation of the four digit number in the spinning wheel game when using an optimal strategy. Doing this DP calculation an optimal placing strategy can be discovered. There are states and probably nobody would like to do this DP calculation by hand.

Another DP approach, which allows a solution by hand, is the following:
First of all observe that each place can be given a value. The first
place has value , the second one has value ,
the third one has value and the last has value 1.
Let a state be identified by

where and are values of free places with arranged in decreasing order. For instance, the state corresponds to there being three free places with values . Let the value functions be

maximal expectation of the end number when | |||

a spinning wheel game is played, there are free places | |||

and the values of the places are |

The DP optimality equation is

for | |||

for |

where is the random variable of the result of a spin. Let's outline the DP calculation.

because if there is one free place only then there is no choice but to place the first spin in this free place and the expectation of a spin is 4.5. Now consider a state . By assumption . If is placed in the first place then the expectation of the end number is

If is placed in the second place then the expectation of the end number is

Therefore it is optimal to place in the first place if and in the second place if . Then

P | |||

P | |||

By looking at the last line one can identify the optimal strategy of placing a spin when three free places are left: If , the result of a spin, is greater than (

From this, one can read off the optimal strategy for where to place the first spin in the spinning wheel game with four free places: if (

With , , , the expectation of the four digit number under an optimal strategy in the spinning wheel game turns out to be 7482.735. If the outcomes of the spins were placed at random then the expected value of the four digit number would be 4999.5.