Next: Datamining ...
Previous: Genetic algorithms ...
Dynamic Programming
Introduction
The study of dynamic programming, as we know it today, started
at around 1940 and a lot of the initial work had been done by
Bellman and Wald. Dynamic programming (DP) can be viewed as recursive
programming since common to all dynamic programming optimization
problems is a functional optimality equation which leads to a
recursive solution method.
In numerous papers Bellman identified optimality conditions of
optimization problems and through his work on functional equations,
dynamic programming and the principle of optimality became well known.
Stochastic sequential decision problems are closely linked to
stochastic DP.
The modern study of stochastic sequential decision problems
began with Wald's work on sequential statistical problems during
the Second World War. Today the existing theory on DP
is vast, especially on stochastic DP which is closely related to
Markov decision processes.
Many problems in management can be formulated as a DP problem.
In particular problems involving uncertainty and allocation problems.
Applications of stochastic dynamic programming arise in many areas,
such as aerospace dynamics, financial economics, resource
management, inventory management, robotics and power generation.
Very few people, it seems, actually know and understand DP and this
explains why DP is not used in industry and commerce for a lot of
suitable problems. DP is not a black box method. It usually requires
a great deal of mathematical skill to formulate problems in a DP form,
even though DP does not require a lot of mathematical knowledge.
A mathematically gifted high school student would be able to understand
the DP principles and models.
In the sequel a mathematical introduction to the usual DP models is
given. If the general description does not make sense to the reader
in the first reading it is probably a good idea to then first read through
the two examples given later on which illustrate the use of DP.
Particularly the first example is straightforward.
A general mathematical description of DP
Common to all DP formulations are states and actions.
Let be the set of states that a system can occupy and let
be the set of actions that can be taken. When a system is in state
then typically the set of actions that can be taken is
.
Taking action
from state leads to a transition
of the system to a new state
and incurs a
reward (or cost) . This is the common ingredient of
DP problems. However, this is not enough information to define an
optimization problem. Different classes of DP problems extend this
framework in different ways.
For example, in deterministic DP problems is defined
by and , i.e. the new state of the system is determined
by the old state and by the action that has been taken.
In stochastic DP problems is stochastic with the
probability distribution depending on and .
For some problems the number of transitions of states is
finite, for others it is infinite, again for others the number of
transitions is finite with probability one.
Also the objective function varies depending on the problem.
For many deterministic, finite horizon (i.e. finite number
of transitions of states) DP problems the objective can be written as
maximize |
|
|
(1) |
where is a given initial state,
, is the
number of transitions,
and is a terminal reward. The maximization in
(1) takes place over
.
Let's assume that the objects , , and are such
that the maximum in 1 exists.
For many stochastic, finite horizon DP problems the objective
can be formulated as
maximize |
|
|
(2) |
where now is a random variable with the distribution
depending on and on
.
is the expectation operator.
The maximization in
(2) takes place over all policies of
choosing actions, which means that maximization takes place over all
functions such that
. determines which
action is taken when in the -th decision period the state occupied
by the system is .
For many stochastic, infinite horizon DP problems the objective
can be written as
maximize |
|
|
(3) |
where
is a discount factor. To ensure convergence, assume
that the function is bounded.
A lot of problems can be written in one of the forms
(1),
(2),
(3) or in one of these forms
when they are slightly changed.
The DP optimality condition to (1)
consists of a value function
and then establishing the functional equation
|
|
|
(4) |
Then the optimal objective value of (1)
is and can be calculated recursively using
(4). Having found
the optimal path of states and actions of problem
(1) are found as follows:
and are the state and action for which
where
and
.
Similarly and are the state and action for which
where
and
. And so on ....
For the objective (2)
of a stochastic, finite horizon problem the value function is
where is a policy of choosing actions, which means that
is a function and
. The maximization
in the above definition takes place over the set of policies.
The functional equation is
where
is a random variable now.
The argument , which achieves the maximum in the RHS, defines
where
is an optimal policy.
For the objective (3)
of a stochastic DP problem with infinite horizon the value function
and functional equation are
For many DP problems the state space can be partitioned
into subsets , , ,... such that when
(for some ) and any action is taken then the system
next occupies a state satisfying
.
In such case the problem has decision periods or stages.
For many DP problems the difficulty is identifying suitable states,
actions and stages. The next section discusses two introductory
examples, the first one is straightforward, the second one has
some entertainment value and shows the difficulty of
finding states, actions and stages.
Shortest path problem
Consider the shortest path problem of getting from node 7 to node 0
in the acyclic graph of Figure 1
where traversing each arc has a specific cost.
A suitable definition of state is
state getting from node to node |
|
|
|
This means every node is associated with a state.
A suitable definition of a value function would be
the cost of the cheapest way from node to |
|
|
|
A suitable definition of action to be taken from state is
next node to be visited from node on a cheapest
route. |
|
|
|
The states of this problem can be partitioned into stages.
State 0 is the only state of stage 0,
states 1,2,3 belong to stage 1, states 4,5,6 belong to stage 2 and
state 7 is the only state of stage 3.
The DP optimality equation can be written as
In words: the optimal route from node to 0 consists always
of a first arc traversal towards node 0 and then to carry on
from there
on the cheapest possible route to 0.
The DP calculation is straightforward, and having
calculated for all states of stage the recursion yields
for all states of stage .
It turns out that F(7)=8
which means that the cheapest route has cost 8.
The optimal path can be obtained by
working the optimal actions backwards starting from state 7.
Figure 1:
A shortest path problem
|
If the problem is of the same shape but larger, with stages
and with
nodes in each
of stages , and if every node is connected to every
node of the neighbouring stage,
then there are different paths
from one end of the network to the other end.
The work involved in the DP calculation is ,
i.e. a lot less than total enumeration when and are
large.
[The classical algorithm for shortest path problems is Dijkstra's Algorithm.
However, Dijkstra's Algorithm can be viewed as and derived from
a special form of
Dynamic Programming.]
A spinning wheel problem
Consider the game, illustrated in Figure 2,
where a spinning wheel has as possible outcomes the decimal
digits 0,1,2,...,9. All possible outcomes have equal probability.
The wheel is spun four times and after each spin a player has to
put the outcome digit into one of the four boxes. The aim is to
have a large four digit number at the end. What is the best
strategy for placing the digits?
For example, if the first spin returns a 9 the player would
(if he or she is sensible) place it into the first box from the left.
If at any time a spin returns a 0 the player would place it into the
last box which is still free.
But what should be done if the first or second spin returns a
6 or 7?
Figure 2:
A spinning wheel game
|
Two DP approaches to this problem will be discussed. The first
approach is easier to understand but less elegant than the second one.
A good objective in this game is to achieve a high expectation
of the final four digit number. In the first approach let's
identify a state as
where each of , , , corresponds to one of the
objects , 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.
A state
is interpreted as the four places
being presently filled in with , , ,
from left to right. If then place contains no
digit and is free. The states can be grouped into stages,
stage being the set of those states
such that exactly of , , , are equal to
the object , i.e. those states which have exactly
free places left.
A suitable value function is
|
|
expectation of the four digit number at the end |
|
|
|
given that the game starts with the places being |
|
|
|
and given that an optimal placing-strategy is applied. |
|
Let be the random variable of the result of a spin.
The DP optimality equation can be written as
where is the indicator function taking value 1 if the
argument is true and 0 otherwise.
For states of stage 0 the value of is simply the number
.
Knowing for all states of stage
the value of can be calculated for all states of stage
using the above recursion.
is the
maximal expectation of the four digit number in the spinning wheel
game when using an optimal strategy.
Doing this DP calculation an optimal placing strategy can be
discovered. There are states and probably nobody would like to
do this DP calculation by hand.
Another DP approach, which allows a solution by hand, is the following:
First of all observe that each place can be given a value. The first
place has value , the second one has value ,
the third one has value and the last has value 1.
Let a state be identified by
where and
are values of free places with
arranged in decreasing order.
For instance, the state
corresponds to there
being three free places with values
.
Let the value functions be
|
|
maximal expectation of the end number when |
|
|
|
a spinning wheel game is played, there are free places |
|
|
|
and the values of the places are |
|
The DP optimality equation is
where is the random variable of
the result of a spin. Let's outline the DP calculation.
because if there is one free place only then there is no
choice but to place the first spin in this free place and the
expectation of a spin is 4.5.
Now consider a state
. By assumption .
If is placed in the first place then the expectation of the end number
is
If is placed in the second place then the expectation of the end number
is
Therefore it is optimal to place in the first place if
and in the second place if .
Then
By looking at the last line one can identify the optimal strategy
of placing a spin when three free places are left:
If , the result of a spin, is greater than
(i.e. if )
it is placed in the first free place, if
(i.e. if ) it is placed in the second free place
and otherwise it is placed in the last free place.
Calculating
in a similar way as
has been calculated, one finds
From this, one can read off the optimal strategy for where to place
the first spin in the spinning wheel game with four free places:
if (i.e.
) it is put in the
first, if (i.e.
)
it is put in the second, if
(i.e.
)
it is put in the third and if
(i.e.
)
it is put in the last place.
One finds
With , , , the expectation of the
four digit number under an optimal strategy in the spinning wheel game
turns out to be 7482.735.
If the outcomes of the spins were placed at random then the
expected value of the four digit number would be 4999.5.
Next: Datamining ...
Previous: Genetic algorithms ...