Solving Large MDPs Quickly with Partitioned Value Iteration
This on-line appendix supports my thesis
"Solving Large MDPs Quickly with Partitioned
Value Iteration". Below are links to movies
which graphically demonstrate the effect different prioritization
metrics have, as well as the source code for the
GPS algorithm and the double-arm pendulum simulator.
All videos run at the same "timescale," with a notable caveat. For
each video, a new frame is generated every time a new partition is
selected. Recall that the GPS algorithm must value iterate over a
partition, which is not shown. Thus, what appears to be one frame in
any of the GPS videos may actually involve multiple iterations over
a given partition. Although sometimes each partition is iterated over
only once, sometimes they are iterated over multiple times; this is
somewhat unfair to value iteration, which always processes each
partition once. However, this does not affect the intuition
that value iteration will take far longer to solve a problem than any
of the other algorithms.
To generate these movies, a slightly modified version of the
traditional mountain car function was employed. Normally, the agent
receives a reward upon exiting the space either to the left or to the
right; the reward on the left is always -1, and the reward on the
right is linear, peaking at a velocity of zero. Instead of this
continuous reward, we opted for a point reward: the agent must exit
the space with a velocity within epsilon of zero. All other rewards
are zero. This does not substantially change the optimal control
policy, but clarifies the way the algorithm functions.
The X axis represents position, the Z axis represents velocity, and
the Y axis represents the value of the states. Red and green colors
show different control policies. A blue color indicates that the state
has never been backed up. A 400x400 state discretization was used.
Partitions were generated by overlaying another 10x10 grid. Epsilon
was set at 0.0001.
The value iteration algorithm running on the Mountain Car
problem. Note how slowly correct information (both value and policy)
backpropagates through the state space, and how much effort is wasted
as value iteration performs useless backups. Variable reordering was
not enabled. Only 752 frames are shown.
The PVI-H1-VRE algorithm running on the Mountain Car problem.
Although value function mass is propagated quickly, it is not
propagated cleanly. Several "steps" are visible, as well as policy
and value imperfections. As a result, the algorithm must "clean up"
the value function by working again and again on some regions. This
run took 406 frames.
The PVI-H2-VRE algorithm running on the Mountain Car problem.
This algorithm tends to ensure that regions are fully converged before
moving on. The policy is always very clean as it propagates
backwards. This run took 234 frames.
The state space for the single-arm pendulum is described by the angle
and the velocity of the link. The agent must learn to balance the
pendulum vertically at a zero degree angle, with zero velocity. There
is a single reward in the system, at the point (0,0). Recall that the
angle is a circular dimension, so the right-hand side and the
left-hand side of these movies are actually connected.
The X axis represents the angle (spanning -pi on the left to +pi on
the right), and the Z axis represents the velocity. The goal state
(0,0) is directly in the middle of the picture. The Y axis represents
the value of the states. Red and green are different control
policies. A blue color indicates that the state has never been backed
up. A 400x400 state discretization was used. Partitions were
generated with another 20x20 grid. Epsilon was set at 0.0001.
The PVI-H1-VRE algorithm running on the Single Arm
Pendulum problem. The "stepping" behavior of the H1 metric is very
clear in this video. Note especially how the "arms" on either side of
the peak must be processed multiple times. This value function is
characterized by many imperfections. The video was terminated after
2230 frames, at which point epsilon was about 0.0009.
The PVI-H2-VRE algorithm running on the Single Arm Pendulum
problem. Again, note how clean the policy and value function are as
they propagate out from the reward. Note also that once the center of
the problem has converged, it is never processed again. This allows
PVI-H2-VRE to converge in less than half the time of PVI-H1-VRE: this
run is 1220 frames long.
The state space for the double-arm pendulum is described by four
variables: two angles and two angular velocities. The agent must
learn to balance the pendulum vertically at a zero degree angle (for
both joints), with zero velocity (for both links). There is a single
reward in the system, at the point (0,0,0,0).
Instead of demonstrating the construction of the value function, this
video demonstrates the double-arm pendulum simulator in action. At
first, the pendulum is allowed is drop straight down. The agent's
control policy is then engaged, and the agent balances the pendulum.
The pendulum is shown being balanced from several random
configurations. This control policy was generated from an MDP with
Download GPS 1.1
The GPS source code is designed to operate on general, discrete MDPs.
Included in the GPS distribution are several companion discretizers
which generate discrete MDPs from continuous state/continuous time
control problems, as well as supporting scripts designed to aid with
Contained in the source code are the models for the Mountain Car, the
Single-Arm Pendulum, and the Double-Arm Pendulum. Some limited
functionality for dumping Geomview meshes is provided.
Download dap-sim 0.1
This is a double-arm pendulum simulator based on the wxWindows cross-platform GUI
library. It allows execution of control policies, or will allow
a user to try and manually balance the pendulum.
The distribution comes with a control policy that successfully
balances the pendulum from any position.