Sample-Based Methods for Continuous Action Markov Decision Processes Chris Mansley Ari Weinstein Michael Littman Rutgers Unversity
From Learning to Planning Bellman Equation
From Learning to Planning Bellman Equation Continuous State Space Standard machine learning approaches to function approximation have proven successful!
From Learning to Planning Bellman Equation Continuous Action Space Continuous State Space Very little work addressing how to Standard machine learning evaluate the maximum approaches to function approximation have proven successful!
Sparse Sampling [Kearns, et al 1999] • An epsilon-optimal planning algorithm for discounted MDPs. • Number of samples independent of state space size! S0 • Requires too many samples! A1 A2 S4 S5 S1 S2 S3 A1 A2 S10 S7 S8 S9 S11 S12
Can we use ideas from the exploration/exploitation problem to better direct our search?
UCB 1 (+1) 0.8 (-1) [Auer, et al 2002] S1 0.2 (+1) • An algorithm for efficient learning in the bandit domain • Fixed number of discrete actions with bounded support • Choose an arm greedily according to the following rule: 2ln n + µ i n i
UCT [Kocsis, Szepesvári 2006] • Upper Confidence applied to Trees • Takes the UCB algorithm and extends it to the full MDP domain • Build a tree similar to SS, but instead of doing a breadth first search perform a depth first search directed by a UCB algorithm at each node
UCT, cont... [Kocsis, Szepesvári 2006] S0 S0 S0 S1 S4 S1 S1 S4 Round 1 Round 2 Round 3 S12 S9 S14 S12 S12 S14 ... ... ... ... ... ...
HOO [Bubeck, et al 2008] • UCT is still restricted to discrete states and actions • HOO (hierarchical optimistic optimization) provides similar guarantees to UCB in “well- behaved” continuous bandit problems • The idea is simple, divide the action space up (similar to a KD-tree), keep track of returns in these volumes, provide exploration bonuses for both number of samples and size of each subdivision
HOO, cont... [Bubeck, et al 2008] • Choose an arm greedily with respect to the following: 2ln n + + v 1 ρ h µ i n i • Very similar to UCB except the spatial term at the end • The intuition is that arms with large volumes and few samples are unknown, but small volumes and lots of samples are well known
HOO, cont... [Bubeck, et al 2008] • Choose an arm greedily with respect to the following: 2ln n + + v 1 ρ h µ i n i diam(i) • Very similar to UCB except the spatial term at the end • The intuition is that arms with large volumes and few samples are unknown, but small volumes and lots of samples are well known
HOO, cont... [Bubeck, et al 2008] Thanks to Remi Munos . . . . . .
UCB vs HOO
HOOT • Our idea is to replace UCB in UCT with HOO, so that we can work directly in the continuous action space • This leads to our algorithm HOO applied to Trees (HOOT) • The algorithm is exactly the same as UCT, but instead of using UCB at each internal node, we maintain a HOO tree
Empirical Results Double Integrator - 1D D-Double Integrator - 1D 200 195 180 190 160 Total Reward 185 Total Reward 140 120 180 100 175 UCT 5A 80 UCT 11A 170 60 UCT 15A HOOT HOOT UCT 40 165 100 1000 10000 0 10 20 30 40 50 Samples per Planning Step (logscale) Number of Discrete Actions D-Double Integrator 200 HOOT 180 UCT 5 UCT 10 160 UCT 20 140 Total Reward 120 100 80 60 40 20 1 2 3 4 Number of Action Dimensions
Empirical Results Bicycle - 0.02cm Bicycle - 0.02cm 2200 2200 UCT 5A 2000 2000 UCT 10A 1800 1800 UCT 20A HOOT 1600 1600 Total Reward Total Reward 1400 1400 1200 1200 1000 1000 800 800 600 600 400 400 HOOT 200 200 UCT 0 0 100 1000 10000 3 5 7 9 11 13 15 Samples per Planning Step (logscale) Number of Discretizations per Action Dimension
Future Work • Using HOO to optimize the n-step sequence of actions as an n-dimensional space • Extend to continuous state spaces by a weighted interpolation between representative HOO trees
Summary • Choosing action discretizations is non-trival! • If you have a distance metric and your value function is locally smooth, use HOOT not vanilla UCT!
Thanks!
Recommend
More recommend