Fast approximate planning in POMDPs Geoff Gordon ggordon@cs.cmu.edu Joelle Pineau, Geoff Gordon, Sebastian Thrun. Point-based value iteration: an anytime algorithm for POMDPs Fast approximate planningin POMDPs – p.1/37
Overview POMDPs are too slow Fast approximate planningin POMDPs – p.2/37
Overview POMDPs are too slow Fast approximate planningin POMDPs – p.3/37
Overview Review of POMDPs Review of POMDP value iteration algorithms Point-based value iteration Theoretical results Actual results Fast approximate planningin POMDPs – p.4/37
POMDP overview Planning in an uncertain world Actions have random effects Don’t observe full world state Fast approximate planningin POMDPs – p.5/37
POMDP definition State x ∈ X , actions a ∈ A , observations z ∈ Z Rewards r a (column vectors), discount γ ∈ [0 , 1) Belief b ∈ P ( X ) (row vectors) Starting belief b 0 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0.4 0.8 0.6 0.6 0.4 0.8 0.2 1 0 Fast approximate planningin POMDPs – p.6/37
POMDP definition cont’d Transitions b → bT a ( T a stochastic) Observation likelihoods w z (row vectors) � w z = 1 z Observation update: b ← w z × b · η where × is pointwise multiplication Fast approximate planningin POMDPs – p.7/37
Value functions Just like MDP value function (but bigger) V ( b ) = expected total discounted future reward starting from b Knowing V means planning is 1-step lookahead If we discretize belief simplex, we are “done” From b get to b z 1 , b z 2 , . . . according to P ( z | b, a ) Fast approximate planningin POMDPs – p.8/37
Value functions Additional structure: convexity Consider beliefs b 1 , b 2 , b 3 = b 1 + b 2 2 b 3 : flip a coin, then start in b 1 if heads, b 2 if tails b 3 is always worse than average of b 1 , b 2 Fast approximate planningin POMDPs – p.9/37
Representation Represent V as the upper surface of a (possibly infinite) set of hyperplanes V is set of hyperplanes 3 2.8 Hyperplanes represented 2.6 2.4 by normals v (column 2.2 vectors) 2 1.8 1.6 V ( b ) = max v ∈V b · v 1.4 1.2 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fast approximate planningin POMDPs – p.10/37
Value iteration Bellman’s equation: V ( b ) = max Q ( b, a ) a � Q ( b, a ) = r a + γ P ( z | b, a ) V ( b az ) z where b az = η ( bT a ) × w z Fast approximate planningin POMDPs – p.11/37
Convergence Backup operator T : V ← TV T is a contraction on P ( X ) �→ R � b − b ′ � = max x | b ( x ) − b ′ ( x ) | 3 2.8 2.8 2.7 2.6 2.6 2.4 2.5 �→ 2.2 2.4 2 2.3 1.8 2.2 1.6 2.1 1.4 2 1.2 1.9 1 1.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fast approximate planningin POMDPs – p.12/37
Sondik’s algorithm (1972) Rearrange Bellman equation to make it linear: η − 1 = P ( z | b, a ) , and V ( ηb ) = ηV ( b ) , so � Q ( b, a ) = r a + γ V (( bT a ) × w z ) z � = r a + γ max v ∈V (( bT a ) × w z ) · v z � = r a + γ max v ∈V b · T a ( w z × v ) z Fast approximate planningin POMDPs – p.13/37
Evaluate from inside out Suppose V t ( b ) = b · v v z = w z × v v az = γT a v z v a = v az 1 + v az 2 + . . . V ′ = { v a 1 , v a 2 , . . . } Now V t +1 ( b ) = max v ∈V ′ b · v Fast approximate planningin POMDPs – p.14/37
More than 1 hyperplane Suppose V t ( b ) = max v ∈V b · v V z = w z × V set ops are elementwise V az = γT a V z V a = r a + V az 1 ⊕ V az 2 ⊕ . . . expensive! V ′ = V a 1 ∪ V a 2 ∪ . . . Now V t +1 ( b ) = max v ∈V ′ b · v above representation due to [Cassandra et al] Fast approximate planningin POMDPs – p.15/37
A note on complexity Or, some very large numbers Set Comment Total size Time/element V z same size as V | Z | |V| O ( | X | ) O ( | X | 2 ) V az still same size | A | | Z | |V| | A | |V| | Z | V a big! O ( | X | ) For example, w/ 5 actions, 5 observations: 1 , 5 , 15625 , 4 . 6566 × 10 21 , 1 . 0948 × 10 109 , . . . Fast approximate planningin POMDPs – p.16/37
Witnesses (Littman 1994) Don’t need all elements of V Just those which are arg max b · v for some b If we have the b (a 3 2.8 witness ), fast to check 2.6 2.4 that v is indeed arg max 2.2 2 1.8 1.6 1.4 1.2 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fast approximate planningin POMDPs – p.17/37
Witness details Linear feasibility problem (size about |V| × | X | ) b · v ≥ b · v i ∀ i b · 1 = 1 b ≥ 0 Solve one LF per element of V —expensive, but well worth it Can add margin ǫ > 0 for approximate solution • don’t have to have all witnesses Fast approximate planningin POMDPs – p.18/37
Incremental pruning (Cassandra, Littman, Zhang 1997) Prune V z , V az , and V a as they are constructed Another big win in runtime We are now up to 16-state POMDPs Fast approximate planningin POMDPs – p.19/37
Summary so far Solve POMDPs by repeatedly applying backup T Represent V with set of hyperplanes V V grows fast Can prune V using witnesses Fast approximate planningin POMDPs – p.20/37
Plan for rest of talk Better use of witnesses: point backups Better way to find witnesses: exploration PBVI = point backups + exploration for witnesses PBVI examples Fast approximate planningin POMDPs – p.21/37
Backups at a point Computing witnesses is expensive What if we knew a witness b already? Fast to compute both V ( b ) and d db V ( b ) Intuitive, then formal derivation Fast approximate planningin POMDPs – p.22/37
Point backup—intuition V ( b ′ ) depends on P ( z | b, a ) b az for all a, z P ( z | b, a ) b az are linear functions of b V ( P ( z | b, a ) b az ) is scaled/shifted copy of V Adding these copies: hard over P ( X ) , easy at b 2.8 2.7 2.6 2.5 2.4 2.3 2.2 2.1 2 1.9 1.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fast approximate planningin POMDPs – p.23/37
Point backup—math When V → V ′ , we want max v ∈V ′ b · v That’s max a max v ∈V a b · v , since V ′ = V a 1 ∪ V a 2 . . . But max v ∈V a b · v is max b · v 1 + max b · v 2 + . . . v 1 ∈V az 1 v 2 ∈V az 2 since any v ∈ V a is v 1 + v 2 + . . . . . . and V az is quick to compute. Fast approximate planningin POMDPs – p.24/37
Advantage of point-based backups Suppose we have a set B of witnesses and V of hyperplanes Pruning V takes time O ( | B | |V| | X | ) (w/ small constant) Without knowing witnesses, solve |V| LFs, each |V| × | X | Higher order, worse constants Fast approximate planningin POMDPs – p.25/37
Where do witnesses come from? Grids (note difference to discretizing belief simplex) Random (Poon 2001) Interleave point-based with incremental pruning (Zhang & Zhang 2000) We are now up to 90-state POMDPs Fast approximate planningin POMDPs – p.26/37
New theorem Bound error of the point-based backup operator Bound depends on how densely we sample reachable beliefs Probably exists an extension to “easily reachable” beliefs Error bound on one step + contraction of value iteration = overall error bound First result of this sort for POMDP VI Fast approximate planningin POMDPs – p.27/37
Definitions Let ∆ be the set of reachable beliefs Let B be a set of witnesses Let ǫ ( B ) be the worst-case density of B in ∆ : b ∈ B � b − b ′ � 1 ǫ ( B ) = max b ′ ∈ ∆ min Fast approximate planningin POMDPs – p.28/37
Theorem A single point-based backup’s error is ǫ ( B )( R max − R min ) 1 − γ That means the error after value iteration is ǫ ( B )( R max − R min ) (1 − γ ) 2 plus a bit for stopping at finite horizon Fast approximate planningin POMDPs – p.29/37
Policy error We therefore have that policy error is: ǫ ( B )( R max − R min ) (1 − γ ) 3 (1 − γ ) 3 , ouch! But it does go to 0 as ǫ ( B ) → 0 Fast approximate planningin POMDPs – p.30/37
Exploration Theorem tells us we want to sample reachable beliefs with high worst-case 1-norm density We can do this by simulating forward from b 0 Generate a set of candidate witnesses Accept those which are farthest (1-norm) from current set Fast approximate planningin POMDPs – p.31/37
Selecting new witnesses . . . . . * . . Fast approximate planningin POMDPs – p.32/37
Summary of algorithm B ← { b 0 } V = { 0 } (or whatever—e.g., use QMDP) Do some point-based backups on V using B • we backup k times, where γ k is small Add more beliefs to B • we double the size of B each time Repeat Fast approximate planningin POMDPs – p.33/37
Tag problem 870 states, 2 × 29 observations, 5 actions fixed opponent policy Fast approximate planningin POMDPs – p.34/37
Results Fast approximate planningin POMDPs – p.35/37
Results Catches opponent 60% of time Don’t know of another value iteration algorithm which could do this well On smaller problems, gets policies as good as other algorithms But uses a small fraction of the compute time Fast approximate planningin POMDPs – p.36/37
Recommend
More recommend