The 24th International Conference on Automated Planning and Scheduling, ICAPS 2014 Thompson Sampling based Monte-Carlo Planning in POMDPs Aijun Bai 1 Feng Wu 2 Zongzhang Zhang 3 Xiaoping Chen 1 1 University of Science & Technology of China 2 University of Southampton 3 National University of Singapore June 24, 2014
Table of Contents Introduction The approach Empirical results Conclusion and future work A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 2 / 22
Monte-Carlo tree search ◮ Online planning method ◮ Finds near-optimal policies for MDPs and POMDPs ◮ Builds a best-first search tree using Monte-Carlo samplings ◮ Without explicitly knowing the underlying models in advance A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 3 / 22
MCTS procedure Figure 1 : Outline of Monte-Carlo tree search [Chaslot et al. , 2008]. A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 4 / 22
Resulting asymmetric search tree Figure 2 : An example of resulting asymmetric search tree [Coquelin and Munos, 2007]. A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 5 / 22
The exploration vs. exploitation dilemma ◮ A fundamental problem in MCTS: 1. Must not only exploit by selecting the action that currently seems best 2. Should also keep exploring for possible higher future outcomes ◮ Can be seen as a multi-armed bandit problem (MAB) 1. A set of actions: A 2. An unknown stochastic reward function R ( a ) := X a ◮ Cumulative regret (CR): � T � � ( X a ∗ − X a t ) R T = E (1) t =1 ◮ Minimize CR by trading off between exploration and exploitation A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 6 / 22
UCB1 heuristics ◮ POMCP algorithm [Silver and Veness, 2010]: � log N ( h ) UCB1( h, a ) = ¯ Q ( h, a ) + c (2) N ( h, a ) ◮ ¯ Q ( h, a ) is the mean outcome of applying action a in history h ◮ N ( h, a ) is the visitation count of action a following h ◮ N ( h ) = � a ∈ A N ( h, a ) is the overall count ◮ c is the exploration constant A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 7 / 22
Balancing between CR and SR in MCTS ◮ Simple regret (SR): r n = E [ X a ∗ − X ¯ a ] (3) a = argmax a ∈ A ¯ where ¯ X a ◮ Makes more sense for pure exploration ◮ A recently growing understanding: balance between CR and SR [Feldman and Domshlak, 2012] 1. Does not collect a real reward when searching the tree 2. Good to grow the tree more accurately by exploiting the current tree A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 8 / 22
Thompson sampling ◮ Select an action based on its posterior probability of being optimal � � � � E [ X a ′ | θ a ′ ] P a ′ ( θ a ′ | Z ) d θ P ( a ) = a = argmax (4) 1 a ′ a ′ 1. θ a specifies the unknown distribution of X a 2. θ = ( θ a 1 , θ a 2 , . . . ) is a vector of all hidden parameters ◮ Can efficiently be approached by sampling method 1. Sample a set of hidden parameters θ a 2. Select the action with highest expectation E [ X a ′ | θ a ′ ] A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 9 / 22
An example of Thompson sampling ◮ 2-armed bandit: a and b ◮ Bernoulli reward distributions ◮ Hidden parameters p a and p b ◮ Prior distributions: ◮ p a ∼ Uniform (0 , 1) (a) Beta (2 , 2) . ◮ p b ∼ Uniform (0 , 1) ◮ History: a, 1, b, 0, a, 0 ◮ Posterior distributions: ◮ p a ∼ Beta (2 , 2) ◮ p b ∼ Beta (1 , 2) ◮ Sample p a and p b (b) Beta (1 , 2) . ◮ Compare E [ X a | p a ] and E [ X b | p b ] Figure 3 : Posterior distributions. A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 10 / 22
Motivation ◮ Thompson sampling 1. Theoretically achieves asymptotic optimality for MABs in terms of CR 2. Empirically has competitive and even better performance comparing with state-of-the-art in terms of CR and SR ◮ Seems to be a promising approach for the challenge of balancing CR and SR A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 11 / 22
Contribution ◮ A complete Bayesian approach for online Monte-Carlo planning in POMDPs 1. Maintain the posterior reward distribution of applying an action 2. Use Thompson sampling to guide the action selection A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 12 / 22
Bayesian modeling and inference ◮ X b,a : the immediate reward of performing action a in belief b ◮ A finite set of possible immediate rewards: I = { r 1 , r 2 , . . . , r k } ◮ X b,a ∼ Multinomial ( p 1 , p 2 , . . . , p k ) 1. p i = � s ∈ S 1 [ R ( s, a ) = r i ] b ( s ) 2. � k i =1 p i = 1 ◮ ( p 1 , p 2 , . . . , p k ) ∼ Dirichlet ( ψ b,a ) , where ψ b,a = ( ψ b,a,r 1 , ψ b,a,r 2 , . . . , ψ b,a,r k ) ◮ Observing r : ψ b,a,r ← ψ b,a,r + 1 (5) A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 13 / 22
Bayesian modeling and inference ◮ X s,b,π : the cumulative reward of following policy π in joint state � s, b � ◮ X s,b,π ∼ N ( µ s,b , 1 /τ s,b ) (according to CLT on Markov chains) ◮ ( µ s,b , τ s,b ) ∼ NormalGamma ( µ s,b, 0 , λ s,b , α s,b , β s,b ) ◮ Observing v : µ s,b, 0 = λ s,b µ s,b, 0 + v (6) λ s,b + 1 λ s,b = λ s,b + 1 (7) α s,b = α s,b + 1 (8) 2 � λ s,b ( v − µ s,b, 0 ) 2 β s,b = β s,b + 1 � (9) 2 λ s,b + 1 A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 14 / 22
Bayesian modeling and inference ◮ X b,π : the cumulative reward of following policy π in belief b ◮ X b,π follows a mixture of Normal distributions: � f X b,π ( x ) = b ( s ) f X s,b,π ( x ) (10) s ∈ S ◮ X b,a,π : the cumulative reward of applying a in belief b and following policy π X b,a,π = X b,a + γX b ′ ,π (11) ◮ Expectation of X b,a,π : 1 [ b ′ = ζ ( b, a, o )]Ω( o | b, a ) E [ X b ′ ,π ] � E [ X b,a,π ] = E [ X b,a ] + γ (12) o ∈ O A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 15 / 22
Bayesian modeling and inference ◮ Ω( · | b, a ) ∼ Dirichlet ( ρ b,a ) ◮ ρ b,a = ( ρ b,a,o 1 , ρ b,a,o 2 , . . . ) ◮ Observing a transition ( b, a ) → o : ρ b,a,o ← ρ b,a,o + 1 (13) A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 16 / 22
Thompson sampling based action selection ◮ Decision node with belief b ◮ Sample a set of parameters: 1. { w b,a,o } ∼ Dirichlet ( ρ b,a ) 2. { w b,a,r } ∼ Dirichlet ( ψ b,a ) 3. { µ s ′ ,b ′ } ∼ NormalGamma ( µ s ′ ,b ′ , 0 , λ s ′ ,b ′ , α s ′ ,b ′ , β s ′ ,b ′ ) , where b ′ = ζ ( b, a, o ) ◮ Select action with highest expectation — sampled ˜ Q value: 1 [ b ′ = ζ ( b, a, o )] w b,a,o ˜ � � � µ s ′ ,b ′ b ′ ( s ′ ) Q ( b, a ) = w b,a,r r + γ (14) r ∈I o ∈ O s ′ ∈ S A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 17 / 22
Experiments ◮ D 2 NG-POMCP: Dirichlet-Dirichlet-NormalGamma partially observable Monte-Carlo planning ◮ RockSample and PocMan domains ◮ Evaluation: 1. Run the algorithms for a number of iterations for current belief 2. Apply the best action based on the resulting action-values 3. Repeat until terminating conditions (goal state or maximal number of steps) 4. Report the total discounted reward and the average time usage per action A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 18 / 22
Experimental results 25 25 25 25 20 20 20 20 Avg. Discounted Return Avg. Discounted Return Avg. Discounted Return Avg. Discounted Return 15 15 15 15 10 10 10 10 5 5 5 5 0 0 0 0 POMCP POMCP POMCP POMCP D 2 NG-POMCP D 2 NG-POMCP D 2 NG-POMCP D 2 NG-POMCP -5 -5 -5 -5 1 10 100 1000 10000 100000 1e+06 1e-05 0.0001 0.001 0.01 0.1 1 10 100 1 10 100 1000 10000 100000 1e+06 1e-05 0.0001 0.001 0.01 0.1 1 10 100 Number of Iterations Avg. Time Per Action (Seconds) Number of Iterations Avg. Time Per Action (Seconds) (a) RS[7, 8]. (b) RS[7, 8]. (c) RS[11, 11]. (d) RS[11, 11]. 25 25 90 90 80 80 Avg. Discounted Return 20 Avg. Discounted Return 20 Avg. Discounted Return Avg. Discounted Return 70 70 60 60 15 15 50 50 10 10 40 40 30 30 5 5 20 20 10 10 0 0 POMCP POMCP POMCP POMCP 0 0 D 2 NG-POMCP D 2 NG-POMCP D 2 NG-POMCP D 2 NG-POMCP -5 -5 -10 -10 1 10 100 1000 10000 100000 0.0001 0.001 0.01 0.1 1 10 100 1 10 100 1000 10000 100000 1e-05 0.0001 0.001 0.01 0.1 1 10 Number of Iterations Avg. Time Per Action (Seconds) Number of Iterations Avg. Time Per Action (Seconds) (e) RS[15, 15]. (f) RS[15, 15]. (g) PocMan. (h) PocMan. Figure 4 : Performance of D 2 NG-POMCP in RockSample and PocMan A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 19 / 22
Recommend
More recommend