online knowledge enhancements for monte carlo tree search
play

Online Knowledge Enhancements for Monte Carlo Tree Search in - PowerPoint PPT Presentation

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning Bachelor presentation Marcel Neidinger <m.neidinger@unibas.ch> Department of Mathematics and Computer Science, University of Basel 13. February 2017


  1. Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning Bachelor presentation Marcel Neidinger <m.neidinger@unibas.ch> Department of Mathematics and Computer Science, University of Basel 13. February 2017

  2. What is Probabilistic Planning? Solve planning tasks with probabilistic transitions Models a Markov Decision Problem given by M = ⟨ V, s 0 , A, T, R ⟩ A set of binary variables V inducing States S = 2 V An initial state s 0 ∈ S A set of applicable actions A A transition model T : S × A × S → [0; 1] A Reward R ( s, a ) Monte Carlo Tree Search algorithms solve MDPs Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 2 / 33

  3. Monte Carlo Tree Search Algorithms Algorithmic framework to solve MDPs Used especially in computer Go Go Board 1 Lee Sedol 2 1 Source: https://commons.wikimedia.org/wiki/File:Go_board.jpg 2 Source: https://qz.com/639952/googles-ai-won-the-game-go-by-defying- millennia-of-basic-human-instinct/ Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 3 / 33

  4. Four phases - Two components Selection Expansion Simulation Simulation e Backpropagation Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 4 / 33

  5. Monte Carlo Tree node MCTS tree for a MDP M Important information in a tree node A state s ∈ S A counter N ( i ) for the number of visits A counter N ( i ) ( s, a ) ∀ a ∈ A for the number of times a was selected in s A reward estimate Q ( i ) ( s, a ) for action a in state s Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 5 / 33

  6. Online Knowledge AlphaGo used Neural Networks for the two policis → Domain-specific knowledge We want domain independent enhancements Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 6 / 33

  7. Overview Tree-Policy Enhancements All Moves as First α -AMAF Cutoff-AMAF Rapid Action Value Estimation Default-Policy Enhancements Move-Average Sampling Technique Conclusion Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 7 / 33

  8. What is a Tree Policy? Iterate through the known part of the tree and select an action given a node Use a Q value for a state-action pair to estimate an actions reward Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 8 / 33

  9. UCT MCTS implementation first proposed in 2006 s 1 m m ′ m ′ s 2 s 3 m ′′ m ′ s 5 s 4 Reward: 10 Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 9 / 33

  10. UCT Reward approximation, parent node v l , child node v j √ 2 ln N ( i ) ( s l ) UCT ( v l , v j ) = Q ( i ) ( s l , a j ) + 2 C p (1) N ( i +1) ( s j ) From parent v l select child node v ∗ that maximises v ∗ = max (2) v j { UCT ( n l , n j ) } Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 10 / 33

  11. All Moves as First - Idea UCT score needs several trials to become reliable Idea: Generalize informations extracted from trials Implementation: Use additional (node-independant) score that updates unselected actions as well State Action Reward … s 1 m s 1 m m ′ m ′ s 2 s 3 m ′′ m ′ s 5 s 4 Reward: 10 Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 11 / 33

  12. All Moves as First - α -AMAF Idea: Combine UCT and AMAF score (3) SCR = αAMAF + (1 − α ) UCT Choose action with highest SCR Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 12 / 33

  13. All Moves as First - α -AMAF - Results AMAF( α = 0 ) AMAF( α = 0 . 6 ) AMAF( α = 0 . 2 ) AMAF( α = 0 . 8 ) AMAF( α = 0 . 4 ) AMAF( α = 1 . 0 ) 1 0 . 8 IPPC score 0 . 6 0 . 4 0 . 2 0 wildfire triangle academic elevators tamarisk sysadmin recon game traffic crossing skill navigation total Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 13 / 33 Domain

  14. All Moves as First - α -AMAF - Problems With more trials UCT becomes more reliable AMAF score has higher variance We want to discontinue using AMAF score after some time Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 14 / 33

  15. All Moves as First - α -AMAF - Problems With more trials UCT becomes more reliable AMAF score has higher variance We want to discontinue using AMAF score after some time Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 14 / 33

  16. All Moves as First - Cutoff-AMAF Introduce cutoff parameter K { , for i ≤ k αAMAF + (1 − α ) UCT (4) SCR = , else UCT Use AMAF score only in the first k trials Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 15 / 33

  17. All Moves as First - Cutoff-AMAF - Results init: IDS, backup: MC Raw UCT Plain α -AMAF 0 . 75 0 . 7 Total IPPC score 0 . 65 0 . 6 0 . 55 0 . 5 0 10 20 30 40 50 K value Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 16 / 33

  18. All Moves as First - Cutoff-AMAF - Problems How to choose the parameter K ? When is the UCT score reliable enough? Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 17 / 33

  19. Rapid Actio Value Estimation - Idea First introduced in 2007 for computer go Use soft cutoff { } 0 , V − v ( n ) (5) α = max V Use UCT for often visited nodes and AMAF score for less-visited Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 18 / 33

  20. Rapid Action Value Estimation - Results UCT RAVE(15) RAVE(50) RAVE(5) RAVE(25) 1 0 . 8 IPPC score 0 . 6 0 . 4 0 . 2 0 wildfire triangle academic elevators tamarisk sysadmin recon game traffic crossing skill navigation total Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 19 / 33 Domain

  21. All Moves as First - Conclusion UCT AMAF( α = 0 . 2 ) RAVE(25) 1 0 . 8 IPPC score 0 . 6 0 . 4 0 . 2 0 w t a e t s r g t c s n t r a e r o y r k i c l a a a i e l a m c o t a s m i v d f l a n v o f s l d a i f a i g a n e c s g l i d r e r t i a e l i m n e m o s t r g i k i s i o n c n Domain Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 20 / 33

  22. Rapid Action Value Estimation - Problems PROST uses problem description with conditional effects Also no preconditions given PROST description is more general Player In PROST: Goal field Movepath Action: move _ up In e.g. computer chess Action: move _ a 2 _ to _ a 3 Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 21 / 33

  23. Predicate Rapid Action Value Estimation A state has predicates that give some context Idea Use predicates to find similar states and use their score Q PRAV E ( s, a ) = 1 ∑ (6) Q RAV E ( p, a ) N p ∈ P and weight with { 0 , V − v ( n ) } (7) α = V Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 22 / 33

  24. All Moves as First - Conclusion - Revisited UCT RAVE(25) PRAVE AMAF( α = 0 . 2 ) 1 0 . 8 IPPC score 0 . 6 0 . 4 0 . 2 0 w t a e t s r g t c s n t r a e r o y r k i c l a a a i e l a m c o t a s m i v d f l a n v o f s l d a i f a i g a n e c s g l i d r e r t i a e l i m n e m o s t r g i k i s i o n c n Domain Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 23 / 33

  25. Overview Tree-Policy Enhancements All Moves as First α -AMAF Cutoff-AMAF Rapid Action Value Estimation Default-Policy Enhancements Move-Average Sampling Technique Conclusion Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 24 / 33

  26. What is a Default Policy? Simulation e Simulate the outcome of a trial Basic default policy: random walk Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 25 / 33

  27. X-Average Sampling Technique Use tree knowledge to bias default policy towards moves that are more goal-oriented Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 26 / 33

  28. Move-Average Sampling Technique - Idea - Sample Game Introduce Q ( a ) Use moves that are Player Goal field good on average Movepath Choose action according to: Q ( a ) e τ (8) P ( a ) = Q ( b ) ∑ e τ b ∈ A Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 27 / 33

  29. Move-Average Sampling Technique - Idea - Example Actions: r,r,u,u,u Actions: r,r,u,l,l Q ( r ) = 1; N ( r ) = 2 Q ( r ) = 2; N ( r ) = 4 Q ( u ) = 6; N ( u ) = 3 Q ( u ) = 7; N ( u ) = 4 Q ( l ) = 3; N ( l ) = 2 Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 28 / 33

  30. Move-Average Sampling Technique - Idea - Example (2) Actions: l,u,u,r,r Actions: r,r,r,u,u Q ( r ) = 7; N ( r ) = 6 Q ( r ) = 7; N ( r ) = 9 Q ( u ) = 8; N ( u ) = 6 Q ( u ) = 9; N ( u ) = 8 Q ( l ) = 2; N ( l ) = 3 Q ( l ) = 2; N ( l ) = 3 Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 29 / 33

Recommend


More recommend