monte carlo tree search guided by symbolic advice for mdps
play

Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien - PowerPoint PPT Presentation

Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien Busatto-Gaston, Debraj Chakraborty and Jean-Francois Raskin Universit Libre de Bruxelles September 16, 2020 HIGHLIGHTS 2020 1/13 Markov Decision Process 1 s 0 4 a 1 a 2 1


  1. Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien Busatto-Gaston, Debraj Chakraborty and Jean-Francois Raskin Université Libre de Bruxelles September 16, 2020 HIGHLIGHTS 2020 1/13

  2. Markov Decision Process 1 s 0 4 a 1 a 2 1 3 a 4 2 3 4 3 1 a 3 s 1 s 2 1 2 1 2 2 1 a 1 a 3 3 2 Path of length 2: s 0 − → ��� s 1 − → ��� s 2 2/13

  3. Markov Decision Process 1 s 0 4 a 1 a 2 1 0 1 3 a 4 2 1 3 4 3 1 a 3 s 1 s 2 2 1 2 1 2 2 1 a 1 a 3 3 2 Path of length 2: s 0 − → ��� s 1 − → ��� s 2 Finite-horizon total reward (horizon H ) Val( s 0 ) = sup σ :Paths → A E [Reward( p )] where p is a random variable over Paths H ( s 0 , σ ) Link with infinite-horizon average reward for H large enough 2/13

  4. Monte Carlo tree search (MCTS) s 0 a 1 a 2 v 1 v 2 s 1 s 2 s 2 a 3 a 4 v 4 a 4 s 1 s 2 s 0 s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 Iterative construction of a sparse tree with value estimates 3/13

  5. Monte Carlo tree search (MCTS) s 0 a 1 a 2 v 1 v 2 s 1 s 2 s 2 a 3 a 4 v 4 a 4 s 1 s 2 s 0 s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 Iterative construction of a sparse tree with value estimates Selection of a new node � simulation 3/13

  6. Monte Carlo tree search (MCTS) s 0 a 1 a 2 v 2 v ′ 1 s 1 s 2 s 2 a 3 a 4 v ′ a 4 4 s 1 s 2 s 0 v s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 Iterative construction of a sparse tree with value estimates Selection of a new node � simulation � update of the estimates 3/13

  7. Monte Carlo tree search (MCTS) s 0 a 1 a 2 v 2 v ′ 1 s 1 s 2 s 2 a 3 a 4 v ′ a 4 4 s 1 s 2 s 0 s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 With UCT (Kocsis & Szepesvári, 2006) as the selection strategy: After a given number of iterations n , MCTS outputs the best action The probability of choosing a suboptimal action converges to zero v i converges to the real value of a i at a speed of (log n ) / n 3/13

  8. Symbolic advice 4/13

  9. Symbolic advice s 0 a 1 a 2 s 1 s 2 s 2 a 3 a 4 a 4 H s 1 s 2 s 0 s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 X X X X � � X � � X X X X X An advice is a subset of Paths H ( s 0 ) Defined symbolically as a logical formula ϕ (reachability or safety property, LTL formula over finite traces, regular expression . . . ) 5/13

  10. Symbolic advice s 0 a 1 a 2 s 1 s 2 s 2 a 3 a 4 a 4 s 1 s 2 s 0 s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 X X X X � � X � � X X X X X An advice is a subset of Paths H ( s 0 ) Defined symbolically as a logical formula ϕ (reachability or safety property, LTL formula over finite traces, regular expression . . . ) 5/13

  11. Symbolic advice s 0 a 1 a 2 s 1 s 2 s 2 a 3 a 4 a 4 s 1 s 2 s 0 s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 X X X X � � X � � X X X X X An advice is a subset of Paths H ( s 0 ) Defined symbolically as a logical formula ϕ (reachability or safety property, LTL formula over finite traces, regular expression . . . ) 5/13 ϕ defines a pruning of the unfolded MDP

  12. Symbolic advice s 0 a 1 a 2 s 1 s 2 s 2 a 3 a 4 a 4 H s 1 s 2 s 0 s 2 s 0 s 2 a 1 a 2 a 1 a 2 a 3 a 4 a 4 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 X X X X X X � � � � � � � � Strongly enforceable advice: can be enforced by controller if the MDP is seen as a game � does not partially prune stochastic transitions 6/13

  13. Symbolic advice s 0 a 1 a 2 s 1 s 2 s 2 a 3 a 4 a 4 H s 1 s 2 s 0 s 2 s 0 s 2 a 1 a 2 a 1 a 2 a 3 a 4 a 4 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 X X X X X X � � � � � � � � Strongly enforceable advice: can be enforced by controller if the MDP is seen as a game � does not partially prune stochastic transitions 6/13

  14. Symbolic advice s 0 a 1 a 2 s 1 s 2 s 2 a 3 a 4 a 4 H s 1 s 2 s 0 s 2 s 0 s 2 a 1 a 2 a 1 a 2 a 3 a 4 a 4 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 X X X X X X X � � � � � � � Strongly enforceable advice: can be enforced by controller if the MDP is seen as a game � does not partially prune stochastic transitions 6/13

  15. Boolean Solvers The advice ψ can be encoded as a Boolean Formula 7/13

  16. Boolean Solvers The advice ψ can be encoded as a Boolean Formula QBF solver A first action a 0 is compatible with ϕ iff ∀ s 1 ∃ a 1 ∀ s 2 . . . , s 0 a 0 s 1 a 1 s 2 . . . | = ψ Inductive way of constructing paths that satisfy the strongly enforceable advice ϕ 7/13

  17. Boolean Solvers The advice ψ can be encoded as a Boolean Formula QBF solver A first action a 0 is compatible with ϕ iff ∀ s 1 ∃ a 1 ∀ s 2 . . . , s 0 a 0 s 1 a 1 s 2 . . . | = ψ Inductive way of constructing paths that satisfy the strongly enforceable advice ϕ Weighted sampling Simulation of safe paths according to ψ Weighted SAT sampling (Chakraborty, Fremont, Meel, Seshia, & Vardi, 2014) 7/13

  18. MCTS under advice 8/13

  19. MCTS under advice s 0 a 1 a 2 s 1 s 2 s 2 a 3 a 4 a 4 s 1 s 2 s 0 s 2 s 0 s 2 a 1 a 2 a 1 a 2 a 3 a 4 a 4 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 X X X X � � X � � X X X X X Select actions in the unfolding pruned by a selection advice ϕ Simulation is restricted according to a simulation advice ψ 9/13

  20. MCTS under advice Convergence properties With UCT (Kocsis & Szepesvári, 2006) as the selection strategy: The probability of choosing a suboptimal action converges to zero v i converges to the real value of a i at a speed of (log n ) / n The convergence properties are maintained: for all simulation advice for all selection advice which 10/13

  21. MCTS under advice Convergence properties With UCT (Kocsis & Szepesvári, 2006) as the selection strategy: The probability of choosing a suboptimal action converges to zero v i converges to the real value of a i at a speed of (log n ) / n The convergence properties are maintained: for all simulation advice for all selection advice which . . . are Strongly enforceable advice 10/13

  22. MCTS under advice Convergence properties With UCT (Kocsis & Szepesvári, 2006) as the selection strategy: The probability of choosing a suboptimal action converges to zero v i converges to the real value of a i at a speed of (log n ) / n The convergence properties are maintained: for all simulation advice for all selection advice which are Strongly enforceable advice satisfy an optimality assumption: does not prune all optimal actions 10/13

  23. Experimental results 11/13

  24. Experimental results Figure: 9 x 21 maze, 4 random ghosts % of no result 1 Algorithm % of win % of loss % of food eaten MCTS 17 59 24 67 MCTS+Selection advice 25 54 21 71 MCTS+Simulation advice 71 29 0 88 MCTS+both advice 85 15 0 94 Human 44 56 0 75 1 after 300 steps 12/13

  25. Future works Compiler LTL → symbolic advice Study interactions with reinforcement learning techniques (and neural networks) Weighted advice 13/13

  26. Future works Compiler LTL → symbolic advice Study interactions with reinforcement learning techniques (and neural networks) Weighted advice Thank You 13/13

Recommend


More recommend