Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien - PowerPoint PPT Presentation

Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien Busatto-Gaston, Debraj Chakraborty and Jean-Francois Raskin Université Libre de Bruxelles September 16, 2020 HIGHLIGHTS 2020 1/13

Markov Decision Process 1 s 0 4 a 1 a 2 1 3 a 4 2 3 4 3 1 a 3 s 1 s 2 1 2 1 2 2 1 a 1 a 3 3 2 Path of length 2: s 0 − → �� s 1 − → �� s 2 2/13

Markov Decision Process 1 s 0 4 a 1 a 2 1 0 1 3 a 4 2 1 3 4 3 1 a 3 s 1 s 2 2 1 2 1 2 2 1 a 1 a 3 3 2 Path of length 2: s 0 − → �� s 1 − → �� s 2 Finite-horizon total reward (horizon H ) Val( s 0 ) = sup σ :Paths → A E [Reward( p )] where p is a random variable over Paths H ( s 0 , σ ) Link with infinite-horizon average reward for H large enough 2/13

Monte Carlo tree search (MCTS) s 0 a 1 a 2 v 1 v 2 s 1 s 2 s 2 a 3 a 4 v 4 a 4 s 1 s 2 s 0 s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 Iterative construction of a sparse tree with value estimates 3/13

Monte Carlo tree search (MCTS) s 0 a 1 a 2 v 1 v 2 s 1 s 2 s 2 a 3 a 4 v 4 a 4 s 1 s 2 s 0 s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 Iterative construction of a sparse tree with value estimates Selection of a new node � simulation 3/13

Monte Carlo tree search (MCTS) s 0 a 1 a 2 v 2 v ′ 1 s 1 s 2 s 2 a 3 a 4 v ′ a 4 4 s 1 s 2 s 0 v s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 Iterative construction of a sparse tree with value estimates Selection of a new node � simulation � update of the estimates 3/13

Monte Carlo tree search (MCTS) s 0 a 1 a 2 v 2 v ′ 1 s 1 s 2 s 2 a 3 a 4 v ′ a 4 4 s 1 s 2 s 0 s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 With UCT (Kocsis & Szepesvári, 2006) as the selection strategy: After a given number of iterations n , MCTS outputs the best action The probability of choosing a suboptimal action converges to zero v i converges to the real value of a i at a speed of (log n ) / n 3/13

Symbolic advice 4/13

Symbolic advice s 0 a 1 a 2 s 1 s 2 s 2 a 3 a 4 a 4 H s 1 s 2 s 0 s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 X X X X � � X � � X X X X X An advice is a subset of Paths H ( s 0 ) Defined symbolically as a logical formula ϕ (reachability or safety property, LTL formula over finite traces, regular expression . . . ) 5/13

Symbolic advice s 0 a 1 a 2 s 1 s 2 s 2 a 3 a 4 a 4 s 1 s 2 s 0 s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 X X X X � � X � � X X X X X An advice is a subset of Paths H ( s 0 ) Defined symbolically as a logical formula ϕ (reachability or safety property, LTL formula over finite traces, regular expression . . . ) 5/13

Symbolic advice s 0 a 1 a 2 s 1 s 2 s 2 a 3 a 4 a 4 s 1 s 2 s 0 s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 X X X X � � X � � X X X X X An advice is a subset of Paths H ( s 0 ) Defined symbolically as a logical formula ϕ (reachability or safety property, LTL formula over finite traces, regular expression . . . ) 5/13 ϕ defines a pruning of the unfolded MDP

Symbolic advice s 0 a 1 a 2 s 1 s 2 s 2 a 3 a 4 a 4 H s 1 s 2 s 0 s 2 s 0 s 2 a 1 a 2 a 1 a 2 a 3 a 4 a 4 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 X X X X X X � � � � � � � � Strongly enforceable advice: can be enforced by controller if the MDP is seen as a game � does not partially prune stochastic transitions 6/13

Symbolic advice s 0 a 1 a 2 s 1 s 2 s 2 a 3 a 4 a 4 H s 1 s 2 s 0 s 2 s 0 s 2 a 1 a 2 a 1 a 2 a 3 a 4 a 4 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 X X X X X X X � � � � � � � Strongly enforceable advice: can be enforced by controller if the MDP is seen as a game � does not partially prune stochastic transitions 6/13

Boolean Solvers The advice ψ can be encoded as a Boolean Formula 7/13

Boolean Solvers The advice ψ can be encoded as a Boolean Formula QBF solver A first action a 0 is compatible with ϕ iff ∀ s 1 ∃ a 1 ∀ s 2 . . . , s 0 a 0 s 1 a 1 s 2 . . . | = ψ Inductive way of constructing paths that satisfy the strongly enforceable advice ϕ 7/13

Boolean Solvers The advice ψ can be encoded as a Boolean Formula QBF solver A first action a 0 is compatible with ϕ iff ∀ s 1 ∃ a 1 ∀ s 2 . . . , s 0 a 0 s 1 a 1 s 2 . . . | = ψ Inductive way of constructing paths that satisfy the strongly enforceable advice ϕ Weighted sampling Simulation of safe paths according to ψ Weighted SAT sampling (Chakraborty, Fremont, Meel, Seshia, & Vardi, 2014) 7/13

MCTS under advice 8/13

MCTS under advice s 0 a 1 a 2 s 1 s 2 s 2 a 3 a 4 a 4 s 1 s 2 s 0 s 2 s 0 s 2 a 1 a 2 a 1 a 2 a 3 a 4 a 4 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 X X X X � � X � � X X X X X Select actions in the unfolding pruned by a selection advice ϕ Simulation is restricted according to a simulation advice ψ 9/13

MCTS under advice Convergence properties With UCT (Kocsis & Szepesvári, 2006) as the selection strategy: The probability of choosing a suboptimal action converges to zero v i converges to the real value of a i at a speed of (log n ) / n The convergence properties are maintained: for all simulation advice for all selection advice which 10/13

MCTS under advice Convergence properties With UCT (Kocsis & Szepesvári, 2006) as the selection strategy: The probability of choosing a suboptimal action converges to zero v i converges to the real value of a i at a speed of (log n ) / n The convergence properties are maintained: for all simulation advice for all selection advice which . . . are Strongly enforceable advice 10/13

MCTS under advice Convergence properties With UCT (Kocsis & Szepesvári, 2006) as the selection strategy: The probability of choosing a suboptimal action converges to zero v i converges to the real value of a i at a speed of (log n ) / n The convergence properties are maintained: for all simulation advice for all selection advice which are Strongly enforceable advice satisfy an optimality assumption: does not prune all optimal actions 10/13

Experimental results 11/13

Experimental results Figure: 9 x 21 maze, 4 random ghosts % of no result 1 Algorithm % of win % of loss % of food eaten MCTS 17 59 24 67 MCTS+Selection advice 25 54 21 71 MCTS+Simulation advice 71 29 0 88 MCTS+both advice 85 15 0 94 Human 44 56 0 75 1 after 300 steps 12/13

Future works Compiler LTL → symbolic advice Study interactions with reinforcement learning techniques (and neural networks) Weighted advice 13/13

Future works Compiler LTL → symbolic advice Study interactions with reinforcement learning techniques (and neural networks) Weighted advice Thank You 13/13

Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien - PowerPoint PPT Presentation

Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien Busatto-Gaston, Debraj Chakraborty and Jean-Francois Raskin Universit Libre de Bruxelles September 16, 2020 HIGHLIGHTS 2020 1/13 Markov Decision Process 1 s 0 4 a 1 a 2 1

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Monte-Carlo tree search for Monte-Carlo tree search for multi-player, no-limit multi-player,

Monte Carlo Tree Search 2-15-16 Reading Quiz What is the relationship between Monte Carlo tree

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

Modern Monte Carlo Tree Search Andrew Li, John Chen, Keiran Paster 1 Outline Motivation

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

CS171: Artificial Intelligence Monte Carlo Tree Search and Alpha Go Jia Chen Dec 5, 2017 1

Monte Carlo Tree Search for Algorithm Configuration: MOSAIC Herilalaina Rakotoarison and Mich`

From Qualitative to Quantitative Dominance Pruning for Optimal Planning Alvaro Torralba

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

Markov decision process (MDP) Robert Platt Northeastern University The RL Setting Action Agent

A Correctness Result for Synthesizing Plans With Loops in Stochastic Domains Laszlo Treszkai

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Temporal-Di ff erence Learning What is MC estimation doing? Adaptive Dynamic Programming V ( s t

Scheduling Don Porter CSE 306 Last time We went through the high-level theory of scheduling

Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok.

Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien - PowerPoint PPT Presentation

Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien Busatto-Gaston, Debraj Chakraborty and Jean-Francois Raskin Universit Libre de Bruxelles September 16, 2020 HIGHLIGHTS 2020 1/13 Markov Decision Process 1 s 0 4 a 1 a 2 1

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Monte-Carlo tree search for Monte-Carlo tree search for multi-player, no-limit multi-player,

Monte Carlo Tree Search 2-15-16 Reading Quiz What is the relationship between Monte Carlo tree

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

Modern Monte Carlo Tree Search Andrew Li, John Chen, Keiran Paster 1 Outline Motivation

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Monte Carlo Control CMPUT 366: Intelligent Systems S&amp;B 5.3-5.5, 5.7 Lecture Outline 1.

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

CS171: Artificial Intelligence Monte Carlo Tree Search and Alpha Go Jia Chen Dec 5, 2017 1

Monte Carlo Tree Search for Algorithm Configuration: MOSAIC Herilalaina Rakotoarison and Mich`

From Qualitative to Quantitative Dominance Pruning for Optimal Planning Alvaro Torralba

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

Markov decision process (MDP) Robert Platt Northeastern University The RL Setting Action Agent

A Correctness Result for Synthesizing Plans With Loops in Stochastic Domains Laszlo Treszkai

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Temporal-Di ff erence Learning What is MC estimation doing? Adaptive Dynamic Programming V ( s t

Scheduling Don Porter CSE 306 Last time We went through the high-level theory of scheduling

Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok.

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.