practical open loop optimistic planning
play

Practical Open-Loop Optimistic Planning Edouard Leurent 1 , 2 , - PowerPoint PPT Presentation

Practical Open-Loop Optimistic Planning Edouard Leurent 1 , 2 , Odalric-Ambrym Maillard 1 1 SequeL, Inria Lille Nord Europe 2 Renault Group ECML PKDD 2019 W urzburg, September 2019 Motivation Sequential Decision Making action Agent


  1. Practical Open-Loop Optimistic Planning Edouard Leurent 1 , 2 , Odalric-Ambrym Maillard 1 1 SequeL, Inria Lille – Nord Europe 2 Renault Group ECML PKDD 2019 W¨ urzburg, September 2019

  2. Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31

  3. Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes 1. Observe state s ∈ S ; Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31

  4. Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes 1. Observe state s ∈ S ; 2. Pick a discrete action a ∈ A ; Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31

  5. Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes 1. Observe state s ∈ S ; 2. Pick a discrete action a ∈ A ; � � 3. Transition to a next state s ′ ∼ P s ′ | s , a ; Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31

  6. Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes 1. Observe state s ∈ S ; 2. Pick a discrete action a ∈ A ; � � 3. Transition to a next state s ′ ∼ P s ′ | s , a ; 4. Receive a bounded reward r ∈ [ 0 , 1 ] drawn from P ( r | s , a ) . Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31

  7. Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes 1. Observe state s ∈ S ; 2. Pick a discrete action a ∈ A ; � � 3. Transition to a next state s ′ ∼ P s ′ | s , a ; 4. Receive a bounded reward r ∈ [ 0 , 1 ] drawn from P ( r | s , a ) . Objective: maximise V = E [ � ∞ t = 0 γ t r t ] Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31

  8. Motivation — Example The highway-env environment We want to handle stochasticity. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 3/31

  9. Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried Agent Environment Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31

  10. Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried Agent Environment state Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31

  11. Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried Agent Environment state Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31

  12. Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried Agent Environment state recommendation Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31

  13. Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried action Agent Environment state recommendation Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31

  14. Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried action Agent Environment state, reward state recommendation Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31

  15. Motivation — How to solve MDPs? Online Planning ◮ fixed budget: the model can only be queried n times Objective: minimize E V ∗ − V ( n ) � �� � Simple Regret r n An exploration-exploitation problem. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 5/31

  16. Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 6/31

  17. Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. ◮ Either you performed well; Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 6/31

  18. Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. ◮ Either you performed well; ◮ or you learned something. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 6/31

  19. Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. ◮ Either you performed well; ◮ or you learned something. Instances ◮ Monte-carlo tree search ( MCTS ) [Coulom 2006]: CrazyStone ◮ Reframed in the bandit setting as UCT [Kocsis and Szepesv´ ari 2006], still very popular (e.g. Alpha Go ). ◮ Proved asymptotic consistency, but no regret bound. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 6/31

  20. Analysis of UCT It was analysed in [Coquelin and Munos 2007] The sample complexity of is lower-bounded by O (exp(exp( D ))) . Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 7/31

  21. Failing cases of UCT Not just a theoretical counter-example. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 8/31

  22. Can we get better guarantees? OPD : Optimistic Planning for Deterministic systems ◮ Introduced by [Hren and Munos 2008] ◮ Another optimistic algorithm ◮ Only for deterministic MDPs Theorem ( OPD sample complexity) � � n − log 1 /γ E r n = O , if κ > 1 log κ Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 9/31

  23. Can we get better guarantees? OPD : Optimistic Planning for Deterministic systems ◮ Introduced by [Hren and Munos 2008] ◮ Another optimistic algorithm ◮ Only for deterministic MDPs Theorem ( OPD sample complexity) � � n − log 1 /γ E r n = O , if κ > 1 log κ OLOP : Open-Loop Optimistic Planning ◮ Introduced by [Bubeck and Munos 2010] ◮ Extends OPD to the stochastic setting ◮ Only considers open-loop policies, i.e. sequences of actions Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 9/31

  24. The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 10/31

  25. The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a 2. Form upper confidence-bounds of sequence values: V ( a ) ≤ U a w.h.p Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 10/31

  26. The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a 2. Form upper confidence-bounds of sequence values: V ( a ) ≤ U a w.h.p 3. Sample the sequence with highest UCB: arg max U a a Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 10/31

  27. The idea behind OLOP Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 11/31

  28. The idea behind OLOP Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 12/31

  29. Under the hood Upper-bounding the value of sequences follow the sequence � �� � act optimally � �� � h � � γ t µ a 1 : t γ t µ a ∗ V ( a ) = + 1 : t t = 1 t ≥ h + 1 Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 13/31

  30. Under the hood Upper-bounding the value of sequences follow the sequence � �� � act optimally � �� � h � � γ t µ a 1 : t γ t µ a ∗ V ( a ) = + ���� 1 : t ���� t = 1 t ≥ h + 1 ≤ U µ ≤ 1 Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 13/31

  31. Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� � � �� � T a ( m ) � �� � Upper bound Empirical mean Confidence interval Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 14/31

  32. Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� � � �� � T a ( m ) � �� � Upper bound Empirical mean Confidence interval OPD : upper-bound all the future rewards by 1 h � γ h + 1 U a ( m ) def γ t U µ = a 1 : t ( m ) + 1 − γ � �� � � �� � t = 1 Past rewards Future rewards Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 14/31

  33. Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� � � �� � T a ( m ) � �� � Upper bound Empirical mean Confidence interval OPD : upper-bound all the future rewards by 1 h � γ h + 1 U a ( m ) def γ t U µ = a 1 : t ( m ) + 1 − γ � �� � � �� � t = 1 Past rewards Future rewards Bounds sharpening B a ( m ) def = 1 ≤ t ≤ L U a 1 : t ( m ) inf Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 14/31

Recommend


More recommend