Practical Open-Loop Optimistic Planning Edouard Leurent 1 , 2 , Odalric-Ambrym Maillard 1 1 SequeL, Inria Lille – Nord Europe 2 Renault Group ECML PKDD 2019 W¨ urzburg, September 2019
Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31
Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes 1. Observe state s ∈ S ; Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31
Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes 1. Observe state s ∈ S ; 2. Pick a discrete action a ∈ A ; Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31
Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes 1. Observe state s ∈ S ; 2. Pick a discrete action a ∈ A ; � � 3. Transition to a next state s ′ ∼ P s ′ | s , a ; Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31
Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes 1. Observe state s ∈ S ; 2. Pick a discrete action a ∈ A ; � � 3. Transition to a next state s ′ ∼ P s ′ | s , a ; 4. Receive a bounded reward r ∈ [ 0 , 1 ] drawn from P ( r | s , a ) . Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31
Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes 1. Observe state s ∈ S ; 2. Pick a discrete action a ∈ A ; � � 3. Transition to a next state s ′ ∼ P s ′ | s , a ; 4. Receive a bounded reward r ∈ [ 0 , 1 ] drawn from P ( r | s , a ) . Objective: maximise V = E [ � ∞ t = 0 γ t r t ] Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31
Motivation — Example The highway-env environment We want to handle stochasticity. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 3/31
Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried Agent Environment Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31
Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried Agent Environment state Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31
Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried Agent Environment state Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31
Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried Agent Environment state recommendation Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31
Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried action Agent Environment state recommendation Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31
Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried action Agent Environment state, reward state recommendation Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31
Motivation — How to solve MDPs? Online Planning ◮ fixed budget: the model can only be queried n times Objective: minimize E V ∗ − V ( n ) � �� � Simple Regret r n An exploration-exploitation problem. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 5/31
Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 6/31
Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. ◮ Either you performed well; Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 6/31
Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. ◮ Either you performed well; ◮ or you learned something. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 6/31
Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. ◮ Either you performed well; ◮ or you learned something. Instances ◮ Monte-carlo tree search ( MCTS ) [Coulom 2006]: CrazyStone ◮ Reframed in the bandit setting as UCT [Kocsis and Szepesv´ ari 2006], still very popular (e.g. Alpha Go ). ◮ Proved asymptotic consistency, but no regret bound. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 6/31
Analysis of UCT It was analysed in [Coquelin and Munos 2007] The sample complexity of is lower-bounded by O (exp(exp( D ))) . Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 7/31
Failing cases of UCT Not just a theoretical counter-example. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 8/31
Can we get better guarantees? OPD : Optimistic Planning for Deterministic systems ◮ Introduced by [Hren and Munos 2008] ◮ Another optimistic algorithm ◮ Only for deterministic MDPs Theorem ( OPD sample complexity) � � n − log 1 /γ E r n = O , if κ > 1 log κ Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 9/31
Can we get better guarantees? OPD : Optimistic Planning for Deterministic systems ◮ Introduced by [Hren and Munos 2008] ◮ Another optimistic algorithm ◮ Only for deterministic MDPs Theorem ( OPD sample complexity) � � n − log 1 /γ E r n = O , if κ > 1 log κ OLOP : Open-Loop Optimistic Planning ◮ Introduced by [Bubeck and Munos 2010] ◮ Extends OPD to the stochastic setting ◮ Only considers open-loop policies, i.e. sequences of actions Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 9/31
The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 10/31
The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a 2. Form upper confidence-bounds of sequence values: V ( a ) ≤ U a w.h.p Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 10/31
The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a 2. Form upper confidence-bounds of sequence values: V ( a ) ≤ U a w.h.p 3. Sample the sequence with highest UCB: arg max U a a Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 10/31
The idea behind OLOP Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 11/31
The idea behind OLOP Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 12/31
Under the hood Upper-bounding the value of sequences follow the sequence � �� � act optimally � �� � h � � γ t µ a 1 : t γ t µ a ∗ V ( a ) = + 1 : t t = 1 t ≥ h + 1 Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 13/31
Under the hood Upper-bounding the value of sequences follow the sequence � �� � act optimally � �� � h � � γ t µ a 1 : t γ t µ a ∗ V ( a ) = + ���� 1 : t ���� t = 1 t ≥ h + 1 ≤ U µ ≤ 1 Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 13/31
Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� � � �� � T a ( m ) � �� � Upper bound Empirical mean Confidence interval Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 14/31
Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� � � �� � T a ( m ) � �� � Upper bound Empirical mean Confidence interval OPD : upper-bound all the future rewards by 1 h � γ h + 1 U a ( m ) def γ t U µ = a 1 : t ( m ) + 1 − γ � �� � � �� � t = 1 Past rewards Future rewards Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 14/31
Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� � � �� � T a ( m ) � �� � Upper bound Empirical mean Confidence interval OPD : upper-bound all the future rewards by 1 h � γ h + 1 U a ( m ) def γ t U µ = a 1 : t ( m ) + 1 − γ � �� � � �� � t = 1 Past rewards Future rewards Bounds sharpening B a ( m ) def = 1 ≤ t ≤ L U a 1 : t ( m ) inf Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 14/31
Recommend
More recommend