Practical Open-Loop Optimistic Planning Edouard Leurent 1 , 2 , - PowerPoint PPT Presentation

Practical Open-Loop Optimistic Planning Edouard Leurent 1 , 2 , Odalric-Ambrym Maillard 1 1 SequeL, Inria Lille – Nord Europe 2 Renault Group ECML PKDD 2019 W¨ urzburg, September 2019

Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31

Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes 1. Observe state s ∈ S ; Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31

Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes 1. Observe state s ∈ S ; 2. Pick a discrete action a ∈ A ; Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31

Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes 1. Observe state s ∈ S ; 2. Pick a discrete action a ∈ A ; � � 3. Transition to a next state s ′ ∼ P s ′ | s , a ; Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31

Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes 1. Observe state s ∈ S ; 2. Pick a discrete action a ∈ A ; � � 3. Transition to a next state s ′ ∼ P s ′ | s , a ; 4. Receive a bounded reward r ∈ [ 0 , 1 ] drawn from P ( r | s , a ) . Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31

Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes 1. Observe state s ∈ S ; 2. Pick a discrete action a ∈ A ; � � 3. Transition to a next state s ′ ∼ P s ′ | s , a ; 4. Receive a bounded reward r ∈ [ 0 , 1 ] drawn from P ( r | s , a ) . Objective: maximise V = E [ � ∞ t = 0 γ t r t ] Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31

Motivation — Example The highway-env environment We want to handle stochasticity. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 3/31

Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried Agent Environment Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31

Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried Agent Environment state Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31

Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried Agent Environment state recommendation Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31

Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried action Agent Environment state recommendation Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31

Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried action Agent Environment state, reward state recommendation Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31

Motivation — How to solve MDPs? Online Planning ◮ fixed budget: the model can only be queried n times Objective: minimize E V ∗ − V ( n ) � �� Simple Regret r n An exploration-exploitation problem. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 5/31

Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 6/31

Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. ◮ Either you performed well; Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 6/31

Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. ◮ Either you performed well; ◮ or you learned something. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 6/31

Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. ◮ Either you performed well; ◮ or you learned something. Instances ◮ Monte-carlo tree search ( MCTS ) [Coulom 2006]: CrazyStone ◮ Reframed in the bandit setting as UCT [Kocsis and Szepesv´ ari 2006], still very popular (e.g. Alpha Go ). ◮ Proved asymptotic consistency, but no regret bound. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 6/31

Analysis of UCT It was analysed in [Coquelin and Munos 2007] The sample complexity of is lower-bounded by O (exp(exp( D ))) . Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 7/31

Failing cases of UCT Not just a theoretical counter-example. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 8/31

Can we get better guarantees? OPD : Optimistic Planning for Deterministic systems ◮ Introduced by [Hren and Munos 2008] ◮ Another optimistic algorithm ◮ Only for deterministic MDPs Theorem ( OPD sample complexity) � � n − log 1 /γ E r n = O , if κ > 1 log κ Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 9/31

Can we get better guarantees? OPD : Optimistic Planning for Deterministic systems ◮ Introduced by [Hren and Munos 2008] ◮ Another optimistic algorithm ◮ Only for deterministic MDPs Theorem ( OPD sample complexity) � � n − log 1 /γ E r n = O , if κ > 1 log κ OLOP : Open-Loop Optimistic Planning ◮ Introduced by [Bubeck and Munos 2010] ◮ Extends OPD to the stochastic setting ◮ Only considers open-loop policies, i.e. sequences of actions Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 9/31

The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 10/31

The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a 2. Form upper confidence-bounds of sequence values: V ( a ) ≤ U a w.h.p Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 10/31

The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a 2. Form upper confidence-bounds of sequence values: V ( a ) ≤ U a w.h.p 3. Sample the sequence with highest UCB: arg max U a a Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 10/31

The idea behind OLOP Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 11/31

The idea behind OLOP Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 12/31

Under the hood Upper-bounding the value of sequences follow the sequence � �� act optimally � �� h � � γ t µ a 1 : t γ t µ a ∗ V ( a ) = + 1 : t t = 1 t ≥ h + 1 Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 13/31

Under the hood Upper-bounding the value of sequences follow the sequence � �� act optimally � �� h � � γ t µ a 1 : t γ t µ a ∗ V ( a ) = + �� 1 : t �� t = 1 t ≥ h + 1 ≤ U µ ≤ 1 Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 13/31

Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� T a ( m ) � �� Upper bound Empirical mean Confidence interval Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 14/31

Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� T a ( m ) � �� Upper bound Empirical mean Confidence interval OPD : upper-bound all the future rewards by 1 h � γ h + 1 U a ( m ) def γ t U µ = a 1 : t ( m ) + 1 − γ � �� t = 1 Past rewards Future rewards Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 14/31

Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� T a ( m ) � �� Upper bound Empirical mean Confidence interval OPD : upper-bound all the future rewards by 1 h � γ h + 1 U a ( m ) def γ t U µ = a 1 : t ( m ) + 1 − γ � �� t = 1 Past rewards Future rewards Bounds sharpening B a ( m ) def = 1 ≤ t ≤ L U a 1 : t ( m ) inf Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 14/31

Practical Open-Loop Optimistic Planning Edouard Leurent 1 , 2 , - PowerPoint PPT Presentation

Practical Open-Loop Optimistic Planning Edouard Leurent 1 , 2 , Odalric-Ambrym Maillard 1 1 SequeL, Inria Lille Nord Europe 2 Renault Group ECML PKDD 2019 W urzburg, September 2019 Motivation Sequential Decision Making action Agent

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

Repetition Types of Loops Counting loop Know how many times to loop

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop

Optimistic Fair Priced Oblivious Transfer A. Rial B. Preneel Katholieke Universiteit Leuven -

Exa- to Yotta-scale Data An Optimistic View Rob Farber PNNL Optimistic about Storage Bandwidth

Open loop synthesis for closed loop control Kazufumi Ito, North Carolina State University June

Loop Invariants: Part 2 7 January 2019 OSU CSE 1 Maintaining the Loop Invariant A claimed

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Upper and Lower Loop Bound Estimation by Symbolic Execution and Loop Acceleration Pavel Cadek

Enhancing Fine- Grained Parallelism Loop vectorization, Loop distribution, Scalar expansion

c } false loop body P (postcondition) Loop Invariant Defn : A boolean condition that

Thermodynamics of feedback controlled systems Francisco J. Cao Open-loop and closed-loop control

Objectives You should be able to ... Loop Invariants Explain the concept of well formed

Loop Statements & Vectorizing Code Chapter 5 Attaway MATLAB 4E for loop used as a

Trace while Loop, cont. Trace while Loop, cont. Print Welcome to Java Print Welcome to Java int

sr rt t rtst

The case for optimism Singapore Healthcare Management Congress August 14 16, 2018 Michael J.

(Ir)rational Exuberance: Optimism, Ambiguity and Risk Anat Bracha and Don Brown Boston FRB and

Over-optimism in biostatistics and bioinformatics Anne-Laure Boulesteix joint with M. Jelizarow,

Entrepreneurial Mindset Entrepreneurs cause entrepreneurship. Market opportunities,

SpeakUp Newpor t F isc al Year 2020-21 Adopted Budget July 8, 2020 1 Over view The FY

Timothy Cohen with Daniel Phalen and Aaron Pierce arXiv:1001.3408 Michigan Center for

Resilience & Optimism During a Crisis HELLO! I am Karen Maher I am an experienced HR

Practical Open-Loop Optimistic Planning Edouard Leurent 1 , 2 , - PowerPoint PPT Presentation

Practical Open-Loop Optimistic Planning Edouard Leurent 1 , 2 , Odalric-Ambrym Maillard 1 1 SequeL, Inria Lille Nord Europe 2 Renault Group ECML PKDD 2019 W urzburg, September 2019 Motivation Sequential Decision Making action Agent

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

Repetition Types of Loops Counting loop Know how many times to loop

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop

Optimistic Fair Priced Oblivious Transfer A. Rial B. Preneel Katholieke Universiteit Leuven -

Exa- to Yotta-scale Data An Optimistic View Rob Farber PNNL Optimistic about Storage Bandwidth

Open loop synthesis for closed loop control Kazufumi Ito, North Carolina State University June

Loop Invariants: Part 2 7 January 2019 OSU CSE 1 Maintaining the Loop Invariant A claimed

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Upper and Lower Loop Bound Estimation by Symbolic Execution and Loop Acceleration Pavel Cadek

Enhancing Fine- Grained Parallelism Loop vectorization, Loop distribution, Scalar expansion

c } false loop body P (postcondition) Loop Invariant Defn : A boolean condition that

Thermodynamics of feedback controlled systems Francisco J. Cao Open-loop and closed-loop control

Objectives You should be able to ... Loop Invariants Explain the concept of well formed

Loop Statements &amp; Vectorizing Code Chapter 5 Attaway MATLAB 4E for loop used as a

Trace while Loop, cont. Trace while Loop, cont. Print Welcome to Java Print Welcome to Java int

sr rt t rtst

The case for optimism Singapore Healthcare Management Congress August 14 16, 2018 Michael J.

(Ir)rational Exuberance: Optimism, Ambiguity and Risk Anat Bracha and Don Brown Boston FRB and

Over-optimism in biostatistics and bioinformatics Anne-Laure Boulesteix joint with M. Jelizarow,

Entrepreneurial Mindset Entrepreneurs cause entrepreneurship. Market opportunities,

SpeakUp Newpor t F isc al Year 2020-21 Adopted Budget July 8, 2020 1 Over view The FY

Timothy Cohen with Daniel Phalen and Aaron Pierce arXiv:1001.3408 Michigan Center for

Resilience &amp; Optimism During a Crisis HELLO! I am Karen Maher I am an experienced HR

Loop Statements & Vectorizing Code Chapter 5 Attaway MATLAB 4E for loop used as a

Resilience & Optimism During a Crisis HELLO! I am Karen Maher I am an experienced HR