Tighter Problem-Dependent Regret Bounds in Reinforcement Learning - PowerPoint PPT Presentation

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds Andrea Zanette*, Emma Brunskill zanette@stanford.edu ebrun@cs.stanford.edu Institute for Computational and Mathematical Engineering and Department of Computer Science, Stanford University

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds Andrea Zanette*, Emma Brunskill zanette@stanford.edu ebrun@cs.stanford.edu Institute for Computational and Mathematical Engineering and Department of Computer Science, Stanford University Exploration in RL = Learn quickly how to play near optimally Setting: episodic tabular RL Goal: automatically inherit instance-dependent regret bounds

State of the Art Regret Bounds for Episodic Tabular MDPs

State of the Art Regret Bounds for Episodic Tabular MDPs No Intelligent Exploration ˜ O ( T ) (naive greedy)

State of the Art Regret Bounds for Episodic Tabular MDPs E ffi cient Exploration No Intelligent Exploration ˜ O ( T ) (naive greedy) ˜ O ( HS AT ) (UCRL2, Jaksch 2010)

State of the Art Regret Bounds for Episodic Tabular MDPs E ffi cient Exploration No Intelligent Exploration ˜ ˜ O ( T ) O ( H SAT ) (naive greedy) (Dann 2015) ˜ O ( HS AT ) (UCRL2, Jaksch 2010)

State of the Art Regret Bounds for Episodic Tabular MDPs E ffi cient Exploration No Intelligent Exploration ˜ ˜ O ( T ) O ( H SAT ) (naive greedy) (Dann 2015) ˜ ˜ O ( HS AT ) O ( S HAT ) (Dann 2017) (UCRL2, Jaksch 2010)

State of the Art Regret Bounds for Episodic Tabular MDPs E ffi cient Exploration No Intelligent Exploration ˜ ˜ O ( T ) ˜ O ( H SAT ) O ( HSAT ) (naive greedy) (Dann 2015) (Azar 2017) ˜ ˜ O ( HS AT ) O ( S HAT ) (Dann 2017) (UCRL2, Jaksch 2010)

State of the Art Regret Bounds for Episodic Tabular MDPs Lower Bound E ffi cient Exploration No Intelligent Exploration ˜ ˜ O ( T ) ˜ O ( H SAT ) O ( HSAT ) (naive greedy) (Dann 2015) (Azar 2017) ˜ ˜ O ( HS AT ) O ( S HAT ) Ω ( HSAT ) (Dann 2017) (UCRL2, (Lower Bound) Jaksch 2010)

State of the Art Regret Bounds for Episodic Tabular MDPs Problem Dependent Analysis Lower Bound E ffi cient Exploration No Intelligent Exploration ˜ ˜ O ( T ) ˜ O ( H SAT ) O ( HSAT ) ˜ ℚ ⋆ SAT ) O ( (naive greedy) (Dann 2015) (Azar 2017) ˜ ˜ O ( HS AT ) O ( S HAT ) (Our work) Ω ( HSAT ) (Dann 2017) (UCRL2, (Lower Bound) Jaksch 2010)

Main Result

Main Result ( s , a )

Main Result t t+1 ( s , a )

Main Result t t+1 V ⋆ ( s + 1 ) V ⋆ ( s + 2 ) V ⋆ ( s + ( s , a ) 3 )

Main Result t t+1 V ⋆ ( s + 1 ) V ⋆ ( s + 2 ) V ⋆ ( s + ( s , a ) 3 ) ℚ ⋆ = max s , a Var s + ∼ p ( s , a ) V ⋆ ( s + )

Main Result t t+1 V ⋆ ( s + 1 ) V ⋆ ( s + 2 ) V ⋆ ( s + ( s , a ) 3 ) ℚ ⋆ = max s , a Var s + ∼ p ( s , a ) V ⋆ ( s + ) r 1 + r 2 + … + r H ≤ 𝒣

Main Result t t+1 V ⋆ ( s + 1 ) V ⋆ ( s + 2 ) V ⋆ ( s + ( s , a ) 3 ) ℚ ⋆ = max s , a Var s + ∼ p ( s , a ) V ⋆ ( s + ) r 1 + r 2 + … + r H ≤ 𝒣 min { O ( H SAT ) + [ const ] } 𝒣 2 Main Result : An algorithm with   ˜ ˜ ℚ ⋆ SAT ) + [ const ], O ( a (high probability) regret bound:

Main Result t t+1 V ⋆ ( s + 1 ) V ⋆ ( s + 2 ) V ⋆ ( s + ( s , a ) 3 ) ℚ ⋆ = max s , a Var s + ∼ p ( s , a ) V ⋆ ( s + ) r 1 + r 2 + … + r H ≤ 𝒣 min { O ( H SAT ) + [ const ] } 𝒣 2 Main Result : An algorithm with   ˜ ˜ ℚ ⋆ SAT ) + [ const ], O ( a (high probability) regret bound: Technique : exploration bonus which is adaptively adjusted as a function of the problem di ffi culty

Long Horizon MDPs

Long Horizon MDPs Standard Setting r ∈ [0,1]

Long Horizon MDPs Standard Setting r ∈ [0,1] H Goal MDP Setting* ∑ r ≥ 0, r t ≤ 1 * this is a more general setting t =1

Long Horizon MDPs Standard Setting r ∈ [0,1] H Goal MDP Setting* ∑ r ≥ 0, r t ≤ 1 * this is a more general setting t =1 COLT Conjecture of Jiang & Agarwal, 2018: Any algorithm must su ff er ~H dependence in terms of sample complexity and regret   in the Goal MDP setting

Long Horizon MDPs Standard Setting r ∈ [0,1] H Goal MDP Setting* ∑ r ≥ 0, r t ≤ 1 * this is a more general setting t =1 COLT Conjecture of Jiang & Agarwal, 2018: Any algorithm must su ff er ~H dependence in terms of sample complexity and regret   in the Goal MDP setting Our algorithm yields   no horizon dependence in the regret bound for the setting   of the COLT conjecture without being informed of the setting.

Effect of MDP Stochasticity

Effect of MDP Stochasticity Stochasticity in the Transition Dynamics

Effect of MDP Stochasticity Stochasticity in the Transition Dynamics Deterministic MDP ˜ O ( SAH 2 )

Effect of MDP Stochasticity Stochasticity in the Transition Dynamics Bandit Like Deterministic MDP Structure ˜ O ( SAH 2 ) ˜ O ( SAT + [ . . . ])

Effect of MDP Stochasticity Stochasticity in the Transition Dynamics Bandit Like Deterministic MDP Hard Instances of the Lower Bound Structure ˜ ˜ O ( SAH 2 ) ˜ O ( HSAT + [ . . . ]) O ( SAT + [ . . . ])

Effect of MDP Stochasticity Stochasticity in the Transition Dynamics Bandit Like Deterministic MDP Hard Instances of the Lower Bound Structure ˜ ˜ O ( SAH 2 ) ˜ O ( HSAT + [ . . . ]) O ( SAT + [ . . . ]) Our algorithm matches in dominant terms the best performance for each setting

Related Work (infinite horizon)

Related Work (infinite horizon) In mixing domains:   May not improve over worst-case:   With domain knowledge:   - ( Talebi et al, 2018)   -(Maillard et al, 2014) - [REGAL] (Bartlett et al, 2010)   - (Ortner, 2018) - [SCAL] (Fruit et al, 2018)

Related Work (infinite horizon) In mixing domains:   May not improve over worst-case:   With domain knowledge:   - ( Talebi et al, 2018)   -(Maillard et al, 2014) - [REGAL] (Bartlett et al, 2010)   - (Ortner, 2018) - [SCAL] (Fruit et al, 2018) Conclusion - Episodic tabular MDP instance dependent bound without knowledge of the environment - Insights into hardness of RL; provable improvements in many settings of interest

Related Work (infinite horizon) In mixing domains:   May not improve over worst-case:   With domain knowledge:   - ( Talebi et al, 2018)   -(Maillard et al, 2014) - [REGAL] (Bartlett et al, 2010)   - (Ortner, 2018) - [SCAL] (Fruit et al, 2018) Conclusion - Episodic tabular MDP instance dependent bound without knowledge of the environment - Insights into hardness of RL; provable improvements in many settings of interest near-deterministic MDPs Bandit-structure long horizon MDPs limited range of   optimal value function limited variability in value function among successor states

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning - PowerPoint PPT Presentation

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds Andrea Zanette*, Emma Brunskill zanette@stanford.edu ebrun@cs.stanford.edu Institute for Computational and Mathematical

Extreme Classification Many modern applications involve a huge number of classes . E.g., image

Tighter Bounds on the Inefficiency Ratio of Stable Equilibria in Load Balancing Games Akaki

Regret bounds for online variational inference Pierre Alquier ACML Nagoya, Nov. 18, 2019

Tighter Bounds for the Sum of Irreducible LCP Values Juha Krkkinen 1 Dominik Kempa 1 Marcin

Tight Bounds on Minimax Regret under Logarithmic Loss via Self-Concordance Blair Bilodeau 1,2,3 ,

Improved Bounds on Minimax Regret under Logarithmic Loss via Self-Concordance Blair Bilodeau 1 , 2

Regret Bounds for Lifelong Learning Pierre Alquier Groupe de Travail de Machine learning du CMLA

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

Regret bounds for meta Bayesian optimization with an unknown Gaussian process prior Zi Wang*

Pessimistic Query Optimization: Tighter Upper Bounds for Intermediate Join Cardinalities Walter

Module 15 POMDP Bounds CS 886 Sequential Decision Making and Reinforcement Learning University

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Improving Optimization Bounds using Machine Learning: Decision Diagrams meet Deep Reinforcement

CDLG PATHS The notion of REGULAR SOLUTION for a path dependent PDE (1) needs to deal with

Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently Asaf Cassel Joint work

New Error Bounds for Approximations from Projected Linear Equations H. Yu . Bertsekas

The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... Shahaf Nacson TAU Nov 15,

Guaranteed Bounds for Solution of Parameter Dependent System of Equations Andrew Pownuk 1 , Iwona

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for

Learning Linear Quadratic Regulators Efficiently with Only Regret T Alon Cohen Joint

Constant-factor approximation algorithms for the minmax regret problem Juan Pablo Fern andez

Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning

R i f R i f Reinforcement Learning III Reinforcement Learning III t L t L i i III III Dec

Data Dependent Priors in PAC-Bayes Bounds John Shawe-Taylor University College London Joint work