Breaking the sample size barrier in reinforcement learning via - PowerPoint PPT Presentation

Breaking the sample size barrier in reinforcement learning via model-based methods � �� “plug-in” Yuxin Chen EE, Princeton University

Gen Li Yuejie Chi Yuantao Gu Yuting Wei Tsinghua EE CMU ECE Tsinghua EE CMU Statistics “Breaking the sample size barrier in model-based reinforcement learning with a generative model,” G. Li, Y. Wei, Y. Chi, Y. Gu, Y. Chen, arxiv:2005.12900, 2020

Gen Li Yuejie Chi Yuantao Gu Yuting Wei Tsinghua EE CMU ECE Tsinghua EE Berkeley Stat Ph.D. “Breaking the sample size barrier in model-based reinforcement learning with a generative model,” G. Li, Y. Wei, Y. Chi, Y. Gu, Y. Chen, arxiv:2005.12900, 2020

Reinforcement learning (RL) 4/ 38

RL challenges In RL, an agent learns by interacting with an environment • unknown or changing environments • delayed rewards or feedback • enormous state and action space • nonconvexity 5/ 38

Sample efficiency Collecting data samples might be expensive or time-consuming clinical trials online ads 6/ 38

Sample efficiency Collecting data samples might be expensive or time-consuming clinical trials online ads Calls for design of sample-efficient RL algorithms! 6/ 38

Background: Markov decision processes

Markov decision process (MDP) • S : state space • A : action space 8/ 38

Markov decision process (MDP) • S : state space • A : action space • r ( s, a ) ∈ [0 , 1] : immediate reward 8/ 38

Markov decision process (MDP) • S : state space • A : action space • r ( s, a ) ∈ [0 , 1] : immediate reward • π ( ·| s ) : policy (or action selection rule) 9/ 38

Markov decision process (MDP) • S : state space • A : action space • r ( s, a ) ∈ [0 , 1] : immediate reward • π ( ·| s ) : policy (or action selection rule) • P ( ·| s, a ) : unknown transition probabilities 9/ 38

Value function Value of policy π : long-term discounted reward � ∞ � � � � s 0 = s V π ( s ) := E γ t r ( s t , a t ) ∀ s ∈ S : t =0 10/ 38

Value function Value of policy π : long-term discounted reward � ∞ � � � � s 0 = s V π ( s ) := E γ t r ( s t , a t ) ∀ s ∈ S : t =0 • ( a 0 , s 1 , a 1 , s 2 , a 2 , · · · ) : generated under policy π 10/ 38

Value function Value of policy π : long-term discounted reward � ∞ � � � � s 0 = s V π ( s ) := E γ t r ( s t , a t ) ∀ s ∈ S : t =0 • ( a 0 , s 1 , a 1 , s 2 , a 2 , · · · ) : generated under policy π • γ ∈ [0 , 1) : discount factor ◦ take γ → 1 to approximate long-horizon MDPs 10/ 38

Optimal policy and optimal values • Optimal policy π ⋆ : maximizing the value function 11/ 38

Optimal policy and optimal values • Optimal policy π ⋆ : maximizing the value function • Optimal values: V ⋆ := V π ⋆ 11/ 38

When the model is known . . . MDP specification b b b π ? planning b b planning oracle e . g . policy iteration truth: P P r r Planning: computing the optimal policy π ⋆ given MDP specification 12/ 38

When the model is unknown . . . Need to learn optimal policy from samples w/o model specification 13/ 38

This talk: RL with a generative model / simulator — Kearns, Singh ’99 For each state-action pair ( s, a ) , collect N samples { ( s, a, s ′ ( i ) ) } 1 ≤ i ≤ N 14/ 38

Question: how many samples are sufficient to learn an ε -optimal policy � ? � ��

Question: how many samples are sufficient to learn an ε -optimal policy ? � �� ∀ s : V � π ( s ) ≥ V ⋆ ( s ) − ε

An incomplete list of prior art • Kearns & Singh ’99 • Kakade ’03 • Kearns, Mansour & Ng ’02 • Azar, Munos & Kappen ’12 • Azar, Munos, Ghavamzadeh & Kappen ’13 • Sidford, Wang, Wu, Yang & Ye ’18 • Sidford, Wang, Wu & Ye ’18 • Wang ’17 • Agarwal, Kakade & Yang ’19 • Wainwright ’19a • Wainwright ’19b • Pananjady & Wainwright ’20 • Yang & Wang ’19 • Khamaru, Pananjady, Ruan, Wainwright & Jordan ’20 • Mou, Li, Wainwright, Bartlett & Jordan ’20 • . . . 16/ 38

An even shorter list of prior art algorithm sample size range sample complexity ε -range � |S| 2 |A| empirical QVI 1 |S||A| √ (0 , (1 − γ ) |S| ] (1 − γ ) 2 , ∞ ) (1 − γ ) 3 ε 2 Azar et al. ’13 � |S||A| � � sublinear randomized VI |S||A| 1 (1 − γ ) 2 , ∞ ) 0 , (1 − γ ) 4 ε 2 Sidford et al. ’18a 1 − γ � |S||A| variance-reduced QVI |S||A| (1 − γ ) 3 , ∞ ) (0 , 1] (1 − γ ) 3 ε 2 Sidford et al. ’18b � |S||A| empirical MDP + planning |S||A| 1 (1 − γ ) 2 , ∞ ) (0 , √ 1 − γ ] (1 − γ ) 3 ε 2 Agarwal et al. ’19 — see also Wainwright ’19 (for estimating optimal values) 17/ 38

18/ 38

|S||A| All prior theory requires sample size > (1 − γ ) 2 � �� sample size barrier 18/ 38

Is it possible to close the gap?

Two approaches Model-based approach (“plug-in”) 1. build an empirical estimate � P for P 2. planning based on the empirical � P 20/ 38

Two approaches Model-based approach (“plug-in”) 1. build an empirical estimate � P for P 2. planning based on the empirical � P Model-free approach (e.g. Q-learning, SARSA) — learning w/o estimating the model explicitly 20/ 38

Model estimation Sampling: for each ( s, a ) , collect N ind. samples { ( s, a, s ′ ( i ) ) } 1 ≤ i ≤ N 21/ 38

Model estimation Sampling: for each ( s, a ) , collect N ind. samples { ( s, a, s ′ ( i ) ) } 1 ≤ i ≤ N N � P ( s ′ | s, a ) by 1 Empirical estimates: estimate � 1 { s ′ ( i ) = s ′ } N i =1 � �� empirical frequency 21/ 38

Model-based (plug-in) estimator — Azar et al. ’13, Agarwal et al. ’19, Pananjady et al. ’20 P empirical MDP b b b π ? b planning b b planning oracle e . g . policy iteration empirical ‚ P r P P r Planning based on the empirical MDP with slightly perturbed rewards 22/ 38

Our method: plug-in estimator + perturbation — Li, Wei, Chi, Gu, Chen ’20 P empirical MDP rds perturb b rewards p b b π ? b planning b b p planning oracle e . g . policy iteration empirical ‚ P rd: r p empirical ‚ P r P r P r P Run planning algorithms based on the empirical MDP 22/ 38

Challenges in the sample-starved regime truth: empirical estimate: � P ∈ R |S||A|×|S| P • Can’t recover P faithfully if sample size ≪ |S| 2 |A| ! 23/ 38

Challenges in the sample-starved regime truth: empirical estimate: � P ∈ R |S||A|×|S| P • Can’t recover P faithfully if sample size ≪ |S| 2 |A| ! • Can we trust our policy estimate when reliable model estimation is infeasible? 23/ 38

Main result Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) 1 π ⋆ For any 0 < ε ≤ 1 − γ , the optimal policy � p of the perturbed empirical MDP achieves � V � π ⋆ p − V ⋆ � ∞ ≤ ε with sample complexity at most � � |S||A| � O (1 − γ ) 3 ε 2 24/ 38

Main result Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) 1 π ⋆ For any 0 < ε ≤ 1 − γ , the optimal policy � p of the perturbed empirical MDP achieves � V � π ⋆ p − V ⋆ � ∞ ≤ ε with sample complexity at most � � |S||A| � O (1 − γ ) 3 ε 2 � � iterations p : obtained by empirical QVI or PI within � 1 π ⋆ • � O 1 − γ 24/ 38

Main result Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) 1 π ⋆ For any 0 < ε ≤ 1 − γ , the optimal policy � p of the perturbed empirical MDP achieves � V � π ⋆ p − V ⋆ � ∞ ≤ ε with sample complexity at most � � |S||A| � O (1 − γ ) 3 ε 2 � � iterations p : obtained by empirical QVI or PI within � 1 π ⋆ • � O 1 − γ • Minimax lower bound: � |S||A| Ω( (1 − γ ) 3 ε 2 ) (Azar et al. ’13) 24/ 38

25/ 38

Analysis

Notation and Bellman equation • V π : true value function under policy π ◦ Bellman equation: V π = ( I − P π ) − 1 r 27/ 38

Notation and Bellman equation • V π : true value function under policy π ◦ Bellman equation: V π = ( I − P π ) − 1 r • � V π : estimate of value function under policy π V π = ( I − � ◦ Bellman equation: � P π ) − 1 r 27/ 38

Notation and Bellman equation • V π : true value function under policy π ◦ Bellman equation: V π = ( I − P π ) − 1 r • � V π : estimate of value function under policy π V π = ( I − � ◦ Bellman equation: � P π ) − 1 r • π ⋆ : optimal policy w.r.t. true value function π ⋆ : optimal policy w.r.t. empirical value function • � 27/ 38

Breaking the sample size barrier in reinforcement learning via - PowerPoint PPT Presentation

Breaking the sample size barrier in reinforcement learning via model-based methods plug-in Yuxin Chen EE, Princeton University Gen Li Yuejie Chi Yuantao Gu Yuting Wei Tsinghua EE CMU ECE Tsinghua EE CMU

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning Yuting Wei Carnegie

I - -75 Median Cable Barrier 75 Median Cable Barrier 75 Median Cable Barrier I 75 Median Cable

SAMPLE SIZE IN TRIAXIAL LOADS How sample size affects the frictional behavior Photo by H.

BAKER BRICK BAKER BRICK BARRIER BARRIER BAKER BRICK BARRIER The Easy Solution to Stained

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Sample Size Power, Sample Size, and the FDR How many observations do we need? Depends on

Air Barrier and Insulation Installation Component Guide COMPONENT AIR BARRIER CRITERIA

Noise Barrier Meeting March 12, 2019 WHY ARE WE HERE TONIGHT? Noise Barrier Final Design Noise

Overview What is an Asymmetric Barrier? Median barrier with unbalanced roadway elevations

developing a MPA network in the Great Barrier Reef Jon Day Great Barrier Reef Marine Park

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

Breaking out of the box Understanding rela5onships between learning and assessment Breaking

CSS creative by @aganaplocha breaking the norm with CSS creative by @aganaplocha breaking

Sample Preparation Sample Preparation Sample Size 6 mm x 12 mm x 50 mm 10 mm x 12 mm

On Distributionally Robust Chance Constrained Program with Wasserstein Distance Weijun Xie ISE,

Chapter 7: The Distribution of Sample Means Frequency 2 1 0 1 2 3 4 5 6 7 8 9 Scores Distribution

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization

New functions for Random samples generation using Stata 15 G. Aguilera-Venegas, J.L. Gal

tr t trr t ts

Probability and Statistics for Computer Science

CS70: Jean Walrand: Lecture 23. Probability Basics Review Exactly 50 heads in 100 coin tosses.

A Number Game MDM4U: Mathematics of Data Management Work with a partner. Each player has three