Demystifying the efficiency of reinforcement learning: A few recent - PowerPoint PPT Presentation

Proof ideas Elementary decomposition: � V ⋆ − � V π ⋆ � + � � π ⋆ � + � � π ⋆ � π ⋆ = V π ⋆ − � π ⋆ − V � V ⋆ − V � V � V � � V ⋆ − � V π ⋆ � + 0 + � � π ⋆ � π ⋆ − V � V � ≤ • Step 1: control V π − � V π for a fixed π ( Bernstein inequality + high-order decomposition ) V � π ⋆ − V � • Step 2: extend it to control � π ⋆ ( decouple statistical dependence ) 28/ 74

Step 1: improved theory for policy evaluation Theorem 2 (Li, Wei, Chi, Gu, Chen’20) V π obeys 1 − γ , the plug-in estimator � 1 Fix any policy π . For 0 < ε ≤ V π − V π � ∞ ≤ ε � � with sample complexity at most � � |S| � O (1 − γ ) 3 ε 2 29/ 74

Step 1: improved theory for policy evaluation Theorem 2 (Li, Wei, Chi, Gu, Chen’20) V π obeys 1 − γ , the plug-in estimator � 1 Fix any policy π . For 0 < ε ≤ V π − V π � ∞ ≤ ε � � with sample complexity at most � � |S| � O (1 − γ ) 3 ε 2 V π − V π • key idea 1: high-order decomposition of � 29/ 74

Step 1: improved theory for policy evaluation Theorem 2 (Li, Wei, Chi, Gu, Chen’20) V π obeys 1 − γ , the plug-in estimator � 1 Fix any policy π . For 0 < ε ≤ V π − V π � ∞ ≤ ε � � with sample complexity at most � � |S| � O (1 − γ ) 3 ε 2 V π − V π • key idea 1: high-order decomposition of � • minimax optimal (Azar et al. ’13, Pananjady & Wainwright ’19) 29/ 74

Step 1: improved theory for policy evaluation Theorem 2 (Li, Wei, Chi, Gu, Chen’20) V π obeys 1 − γ , the plug-in estimator � 1 Fix any policy π . For 0 < ε ≤ V π − V π � ∞ ≤ ε � � with sample complexity at most � � |S| � O (1 − γ ) 3 ε 2 V π − V π • key idea 1: high-order decomposition of � • minimax optimal (Azar et al. ’13, Pananjady & Wainwright ’19) |S| • break sample size barrier (1 − γ ) 2 in prior work (Agarwal et al. ’19, Pananjady & Wainwright ’19, Khamaru et al. ’20) 29/ 74

π ⋆ � π ⋆ Step 2: controlling � − V � V key idea 2: a leave-one-out argument to decouple stat. dependency — inspired by Agarwal et al. ’19 but different . . . 30/ 74

π ⋆ � π ⋆ Step 2: controlling � − V � V key idea 2: a leave-one-out argument to decouple stat. dependency — inspired by Agarwal et al. ’19 but different . . . Caveat: requires the optimal policy to stand out from other policies 30/ 74

π ⋆ � π ⋆ Step 2: controlling � − V � V key idea 3: tie-breaking via perturbation π ⋆ • perturb rewards r by a tiny bit = ⇒ � p 31/ 74

Summary Model-based RL is minimax optimal and does not suffer from a sample size barrier! 32/ 74

Summary Model-based RL is minimax optimal and does not suffer from a sample size barrier! future directions • finite-horizon episodic MDPs • Markov games 32/ 74

Story 2: sample complexity of (asynchronous) Q-learning on Markovian samples Gen Li Yuantao Gu Yuting Wei Yuejie Chi Tsinghua EE Tsinghua EE CMU Stats CMU ECE

Model-based vs. model-free RL Model-based approach (“plug-in”) 1. build an empirical estimate � P for P 2. planning based on empirical � P Model-free approach — learning w/o modeling & estimating environment explicitly 34/ 74

A classical example: Q-learning on Markovian samples

Markovian samples and behavior policy Observed : { s t , a t , r t } t ≥ 0 generated by behavior policy π b � �� Markovian trajectory Goal : learn optimal value V ⋆ and Q ⋆ based on sample trajectory 36/ 74

Markovian samples and behavior policy Key quantities of sample trajectory • minimum state-action occupancy probability µ min := min µ π b ( s, a ) � �� stationary distribution • mixing time: t mix 36/ 74

Q-learning: a classical model-free algorithm Chris Watkins Peter Dayan Stochastic approximation for solving Bellman equation Q = T ( Q ) � �� Robbins & Monro ’51 37/ 74

Aside: Bellman optimality principle Bellman operator � � a ′ ∈A Q ( s ′ , a ′ ) T ( Q )( s, a ) := r ( s, a ) + γ max E � �� s ′ ∼ P ( ·| s,a ) � �� immediate reward next state’s value • one-step look-ahead 38/ 74

Aside: Bellman optimality principle Bellman operator � � a ′ ∈A Q ( s ′ , a ′ ) T ( Q )( s, a ) := r ( s, a ) + γ max E � �� s ′ ∼ P ( ·| s,a ) � �� immediate reward next state’s value • one-step look-ahead Bellman equation: Q ⋆ is unique solution to T ( Q ⋆ ) = Q ⋆ Richard Bellman 38/ 74

Q-learning: a classical model-free algorithm Chris Watkins Peter Dayan Stochastic approximation for solving Bellman equation Q = T ( Q ) Q t +1 ( s t , a t ) = (1 − η t ) Q t ( s t , a t ) + η t T t ( Q t )( s t , a t ) , t ≥ 0 � �� only update ( s t ,a t ) -th entry 39/ 74

Q-learning: a classical model-free algorithm Chris Watkins Peter Dayan Stochastic approximation for solving Bellman equation Q = T ( Q ) Q t +1 ( s t , a t ) = (1 − η t ) Q t ( s t , a t ) + η t T t ( Q t )( s t , a t ) , t ≥ 0 � �� only update ( s t ,a t ) -th entry Q ( s t +1 , a ′ ) T t ( Q )( s t , a t ) = r ( s t , a t ) + γ max a ′ � Q ( s ′ , a ′ ) � T ( Q )( s, a ) = r ( s, a ) + γ max E a ′ s ′ ∼ P ( ·| s,a ) 39/ 74

Q-learning on Markovian samples • asynchronous: only a single entry is updated each iteration 40/ 74

Q-learning on Markovian samples • asynchronous: only a single entry is updated each iteration ◦ resembles Markov-chain coordinate descent 40/ 74

Q-learning on Markovian samples • asynchronous: only a single entry is updated each iteration ◦ resembles Markov-chain coordinate descent • off-policy: target policy π ⋆ � = behavior policy π b 40/ 74

A highly incomplete list of prior work • Watkins, Dayan ’92 • Tsitsiklis ’94 • Jaakkola, Jordan, Singh ’94 • Szepesv´ ari ’98 • Kearns, Singh ’99 • Borkar, Meyn ’00 • Even-Dar, Mansour ’03 • Beck, Srikant ’12 • Chi, Zhu, Bubeck, Jordan ’18 • Shah, Xie ’18 • Lee, He ’18 • Wainwright ’19 • Chen, Zhang, Doan, Maguluri, Clarke ’19 • Yang, Wang ’19 • Du, Lee, Mahajan, Wang ’20 • Chen, Maguluri, Shakkottai, Shanmugam ’20 • Qu, Wierman ’20 • Devraj, Meyn ’20 • Weng, Gupta, He, Ying, Srikant ’20 • ... 41/ 74

What is sample complexity of (async) Q-learning?

Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? paper sample complexity learning rate 1 1 − γ ( t cover ) 1 Even-Dar & Mansour ’03 linear: (1 − γ ) 4 ε 2 t � � 1 ω + � t cover � 1 t 1+3 ω t ω , ω ∈ ( 1 1 1 − ω Even-Dar & Mansour ’03 cover poly: 2 , 1) (1 − γ ) 4 ε 2 1 − γ t 3 cover |S||A| Beck & Srikant ’12 constant (1 − γ ) 5 ε 2 t mix Qu & Wierman ’20 rescaled linear µ 2 min (1 − γ ) 5 ε 2 43/ 74

Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? |S||A| , t cover ≍ t mix 1 if we take µ min ≍ µ min 43/ 74

Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? |S||A| , t cover ≍ t mix 1 if we take µ min ≍ µ min All prior results require sample size of at least t mix |S| 2 |A| 2 ! 43/ 74

Main result: ℓ ∞ -based sample complexity Theorem 3 (Li, Wei, Chi, Gu, Chen ’20) 1 For any 0 < ε ≤ 1 − γ , sample complexity of async Q-learning to yield � � Q − Q ⋆ � ∞ ≤ ε is at most (up to some log factor) 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) 44/ 74

Main result: ℓ ∞ -based sample complexity Theorem 3 (Li, Wei, Chi, Gu, Chen ’20) 1 For any 0 < ε ≤ 1 − γ , sample complexity of async Q-learning to yield � � Q − Q ⋆ � ∞ ≤ ε is at most (up to some log factor) 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) • Improves upon prior art by at least |S||A| ! t mix — prior art: min (1 − γ ) 5 ε 2 (Qu & Wierman ’20) µ 2 44/ 74

Effect of mixing time on sample complexity 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) • reflects cost taken to reach steady state • one-time expense (almost independent of ε ) — it becomes amortized as algorithm runs 45/ 74

Effect of mixing time on sample complexity 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) • reflects cost taken to reach steady state • one-time expense (almost independent of ε ) — it becomes amortized as algorithm runs t mix — prior art: min (1 − γ ) 5 ε 2 (Qu & Wierman ’20) µ 2 45/ 74

Learning rates � (1 − γ ) 4 ε 2 � 1 Our choice: constant stepsize η t ≡ min , γ 2 t mix 1 µ min (1 − γ ) • Qu & Wierman ’20: rescaled linear η t = 1 t +max { µ min (1 − γ ) ,t mix } • Beck & Srikant ’12: constant η t ≡ (1 − γ ) 4 ε 2 |S||A| t 2 cover � �� too conservative • Even-Dar & Mansour ’03: polynomial η t = t − ω ( ω ∈ ( 1 2 , 1] ) 46/ 74

Minimax lower bound minimax lower bound asyn Q-learning (Azar et al. ’13) (ignoring dependency on t mix ) 1 1 µ min (1 − γ ) 3 ε 2 µ min (1 − γ ) 5 ε 2 47/ 74

Minimax lower bound minimax lower bound asyn Q-learning (Azar et al. ’13) (ignoring dependency on t mix ) 1 1 µ min (1 − γ ) 3 ε 2 µ min (1 − γ ) 5 ε 2 1 Can we improve dependency on discount complexity 1 − γ ? 47/ 74

One strategy: variance reduction — inspired by Johnson & Zhang ’13, Wainwright ’19 Variance-reduced Q-learning updates � � T t ( Q t − 1 ) −T t ( Q ) + � Q t ( s t , a t ) = (1 − η ) Q t − 1 ( s t , a t ) + η T ( Q ) ( s t , a t ) � �� use Q to help reduce variability • Q : some reference Q-estimate • � T : empirical Bellman operator (using a batch of samples) 48/ 74

Variance-reduced Q-learning — inspired by Johnson & Zhang ’13, Sidford et al. ’18, Wainwright ’19 for each epoch 1. update Q and � T ( Q ) 2. run variance-reduced Q-learning updates 49/ 74

Main result: ℓ ∞ -based sample complexity Theorem 4 (Li, Wei, Chi, Gu, Chen ’20) For any 0 < ε ≤ 1 , sample complexity for (async) variance-reduced Q-learning to yield � � Q − Q ⋆ � ∞ ≤ ε is at most on the order of 1 t mix µ min (1 − γ ) 3 ε 2 + µ min (1 − γ ) � ✘✘ � ✘ (1 − γ ) 4 (1 − γ ) 2 1 • more aggressive learning rates: η t ≡ min , γ 2 t mix 50/ 74

Main result: ℓ ∞ -based sample complexity Theorem 4 (Li, Wei, Chi, Gu, Chen ’20) For any 0 < ε ≤ 1 , sample complexity for (async) variance-reduced Q-learning to yield � � Q − Q ⋆ � ∞ ≤ ε is at most on the order of 1 t mix µ min (1 − γ ) 3 ε 2 + µ min (1 − γ ) � ✘✘ � ✘ (1 − γ ) 4 (1 − γ ) 2 1 • more aggressive learning rates: η t ≡ min , γ 2 t mix • minimax-optimal for 0 < ε ≤ 1 50/ 74

Summary Sharpens finite-sample understanding of Q-learning on Markovian data 51/ 74

Summary Sharpens finite-sample understanding of Q-learning on Markovian data future directions • function approximation • on-policy algorithms like SARSA • general Markov-chain-based optimization algorithms 51/ 74

Story 3: fast global convergence of entropy-regularized natural policy gradient (NPG) methods Shicong Cen Chen Cheng Yuejie Chi Yuting Wei CMU ECE Stanford Stats CMU Stats CMU ECE

Policy optimization: a major contributor to these successes 53/ 74

Policy gradient (PG) methods Given initial state distribution s ∼ ρ : V π ( ρ ) := E s ∼ ρ [ V π ( s )] maximize π 54/ 74

Policy gradient (PG) methods Given initial state distribution s ∼ ρ : V π ( ρ ) := E s ∼ ρ [ V π ( s )] maximize π softmax parameterization: exp( θ ( s, a )) � π θ ( a | s ) = a exp( θ ( s, a )) 54/ 74

Policy gradient (PG) methods Given initial state distribution s ∼ ρ : V π ( ρ ) := E s ∼ ρ [ V π ( s )] maximize π softmax parameterization: exp( θ ( s, a )) � π θ ( a | s ) = a exp( θ ( s, a )) V π θ ( ρ ) := E s ∼ ρ [ V π θ ( s )] maximize θ 54/ 74

Policy gradient (PG) methods Given initial state distribution s ∼ ρ : V π ( ρ ) := E s ∼ ρ [ V π ( s )] maximize π softmax parameterization: exp( θ ( s, a )) � π θ ( a | s ) = a exp( θ ( s, a )) V π θ ( ρ ) := E s ∼ ρ [ V π θ ( s )] maximize θ PG method (Sutton et al. ’00) θ ( t +1) = θ ( t ) + η ∇ θ V π ( t ) θ ( ρ ) , t = 0 , 1 , · · · • η : learning rate 54/ 74

Booster 1: natural policy gradient (NPG) precondition gradients to improve search directions ... Natural Gradient = ⇒ NPG method (Kakade ’02) ρ ) † ∇ θ V π ( t ) θ ( t +1) = θ ( t ) + η ( F θ θ ( ρ ) , t = 0 , 1 , · · · �� ⊤ � �� • F θ : Fisher info matrix ρ := E ∇ θ log π θ ( a | s ) ∇ θ log π θ ( a | s ) 55/ 74

Demystifying the efficiency of reinforcement learning: A few recent - PowerPoint PPT Presentation

Demystifying the efficiency of reinforcement learning: A few recent stories Yuxin Chen EE, Princeton University Acknowledgement 2/ 74 Reinforcement learning (RL) 3/ 74 RL challenges In RL, an agent learns by interacting with an environment

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Demystifying the Finance Audit Committee DEMYSTIFYING THE FINANCE AND AUDIT COMMITTEE

Demystifying SEO for Government Agencies Demystifying SEO for Government Agencies Why should you

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Demystifying DNA Demystifying DNA What is it? How do I get it? What is it? How do I

Demystifying Python Metaclasses Demystifying Python Metaclasses Eric D. Wills, Ph.D. Instructor,

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Chairs Report Sowjanya Gollapinni (UTK) FNAL UEC meeting January 19, 2018 1 News

TRACER TUTORIAL: TEXT REUSE DETECTION SELECTION Mar co B uchler, Emily Franzini and Greta

Milked and Feathered The Regressive Welfare Effects of Canadas Supply Management Regime

2 nd semester Topic 64: Grammar: Both, neither, either, nor, so We use either , neither

Course overview intelligent agents search and game-playing logical systems Artificial

Machine Translation Overview April 23, 2020 Junjie Hu Materials largely borrowed from Austin

How have Data Science Skills Evolved? A case study using embeddings Maryam Jahanshahi Ph.D.

2)EXERCCIOS 3)TAREFA DE CASA 2 E X E R C I S E S 3 Question 1 INDICATE THE IDEA TRANSMITTED