Proof ideas Elementary decomposition: � V ⋆ − � V π ⋆ � + � � π ⋆ � + � � π ⋆ � π ⋆ = V π ⋆ − � π ⋆ − V � V ⋆ − V � V � V � � V ⋆ − � V π ⋆ � + 0 + � � π ⋆ � π ⋆ − V � V � ≤ • Step 1: control V π − � V π for a fixed π ( Bernstein inequality + high-order decomposition ) V � π ⋆ − V � • Step 2: extend it to control � π ⋆ ( decouple statistical dependence ) 28/ 74
Step 1: improved theory for policy evaluation Theorem 2 (Li, Wei, Chi, Gu, Chen’20) V π obeys 1 − γ , the plug-in estimator � 1 Fix any policy π . For 0 < ε ≤ V π − V π � ∞ ≤ ε � � with sample complexity at most � � |S| � O (1 − γ ) 3 ε 2 29/ 74
Step 1: improved theory for policy evaluation Theorem 2 (Li, Wei, Chi, Gu, Chen’20) V π obeys 1 − γ , the plug-in estimator � 1 Fix any policy π . For 0 < ε ≤ V π − V π � ∞ ≤ ε � � with sample complexity at most � � |S| � O (1 − γ ) 3 ε 2 V π − V π • key idea 1: high-order decomposition of � 29/ 74
Step 1: improved theory for policy evaluation Theorem 2 (Li, Wei, Chi, Gu, Chen’20) V π obeys 1 − γ , the plug-in estimator � 1 Fix any policy π . For 0 < ε ≤ V π − V π � ∞ ≤ ε � � with sample complexity at most � � |S| � O (1 − γ ) 3 ε 2 V π − V π • key idea 1: high-order decomposition of � • minimax optimal (Azar et al. ’13, Pananjady & Wainwright ’19) 29/ 74
Step 1: improved theory for policy evaluation Theorem 2 (Li, Wei, Chi, Gu, Chen’20) V π obeys 1 − γ , the plug-in estimator � 1 Fix any policy π . For 0 < ε ≤ V π − V π � ∞ ≤ ε � � with sample complexity at most � � |S| � O (1 − γ ) 3 ε 2 V π − V π • key idea 1: high-order decomposition of � • minimax optimal (Azar et al. ’13, Pananjady & Wainwright ’19) |S| • break sample size barrier (1 − γ ) 2 in prior work (Agarwal et al. ’19, Pananjady & Wainwright ’19, Khamaru et al. ’20) 29/ 74
π ⋆ � π ⋆ Step 2: controlling � − V � V key idea 2: a leave-one-out argument to decouple stat. dependency — inspired by Agarwal et al. ’19 but different . . . 30/ 74
π ⋆ � π ⋆ Step 2: controlling � − V � V key idea 2: a leave-one-out argument to decouple stat. dependency — inspired by Agarwal et al. ’19 but different . . . Caveat: requires the optimal policy to stand out from other policies 30/ 74
π ⋆ � π ⋆ Step 2: controlling � − V � V key idea 3: tie-breaking via perturbation π ⋆ • perturb rewards r by a tiny bit = ⇒ � p 31/ 74
Summary Model-based RL is minimax optimal and does not suffer from a sample size barrier! 32/ 74
Summary Model-based RL is minimax optimal and does not suffer from a sample size barrier! future directions • finite-horizon episodic MDPs • Markov games 32/ 74
Story 2: sample complexity of (asynchronous) Q-learning on Markovian samples Gen Li Yuantao Gu Yuting Wei Yuejie Chi Tsinghua EE Tsinghua EE CMU Stats CMU ECE
Model-based vs. model-free RL Model-based approach (“plug-in”) 1. build an empirical estimate � P for P 2. planning based on empirical � P Model-free approach — learning w/o modeling & estimating environment explicitly 34/ 74
A classical example: Q-learning on Markovian samples
Markovian samples and behavior policy Observed : { s t , a t , r t } t ≥ 0 generated by behavior policy π b � �� � Markovian trajectory Goal : learn optimal value V ⋆ and Q ⋆ based on sample trajectory 36/ 74
Markovian samples and behavior policy Key quantities of sample trajectory • minimum state-action occupancy probability µ min := min µ π b ( s, a ) � �� � stationary distribution • mixing time: t mix 36/ 74
Q-learning: a classical model-free algorithm Chris Watkins Peter Dayan Stochastic approximation for solving Bellman equation Q = T ( Q ) � �� � Robbins & Monro ’51 37/ 74
Aside: Bellman optimality principle Bellman operator � � a ′ ∈A Q ( s ′ , a ′ ) T ( Q )( s, a ) := r ( s, a ) + γ max E � �� � s ′ ∼ P ( ·| s,a ) � �� � immediate reward next state’s value • one-step look-ahead 38/ 74
Aside: Bellman optimality principle Bellman operator � � a ′ ∈A Q ( s ′ , a ′ ) T ( Q )( s, a ) := r ( s, a ) + γ max E � �� � s ′ ∼ P ( ·| s,a ) � �� � immediate reward next state’s value • one-step look-ahead Bellman equation: Q ⋆ is unique solution to T ( Q ⋆ ) = Q ⋆ Richard Bellman 38/ 74
Q-learning: a classical model-free algorithm Chris Watkins Peter Dayan Stochastic approximation for solving Bellman equation Q = T ( Q ) Q t +1 ( s t , a t ) = (1 − η t ) Q t ( s t , a t ) + η t T t ( Q t )( s t , a t ) , t ≥ 0 � �� � only update ( s t ,a t ) -th entry 39/ 74
Q-learning: a classical model-free algorithm Chris Watkins Peter Dayan Stochastic approximation for solving Bellman equation Q = T ( Q ) Q t +1 ( s t , a t ) = (1 − η t ) Q t ( s t , a t ) + η t T t ( Q t )( s t , a t ) , t ≥ 0 � �� � only update ( s t ,a t ) -th entry Q ( s t +1 , a ′ ) T t ( Q )( s t , a t ) = r ( s t , a t ) + γ max a ′ � Q ( s ′ , a ′ ) � T ( Q )( s, a ) = r ( s, a ) + γ max E a ′ s ′ ∼ P ( ·| s,a ) 39/ 74
Q-learning on Markovian samples • asynchronous: only a single entry is updated each iteration 40/ 74
Q-learning on Markovian samples • asynchronous: only a single entry is updated each iteration ◦ resembles Markov-chain coordinate descent 40/ 74
Q-learning on Markovian samples • asynchronous: only a single entry is updated each iteration ◦ resembles Markov-chain coordinate descent • off-policy: target policy π ⋆ � = behavior policy π b 40/ 74
A highly incomplete list of prior work • Watkins, Dayan ’92 • Tsitsiklis ’94 • Jaakkola, Jordan, Singh ’94 • Szepesv´ ari ’98 • Kearns, Singh ’99 • Borkar, Meyn ’00 • Even-Dar, Mansour ’03 • Beck, Srikant ’12 • Chi, Zhu, Bubeck, Jordan ’18 • Shah, Xie ’18 • Lee, He ’18 • Wainwright ’19 • Chen, Zhang, Doan, Maguluri, Clarke ’19 • Yang, Wang ’19 • Du, Lee, Mahajan, Wang ’20 • Chen, Maguluri, Shakkottai, Shanmugam ’20 • Qu, Wierman ’20 • Devraj, Meyn ’20 • Weng, Gupta, He, Ying, Srikant ’20 • ... 41/ 74
What is sample complexity of (async) Q-learning?
Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? paper sample complexity learning rate 1 1 − γ ( t cover ) 1 Even-Dar & Mansour ’03 linear: (1 − γ ) 4 ε 2 t � � 1 ω + � t cover � 1 t 1+3 ω t ω , ω ∈ ( 1 1 1 − ω Even-Dar & Mansour ’03 cover poly: 2 , 1) (1 − γ ) 4 ε 2 1 − γ t 3 cover |S||A| Beck & Srikant ’12 constant (1 − γ ) 5 ε 2 t mix Qu & Wierman ’20 rescaled linear µ 2 min (1 − γ ) 5 ε 2 43/ 74
Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? |S||A| , t cover ≍ t mix 1 if we take µ min ≍ µ min 43/ 74
Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? |S||A| , t cover ≍ t mix 1 if we take µ min ≍ µ min All prior results require sample size of at least t mix |S| 2 |A| 2 ! 43/ 74
Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? |S||A| , t cover ≍ t mix 1 if we take µ min ≍ µ min All prior results require sample size of at least t mix |S| 2 |A| 2 ! 43/ 74
Main result: ℓ ∞ -based sample complexity Theorem 3 (Li, Wei, Chi, Gu, Chen ’20) 1 For any 0 < ε ≤ 1 − γ , sample complexity of async Q-learning to yield � � Q − Q ⋆ � ∞ ≤ ε is at most (up to some log factor) 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) 44/ 74
Main result: ℓ ∞ -based sample complexity Theorem 3 (Li, Wei, Chi, Gu, Chen ’20) 1 For any 0 < ε ≤ 1 − γ , sample complexity of async Q-learning to yield � � Q − Q ⋆ � ∞ ≤ ε is at most (up to some log factor) 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) • Improves upon prior art by at least |S||A| ! t mix — prior art: min (1 − γ ) 5 ε 2 (Qu & Wierman ’20) µ 2 44/ 74
Effect of mixing time on sample complexity 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) • reflects cost taken to reach steady state • one-time expense (almost independent of ε ) — it becomes amortized as algorithm runs 45/ 74
Effect of mixing time on sample complexity 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) • reflects cost taken to reach steady state • one-time expense (almost independent of ε ) — it becomes amortized as algorithm runs t mix — prior art: min (1 − γ ) 5 ε 2 (Qu & Wierman ’20) µ 2 45/ 74
Learning rates � (1 − γ ) 4 ε 2 � 1 Our choice: constant stepsize η t ≡ min , γ 2 t mix 1 µ min (1 − γ ) • Qu & Wierman ’20: rescaled linear η t = 1 t +max { µ min (1 − γ ) ,t mix } • Beck & Srikant ’12: constant η t ≡ (1 − γ ) 4 ε 2 |S||A| t 2 cover � �� � too conservative • Even-Dar & Mansour ’03: polynomial η t = t − ω ( ω ∈ ( 1 2 , 1] ) 46/ 74
Minimax lower bound minimax lower bound asyn Q-learning (Azar et al. ’13) (ignoring dependency on t mix ) 1 1 µ min (1 − γ ) 3 ε 2 µ min (1 − γ ) 5 ε 2 47/ 74
Minimax lower bound minimax lower bound asyn Q-learning (Azar et al. ’13) (ignoring dependency on t mix ) 1 1 µ min (1 − γ ) 3 ε 2 µ min (1 − γ ) 5 ε 2 1 Can we improve dependency on discount complexity 1 − γ ? 47/ 74
One strategy: variance reduction — inspired by Johnson & Zhang ’13, Wainwright ’19 Variance-reduced Q-learning updates � � T t ( Q t − 1 ) −T t ( Q ) + � Q t ( s t , a t ) = (1 − η ) Q t − 1 ( s t , a t ) + η T ( Q ) ( s t , a t ) � �� � use Q to help reduce variability • Q : some reference Q-estimate • � T : empirical Bellman operator (using a batch of samples) 48/ 74
Variance-reduced Q-learning — inspired by Johnson & Zhang ’13, Sidford et al. ’18, Wainwright ’19 for each epoch 1. update Q and � T ( Q ) 2. run variance-reduced Q-learning updates 49/ 74
Main result: ℓ ∞ -based sample complexity Theorem 4 (Li, Wei, Chi, Gu, Chen ’20) For any 0 < ε ≤ 1 , sample complexity for (async) variance-reduced Q-learning to yield � � Q − Q ⋆ � ∞ ≤ ε is at most on the order of 1 t mix µ min (1 − γ ) 3 ε 2 + µ min (1 − γ ) � ✘✘ � ✘ (1 − γ ) 4 (1 − γ ) 2 1 • more aggressive learning rates: η t ≡ min , γ 2 t mix 50/ 74
Main result: ℓ ∞ -based sample complexity Theorem 4 (Li, Wei, Chi, Gu, Chen ’20) For any 0 < ε ≤ 1 , sample complexity for (async) variance-reduced Q-learning to yield � � Q − Q ⋆ � ∞ ≤ ε is at most on the order of 1 t mix µ min (1 − γ ) 3 ε 2 + µ min (1 − γ ) � ✘✘ � ✘ (1 − γ ) 4 (1 − γ ) 2 1 • more aggressive learning rates: η t ≡ min , γ 2 t mix • minimax-optimal for 0 < ε ≤ 1 50/ 74
Summary Sharpens finite-sample understanding of Q-learning on Markovian data 51/ 74
Summary Sharpens finite-sample understanding of Q-learning on Markovian data future directions • function approximation • on-policy algorithms like SARSA • general Markov-chain-based optimization algorithms 51/ 74
Story 3: fast global convergence of entropy-regularized natural policy gradient (NPG) methods Shicong Cen Chen Cheng Yuejie Chi Yuting Wei CMU ECE Stanford Stats CMU Stats CMU ECE
Policy optimization: a major contributor to these successes 53/ 74
Policy gradient (PG) methods Given initial state distribution s ∼ ρ : V π ( ρ ) := E s ∼ ρ [ V π ( s )] maximize π 54/ 74
Policy gradient (PG) methods Given initial state distribution s ∼ ρ : V π ( ρ ) := E s ∼ ρ [ V π ( s )] maximize π softmax parameterization: exp( θ ( s, a )) � π θ ( a | s ) = a exp( θ ( s, a )) 54/ 74
Policy gradient (PG) methods Given initial state distribution s ∼ ρ : V π ( ρ ) := E s ∼ ρ [ V π ( s )] maximize π softmax parameterization: exp( θ ( s, a )) � π θ ( a | s ) = a exp( θ ( s, a )) V π θ ( ρ ) := E s ∼ ρ [ V π θ ( s )] maximize θ 54/ 74
Policy gradient (PG) methods Given initial state distribution s ∼ ρ : V π ( ρ ) := E s ∼ ρ [ V π ( s )] maximize π softmax parameterization: exp( θ ( s, a )) � π θ ( a | s ) = a exp( θ ( s, a )) V π θ ( ρ ) := E s ∼ ρ [ V π θ ( s )] maximize θ PG method (Sutton et al. ’00) θ ( t +1) = θ ( t ) + η ∇ θ V π ( t ) θ ( ρ ) , t = 0 , 1 , · · · • η : learning rate 54/ 74
Booster 1: natural policy gradient (NPG) precondition gradients to improve search directions ... Natural Gradient = ⇒ NPG method (Kakade ’02) ρ ) † ∇ θ V π ( t ) θ ( t +1) = θ ( t ) + η ( F θ θ ( ρ ) , t = 0 , 1 , · · · �� � ⊤ � �� • F θ : Fisher info matrix ρ := E ∇ θ log π θ ( a | s ) ∇ θ log π θ ( a | s ) 55/ 74
Recommend
More recommend