Sample Complexity of Asynchronous Q-Learning: Sharper non-asymptotic analysis and variance reduction Yuxin Chen EE, Princeton University
Gen Li Yuting Wei Yuejie Chi Yuantao Gu Tsinghua EE CMU Statistics CMU ECE Tsinghua EE “Sample complexity of asynchronous Q-learning: sharper analysis and variance reduction,” G. Li, Y. Wei, Y. Chi, Y. Gu, Y. Chen, NeurIPS 2020
Reinforcement learning (RL) 3/ 33
RL challenges • Unknown or changing environments • Delayed rewards • Enormous state and action space 4/ 33
Sample efficiency Collecting data samples might be expensive or time-consuming clinical trials online ads 5/ 33
Sample efficiency Collecting data samples might be expensive or time-consuming clinical trials online ads Calls for in-depth understanding about sample efficiency of RL algorithms 5/ 33
This talk: a classical example — Q-learning
Background: Markov decision processes
Markov decision process (MDP) • S : state space • A : action space 8/ 33
Markov decision process (MDP) • S : state space • A : action space • r ( s, a ) ∈ [0 , 1] : immediate reward 8/ 33
Markov decision process (MDP) • state space S : positions in the maze • action space A : up, down, left, right • immediate reward r ( s, a ) : cheese, electricity shocks, cats 9/ 33
Markov decision process (MDP) • S : state space • A : action space • r ( s, a ) ∈ [0 , 1] : immediate reward • π ( ·| s ) : policy (or action selection rule) 10/ 33
Markov decision process (MDP) • S : state space • A : action space • r ( s, a ) ∈ [0 , 1] : immediate reward • π ( ·| s ) : policy (or action selection rule) • P ( ·| s, a ) : unknown transition probabilities 10/ 33
Value function Value of policy π : long-term discounted reward � ∞ � � � V π ( s ) := E γ t r t � s 0 = s ∀ s ∈ S : t =0 11/ 33
Value function Value of policy π : long-term discounted reward � ∞ � � � V π ( s ) := E γ t r t � s 0 = s ∀ s ∈ S : t =0 • γ ∈ [0 , 1) : discount factor • ( a 0 , s 1 , a 1 , s 2 , a 2 , · · · ) : generated under policy π 11/ 33
Action-value function (a.k.a. Q-function) Q-function of policy π � ∞ � � � Q π ( s, a ) := E γ t r t � s 0 = s, a 0 = a ∀ ( s, a ) ∈ S × A : t =0 • ( ✟ a 0 , s 1 , a 1 , s 2 , a 2 , · · · ) : generated under policy π ✟ 12/ 33
Action-value function (a.k.a. Q-function) Q-function of policy π � ∞ � � � Q π ( s, a ) := E γ t r t � s 0 = s, a 0 = a ∀ ( s, a ) ∈ S × A : t =0 • ( ✟ a 0 , s 1 , a 1 , s 2 , a 2 , · · · ) : generated under policy π ✟ 12/ 33
Optimal policy and optimal value 13/ 33
Optimal policy and optimal value • optimal policy π ⋆ : maximizing value 13/ 33
Optimal policy and optimal value • optimal policy π ⋆ : maximizing value • optimal value / Q function: V ⋆ := V π ⋆ , Q ⋆ := Q π ⋆ 13/ 33
Need to learn optimal value / policy from data samples
Markovian samples and behavior policy Observed : { s t , a t , r t } t ≥ 0 generated by behavior policy π b � �� � Markovian trajectory Goal : learn optimal value V ⋆ and Q ⋆ based on sample trajectory 15/ 33
Markovian samples and behavior policy Key quantities of sample trajectory • minimum state-action occupancy probability µ min := min µ π b ( s, a ) � �� � stationary distribution • mixing time: t mix 15/ 33
Asynchronous Q-learning (on Markovian samples)
Model-based vs. model-free RL Model-based approach (“plug-in”) 1. build empirical estimate � P for P 2. planning based on empirical � P 17/ 33
Model-based vs. model-free RL Model-based approach (“plug-in”) 1. build empirical estimate � P for P 2. planning based on empirical � P Model-free approach — learning w/o modeling & estimating environment explicitly 17/ 33
Q-learning: a classical model-free algorithm Chris Watkins Peter Dayan Stochastic approximation for solving Bellman equation Q = T ( Q ) � �� � Robbins & Monro ’51 18/ 33
Aside: Bellman optimality principle Bellman operator � � a ′ ∈A Q ( s ′ , a ′ ) T ( Q )( s, a ) := r ( s, a ) + γ max E � �� � s ′ ∼ P ( ·| s,a ) � �� � immediate reward next state’s value • one-step look-ahead 19/ 33
Aside: Bellman optimality principle Bellman operator � � a ′ ∈A Q ( s ′ , a ′ ) T ( Q )( s, a ) := r ( s, a ) + γ max E � �� � s ′ ∼ P ( ·| s,a ) � �� � immediate reward next state’s value • one-step look-ahead Bellman equation: Q ⋆ is unique solution to T ( Q ⋆ ) = Q ⋆ Richard Bellman 19/ 33
Q-learning: a classical model-free algorithm Chris Watkins Peter Dayan Stochastic approximation for solving Bellman equation Q = T ( Q ) Q t +1 ( s t , a t ) = (1 − η t ) Q t ( s t , a t ) + η t T t ( Q t )( s t , a t ) t ≥ 0 , � �� � only update ( s t ,a t ) -th entry 20/ 33
Q-learning: a classical model-free algorithm Chris Watkins Peter Dayan Stochastic approximation for solving Bellman equation Q = T ( Q ) Q t +1 ( s t , a t ) = (1 − η t ) Q t ( s t , a t ) + η t T t ( Q t )( s t , a t ) t ≥ 0 , � �� � only update ( s t ,a t ) -th entry Q ( s t +1 , a ′ ) T t ( Q )( s t , a t ) = r ( s t , a t ) + γ max a ′ � Q ( s ′ , a ′ ) � T ( Q )( s, a ) = r ( s, a ) + γ max E a ′ s ′ ∼ P ( ·| s,a ) 20/ 33
Q-learning on Markovian samples • asynchronous: only a single entry is updated each iteration 21/ 33
Q-learning on Markovian samples • asynchronous: only a single entry is updated each iteration ◦ resembles Markov-chain coordinate descent 21/ 33
Q-learning on Markovian samples • asynchronous: only a single entry is updated each iteration ◦ resembles Markov-chain coordinate descent • off-policy: target policy π ⋆ � = behavior policy π b 21/ 33
A highly incomplete list of prior work • Watkins, Dayan ’92 • Tsitsiklis ’94 • Jaakkola, Jordan, Singh ’94 • Szepesv´ ari ’98 • Kearns, Singh ’99 • Borkar, Meyn ’00 • Even-Dar, Mansour ’03 • Beck, Srikant ’12 • Chi, Zhu, Bubeck, Jordan ’18 • Shah, Xie ’18 • Lee, He ’18 • Wainwright ’19 • Chen, Zhang, Doan, Maguluri, Clarke ’19 • Yang, Wang ’19 • Du, Lee, Mahajan, Wang ’20 • Chen, Maguluri, Shakkottai, Shanmugam ’20 • Qu, Wierman ’20 • Devraj, Meyn ’20 • Weng, Gupta, He, Ying, Srikant ’20 • ... 22/ 33
What is sample complexity of (async) Q-learning?
Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? paper sample complexity learning rate 1 1 − γ ( t cover ) 1 Even-Dar & Mansour ’03 linear: (1 − γ ) 4 ε 2 t � � 1 ω + � t cover � 1 t 1+3 ω t ω , ω ∈ ( 1 1 1 − ω Even-Dar & Mansour ’03 cover poly: 2 , 1) (1 − γ ) 4 ε 2 1 − γ t 3 cover |S||A| Beck & Srikant ’12 constant (1 − γ ) 5 ε 2 t mix Qu & Wierman ’20 rescaled linear µ 2 min (1 − γ ) 5 ε 2 24/ 33
Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? |S||A| , t cover ≍ t mix 1 if we take µ min ≍ µ min 24/ 33
Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? |S||A| , t cover ≍ t mix 1 if we take µ min ≍ µ min All prior results require sample size of at least t mix |S| 2 |A| 2 ! 24/ 33
Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? |S||A| , t cover ≍ t mix 1 if we take µ min ≍ µ min All prior results require sample size of at least t mix |S| 2 |A| 2 ! 24/ 33
Main result: ℓ ∞ -based sample complexity Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) 1 For any 0 < ε ≤ 1 − γ , sample complexity of async Q-learning to yield � � Q − Q ⋆ � ∞ ≤ ε is at most (up to some log factor) 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) 25/ 33
Main result: ℓ ∞ -based sample complexity Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) 1 For any 0 < ε ≤ 1 − γ , sample complexity of async Q-learning to yield � � Q − Q ⋆ � ∞ ≤ ε is at most (up to some log factor) 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) • Improves upon prior art by at least |S||A| ! t mix — prior art: min (1 − γ ) 5 ε 2 (Qu & Wierman ’20) µ 2 25/ 33
Effect of mixing time on sample complexity 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) • reflects cost taken to reach steady state • one-time expense (almost independent of ε ) — it becomes amortized as algorithm runs 26/ 33
Effect of mixing time on sample complexity 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) • reflects cost taken to reach steady state • one-time expense (almost independent of ε ) — it becomes amortized as algorithm runs t mix — prior art: min (1 − γ ) 5 ε 2 (Qu & Wierman ’20) µ 2 26/ 33
Recommend
More recommend