sample complexity of asynchronous q learning
play

Sample Complexity of Asynchronous Q-Learning: Sharper non-asymptotic - PowerPoint PPT Presentation

Sample Complexity of Asynchronous Q-Learning: Sharper non-asymptotic analysis and variance reduction Yuxin Chen EE, Princeton University Gen Li Yuting Wei Yuejie Chi Yuantao Gu Tsinghua EE CMU Statistics CMU ECE Tsinghua EE Sample


  1. Sample Complexity of Asynchronous Q-Learning: Sharper non-asymptotic analysis and variance reduction Yuxin Chen EE, Princeton University

  2. Gen Li Yuting Wei Yuejie Chi Yuantao Gu Tsinghua EE CMU Statistics CMU ECE Tsinghua EE “Sample complexity of asynchronous Q-learning: sharper analysis and variance reduction,” G. Li, Y. Wei, Y. Chi, Y. Gu, Y. Chen, NeurIPS 2020

  3. Reinforcement learning (RL) 3/ 33

  4. RL challenges • Unknown or changing environments • Delayed rewards • Enormous state and action space 4/ 33

  5. Sample efficiency Collecting data samples might be expensive or time-consuming clinical trials online ads 5/ 33

  6. Sample efficiency Collecting data samples might be expensive or time-consuming clinical trials online ads Calls for in-depth understanding about sample efficiency of RL algorithms 5/ 33

  7. This talk: a classical example — Q-learning

  8. Background: Markov decision processes

  9. Markov decision process (MDP) • S : state space • A : action space 8/ 33

  10. Markov decision process (MDP) • S : state space • A : action space • r ( s, a ) ∈ [0 , 1] : immediate reward 8/ 33

  11. Markov decision process (MDP) • state space S : positions in the maze • action space A : up, down, left, right • immediate reward r ( s, a ) : cheese, electricity shocks, cats 9/ 33

  12. Markov decision process (MDP) • S : state space • A : action space • r ( s, a ) ∈ [0 , 1] : immediate reward • π ( ·| s ) : policy (or action selection rule) 10/ 33

  13. Markov decision process (MDP) • S : state space • A : action space • r ( s, a ) ∈ [0 , 1] : immediate reward • π ( ·| s ) : policy (or action selection rule) • P ( ·| s, a ) : unknown transition probabilities 10/ 33

  14. Value function Value of policy π : long-term discounted reward � ∞ � � � V π ( s ) := E γ t r t � s 0 = s ∀ s ∈ S : t =0 11/ 33

  15. Value function Value of policy π : long-term discounted reward � ∞ � � � V π ( s ) := E γ t r t � s 0 = s ∀ s ∈ S : t =0 • γ ∈ [0 , 1) : discount factor • ( a 0 , s 1 , a 1 , s 2 , a 2 , · · · ) : generated under policy π 11/ 33

  16. Action-value function (a.k.a. Q-function) Q-function of policy π � ∞ � � � Q π ( s, a ) := E γ t r t � s 0 = s, a 0 = a ∀ ( s, a ) ∈ S × A : t =0 • ( ✟ a 0 , s 1 , a 1 , s 2 , a 2 , · · · ) : generated under policy π ✟ 12/ 33

  17. Action-value function (a.k.a. Q-function) Q-function of policy π � ∞ � � � Q π ( s, a ) := E γ t r t � s 0 = s, a 0 = a ∀ ( s, a ) ∈ S × A : t =0 • ( ✟ a 0 , s 1 , a 1 , s 2 , a 2 , · · · ) : generated under policy π ✟ 12/ 33

  18. Optimal policy and optimal value 13/ 33

  19. Optimal policy and optimal value • optimal policy π ⋆ : maximizing value 13/ 33

  20. Optimal policy and optimal value • optimal policy π ⋆ : maximizing value • optimal value / Q function: V ⋆ := V π ⋆ , Q ⋆ := Q π ⋆ 13/ 33

  21. Need to learn optimal value / policy from data samples

  22. Markovian samples and behavior policy Observed : { s t , a t , r t } t ≥ 0 generated by behavior policy π b � �� � Markovian trajectory Goal : learn optimal value V ⋆ and Q ⋆ based on sample trajectory 15/ 33

  23. Markovian samples and behavior policy Key quantities of sample trajectory • minimum state-action occupancy probability µ min := min µ π b ( s, a ) � �� � stationary distribution • mixing time: t mix 15/ 33

  24. Asynchronous Q-learning (on Markovian samples)

  25. Model-based vs. model-free RL Model-based approach (“plug-in”) 1. build empirical estimate � P for P 2. planning based on empirical � P 17/ 33

  26. Model-based vs. model-free RL Model-based approach (“plug-in”) 1. build empirical estimate � P for P 2. planning based on empirical � P Model-free approach — learning w/o modeling & estimating environment explicitly 17/ 33

  27. Q-learning: a classical model-free algorithm Chris Watkins Peter Dayan Stochastic approximation for solving Bellman equation Q = T ( Q ) � �� � Robbins & Monro ’51 18/ 33

  28. Aside: Bellman optimality principle Bellman operator � � a ′ ∈A Q ( s ′ , a ′ ) T ( Q )( s, a ) := r ( s, a ) + γ max E � �� � s ′ ∼ P ( ·| s,a ) � �� � immediate reward next state’s value • one-step look-ahead 19/ 33

  29. Aside: Bellman optimality principle Bellman operator � � a ′ ∈A Q ( s ′ , a ′ ) T ( Q )( s, a ) := r ( s, a ) + γ max E � �� � s ′ ∼ P ( ·| s,a ) � �� � immediate reward next state’s value • one-step look-ahead Bellman equation: Q ⋆ is unique solution to T ( Q ⋆ ) = Q ⋆ Richard Bellman 19/ 33

  30. Q-learning: a classical model-free algorithm Chris Watkins Peter Dayan Stochastic approximation for solving Bellman equation Q = T ( Q ) Q t +1 ( s t , a t ) = (1 − η t ) Q t ( s t , a t ) + η t T t ( Q t )( s t , a t ) t ≥ 0 , � �� � only update ( s t ,a t ) -th entry 20/ 33

  31. Q-learning: a classical model-free algorithm Chris Watkins Peter Dayan Stochastic approximation for solving Bellman equation Q = T ( Q ) Q t +1 ( s t , a t ) = (1 − η t ) Q t ( s t , a t ) + η t T t ( Q t )( s t , a t ) t ≥ 0 , � �� � only update ( s t ,a t ) -th entry Q ( s t +1 , a ′ ) T t ( Q )( s t , a t ) = r ( s t , a t ) + γ max a ′ � Q ( s ′ , a ′ ) � T ( Q )( s, a ) = r ( s, a ) + γ max E a ′ s ′ ∼ P ( ·| s,a ) 20/ 33

  32. Q-learning on Markovian samples • asynchronous: only a single entry is updated each iteration 21/ 33

  33. Q-learning on Markovian samples • asynchronous: only a single entry is updated each iteration ◦ resembles Markov-chain coordinate descent 21/ 33

  34. Q-learning on Markovian samples • asynchronous: only a single entry is updated each iteration ◦ resembles Markov-chain coordinate descent • off-policy: target policy π ⋆ � = behavior policy π b 21/ 33

  35. A highly incomplete list of prior work • Watkins, Dayan ’92 • Tsitsiklis ’94 • Jaakkola, Jordan, Singh ’94 • Szepesv´ ari ’98 • Kearns, Singh ’99 • Borkar, Meyn ’00 • Even-Dar, Mansour ’03 • Beck, Srikant ’12 • Chi, Zhu, Bubeck, Jordan ’18 • Shah, Xie ’18 • Lee, He ’18 • Wainwright ’19 • Chen, Zhang, Doan, Maguluri, Clarke ’19 • Yang, Wang ’19 • Du, Lee, Mahajan, Wang ’20 • Chen, Maguluri, Shakkottai, Shanmugam ’20 • Qu, Wierman ’20 • Devraj, Meyn ’20 • Weng, Gupta, He, Ying, Srikant ’20 • ... 22/ 33

  36. What is sample complexity of (async) Q-learning?

  37. Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? paper sample complexity learning rate 1 1 − γ ( t cover ) 1 Even-Dar & Mansour ’03 linear: (1 − γ ) 4 ε 2 t � � 1 ω + � t cover � 1 t 1+3 ω t ω , ω ∈ ( 1 1 1 − ω Even-Dar & Mansour ’03 cover poly: 2 , 1) (1 − γ ) 4 ε 2 1 − γ t 3 cover |S||A| Beck & Srikant ’12 constant (1 − γ ) 5 ε 2 t mix Qu & Wierman ’20 rescaled linear µ 2 min (1 − γ ) 5 ε 2 24/ 33

  38. Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? |S||A| , t cover ≍ t mix 1 if we take µ min ≍ µ min 24/ 33

  39. Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? |S||A| , t cover ≍ t mix 1 if we take µ min ≍ µ min All prior results require sample size of at least t mix |S| 2 |A| 2 ! 24/ 33

  40. Prior art: async Q-learning Question: how many samples are needed to ensure � � Q − Q ⋆ � ∞ ≤ ε ? |S||A| , t cover ≍ t mix 1 if we take µ min ≍ µ min All prior results require sample size of at least t mix |S| 2 |A| 2 ! 24/ 33

  41. Main result: ℓ ∞ -based sample complexity Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) 1 For any 0 < ε ≤ 1 − γ , sample complexity of async Q-learning to yield � � Q − Q ⋆ � ∞ ≤ ε is at most (up to some log factor) 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) 25/ 33

  42. Main result: ℓ ∞ -based sample complexity Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) 1 For any 0 < ε ≤ 1 − γ , sample complexity of async Q-learning to yield � � Q − Q ⋆ � ∞ ≤ ε is at most (up to some log factor) 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) • Improves upon prior art by at least |S||A| ! t mix — prior art: min (1 − γ ) 5 ε 2 (Qu & Wierman ’20) µ 2 25/ 33

  43. Effect of mixing time on sample complexity 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) • reflects cost taken to reach steady state • one-time expense (almost independent of ε ) — it becomes amortized as algorithm runs 26/ 33

  44. Effect of mixing time on sample complexity 1 t mix µ min (1 − γ ) 5 ε 2 + µ min (1 − γ ) • reflects cost taken to reach steady state • one-time expense (almost independent of ε ) — it becomes amortized as algorithm runs t mix — prior art: min (1 − γ ) 5 ε 2 (Qu & Wierman ’20) µ 2 26/ 33

Recommend


More recommend