breaking the sample size barrier in reinforcement
play

Breaking the sample size barrier in reinforcement learning via - PowerPoint PPT Presentation

Breaking the sample size barrier in reinforcement learning via model-based methods plug-in Yuxin Chen EE, Princeton University Gen Li Yuejie Chi Yuantao Gu Yuting Wei Tsinghua EE CMU ECE Tsinghua EE CMU


  1. Breaking the sample size barrier in reinforcement learning via model-based methods � �� � “plug-in” Yuxin Chen EE, Princeton University

  2. Gen Li Yuejie Chi Yuantao Gu Yuting Wei Tsinghua EE CMU ECE Tsinghua EE CMU Statistics “Breaking the sample size barrier in model-based reinforcement learning with a generative model,” G. Li, Y. Wei, Y. Chi, Y. Gu, Y. Chen, arxiv:2005.12900, 2020

  3. Gen Li Yuejie Chi Yuantao Gu Yuting Wei Tsinghua EE CMU ECE Tsinghua EE Berkeley Stat Ph.D. “Breaking the sample size barrier in model-based reinforcement learning with a generative model,” G. Li, Y. Wei, Y. Chi, Y. Gu, Y. Chen, arxiv:2005.12900, 2020

  4. Reinforcement learning (RL) 4/ 38

  5. RL challenges In RL, an agent learns by interacting with an environment • unknown or changing environments • delayed rewards or feedback • enormous state and action space • nonconvexity 5/ 38

  6. Sample efficiency Collecting data samples might be expensive or time-consuming clinical trials online ads 6/ 38

  7. Sample efficiency Collecting data samples might be expensive or time-consuming clinical trials online ads Calls for design of sample-efficient RL algorithms! 6/ 38

  8. Background: Markov decision processes

  9. Markov decision process (MDP) • S : state space • A : action space 8/ 38

  10. Markov decision process (MDP) • S : state space • A : action space • r ( s, a ) ∈ [0 , 1] : immediate reward 8/ 38

  11. Markov decision process (MDP) • S : state space • A : action space • r ( s, a ) ∈ [0 , 1] : immediate reward • π ( ·| s ) : policy (or action selection rule) 9/ 38

  12. Markov decision process (MDP) • S : state space • A : action space • r ( s, a ) ∈ [0 , 1] : immediate reward • π ( ·| s ) : policy (or action selection rule) • P ( ·| s, a ) : unknown transition probabilities 9/ 38

  13. Value function Value of policy π : long-term discounted reward � ∞ � � � � s 0 = s V π ( s ) := E γ t r ( s t , a t ) ∀ s ∈ S : t =0 10/ 38

  14. Value function Value of policy π : long-term discounted reward � ∞ � � � � s 0 = s V π ( s ) := E γ t r ( s t , a t ) ∀ s ∈ S : t =0 • ( a 0 , s 1 , a 1 , s 2 , a 2 , · · · ) : generated under policy π 10/ 38

  15. Value function Value of policy π : long-term discounted reward � ∞ � � � � s 0 = s V π ( s ) := E γ t r ( s t , a t ) ∀ s ∈ S : t =0 • ( a 0 , s 1 , a 1 , s 2 , a 2 , · · · ) : generated under policy π • γ ∈ [0 , 1) : discount factor ◦ take γ → 1 to approximate long-horizon MDPs 10/ 38

  16. Optimal policy and optimal values • Optimal policy π ⋆ : maximizing the value function 11/ 38

  17. Optimal policy and optimal values • Optimal policy π ⋆ : maximizing the value function • Optimal values: V ⋆ := V π ⋆ 11/ 38

  18. When the model is known . . . MDP specification b b b π ? planning b b planning oracle e . g . policy iteration truth: P P r r Planning: computing the optimal policy π ⋆ given MDP specification 12/ 38

  19. When the model is unknown . . . Need to learn optimal policy from samples w/o model specification 13/ 38

  20. This talk: RL with a generative model / simulator — Kearns, Singh ’99 For each state-action pair ( s, a ) , collect N samples { ( s, a, s ′ ( i ) ) } 1 ≤ i ≤ N 14/ 38

  21. Question: how many samples are sufficient to learn an ε -optimal policy � ? � ��

  22. Question: how many samples are sufficient to learn an ε -optimal policy ? � �� � ∀ s : V � π ( s ) ≥ V ⋆ ( s ) − ε

  23. An incomplete list of prior art • Kearns & Singh ’99 • Kakade ’03 • Kearns, Mansour & Ng ’02 • Azar, Munos & Kappen ’12 • Azar, Munos, Ghavamzadeh & Kappen ’13 • Sidford, Wang, Wu, Yang & Ye ’18 • Sidford, Wang, Wu & Ye ’18 • Wang ’17 • Agarwal, Kakade & Yang ’19 • Wainwright ’19a • Wainwright ’19b • Pananjady & Wainwright ’20 • Yang & Wang ’19 • Khamaru, Pananjady, Ruan, Wainwright & Jordan ’20 • Mou, Li, Wainwright, Bartlett & Jordan ’20 • . . . 16/ 38

  24. An even shorter list of prior art algorithm sample size range sample complexity ε -range � |S| 2 |A| empirical QVI 1 |S||A| √ (0 , (1 − γ ) |S| ] (1 − γ ) 2 , ∞ ) (1 − γ ) 3 ε 2 Azar et al. ’13 � |S||A| � � sublinear randomized VI |S||A| 1 (1 − γ ) 2 , ∞ ) 0 , (1 − γ ) 4 ε 2 Sidford et al. ’18a 1 − γ � |S||A| variance-reduced QVI |S||A| (1 − γ ) 3 , ∞ ) (0 , 1] (1 − γ ) 3 ε 2 Sidford et al. ’18b � |S||A| empirical MDP + planning |S||A| 1 (1 − γ ) 2 , ∞ ) (0 , √ 1 − γ ] (1 − γ ) 3 ε 2 Agarwal et al. ’19 — see also Wainwright ’19 (for estimating optimal values) 17/ 38

  25. 18/ 38

  26. 18/ 38

  27. |S||A| All prior theory requires sample size > (1 − γ ) 2 � �� � sample size barrier 18/ 38

  28. Is it possible to close the gap?

  29. Two approaches Model-based approach (“plug-in”) 1. build an empirical estimate � P for P 2. planning based on the empirical � P 20/ 38

  30. Two approaches Model-based approach (“plug-in”) 1. build an empirical estimate � P for P 2. planning based on the empirical � P Model-free approach (e.g. Q-learning, SARSA) — learning w/o estimating the model explicitly 20/ 38

  31. Two approaches Model-based approach (“plug-in”) 1. build an empirical estimate � P for P 2. planning based on the empirical � P Model-free approach (e.g. Q-learning, SARSA) — learning w/o estimating the model explicitly 20/ 38

  32. Model estimation Sampling: for each ( s, a ) , collect N ind. samples { ( s, a, s ′ ( i ) ) } 1 ≤ i ≤ N 21/ 38

  33. Model estimation Sampling: for each ( s, a ) , collect N ind. samples { ( s, a, s ′ ( i ) ) } 1 ≤ i ≤ N N � P ( s ′ | s, a ) by 1 Empirical estimates: estimate � 1 { s ′ ( i ) = s ′ } N i =1 � �� � empirical frequency 21/ 38

  34. Model-based (plug-in) estimator — Azar et al. ’13, Agarwal et al. ’19, Pananjady et al. ’20 P empirical MDP b b b π ? b planning b b planning oracle e . g . policy iteration empirical ‚ P r P P r Planning based on the empirical MDP with slightly perturbed rewards 22/ 38

  35. Our method: plug-in estimator + perturbation — Li, Wei, Chi, Gu, Chen ’20 P empirical MDP rds perturb b rewards p b b π ? b planning b b p planning oracle e . g . policy iteration empirical ‚ P rd: r p empirical ‚ P r P r P r P Run planning algorithms based on the empirical MDP 22/ 38

  36. Challenges in the sample-starved regime truth: empirical estimate: � P ∈ R |S||A|×|S| P • Can’t recover P faithfully if sample size ≪ |S| 2 |A| ! 23/ 38

  37. Challenges in the sample-starved regime truth: empirical estimate: � P ∈ R |S||A|×|S| P • Can’t recover P faithfully if sample size ≪ |S| 2 |A| ! • Can we trust our policy estimate when reliable model estimation is infeasible? 23/ 38

  38. Main result Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) 1 π ⋆ For any 0 < ε ≤ 1 − γ , the optimal policy � p of the perturbed empirical MDP achieves � V � π ⋆ p − V ⋆ � ∞ ≤ ε with sample complexity at most � � |S||A| � O (1 − γ ) 3 ε 2 24/ 38

  39. Main result Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) 1 π ⋆ For any 0 < ε ≤ 1 − γ , the optimal policy � p of the perturbed empirical MDP achieves � V � π ⋆ p − V ⋆ � ∞ ≤ ε with sample complexity at most � � |S||A| � O (1 − γ ) 3 ε 2 � � iterations p : obtained by empirical QVI or PI within � 1 π ⋆ • � O 1 − γ 24/ 38

  40. Main result Theorem 1 (Li, Wei, Chi, Gu, Chen ’20) 1 π ⋆ For any 0 < ε ≤ 1 − γ , the optimal policy � p of the perturbed empirical MDP achieves � V � π ⋆ p − V ⋆ � ∞ ≤ ε with sample complexity at most � � |S||A| � O (1 − γ ) 3 ε 2 � � iterations p : obtained by empirical QVI or PI within � 1 π ⋆ • � O 1 − γ • Minimax lower bound: � |S||A| Ω( (1 − γ ) 3 ε 2 ) (Azar et al. ’13) 24/ 38

  41. 25/ 38

  42. Analysis

  43. Notation and Bellman equation • V π : true value function under policy π ◦ Bellman equation: V π = ( I − P π ) − 1 r 27/ 38

  44. Notation and Bellman equation • V π : true value function under policy π ◦ Bellman equation: V π = ( I − P π ) − 1 r • � V π : estimate of value function under policy π V π = ( I − � ◦ Bellman equation: � P π ) − 1 r 27/ 38

  45. Notation and Bellman equation • V π : true value function under policy π ◦ Bellman equation: V π = ( I − P π ) − 1 r • � V π : estimate of value function under policy π V π = ( I − � ◦ Bellman equation: � P π ) − 1 r • π ⋆ : optimal policy w.r.t. true value function π ⋆ : optimal policy w.r.t. empirical value function • � 27/ 38

Recommend


More recommend