breaking the sample size barrier in model based
play

Breaking the Sample Size Barrier in Model-Based Reinforcement - PowerPoint PPT Presentation

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning Yuting Wei Carnegie Mellon University Nov, 2020 Gen Li Yuejie Chi Yuantao Gu Yuxin Chen Tsinghua EE CMU ECE Tsinghua EE Princeton EE Reinforcement learning (RL) 3 /


  1. Breaking the Sample Size Barrier in Model-Based Reinforcement Learning Yuting Wei Carnegie Mellon University Nov, 2020

  2. Gen Li Yuejie Chi Yuantao Gu Yuxin Chen Tsinghua EE CMU ECE Tsinghua EE Princeton EE

  3. Reinforcement learning (RL) 3 / 34

  4. RL challenges • Unknown or changing environment • Credit assignment problem • Enormous state and action space 4 / 34

  5. Provable e ffi ciency • Collecting samples might be expensive or impossible: sample e ffi ciency • Training deep RL algorithms might take long time: computational e ffi ciency 5 / 34

  6. This talk Question: can we design sample- and computation-e ffi cient RL algorithms? —– inspired by numerous prior work [Kearns and Singh, 1999, Sidford et al., 2018a, Agarwal et al., 2019]... 6 / 34

  7. Background: Markov decision processes 7 / 34

  8. Markov decision process (MDP) • S : state space • A : action space 8 / 34

  9. Markov decision process (MDP) • S : state space • A : action space • r ( s , a ) 2 [0 , 1]: immediate reward 8 / 34

  10. Markov decision process (MDP) • S : state space • A : action space • r ( s , a ) 2 [0 , 1]: immediate reward • ⇡ ( ·| s ): policy (or action selection rule) 8 / 34

  11. Markov decision process (MDP) • S : state space • A : action space • r ( s , a ) 2 [0 , 1]: immediate reward • ⇡ ( ·| s ): policy (or action selection rule) • P ( ·| s , a ): unknown transition probabilities 8 / 34

  12. Help the mouse! 9 / 34

  13. Help the mouse! • state space S : positions in the maze 9 / 34

  14. Help the mouse! • state space S : positions in the maze • action space A : up, down, left, right 9 / 34

  15. Help the mouse! • state space S : positions in the maze • action space A : up, down, left, right • immediate reward r : cheese, electricity shocks, cats 9 / 34

  16. Help the mouse! • state space S : positions in the maze • action space A : up, down, left, right • immediate reward r : cheese, electricity shocks, cats • policy ⇡ ( ·| s ): the way to find cheese 9 / 34

  17. Value function Value function of policy ⇡ : long-term discounted reward " 1 # X � � s 0 = s � t r t V ⇡ ( s ) := E 8 s 2 S : t =0 10 / 34

  18. Value function Value function of policy ⇡ : long-term discounted reward " 1 # X � � s 0 = s � t r t V ⇡ ( s ) := E 8 s 2 S : t =0 • � 2 [0 , 1): discount factor • ( a 0 , s 1 , a 1 , s 2 , a 2 , · · · ): generated under policy ⇡ 10 / 34

  19. Action-value function (a.k.a. Q-function) Q-function of policy ⇡ " 1 # X � � s 0 = s , a 0 = a � t r t 8 ( s , a ) 2 S ⇥ A : Q ⇡ ( s , a ) := E t =0 • ( � a 0 , s 1 , a 1 , s 2 , a 2 , · · · ): generated under policy ⇡ � 11 / 34

  20. Action-value function (a.k.a. Q-function) Q-function of policy ⇡ " 1 # X � � s 0 = s , a 0 = a � t r t 8 ( s , a ) 2 S ⇥ A : Q ⇡ ( s , a ) := E t =0 • ( � a 0 , s 1 , a 1 , s 2 , a 2 , · · · ): generated under policy ⇡ � 11 / 34

  21. Optimal policy 12 / 34

  22. Optimal policy • optimal policy ⇡ ? : maximizing value function 12 / 34

  23. Optimal policy • optimal policy ⇡ ? : maximizing value function • optimal value / Q function: V ? := V ⇡ ? ; Q ? := Q ⇡ ? 12 / 34

  24. Practically, learn the optimal policy from data samples . . .

  25. This talk: sampling from a generative model 14 / 34

  26. This talk: sampling from a generative model For each state-action pair ( s , a ), collect N samples { ( s , a , s 0 ( i ) ) } 1  i  N 14 / 34

  27. This talk: sampling from a generative model For each state-action pair ( s , a ), collect N samples { ( s , a , s 0 ( i ) ) } 1  i  N How many samples are su ffi cient to learn an " -optimal policy? 14 / 34

  28. An incomplete list of prior art • [Kearns and Singh, 1999] • [Kakade, 2003] • [Kearns et al., 2002] • [Azar et al., 2012] • [Azar et al., 2013] • [Sidford et al., 2018a] • [Sidford et al., 2018b] • [Wang, 2019] • [Agarwal et al., 2019] • [Wainwright, 2019a] • [Wainwright, 2019b] • [Pananjady and Wainwright, 2019] • [Yang and Wang, 2019] • [Khamaru et al., 2020] • [Mou et al., 2020] • . . . 15 / 34

  29. An even shorter list of prior art algorithm sample size range sample complexity " -range ⇥ |S| 2 |A| Empirical QVI 1 |S||A| p (0 , (1 − � ) |S| ] (1 − � ) 2 , 1 ) (1 − � ) 3 " 2 [Azar et al., 2013] ⇥ |S||A| � ⇤ Sublinear randomized VI |S||A| 1 (1 − � ) 2 , 1 ) 0 , (1 − � ) 4 " 2 1 − � [Sidford et al., 2018b] ⇥ |S||A| Variance-reduced QVI |S||A| (1 − � ) 3 , 1 ) (0 , 1] (1 − � ) 3 " 2 [Sidford et al., 2018a] ⇥ |S||A| Randomized primal-dual |S||A| 1 (0 , 1 − � ] (1 − � ) 2 , 1 ) (1 − � ) 4 " 2 [Wang, 2019] ⇥ |S||A| Empirical MDP + planning |S||A| 1 (1 − � ) 2 , 1 ) (0 , √ 1 − � ] (1 − � ) 3 " 2 [Agarwal et al., 2019] • # states |S| , # actions |A| important parameters = ) 1 • the discounted complexity 1 � � 1 • approximation error " 2 (0 , 1 � � ] 16 / 34

  30. 17 / 34

  31. 17 / 34

  32. |S||A| All prior theory requires sample size > (1 � � ) 2 | {z } sample size barrier 17 / 34

  33. This talk: break the sample complexity barrier 18 / 34

  34. Two approaches Model-based approach (“plug-in”) 1. build empirical estimate b P for P 2. planning based on empirical b P 19 / 34

  35. Two approaches Model-based approach (“plug-in”) 1. build empirical estimate b P for P 2. planning based on empirical b P Model-free approach — learning w/o constructing a model explicitly 19 / 34

  36. Two approaches Model-based approach (“plug-in”) 1. build empirical estimate b P for P 2. planning based on empirical b P Model-free approach — learning w/o constructing a model explicitly 19 / 34

  37. Model estimation Sampling: for each ( s , a ), collect N ind. samples { ( s , a , s 0 ( i ) ) } 1  i  N 20 / 34

  38. Model estimation Sampling: for each ( s , a ), collect N ind. samples { ( s , a , s 0 ( i ) ) } 1  i  N N X P ( s 0 | s , a ) by 1 Empirical estimates: estimate b 1 { s 0 ( i ) = s 0 } N i =1 | {z } empirical frequency 20 / 34

  39. Our method: plug-in estimator + perturbation 21 / 34

  40. Our method: plug-in estimator + perturbation 21 / 34

  41. Our method: plug-in estimator + perturbation 21 / 34

  42. Our method: plug-in estimator + perturbation 21 / 34

  43. Challenges in the sample-starved regime empirical estimate: b truth: P 2 R |S||A| ⇥ |S| P • can’t recover P faithfully if sample size ⌧ |S| 2 |A| ! 22 / 34

  44. Challenges in the sample-starved regime empirical estimate: b truth: P 2 R |S||A| ⇥ |S| P • can’t recover P faithfully if sample size ⌧ |S| 2 |A| ! Can we trust our policy estimate when reliable model estimation is infeasible? 22 / 34

  45. Main result Theorem (Li, Wei, Chi, Gu, Chen ’20) 1 For every 0 < "  1 � � , policy b ⇡ ? p of perturbed empirical MDP achieves p � V ? k 1  " p � Q ? k 1  �" k V b ⇡ ? k Q b ⇡ ? and with sample complexity at most ✓ ◆ |S||A| e O . (1 � � ) 3 " 2 23 / 34

  46. Main result Theorem (Li, Wei, Chi, Gu, Chen ’20) 1 For every 0 < "  1 � � , policy b ⇡ ? p of perturbed empirical MDP achieves p � V ? k 1  " p � Q ? k 1  �" k V b ⇡ ? k Q b ⇡ ? and with sample complexity at most ✓ ◆ |S||A| e O . (1 � � ) 3 " 2 � � p : obtained by empirical QVI or PI within e 1 • b iterations ⇡ ? O 1 � � 23 / 34

  47. Main result Theorem (Li, Wei, Chi, Gu, Chen ’20) 1 For every 0 < "  1 � � , policy b ⇡ ? p of perturbed empirical MDP achieves p � V ? k 1  " p � Q ? k 1  �" k V b ⇡ ? k Q b ⇡ ? and with sample complexity at most ✓ ◆ |S||A| e O . (1 � � ) 3 " 2 � � p : obtained by empirical QVI or PI within e 1 • b iterations ⇡ ? O 1 � � • minimax lower bound: e |S||A| Ω ( (1 � � ) 3 " 2 ) [Azar et al., 2013] 23 / 34

  48. 24 / 34

  49. A sketch of the main proof ingredients 25 / 34

  50. Notation and Bellman equation • V ⇡ : true value function under policy ⇡ I Bellman equation: V = ( I � � P ⇡ ) − 1 r [Sutton and Barto, 2018] 26 / 34

  51. Notation and Bellman equation • V ⇡ : true value function under policy ⇡ I Bellman equation: V = ( I � � P ⇡ ) − 1 r [Sutton and Barto, 2018] • b V ⇡ : estimate of value function under policy ⇡ I Bellman equation: b V = ( I � � b P ⇡ ) − 1 r 26 / 34

  52. Notation and Bellman equation • V ⇡ : true value function under policy ⇡ I Bellman equation: V = ( I � � P ⇡ ) − 1 r [Sutton and Barto, 2018] • b V ⇡ : estimate of value function under policy ⇡ I Bellman equation: b V = ( I � � b P ⇡ ) − 1 r • ⇡ ? : optimal policy w.r.t. true value function • b ⇡ ? : optimal policy w.r.t. empirical value function 26 / 34

Recommend


More recommend