reinforcement learning a primer multi task goal
play

Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned CS - PowerPoint PPT Presentation

Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned CS 330 1 Logistics Homework 2 due Wednesday. Homework 3 out on Wednesday. Project proposal due next Wedesday. 2 Why Reinforcement Learning? When do you not need sequential


  1. Reinforcement Learning: 
 A Primer, Multi-Task, Goal-Conditioned CS 330 1

  2. Logistics Homework 2 due Wednesday. Homework 3 out on Wednesday. Project proposal due next Wedesday. 2

  3. Why Reinforcement Learning? When do you not need sequential decision making? When your system is making a single isolated decision, e.g. classi fi cation, regression. When that decision does not a ff ect future inputs or decisions. Common applications robotics language & dialog autonomous driving business operations fi nance (most deployed ML systems) + a key aspect of intelligence 3

  4. The Plan Multi-task reinforcement learning problem Policy gradients & their multi-task/meta counterparts Q-learning <— should be review Multi-task Q-learning 4

  5. object manipulation object classi fi cation supervised learning sequential decision making iid data action a ff ects next state how to collect data? large labeled, curated dataset what are the labels? well-de fi ned notions of success what does success mean? 5

  6. Terminology & notation 1. run away 2. ignore 3. pet 6 Slide adapted from Sergey Levine

  7. Imitation Learning supervised training learning data Images: Bojarski et al. ‘16, NVIDIA 7 Slide adapted from Sergey Levine

  8. Reward functions 8 Slide adapted from Sergey Levine

  9. The goal of reinforcement learning infinite horizon case finite horizon case 9 Slide adapted from Sergey Levine

  10. What is a reinforcement learning task ? Recall : supervised learning Reinforcement learning ac4on space dynamics data genera4ng distribu4ons, loss 𝒰 i ≜ { p i ( x ), p i ( y | x ), ℒ i } A task: A task: 𝒰 i ≜ { 𝒯 i , 𝒝 i , p i ( s 1 ), p i ( s ′ � | s , a ), r i ( s , a )} state ini4al state reward space distribu4on a Markov decision process much more than the seman4c meaning of task! 10

  11. Examples Task Distributions A task: 𝒰 i ≜ { 𝒯 i , 𝒝 i , p i ( s 1 ), p i ( s ′ � | s , a ), r i ( s , a )} p i ( s ′ � | s , a ), r i ( s , a ) Personalized recommendations: vary across tasks Character animation: across maneuvers r i ( s , a ) vary across garments & initial states p i ( s 1 ), p i ( s ′ � | s , a ) vary Multi-robot RL: 𝒯 i , 𝒝 i , p i ( s 1 ), p i ( s ′ � | s , a ) vary 11

  12. What is a reinforcement learning task ? Reinforcement learning ac4on space dynamics A task: 𝒰 i ≜ { 𝒯 i , 𝒝 i , p i ( s 1 ), p i ( s ′ � | s , a ), r i ( s , a )} state ini4al state reward space distribu4on An alterna4ve view: A task identifier is part of the state : s = (¯ s , z i ) original state { 𝒰 i } = { ⋃ 𝒯 i , ⋃ 𝒝 i , 1 p i ( s 1 ), p ( s ′ � | s , a ), r ( s , a ) } 𝒰 i ≜ { 𝒯 i , 𝒝 i , p i ( s 1 ), p ( s ′ � | s , a ), r ( s , a )} N ∑ i It can be cast as a standard Markov decision process ! 12

  13. The goal of multi-task reinforcement learning Multi-task RL What is the reward? The same as before The same as before, except : Or, for goal-conditioned RL: a task identifier is part of the state : s = (¯ s , z i ) s , s g ) = − d (¯ r ( s ) = r (¯ s , s g ) e.g. one-hot task ID Distance function examples: d language description desired goal state, z i = s g “ goal-conditioned RL” - Euclidean ℓ 2 - sparse 0/1 If it's still a standard Markov decision process , 13 then, why not apply standard RL algorithms ? You can! You can often do better.

  14. The Plan Multi-task reinforcement learning problem Policy gradients & their multi-task/meta counterparts Q-learning Multi-task Q-learning 14

  15. The anatomy of a reinforcement learning algorithm fit a model to es7mate return generate samples (i.e. run the policy) improve the policy This lecture : focus on model-free RL methods (policy gradient, Q-learning) 11/6 : focus on model-based RL methods 15

  16. Evaluating the objective 16 Slide adapted from Sergey Levine

  17. Direct policy differentiation a convenient identity 17 Slide adapted from Sergey Levine

  18. Direct policy differentiation 18 Slide adapted from Sergey Levine

  19. Evaluating the policy gradient fit a model to estimate return generate samples (i.e. run the policy) improve the policy 19 Slide adapted from Sergey Levine

  20. Comparison to maximum likelihood Multi-task learning algorithms can readily be applied! supervised training learning data 20 Slide adapted from Sergey Levine

  21. What did we just do? good stuff is made more likely bad stuff is made less likely simply formalizes the notion of “trial and error”! 21 Can we use policy gradients with meta-learning ? Slide adapted from Sergey Levine

  22. Example: MAML + policy gradient two tasks: running backward, running forward 22 Finn, Abbeel, Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML ‘17

  23. Example: MAML + policy gradient two tasks: running backward, running forward There exists a representaOon under which RL is fast and efficient. 23 Finn, Abbeel, Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML ‘17

  24. Example: Black-box meta-learning + policy gradient Experiment : Learning to visually navigate a maze - train on 1000 small mazes - test on held-out small mazes and large mazes 24 Mishra, Rohaninejad, Chen, Abbeel. A Simple Neural Attentive Meta-Learner ICLR ‘18

  25. Policy Gradients Pros: + Simple + Easy to combine with exisOng mulO-task & meta-learning algorithms Cons: - Produces a high-variance gradient - Can be mi4gated with baselines (used by all algorithms in prac4ce), trust regions - Requires on-policy data - Cannot reuse exis4ng experience to es4mate the gradient! - Importance weights can help, but also high variance 25

  26. The Plan Multi-task reinforcement learning problem Policy gradients & their multi-task/meta counterparts Q-learning Multi-task Q-learning 26

  27. Value-Based RL: Definitions T ∑ Value funcOon: V π ( s t ) = 𝔽 π [ r ( s t ′ � , a t ′ � ) ∣ s t ] total reward star4ng from and following π s t ′ � = t "how good is a state” T ∑ Q funcOon: Q π ( s t , a t ) = 𝔽 π [ r ( s t ′ � , a t ′ � ) ∣ s t , a t ] total reward star4ng from , taking , and then following π s a t ′ � = t "how good is a state-ac4on pair” V π ( s t ) = 𝔽 a t ∼ π ( ⋅ | s t ) [ Q π ( s t , a t ) ] They're related: Q π If you know , you can use it to improve . π Set π ( a | s ) ← 1 for a = arg max Q π ( s , ¯ a ) New policy is at least as good as old policy. ¯ a 27

  28. Value-Based RL: Definitions T ∑ Value funcOon: V π ( s t ) = 𝔽 π [ r ( s t ′ � , a t ′ � ) ∣ s t ] total reward star4ng from and following π s t ′ � = t "how good is a state” T ∑ Q funcOon: Q π ( s t , a t ) = 𝔽 π [ r ( s t ′ � , a t ′ � ) ∣ s t , a t ] total reward star4ng from , taking , and then following π s a t ′ � = t "how good is a state-ac4on pair” Q ⋆ ( s t , a t ) = 𝔽 s ′ � ∼ p ( ⋅ | s , a ) [ r ( s , a ) + γ max a ′ � Q ⋆ ( s ′ � , a ′ � ) ] π ⋆ : For the opOmal policy Bellman equa4on 28

  29. Fitted Q-iteration Algorithm Algorithm hyperparameters Result : get a policy from π ( a | s ) arg max Q ϕ ( s , a ) a We can reuse data from previous policies! Important notes: using replay buffers an off-policy algorithm This is not a gradient descent algorithm! 29 Can be readily extended to mulO-task / goal-condiOoned RL Slide adapted from Sergey Levine

  30. The Plan Multi-task reinforcement learning problem Policy gradients & their multi-task/meta counterparts Q-learning Multi-task Q-learning 30

  31. Multi-Task RL Algorithms π θ ( a | ¯ π θ ( a | ¯ Policy: s ) —> s , z i ) Q-funcOon: Q ϕ (¯ —> Q ϕ (¯ s , a ) s , a , z i ) Analogous to mulO-task supervised learning : stra4fied sampling, soc/hard weight sharing, etc. What is different about reinforcement learning ? The data distribuOon is You may know what aspect(s) of the MDP controlled by the agent! are changing across tasks. Can we leverage this knowledge? Should we share data in addi4on to sharing weights ? 31

  32. An example Task 1: passing Task 2: shoo4ng goals What if you accidentally perform a good pass when trying to shoot a goal? Store experience as normal. *and* Relabel experience with task 2 ID & reward and store. “hindsight relabeling” "hindsight experience replay” (HER) 32

  33. Goal-conditioned RL with hindsight relabeling 𝒠 k = {( s 1: T , a 1: T , s g , r 1: T )} 1. Collect data using some policy 𝒠 ← 𝒠 ∪ 𝒠 k 2. Store data in replay bu ff er 3. Perform hindsight relabeling: k ++ 𝒠 k <— Other relabeling strategies? a. Relabel experience in using last state as goal: 
 𝒠′ � k = {( s 1: T , a 1: T , s T , r ′ � r ′ � t = − d ( s t , s T ) where 1: T } use any state from the trajectory 𝒠 ← 𝒠 ∪ 𝒠′ � b. Store relabeled data in replay bu ff er k 4. Update policy using replay bu ff er 𝒠 Result : explora4on challenges alleviated Kaelbling. Learning to Achieve Goals. IJCAI ‘93 33 Andrychowicz et al. Hindsight Experience Replay. NeurIPS ‘17

Recommend


More recommend