Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned CS - PowerPoint PPT Presentation

Reinforcement Learning:   A Primer, Multi-Task, Goal-Conditioned CS 330 1

Logistics Homework 2 due Wednesday. Homework 3 out on Wednesday. Project proposal due next Wedesday. 2

Why Reinforcement Learning? When do you not need sequential decision making? When your system is making a single isolated decision, e.g. classi fi cation, regression. When that decision does not a ff ect future inputs or decisions. Common applications robotics language & dialog autonomous driving business operations fi nance (most deployed ML systems) + a key aspect of intelligence 3

The Plan Multi-task reinforcement learning problem Policy gradients & their multi-task/meta counterparts Q-learning <— should be review Multi-task Q-learning 4

object manipulation object classi fi cation supervised learning sequential decision making iid data action a ff ects next state how to collect data? large labeled, curated dataset what are the labels? well-de fi ned notions of success what does success mean? 5

Terminology & notation 1. run away 2. ignore 3. pet 6 Slide adapted from Sergey Levine

Imitation Learning supervised training learning data Images: Bojarski et al. ‘16, NVIDIA 7 Slide adapted from Sergey Levine

Reward functions 8 Slide adapted from Sergey Levine

The goal of reinforcement learning infinite horizon case finite horizon case 9 Slide adapted from Sergey Levine

What is a reinforcement learning task ? Recall : supervised learning Reinforcement learning ac4on space dynamics data genera4ng distribu4ons, loss 𝒰 i ≜ { p i ( x ), p i ( y | x ), ℒ i } A task: A task: 𝒰 i ≜ { 𝒯 i , 𝒝 i , p i ( s 1 ), p i ( s ′ � | s , a ), r i ( s , a )} state ini4al state reward space distribu4on a Markov decision process much more than the seman4c meaning of task! 10

Examples Task Distributions A task: 𝒰 i ≜ { 𝒯 i , 𝒝 i , p i ( s 1 ), p i ( s ′ � | s , a ), r i ( s , a )} p i ( s ′ � | s , a ), r i ( s , a ) Personalized recommendations: vary across tasks Character animation: across maneuvers r i ( s , a ) vary across garments & initial states p i ( s 1 ), p i ( s ′ � | s , a ) vary Multi-robot RL: 𝒯 i , 𝒝 i , p i ( s 1 ), p i ( s ′ � | s , a ) vary 11

What is a reinforcement learning task ? Reinforcement learning ac4on space dynamics A task: 𝒰 i ≜ { 𝒯 i , 𝒝 i , p i ( s 1 ), p i ( s ′ � | s , a ), r i ( s , a )} state ini4al state reward space distribu4on An alterna4ve view: A task identifier is part of the state : s = (¯ s , z i ) original state { 𝒰 i } = { ⋃ 𝒯 i , ⋃ 𝒝 i , 1 p i ( s 1 ), p ( s ′ � | s , a ), r ( s , a ) } 𝒰 i ≜ { 𝒯 i , 𝒝 i , p i ( s 1 ), p ( s ′ � | s , a ), r ( s , a )} N ∑ i It can be cast as a standard Markov decision process ! 12

The goal of multi-task reinforcement learning Multi-task RL What is the reward? The same as before The same as before, except : Or, for goal-conditioned RL: a task identifier is part of the state : s = (¯ s , z i ) s , s g ) = − d (¯ r ( s ) = r (¯ s , s g ) e.g. one-hot task ID Distance function examples: d language description desired goal state, z i = s g “ goal-conditioned RL” - Euclidean ℓ 2 - sparse 0/1 If it's still a standard Markov decision process , 13 then, why not apply standard RL algorithms ? You can! You can often do better.

The Plan Multi-task reinforcement learning problem Policy gradients & their multi-task/meta counterparts Q-learning Multi-task Q-learning 14

The anatomy of a reinforcement learning algorithm fit a model to es7mate return generate samples (i.e. run the policy) improve the policy This lecture : focus on model-free RL methods (policy gradient, Q-learning) 11/6 : focus on model-based RL methods 15

Evaluating the objective 16 Slide adapted from Sergey Levine

Direct policy differentiation a convenient identity 17 Slide adapted from Sergey Levine

Direct policy differentiation 18 Slide adapted from Sergey Levine

Evaluating the policy gradient fit a model to estimate return generate samples (i.e. run the policy) improve the policy 19 Slide adapted from Sergey Levine

Comparison to maximum likelihood Multi-task learning algorithms can readily be applied! supervised training learning data 20 Slide adapted from Sergey Levine

What did we just do? good stuff is made more likely bad stuff is made less likely simply formalizes the notion of “trial and error”! 21 Can we use policy gradients with meta-learning ? Slide adapted from Sergey Levine

Example: MAML + policy gradient two tasks: running backward, running forward 22 Finn, Abbeel, Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML ‘17

Example: MAML + policy gradient two tasks: running backward, running forward There exists a representaOon under which RL is fast and efficient. 23 Finn, Abbeel, Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML ‘17

Example: Black-box meta-learning + policy gradient Experiment : Learning to visually navigate a maze - train on 1000 small mazes - test on held-out small mazes and large mazes 24 Mishra, Rohaninejad, Chen, Abbeel. A Simple Neural Attentive Meta-Learner ICLR ‘18

Policy Gradients Pros: + Simple + Easy to combine with exisOng mulO-task & meta-learning algorithms Cons: - Produces a high-variance gradient - Can be mi4gated with baselines (used by all algorithms in prac4ce), trust regions - Requires on-policy data - Cannot reuse exis4ng experience to es4mate the gradient! - Importance weights can help, but also high variance 25

Value-Based RL: Definitions T ∑ Value funcOon: V π ( s t ) = 𝔽 π [ r ( s t ′ � , a t ′ � ) ∣ s t ] total reward star4ng from and following π s t ′ � = t "how good is a state” T ∑ Q funcOon: Q π ( s t , a t ) = 𝔽 π [ r ( s t ′ � , a t ′ � ) ∣ s t , a t ] total reward star4ng from , taking , and then following π s a t ′ � = t "how good is a state-ac4on pair” V π ( s t ) = 𝔽 a t ∼ π ( ⋅ | s t ) [ Q π ( s t , a t ) ] They're related: Q π If you know , you can use it to improve . π Set π ( a | s ) ← 1 for a = arg max Q π ( s , ¯ a ) New policy is at least as good as old policy. ¯ a 27

Value-Based RL: Definitions T ∑ Value funcOon: V π ( s t ) = 𝔽 π [ r ( s t ′ � , a t ′ � ) ∣ s t ] total reward star4ng from and following π s t ′ � = t "how good is a state” T ∑ Q funcOon: Q π ( s t , a t ) = 𝔽 π [ r ( s t ′ � , a t ′ � ) ∣ s t , a t ] total reward star4ng from , taking , and then following π s a t ′ � = t "how good is a state-ac4on pair” Q ⋆ ( s t , a t ) = 𝔽 s ′ � ∼ p ( ⋅ | s , a ) [ r ( s , a ) + γ max a ′ � Q ⋆ ( s ′ � , a ′ � ) ] π ⋆ : For the opOmal policy Bellman equa4on 28

Fitted Q-iteration Algorithm Algorithm hyperparameters Result : get a policy from π ( a | s ) arg max Q ϕ ( s , a ) a We can reuse data from previous policies! Important notes: using replay buffers an off-policy algorithm This is not a gradient descent algorithm! 29 Can be readily extended to mulO-task / goal-condiOoned RL Slide adapted from Sergey Levine

Multi-Task RL Algorithms π θ ( a | ¯ π θ ( a | ¯ Policy: s ) —> s , z i ) Q-funcOon: Q ϕ (¯ —> Q ϕ (¯ s , a ) s , a , z i ) Analogous to mulO-task supervised learning : stra4fied sampling, soc/hard weight sharing, etc. What is different about reinforcement learning ? The data distribuOon is You may know what aspect(s) of the MDP controlled by the agent! are changing across tasks. Can we leverage this knowledge? Should we share data in addi4on to sharing weights ? 31

An example Task 1: passing Task 2: shoo4ng goals What if you accidentally perform a good pass when trying to shoot a goal? Store experience as normal. *and* Relabel experience with task 2 ID & reward and store. “hindsight relabeling” "hindsight experience replay” (HER) 32

Goal-conditioned RL with hindsight relabeling 𝒠 k = {( s 1: T , a 1: T , s g , r 1: T )} 1. Collect data using some policy 𝒠 ← 𝒠 ∪ 𝒠 k 2. Store data in replay bu ff er 3. Perform hindsight relabeling: k ++ 𝒠 k <— Other relabeling strategies? a. Relabel experience in using last state as goal:   𝒠′ � k = {( s 1: T , a 1: T , s T , r ′ � r ′ � t = − d ( s t , s T ) where 1: T } use any state from the trajectory 𝒠 ← 𝒠 ∪ 𝒠′ � b. Store relabeled data in replay bu ff er k 4. Update policy using replay bu ff er 𝒠 Result : explora4on challenges alleviated Kaelbling. Learning to Achieve Goals. IJCAI ‘93 33 Andrychowicz et al. Hindsight Experience Replay. NeurIPS ‘17

Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned CS - PowerPoint PPT Presentation

Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned CS 330 1 Logistics Homework 2 due Wednesday. Homework 3 out on Wednesday. Project proposal due next Wedesday. 2 Why Reinforcement Learning? When do you not need sequential

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned CS 330 1 Introduction Some

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Multi-agent learning Multi-agent reinforcement learning Gerard Vreeswijk , Intelligent Systems

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Table S2. Gene-specific PCR primer pairs for all validated SBSs. Forward primer Reverse primer

REINFORCEMENT LEARNING IN MULTI-AGENT SYSTEMS MACHINE LEARNING MEETUP DR. ANA PELETEIRO

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Tariff Primer: A Graphic Presentation of the Fordney- Tariff Primer: A Graphic Presentation of the

Linac Simulation Linac Simulation Primer Primer J.-F. Ostiguy APC ostiguy@fnal.gov September

TUESDAY FANBOYS 'but' and 'so' (3) 1 SPELLING We will be learning to spell: Words ending in

Disentangled Representation Learning 2020.5.21 Seung-Hoon Na Jeonbuk National University

Vertaald uit het Spaans Freddy Storm 07/2011 ICE BALLS This freak of nature occurs after heavy

Course Roadmap Informatics 2A: Lecture 2 John Longley, Mirella Lapata School of Informatics

Administrivia Finals (everyone) Thursday, May 5, 1-3pm, Hasbrouck 113 Final exam

Training neural networks Today's lecture Learning from small data Curriculum: Active

Multiparty Multimedia Session Control Working Group 68th IETF Prague 19 March 2007 Please

SHORELINE SPECIAL NEEDS PTSA MEMBER MEETING AGENDA 6:45 p.m. District Levy Presentation 7 p.m.