Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned CS 330 1
Introduction Some background: - Not a native English speaker so please lease please lease let me know if you don ’ t understand something Karol - I like robots ☺ Hausman - Studied classical robotics first - Got fascinated by deep RL in the middle of my PhD after a talk by Sergey Levine - Research Scientist at Robotics @ Google 2
Why Reinforcement Learning? Isolated action that doesn ’ t affect the future? 3
Why Reinforcement Learning? Isolated action that doesn ’ t affect the future? Supervised learning? Common applications robotics language & dialog autonomous driving business operations finance (most deployed ML systems) + a key aspect of intelligence 4
The Plan Multi-task reinforcement learning problem Policy gradients & their multi-task counterparts Q-learning < — should be review Multi-task Q-learning 7
The Plan Mu Multi lti-task task reinf einforcement orcement learning arning problem oblem Policy gradients & their multi-task counterparts Q-learning < — should be review Multi-task Q-learning 8
Terminology & notation 1. run away 2. ignore 3. pet 10 Slide adapted from Sergey Levine
Terminology & notation 1. run away 2. ignore 3. pet 11 Slide adapted from Sergey Levine
Imitation Learning supervised training learning data Images: Bojarski et al. ‘ 16, NVIDIA 15 Slide adapted from Sergey Levine
Imitation Learning Imitation Learning vs Reinforcement Learning? supervised training learning data Images: Bojarski et al. ‘ 16, NVIDIA 16 Slide adapted from Sergey Levine
Reward functions 18 Slide adapted from Sergey Levine
The goal of reinforcement learning 20 Slide adapted from Sergey Levine
Partial observability Fully observable? - Simulated robot performing a reaching task given the goal position and positions and velocities of all of its joints - Indiscriminate robotic grasping from a bin given an overhead image - A robot sorting trash given a camera image 21
The goal of reinforcement learning infinite horizon case finite horizon case 22 Slide adapted from Sergey Levine
What is a reinforcement learning task ? Recall : supervised learning Reinforcement learning action space dynamics data generating distributions, loss A task: 𝑗 ≜ {𝒯 𝑗 , 𝑗 , 𝑞 𝑗 (𝐭 1 ), 𝑞 𝑗 (𝐭 ′ |𝐭, 𝐛), 𝑠 𝒰 𝑗 ≜ {𝑞 𝑗 (𝐲), 𝑞 𝑗 (𝐳|𝐲), ℒ 𝑗 } A task: 𝒰 𝑗 (𝐭, 𝐛)} state initial state reward space distribution a Markov decision process much more than the semantic meaning of task! 23
Examples Task Distributions 𝑗 ≜ {𝒯 𝑗 , 𝑗 , 𝑞 𝑗 (𝐭 1 ), 𝑞 𝑗 (𝐭 ′ |𝐭, 𝐛), 𝑠 A task: 𝒰 𝑗 (𝐭, 𝐛)} Character animation: across maneuvers 𝑗 (𝐭, 𝐛) vary 𝑠 across garments & initial states 𝑞 𝑗 (𝐭 1 ), 𝑞 𝑗 (𝐭 ′ |𝐭, 𝐛) vary Multi-robot RL: 𝒯 𝑗 , 𝑗 , 𝑞 𝑗 (𝐭 1 ), 𝑞 𝑗 (𝐭 ′ |𝐭, 𝐛) vary 25
What is a reinforcement learning task ? action space dynamics Reinforcement learning 𝑗 ≜ {𝒯 𝑗 , 𝑗 , 𝑞 𝑗 (𝐭 1 ), 𝑞 𝑗 (𝐭 ′ |𝐭, 𝐛), 𝑠 A task: 𝒰 𝑗 (𝐭, 𝐛)} state initial state reward space distribution An alternative view: A task identifier is part of the state : 𝐭 = (𝐭, 𝐴 𝑗 ) original state 𝑗 } = ⋃𝒯 𝑗 , ⋃ 𝑗 , 1 𝑗 ≜ {𝒯 𝑗 , 𝑗 , 𝑞 𝑗 (𝐭 1 ), 𝑞(𝐭 ′ |𝐭, 𝐛), 𝑠(𝐭, 𝐛)} 𝑞 𝑗 (𝐭 1 ), 𝑞(𝐭 ′ |𝐭, 𝐛), 𝑠(𝐭, 𝐛) {𝒰 𝑂 ∑ 𝒰 𝑗 It can be cast as a standard Markov decision process ! 26
The goal of multi-task reinforcement learning Multi-task RL What is the reward? The same as before The same as before, except : a task identifier is part of the state : 𝐭 = (𝐭, 𝐴 𝑗 ) Or, for goal-conditioned RL: 𝑠(𝐭) = 𝑠(𝐭, 𝐭 ) = −𝑒(𝐭, 𝐭 ) e.g. one-hot task ID Distance function 𝑒 examples: language description - Euclidean ℓ 2 desired goal state, 𝐴 𝑗 = 𝐭 “ goal-conditioned RL ” - sparse 0/1 If it's still a standard Markov decision process , 27 then, why not apply standard RL algorithms ? You can! You can often do better.
The Plan Multi-task reinforcement learning problem Policy gradients & their multi-task counterparts Q-learning Multi-task Q-learning 28
The anatomy of a reinforcement learning algorithm This lecture : focus on model-free RL methods (policy gradient, Q-learning) 10/19 : focus on model-based RL methods 29
On-policy vs Off-policy - Data comes from any policy - Data comes from the current policy - Works with specific RL - Compatible with all RL algorithms algorithms - Can ’ t reuse data from previous - Much more sample efficient, policies can re-use old data 30
Evaluating the objective 32 Slide adapted from Sergey Levine
Direct policy differentiation a convenient identity 34 Slide adapted from Sergey Levine
Direct policy differentiation 35 Slide adapted from Sergey Levine
Evaluating the policy gradient fit a model to estimate return generate samples (i.e. run the policy) improve the policy 36 Slide adapted from Sergey Levine
Comparison to maximum likelihood Multi-task learning algorithms can readily be applied! supervised training learning data 38 Slide adapted from Sergey Levine
What did we just do? good stuff is made more likely bad stuff is made less likely simply formalizes the notion of “ trial and error ” ! 40 Slide adapted from Sergey Levine
Policy Gradients Pros: + Simple + Easy to combine with existing multi-task & meta-learning algorithms Cons: - Produces a high-variance gradient - Can be mitigated with baselines (used by all algorithms in practice), trust regions - Requires on-policy data - Cannot reuse existing experience to estimate the gradient! - Importance weights can help, but also high variance 41
The Plan Multi-task reinforcement learning problem Policy gradients & their multi-task/meta counterparts Q-learning Multi-task Q-learning 42
Value-Based RL: Definitions 𝑈 Value function: 𝑊 𝜌 (𝐭 𝑢 ) = ∑ 𝔽 𝜌 𝑠(𝐭 𝑢 ′ , 𝐛 𝑢 ′ ) ∣ 𝐭 𝑢 total reward starting from 𝐭 and following 𝜌 𝑢 ′ =𝑢 "how good is a state ” total reward starting from 𝐭 , taking 𝐛 , 𝑈 Q function: 𝑅 𝜌 (𝐭 𝑢 , 𝐛 𝑢 ) = ∑ 𝔽 𝜌 𝑠(𝐭 𝑢 ′ , 𝐛 𝑢 ′ ) ∣ 𝐭 𝑢 , 𝐛 𝑢 and then following 𝜌 𝑢 ′ =𝑢 "how good is a state-action pair ” 𝑊 𝜌 (𝐭 𝑢 ) = 𝔽 𝐛 𝑢 ∼𝜌(⋅|𝐭 𝑢 ) 𝑅 𝜌 (𝐭 𝑢 , 𝐛 𝑢 ) They're related: If you know 𝑅 𝜌 , you can use it to improve 𝜌 . 𝐛 𝑅 𝜌 (𝐭, 𝐛) New policy is at least as good as old policy. Set 𝜌 𝐛 𝐭 ← 1 for 𝐛 = arg𝑛𝑏𝑦 43
Value-Based RL: Definitions 𝑈 Value function: 𝑊 𝜌 (𝐭 𝑢 ) = ∑ 𝔽 𝜌 𝑠(𝐭 𝑢 ′ , 𝐛 𝑢 ′ ) ∣ 𝐭 𝑢 total reward starting from 𝐭 and following 𝜌 𝑢 ′ =𝑢 "how good is a state ” total reward starting from 𝐭 , taking 𝐛 , 𝑈 Q function: 𝑅 𝜌 (𝐭 𝑢 , 𝐛 𝑢 ) = ∑ 𝔽 𝜌 𝑠(𝐭 𝑢 ′ , 𝐛 𝑢 ′ ) ∣ 𝐭 𝑢 , 𝐛 𝑢 and then following 𝜌 𝑢 ′ =𝑢 "how good is a state-action pair ” For the optimal policy 𝜌 ⋆ : 𝑅 ⋆ (𝐭 𝑢 , 𝐛 𝑢 ) = 𝔽 𝐭 ′ ∼𝑞(⋅|𝐭,𝐛) 𝑠(𝐭, 𝐛) + 𝛿𝑛𝑏𝑦 𝐛 ′ 𝑅 ⋆ (𝐭 ′ , 𝐛 ′ ) Bellman equation 44
Value function: 𝑊 𝜌 𝐭 𝑢 = ? Value-Based RL Q function: 𝑅 𝜌 𝐭 𝑢 , 𝐛 𝑢 = ? Q* function: 𝑅 ∗ 𝐭 𝑢 , 𝐛 𝑢 = ? Reward = 1 if I can play it in a month, 0 otherwise Value* function: 𝑊 ∗ 𝐭 𝑢 = ? a 3 a 2 a 1 s t 𝐛 𝑅 𝜌 (𝐭, 𝐛) Set 𝜌 𝐛 𝐭 ← 1 for 𝐛 = arg𝑛𝑏𝑦 New policy is at least as good as old policy. Current 𝜌 𝐛 1 𝐭 = 1 45
Fitted Q-iteration Algorithm Algorithm hyperparameters Result : get a policy 𝜌(𝐛|𝐭) from arg𝑛𝑏𝑦 𝐛 𝑅 𝜚 (𝐭, 𝐛) We can reuse data from previous policies! using replay buffers Important notes: an off-policy algorithm This is not a gradient descent algorithm! 46 Can be readily extended to multi-task / goal-conditioned RL Slide adapted from Sergey Levine
Example: Q-learning Applied to Robotics Continuous action space? Simple optimization algorithm -> Cross Entropy Method (CEM) 47
QT-Opt: Q-learning at Scale CEM optimization In-memory buffers Bellman updaters stored data from all past experiments Training jobs Slide adapted from D. Kalashnikov QT-Opt: Kalashnikov et al. ‘ 18, Google Brain
QT-Opt: MDP Definition for Grasping State: over the shoulder RGB camera image, no depth Actio ion: 4DOF pose change in Cartesian space + gripper control Reward: binary reward at the end, if the object was lifted. Sparse. No shaping Automatic success detection: Slide adapted from D. Kalashnikov
QT-Opt: Setup and Results 96 96% test success ra rate! 7 robots collected 580k grasps Unseen test objects 50
Recommend
More recommend