Class notes 1. Homework 5 due Tuesday, November 13 th 11:59pm
Real-World Robot Learning: Safety and Flexibility CS294-112: Deep Reinforcement Learning Gregory Kahn
Why should you care? Safety Flexibility
Outline Topics Algorithms • Safety • Imitation learning • Flexibility • Model-free • Model-based 2 * 3 = 6 papers we’ll cover By no means the best / only papers on these topics Safety Flexibility Imitation learning Model-free Model-based
Safety Flexibility Imitation learning Model-free Model-based
Goal Learn control policy that maps observations to controls Control Observation Policy Safety Flexibility Imitation learning Model-free Model-based
Assumption ● Able to generate good trajectories using an expert policy Human expert Trajectory optimization - cost function - optimization - full state information only during training Safety Flexibility Imitation learning Model-free Model-based
Supervised Learning Gather expert Supervised Trajectory trajectories learning optimization Training trajectory Learned policy trajectory Policy reaches states not in training set! [Ross et al 2010] ● Problem: training and test distributions differ Safety Flexibility Imitation learning Model-free Model-based
Dataset Aggregation (DAgger) [Ross et al 2011] ● Problem: training and test distributions differ ● Solution: execute policy during training Supervised Gather expert learning trajectories Safety Flexibility Imitation learning Model-free Model-based
Safety during training ● DAgger mixes the actions Safety Flexibility Imitation learning Model-free Model-based
Policy Learning using Adaptive Trajectory Optimization (PLATO) ● DAgger mixes the actions ● PLATO mixes the objectives cost J → avoids high cost Safety Flexibility Imitation learning Model-free Model-based
Algorithm comparisons approach sampling safe similar training policy and test distributions supervised learning DAgger PLATO Safety Flexibility Imitation learning Model-free Model-based
Experiments: final neural network policies Canyon Forest Safety Flexibility Imitation learning Model-free Model-based
Experiments: metrics Canyon Forest Safety Flexibility Imitation learning Model-free Model-based
Experiments: metrics Forest Canyon Forest Canyon Safety Flexibility Imitation learning Model-free Model-based
Safety Flexibility Imitation learning Model-free Model-based
Goal NOT SAFE Safety Flexibility Imitation learning Model-free Model-based
Shielding Pre-emptive shielding Post-posed shielding Like learning in a transformed MDP Shield can be used at test time Safety Flexibility Imitation learning Model-free Model-based
How to shield: linear temporal logic ● Encode safety with temporal logic ● Assumption: Known approximate/conservative transition dynamics Safety Flexibility Imitation learning Model-free Model-based
Experiments Safety criteria - Don’t crash Safety Flexibility Imitation learning Model-free Model-based
Experiments Safety criteria - Don’t run out of oxygen - If enough oxygen, don’t surface w/o divers Safety Flexibility Imitation learning Model-free Model-based
Safety Flexibility Imitation learning Model-free Model-based
Goal unknown environment How to do reinforcement learning without destroying the robot during training using only onboard images Safety Flexibility Imitation learning Model-free Model-based
Approach unknown environment learn a collision prediction model raw image command velocities neural network Safety Flexibility Imitation learning Model-free Model-based
Collision prediction model Safety Flexibility Imitation learning Model-free Model-based
Model-based RL using collision prediction model Encourage safe, low-speed collisions by reasoning about May experience collisions the model’s uncertainty Form speed-dependent, Gather trajectories using uncertainty-aware MPC controller collision cost . Train uncertainty-aware Data collision prediction model Deep neural network with Robot increases speed uncertainty estimates from as model becomes more bootstrapping and dropout confident Safety Flexibility Imitation learning Model-free Model-based
Collision cost high speed predict collision large uncertainty large cost Safety Flexibility Imitation learning Model-free Model-based
Estimating neural network output uncertainty Bootstrapping Training time Test time Input Data Resample with replacement M 1 M 2 M 3 D 1 D 2 D 3 Train Train Train M 1 M 2 M 3 Safety Flexibility Imitation learning Model-free Model-based
Estimating neural network output uncertainty Dropout Test time Training time Input Data Model Model Model Model Model Model Safety Flexibility Imitation learning Model-free Model-based
Preliminary real-world experiments Not accounting for uncertainty (higher-speed collisions) Safety Flexibility Imitation learning Model-free Model-based
Preliminary real-world experiments accounting for uncertainty (lower-speed collisions) Safety Flexibility Imitation learning Model-free Model-based
Preliminary real-world experiments successful flight past obstacle Safety Flexibility Imitation learning Model-free Model-based
Safety takeaways • Tradeoff between safety and exploration • Safety guarantees require expert oversight or known environment + dynamics • Uncertainty can play a key role Safety Flexibility Imitation learning Model-free Model-based
Safety Flexibility Imitation learning Model-free Model-based
Goal User-specified command Safety Flexibility Imitation learning Model-free Model-based
Approach Option A: Input command Option B: Branch using command + empirically better - only works for discrete commands Safety Flexibility Imitation learning Model-free Model-based
Approach Important details • Data augmentation • Contrast • Brightness • Tone • Gaussian blur • Salt-and-pepper noise • Region dropout • Adding noise to expert Safety Flexibility Imitation learning Model-free Model-based
[slides adapted from Tuomas Haarnoja] Safety Flexibility Imitation learning Model-free Model-based
Goal Space of trajectories Avoidance skill Task 1: Reach Task 2: Avoid Reaching while Reaching skill avoiding skill Safety Flexibility Imitation learning Model-free Model-based
Policy Composition Space of trajectories Task 1: Reach Task 2: Avoid Avoidance skill Task 1+2: Reach and avoid Reaching while Reaching skill avoiding skill Related to divergence between and Reusability! Safety Flexibility Imitation learning Model-free Model-based
Task 1 Task 2 Task 1 + 2
Stacking policy Avoidance policy
Stacking policy Avoidance policy Combined policy
Safety Flexibility Imitation learning Model-free Model-based
Standard Reinforcement Learning Train Data inefficient Data Data Data Expert in the loop Policy Policy Policy Inflexible Test
CAPs Approach Event Cues Detector Train Data efficient Data Detector in the loop CAPs Flexible Test
Detect Predict Control Safety Flexibility Imitation learning Model-free Model-based
Detect Predict Control Detector Event Cues Safety Flexibility Imitation learning Model-free Model-based
Detect Predict Control Safety Flexibility Imitation learning Model-free Model-based
Detect Predict Control Safety Flexibility Imitation learning Model-free Model-based
Detect Predict Control Safety Flexibility Imitation learning Model-free Model-based
8x 8x 8x 8x 8x 8x Safety Flexibility Imitation learning Model-free Model-based
8x Safety Flexibility Imitation learning Model-free Model-based
Drive at 7m/s Avoid collisions Drive in either lane Drive in right lane 6x 6x Safety Flexibility Imitation learning Model-free Model-based
CAPs 6x
Safety Flexibility Imitation learning Model-free Model-based
Safety Flexibility Imitation learning Model-free Model-based
Collision Avoidance CAPs DQL Safety Flexibility Imitation learning Model-free Model-based
Avoid collisions Follow goal heading Move towards doors Heading
Flexibility takeaways • Carefully construct how your policy / model deals with goals • Model-free methods require extra care to reuse • Model-based methods are flexible by construction Safety Flexibility Imitation learning Model-free Model-based
Recommend
More recommend