class notes
play

Class notes 1. Homework 5 due Tuesday, November 13 th 11:59pm - PowerPoint PPT Presentation

Class notes 1. Homework 5 due Tuesday, November 13 th 11:59pm Real-World Robot Learning: Safety and Flexibility CS294-112: Deep Reinforcement Learning Gregory Kahn Why should you care? Safety Flexibility Outline Topics Algorithms


  1. Class notes 1. Homework 5 due Tuesday, November 13 th 11:59pm

  2. Real-World Robot Learning: Safety and Flexibility CS294-112: Deep Reinforcement Learning Gregory Kahn

  3. Why should you care? Safety Flexibility

  4. Outline Topics Algorithms • Safety • Imitation learning • Flexibility • Model-free • Model-based 2 * 3 = 6 papers we’ll cover By no means the best / only papers on these topics Safety Flexibility Imitation learning Model-free Model-based

  5. Safety Flexibility Imitation learning Model-free Model-based

  6. Goal Learn control policy that maps observations to controls Control Observation Policy Safety Flexibility Imitation learning Model-free Model-based

  7. Assumption ● Able to generate good trajectories using an expert policy Human expert Trajectory optimization - cost function - optimization - full state information only during training Safety Flexibility Imitation learning Model-free Model-based

  8. Supervised Learning Gather expert Supervised Trajectory trajectories learning optimization Training trajectory Learned policy trajectory Policy reaches states not in training set! [Ross et al 2010] ● Problem: training and test distributions differ Safety Flexibility Imitation learning Model-free Model-based

  9. Dataset Aggregation (DAgger) [Ross et al 2011] ● Problem: training and test distributions differ ● Solution: execute policy during training Supervised Gather expert learning trajectories Safety Flexibility Imitation learning Model-free Model-based

  10. Safety during training ● DAgger mixes the actions Safety Flexibility Imitation learning Model-free Model-based

  11. Policy Learning using Adaptive Trajectory Optimization (PLATO) ● DAgger mixes the actions ● PLATO mixes the objectives cost J → avoids high cost Safety Flexibility Imitation learning Model-free Model-based

  12. Algorithm comparisons approach sampling safe similar training policy and test distributions supervised learning DAgger PLATO Safety Flexibility Imitation learning Model-free Model-based

  13. Experiments: final neural network policies Canyon Forest Safety Flexibility Imitation learning Model-free Model-based

  14. Experiments: metrics Canyon Forest Safety Flexibility Imitation learning Model-free Model-based

  15. Experiments: metrics Forest Canyon Forest Canyon Safety Flexibility Imitation learning Model-free Model-based

  16. Safety Flexibility Imitation learning Model-free Model-based

  17. Goal NOT SAFE Safety Flexibility Imitation learning Model-free Model-based

  18. Shielding Pre-emptive shielding Post-posed shielding Like learning in a transformed MDP Shield can be used at test time Safety Flexibility Imitation learning Model-free Model-based

  19. How to shield: linear temporal logic ● Encode safety with temporal logic ● Assumption: Known approximate/conservative transition dynamics Safety Flexibility Imitation learning Model-free Model-based

  20. Experiments Safety criteria - Don’t crash Safety Flexibility Imitation learning Model-free Model-based

  21. Experiments Safety criteria - Don’t run out of oxygen - If enough oxygen, don’t surface w/o divers Safety Flexibility Imitation learning Model-free Model-based

  22. Safety Flexibility Imitation learning Model-free Model-based

  23. Goal unknown environment How to do reinforcement learning without destroying the robot during training using only onboard images Safety Flexibility Imitation learning Model-free Model-based

  24. Approach unknown environment learn a collision prediction model raw image command velocities neural network Safety Flexibility Imitation learning Model-free Model-based

  25. Collision prediction model Safety Flexibility Imitation learning Model-free Model-based

  26. Model-based RL using collision prediction model Encourage safe, low-speed collisions by reasoning about May experience collisions the model’s uncertainty Form speed-dependent, Gather trajectories using uncertainty-aware MPC controller collision cost . Train uncertainty-aware Data collision prediction model Deep neural network with Robot increases speed uncertainty estimates from as model becomes more bootstrapping and dropout confident Safety Flexibility Imitation learning Model-free Model-based

  27. Collision cost high speed predict collision large uncertainty large cost Safety Flexibility Imitation learning Model-free Model-based

  28. Estimating neural network output uncertainty Bootstrapping Training time Test time Input Data Resample with replacement M 1 M 2 M 3 D 1 D 2 D 3 Train Train Train M 1 M 2 M 3 Safety Flexibility Imitation learning Model-free Model-based

  29. Estimating neural network output uncertainty Dropout Test time Training time Input Data Model Model Model Model Model Model Safety Flexibility Imitation learning Model-free Model-based

  30. Preliminary real-world experiments Not accounting for uncertainty (higher-speed collisions) Safety Flexibility Imitation learning Model-free Model-based

  31. Preliminary real-world experiments accounting for uncertainty (lower-speed collisions) Safety Flexibility Imitation learning Model-free Model-based

  32. Preliminary real-world experiments successful flight past obstacle Safety Flexibility Imitation learning Model-free Model-based

  33. Safety takeaways • Tradeoff between safety and exploration • Safety guarantees require expert oversight or known environment + dynamics • Uncertainty can play a key role Safety Flexibility Imitation learning Model-free Model-based

  34. Safety Flexibility Imitation learning Model-free Model-based

  35. Goal User-specified command Safety Flexibility Imitation learning Model-free Model-based

  36. Approach Option A: Input command Option B: Branch using command + empirically better - only works for discrete commands Safety Flexibility Imitation learning Model-free Model-based

  37. Approach Important details • Data augmentation • Contrast • Brightness • Tone • Gaussian blur • Salt-and-pepper noise • Region dropout • Adding noise to expert Safety Flexibility Imitation learning Model-free Model-based

  38. [slides adapted from Tuomas Haarnoja] Safety Flexibility Imitation learning Model-free Model-based

  39. Goal Space of trajectories Avoidance skill Task 1: Reach Task 2: Avoid Reaching while Reaching skill avoiding skill Safety Flexibility Imitation learning Model-free Model-based

  40. Policy Composition Space of trajectories Task 1: Reach Task 2: Avoid Avoidance skill Task 1+2: Reach and avoid Reaching while Reaching skill avoiding skill Related to divergence between and Reusability! Safety Flexibility Imitation learning Model-free Model-based

  41. Task 1 Task 2 Task 1 + 2

  42. Stacking policy Avoidance policy

  43. Stacking policy Avoidance policy Combined policy

  44. Safety Flexibility Imitation learning Model-free Model-based

  45. Standard Reinforcement Learning Train Data inefficient Data Data Data Expert in the loop Policy Policy Policy Inflexible Test

  46. CAPs Approach Event Cues Detector Train Data efficient Data Detector in the loop CAPs Flexible Test

  47. Detect Predict Control Safety Flexibility Imitation learning Model-free Model-based

  48. Detect Predict Control Detector Event Cues Safety Flexibility Imitation learning Model-free Model-based

  49. Detect Predict Control Safety Flexibility Imitation learning Model-free Model-based

  50. Detect Predict Control Safety Flexibility Imitation learning Model-free Model-based

  51. Detect Predict Control Safety Flexibility Imitation learning Model-free Model-based

  52. 8x 8x 8x 8x 8x 8x Safety Flexibility Imitation learning Model-free Model-based

  53. 8x Safety Flexibility Imitation learning Model-free Model-based

  54. Drive at 7m/s Avoid collisions Drive in either lane Drive in right lane 6x 6x Safety Flexibility Imitation learning Model-free Model-based

  55. CAPs 6x

  56. Safety Flexibility Imitation learning Model-free Model-based

  57. Safety Flexibility Imitation learning Model-free Model-based

  58. Collision Avoidance CAPs DQL Safety Flexibility Imitation learning Model-free Model-based

  59. Avoid collisions Follow goal heading Move towards doors Heading

  60. Flexibility takeaways • Carefully construct how your policy / model deals with goals • Model-free methods require extra care to reuse • Model-based methods are flexible by construction Safety Flexibility Imitation learning Model-free Model-based

Recommend


More recommend