cs 285
play

CS 285 Instructor: Sergey Levine UC Berkeley Terminology & - PowerPoint PPT Presentation

Supervised Learning of Behaviors CS 285 Instructor: Sergey Levine UC Berkeley Terminology & notation 1. run away 2. ignore 3. pet Terminology & notation 1. run away 2. ignore 3. pet Aside: notation Lev


  1. Supervised Learning of Behaviors CS 285 Instructor: Sergey Levine UC Berkeley

  2. Terminology & notation 1. run away 2. ignore 3. pet

  3. Terminology & notation 1. run away 2. ignore 3. pet

  4. Aside: notation управление Lev Pontryagin Richard Bellman

  5. Imitation Learning supervised training learning data behavioral cloning Images: Bojarski et al. ‘16, NVIDIA

  6. The original deep imitation learning system ALVINN: A utonomous L and V ehicle I n a N eural N etwork 1989

  7. No! Does it work?

  8. Does it work? Yes! Video: Bojarski et al. ‘16, NVIDIA

  9. Why did that work? Bojarski et al. ‘16, NVIDIA

  10. Can we make it work more often? cost stability (more on this later)

  11. Can we make it work more often?

  12. Can we make it work more often? DAgger : D ataset A ggregation Ross et al. ‘11

  13. DAgger Example Ross et al. ‘11

  14. What’s the problem? Ross et al. ‘11

  15. Deep imitation learning in practice

  16. Can we make it work without more data? • DAgger addresses the problem of distributional “drift” • What if our model is so good that it doesn’t drift? • Need to mimic expert behavior very accurately • But don’t overfit!

  17. Why might we fail to fit the expert? 1. Non-Markovian behavior 2. Multimodal behavior behavior depends only behavior depends on on current observation all past observations If we see the same thing Often very unnatural for twice, we do the same thing twice, regardless of what human demonstrators happened before

  18. How can we use the whole history? variable number of frames, too many weights

  19. How can we use the whole history? shared weights RNN state RNN state RNN state Typically, LSTM cells work better here

  20. Aside: why might this work poorly? “causal confusion” see: de Haan et al., “Causal Confusion in Imitation Learning” Question 1: Does including history mitigate causal confusion? Question 2: Can DAgger mitigate causal confusion?

  21. Why might we fail to fit the expert? 1. Non-Markovian behavior 1. Output mixture of 2. Multimodal behavior Gaussians 2. Latent variable models 3. Autoregressive discretization

  22. Why might we fail to fit the expert? 1. Output mixture of Gaussians 2. Latent variable models 3. Autoregressive discretization

  23. Why might we fail to fit the expert? 1. Output mixture of Gaussians 2. Latent variable models 3. Autoregressive discretization Look up some of these: • Conditional variational autoencoder • Normalizing flow/realNVP • Stein variational gradient descent

  24. Why might we fail to fit the expert? dim 2 discrete value sampling 1. Output mixture of Gaussians discrete dim 1 2. Latent variable models (discretized) distribution sampling value over dimension 1 only 3. Autoregressive discretization

  25. Imitation learning: recap supervised training learning data • Often (but not always) insufficient by itself • Distribution mismatch problem • Sometimes works well • Hacks (e.g. left/right images) • Samples from a stable trajectory distribution • Add more on-policy data, e.g. using Dagger • Better models that fit more accurately

  26. A case study: trail following from human demonstration data

  27. Case study 1: trail following as classification

  28. Cost functions, reward functions, and a bit of theory

  29. Imitation learning: what’s the problem? • Humans need to provide data, which is typically finite • Deep learning works best when data is plentiful • Humans are not good at providing some kinds of actions • Humans can learn autonomously; can our machines do the same? • Unlimited data from own experience • Continuous self-improvement

  30. Terminology & notation 1. run away 2. ignore 3. pet

  31. Aside: notation Lev Pontryagin Richard Bellman

  32. Cost functions, reward functions, and a bit of theory

  33. A cost function for imitation? supervised training learning data Ross et al. ‘11

  34. Some analysis

  35. More general analysis For more analysis, see Ross et al. “A Reduction of Imitation Learning and Structured Prediction to No - Regret Online Learning”

  36. More general analysis For more analysis, see Ross et al. “A Reduction of Imitation Learning and Structured Prediction to No - Regret Online Learning”

  37. Another way to imitate

  38. Another imitation idea

  39. Goal-conditioned behavioral cloning

  40. 1. Collect data 2. Train goal conditioned policy

  41. 3. Reach goals

  42. Going beyond just imitation? ➢ Start with a random policy ➢ Collect data with random goals ➢ Treat this data as “demonstrations” for the goals that were reached ➢ Use this to improve the policy ➢ Repeat

Recommend


More recommend