deep learning for robo cs pieter abbeel reinforcement
play

Deep Learning for Robo/cs Pieter Abbeel Reinforcement Learning (RL) - PowerPoint PPT Presentation

Deep Learning for Robo/cs Pieter Abbeel Reinforcement Learning (RL) probability of taking ac2on a in state s Robo2cs ( a | s ) n Marke2ng / n Adver2sing Dialogue n Robot + Op2mizing Environment n opera2ons / logis2cs Queue H n


  1. Deep Learning for Robo/cs Pieter Abbeel

  2. Reinforcement Learning (RL) probability of taking ac2on a in state s Robo2cs π θ ( a | s ) n Marke2ng / n Adver2sing Dialogue n Robot + Op2mizing Environment n opera2ons / logis2cs Queue H n management X max E[ R ( s t ) | π θ ] θ … t =0 n

  3. Reinforcement Learning (RL) probability of taking ac2on a in state s π θ ( a | s ) Robot + Environment Addi/onal challenges: n Stability Goal: n n Credit assignment n H X Explora/on max E[ R ( s t ) | π θ ] n θ t =0

  4. Reinforcement Learning (RL) probability of taking ac2on a in state s π θ ( a | s ) Robot + Environment Goal: n H X max E[ R ( s t ) | π θ ] θ t =0

  5. Deep RL Success Stories DQN Mnih et al, NIPS 2013 / Nature 2015 Gu et al, NIPS 2014 TRPO Schulman, Levine, Moritz, Jordan, Abbeel, ICML 2015 Schulman, Levine, Moritz, Jordan, Abbeel, ICML 2015 A3C Mnih et al,2016 Schulman, Moritz, Levine, Jordan, Abbeel, ICLR 2016 Levine*, Finn*, Darrell, Abbeel, JMLR 2016 Silver et al, Nature 2015

  6. Speed of Learning Deep RL (DQN) Human vs. Score: 18.9 Score: 9.3 #Experience #Experience measured in real measured in real 2me: 40 days 2me: 2 hours “Slow” “Fast”

  7. Star2ng Observa2ons n TRPO, DQN, A3C are fully general RL algorithms n i.e., for any MDP that can be mathema2cally defined, these algorithms are equally applicable n MDPs encountered in real world = 2ny, 2ny subset of all MDPs that could be defined n Can we design “fast” RL algorithms that take advantage of such knowledge?

  8. Research Ques2ons n How to acquire a good prior for real-world MDPs? n Or for starters, e.g., for real-games MDPs? n How to design algorithms that make use of such prior informa2on? Key idea: Learn a fast RL algorithm that encodes this prior

  9. Formula2on n Given: Distribu2on over relevant MDPs n Train the fast RL algorithm to be fast on a training set of MDPs

  10. Formula2on

  11. Learning the Fast RL Algorithm n Representa2on of the fast RL algorithm: n RNN = generic computa2on architecture n different weights in the RNN means different RL algorithm n different ac2va2ons in the RNN means different current policy n Training setup:

  12. Alterna2ve View on RL2 n RNN = policy for ac2ng in a POMDP n Part of what’s not observed in the POMDP is which MDP the agent is in

  13. Related Work n Wang et al., (2016) Learning to Reinforcement Learn, in submission to ICLR 2017, n Chen et al. (2016) Learning to Learn for Global Op2miza2on of Black Box Func2ons n Andrychowicz et al., (2016) Learning to learn by gradient descent by gradient descent n Santoro et al., (2016) One-shot Learning with Memory- Augmented Neural Networks n Larochelle et al., (2008), Zero-data Learning of New Tasks. n Younger et al. (2001), Meta learning with backpropaga2on n Schmidhuber et al. (1996), Simple principles of metalearning

  14. RL 2 : Fast RL by Slow RL Key Insights : n n We represent the AI agent as a Recurrent Neural Net (RNN) n i.e., the RNN is the “fast” RL algorithm n different weights in the RNN means different RL algorithm n To discover good weights for the RNN (i.e., to discover the fast RL algorithm), train with classical (“slow”) RL [Duan, Schulman, Chen, Bartleo, Sutskever, Abbeel, 2016]

  15. Evalua2on Mul2-Armed Bandits n Provably (asympto2cally) op2mal n RL algorithms have been invented by humans: Giqns index, UCB1, Thompson sampling, … 5-armed bandit (source: ebay) [Duan, Schulman, Chen, Bartleo, Sutskever, Abbeel, 2016]

  16. Evalua2on n Mul2-Armed Bandits [Duan, Schulman, Chen, Bartleo, Sutskever, Abbeel, 2016]

  17. Evalua2on n Mul2-Armed Bandits

  18. Evalua2on: Tabular MDPs n Provably (asympto2cally) op2mal algorithms: n BEB, PSRL, UCRL2, …

  19. Evalua2on: Tabular MDPs

  20. Evalua2on: Tabular MDPs

  21. Evalua2on: Visual Naviga2on (built on top of ViZDoom) Agent’s view Small maze Large maze

  22. Evalua2on: Visual Naviga2on Before learning After learning

  23. Evalua2on: Visual Naviga2on n Visual Naviga2on (built on top of ViZDoom) Occasional “bad” behavior

  24. Evalua2on: Visual Naviga2on

  25. Evalua2on

  26. OpenAI Universe

  27. Outline RL^2: Fast Reinforcement Learning via Slow Reinforcement n Learning Yan Duan, John Schulman, Xi Chen, Peter L. Bartleo, Ilya Sutskever, Pieter Abbeel Third Person Imita2on Learning n Bradly C Stadie, Pieter Abbeel, Ilya Sutskever Varia2onal Lossy Autoencoder n Xi Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, P. Abbeel Deep Reinforcement Learning for Tensegrity Robot Locomo2on n X. Geng*, M. Zhang*, J. Bruce*, K. Caluwaerts, M. Vespignani, V. SunSpiral, P. Abbeel, S. Levine

  28. Third Person Imita2on Learning n First person imita2on learning n Demonstrate with robot itself n E.g., drive a car, tele-operate a robot, etc… n Third person imita2on learning n Robot watches demonstra2ons n Challenges: n Different viewpoint n Different incarna2on (human vs. robot)

  29. Third Person Imita2on Learning n Example problem seqngs: Third-person view: Robot environment:

  30. Basic Ideas Genera2ve Adversarial Imita2on Learning (Ho et al, 2016, Finn et al 2016) n n Reward = defined by learned classifier dis2nguishing expert from robot behavior à Op2mizing such reward makes robot perform like expert à Works well for first person imita2on learning BUT: in third person seqng classifier will simply iden2fy expert vs. robot environment and robot can never match expert Domain confusion loss (Tzeng et al, 2015) n Deep learn a feature representa2on from which it isn’t possible to dis2nguish the environment n BUT: competes too directly with first objec2ve Let first objec2ve have mul2ple frames – i.e.., see behavior n

  31. Architecture Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

  32. Learning Curves Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

  33. Domain Classifica2on Accuracy Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

  34. Does the algorithm we propose benefit from both domain confusion and the mul2-2me step input? Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

  35. How sensi2ve is our proposed algorithm to the selec2on of hyper- parameters used in deployment? –Domain Confusion Weight lambda Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

  36. How sensi2ve is our proposed algorithm to the selec2on of hyper- parameters used in deployment? – Number of lookahead frames Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

  37. Results Third Person View on Expert Imitator

  38. Outline RL^2: Fast Reinforcement Learning via Slow Reinforcement n Learning Yan Duan, John Schulman, Xi Chen, Peter L. Bartleo, Ilya Sutskever, Pieter Abbeel Third Person Imita2on Learning n Bradly C Stadie, Pieter Abbeel, Ilya Sutskever Probabilis2cally Safe Policy Transfer n David Held, Zoe McCarthy, Michael Zhang, Fred Shentu, and Pieter Abbeel Deep Reinforcement Learning for Tensegrity Robot Locomo2on n X. Geng*, M. Zhang*, J. Bruce*, K. Caluwaerts, M. Vespignani, V. SunSpiral, P. Abbeel, S. Levine

  39. Risky Robo2cs Tasks n Autonomous driving n Robots interac2ng around / with people n Robots manipula2ng fragile objects n Robot can damage itself Ques2on: How to train a robot for these tasks? Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

  40. Previous Approaches n Isolated training environment (e.g. cage) n May not represent test environment n No human interac2on / collabora2on in isolated environment n Train in simula2on n May not represent test environment n Watch robot carefully, try to press kill-switch in 2me n Requires careful observa2on and predic2on of robot ac2ons by human, may not react fast enough Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

  41. Our Approach n Operate in test environment ini2ally with low torques n Use lowest torques at which task can s2ll be completed, but safely and slowly n Assump2on: low torques are safer but less efficient n Increase the torque limit as the robot demonstrates that it can operate safely n How do we define safety? Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

  42. How to Define Safety? Expected damage n D safe defines our “safety budget” – how much expected damage we can afford n Example for autonomous car: n low risk of hiqng a pedestrian at low speeds n Even lower risk of killing a pedestrian Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

  43. How to Define Safety? n Higher torques -> lower 2me per task -> more benefit n Higher torques -> more damage n Torques are clipped at T lim (wlog assume α = 1) n Assume binary probability of being unsafe: n = probability of failure to be safe Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

  44. How to Define Safety? Expected damage n High probability of failure -> low torques n Low probability of failure -> higher torques Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

  45. Overall Algorithm Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

  46. Predic2ng failure increases n Effect from adjus2ng torque limit n Effect from upda2ng policy Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

  47. Predic2ng failure increases n Effect from changing torque limit F(-T lim ) 1 - F(T lim ) n Effect from upda2ng policy 0.45 Old policy New policy Intersection 0.4 0.35 0.3 PDF 0.25 0.2 0.15 0.1 0.05 0 -6 -4 -2 0 2 4 6 8 Torque Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope

Recommend


More recommend