Deep Learning for Robo/cs Pieter Abbeel
Reinforcement Learning (RL) probability of taking ac2on a in state s Robo2cs π θ ( a | s ) n Marke2ng / n Adver2sing Dialogue n Robot + Op2mizing Environment n opera2ons / logis2cs Queue H n management X max E[ R ( s t ) | π θ ] θ … t =0 n
Reinforcement Learning (RL) probability of taking ac2on a in state s π θ ( a | s ) Robot + Environment Addi/onal challenges: n Stability Goal: n n Credit assignment n H X Explora/on max E[ R ( s t ) | π θ ] n θ t =0
Reinforcement Learning (RL) probability of taking ac2on a in state s π θ ( a | s ) Robot + Environment Goal: n H X max E[ R ( s t ) | π θ ] θ t =0
Deep RL Success Stories DQN Mnih et al, NIPS 2013 / Nature 2015 Gu et al, NIPS 2014 TRPO Schulman, Levine, Moritz, Jordan, Abbeel, ICML 2015 Schulman, Levine, Moritz, Jordan, Abbeel, ICML 2015 A3C Mnih et al,2016 Schulman, Moritz, Levine, Jordan, Abbeel, ICLR 2016 Levine*, Finn*, Darrell, Abbeel, JMLR 2016 Silver et al, Nature 2015
Speed of Learning Deep RL (DQN) Human vs. Score: 18.9 Score: 9.3 #Experience #Experience measured in real measured in real 2me: 40 days 2me: 2 hours “Slow” “Fast”
Star2ng Observa2ons n TRPO, DQN, A3C are fully general RL algorithms n i.e., for any MDP that can be mathema2cally defined, these algorithms are equally applicable n MDPs encountered in real world = 2ny, 2ny subset of all MDPs that could be defined n Can we design “fast” RL algorithms that take advantage of such knowledge?
Research Ques2ons n How to acquire a good prior for real-world MDPs? n Or for starters, e.g., for real-games MDPs? n How to design algorithms that make use of such prior informa2on? Key idea: Learn a fast RL algorithm that encodes this prior
Formula2on n Given: Distribu2on over relevant MDPs n Train the fast RL algorithm to be fast on a training set of MDPs
Formula2on
Learning the Fast RL Algorithm n Representa2on of the fast RL algorithm: n RNN = generic computa2on architecture n different weights in the RNN means different RL algorithm n different ac2va2ons in the RNN means different current policy n Training setup:
Alterna2ve View on RL2 n RNN = policy for ac2ng in a POMDP n Part of what’s not observed in the POMDP is which MDP the agent is in
Related Work n Wang et al., (2016) Learning to Reinforcement Learn, in submission to ICLR 2017, n Chen et al. (2016) Learning to Learn for Global Op2miza2on of Black Box Func2ons n Andrychowicz et al., (2016) Learning to learn by gradient descent by gradient descent n Santoro et al., (2016) One-shot Learning with Memory- Augmented Neural Networks n Larochelle et al., (2008), Zero-data Learning of New Tasks. n Younger et al. (2001), Meta learning with backpropaga2on n Schmidhuber et al. (1996), Simple principles of metalearning
RL 2 : Fast RL by Slow RL Key Insights : n n We represent the AI agent as a Recurrent Neural Net (RNN) n i.e., the RNN is the “fast” RL algorithm n different weights in the RNN means different RL algorithm n To discover good weights for the RNN (i.e., to discover the fast RL algorithm), train with classical (“slow”) RL [Duan, Schulman, Chen, Bartleo, Sutskever, Abbeel, 2016]
Evalua2on Mul2-Armed Bandits n Provably (asympto2cally) op2mal n RL algorithms have been invented by humans: Giqns index, UCB1, Thompson sampling, … 5-armed bandit (source: ebay) [Duan, Schulman, Chen, Bartleo, Sutskever, Abbeel, 2016]
Evalua2on n Mul2-Armed Bandits [Duan, Schulman, Chen, Bartleo, Sutskever, Abbeel, 2016]
Evalua2on n Mul2-Armed Bandits
Evalua2on: Tabular MDPs n Provably (asympto2cally) op2mal algorithms: n BEB, PSRL, UCRL2, …
Evalua2on: Tabular MDPs
Evalua2on: Tabular MDPs
Evalua2on: Visual Naviga2on (built on top of ViZDoom) Agent’s view Small maze Large maze
Evalua2on: Visual Naviga2on Before learning After learning
Evalua2on: Visual Naviga2on n Visual Naviga2on (built on top of ViZDoom) Occasional “bad” behavior
Evalua2on: Visual Naviga2on
Evalua2on
OpenAI Universe
Outline RL^2: Fast Reinforcement Learning via Slow Reinforcement n Learning Yan Duan, John Schulman, Xi Chen, Peter L. Bartleo, Ilya Sutskever, Pieter Abbeel Third Person Imita2on Learning n Bradly C Stadie, Pieter Abbeel, Ilya Sutskever Varia2onal Lossy Autoencoder n Xi Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, P. Abbeel Deep Reinforcement Learning for Tensegrity Robot Locomo2on n X. Geng*, M. Zhang*, J. Bruce*, K. Caluwaerts, M. Vespignani, V. SunSpiral, P. Abbeel, S. Levine
Third Person Imita2on Learning n First person imita2on learning n Demonstrate with robot itself n E.g., drive a car, tele-operate a robot, etc… n Third person imita2on learning n Robot watches demonstra2ons n Challenges: n Different viewpoint n Different incarna2on (human vs. robot)
Third Person Imita2on Learning n Example problem seqngs: Third-person view: Robot environment:
Basic Ideas Genera2ve Adversarial Imita2on Learning (Ho et al, 2016, Finn et al 2016) n n Reward = defined by learned classifier dis2nguishing expert from robot behavior à Op2mizing such reward makes robot perform like expert à Works well for first person imita2on learning BUT: in third person seqng classifier will simply iden2fy expert vs. robot environment and robot can never match expert Domain confusion loss (Tzeng et al, 2015) n Deep learn a feature representa2on from which it isn’t possible to dis2nguish the environment n BUT: competes too directly with first objec2ve Let first objec2ve have mul2ple frames – i.e.., see behavior n
Architecture Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Learning Curves Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Domain Classifica2on Accuracy Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Does the algorithm we propose benefit from both domain confusion and the mul2-2me step input? Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
How sensi2ve is our proposed algorithm to the selec2on of hyper- parameters used in deployment? –Domain Confusion Weight lambda Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
How sensi2ve is our proposed algorithm to the selec2on of hyper- parameters used in deployment? – Number of lookahead frames Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Results Third Person View on Expert Imitator
Outline RL^2: Fast Reinforcement Learning via Slow Reinforcement n Learning Yan Duan, John Schulman, Xi Chen, Peter L. Bartleo, Ilya Sutskever, Pieter Abbeel Third Person Imita2on Learning n Bradly C Stadie, Pieter Abbeel, Ilya Sutskever Probabilis2cally Safe Policy Transfer n David Held, Zoe McCarthy, Michael Zhang, Fred Shentu, and Pieter Abbeel Deep Reinforcement Learning for Tensegrity Robot Locomo2on n X. Geng*, M. Zhang*, J. Bruce*, K. Caluwaerts, M. Vespignani, V. SunSpiral, P. Abbeel, S. Levine
Risky Robo2cs Tasks n Autonomous driving n Robots interac2ng around / with people n Robots manipula2ng fragile objects n Robot can damage itself Ques2on: How to train a robot for these tasks? Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Previous Approaches n Isolated training environment (e.g. cage) n May not represent test environment n No human interac2on / collabora2on in isolated environment n Train in simula2on n May not represent test environment n Watch robot carefully, try to press kill-switch in 2me n Requires careful observa2on and predic2on of robot ac2ons by human, may not react fast enough Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Our Approach n Operate in test environment ini2ally with low torques n Use lowest torques at which task can s2ll be completed, but safely and slowly n Assump2on: low torques are safer but less efficient n Increase the torque limit as the robot demonstrates that it can operate safely n How do we define safety? Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
How to Define Safety? Expected damage n D safe defines our “safety budget” – how much expected damage we can afford n Example for autonomous car: n low risk of hiqng a pedestrian at low speeds n Even lower risk of killing a pedestrian Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
How to Define Safety? n Higher torques -> lower 2me per task -> more benefit n Higher torques -> more damage n Torques are clipped at T lim (wlog assume α = 1) n Assume binary probability of being unsafe: n = probability of failure to be safe Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
How to Define Safety? Expected damage n High probability of failure -> low torques n Low probability of failure -> higher torques Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Overall Algorithm Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Predic2ng failure increases n Effect from adjus2ng torque limit n Effect from upda2ng policy Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Predic2ng failure increases n Effect from changing torque limit F(-T lim ) 1 - F(T lim ) n Effect from upda2ng policy 0.45 Old policy New policy Intersection 0.4 0.35 0.3 PDF 0.25 0.2 0.15 0.1 0.05 0 -6 -4 -2 0 2 4 6 8 Torque Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Recommend
More recommend