Deep Learning for Robo/cs Pieter Abbeel UC Berkeley / OpenAI / Gradescope
Outline n Some deep learning successes n Deep reinforcement learning n Current direc5ons Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Object Detec5on in Computer Vision State-of-the-art object detec5on un5l 2012: n Support “cat” Hand-engineered Input Vector “dog” features (SIFT, Image Machine “car” HOG, DAISY, …) (SVM) … Deep Supervised Learning (Krizhevsky, Sutskever, Hinton 2012; also LeCun, Bengio, Ng, Darrell, …): n “cat” “dog” Input 8-layer neural network with 60 million Image parameters to learn “car” … n ~1.2 million training images from ImageNet [Deng, Dong, Socher, Li, Li, Fei-Fei, 2009] Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Performance graph credit Matt Zeiler, Clarifai
Performance graph credit Matt Zeiler, Clarifai
Performance AlexNet graph credit Matt Zeiler, Clarifai
Performance AlexNet graph credit Matt Zeiler, Clarifai
Performance AlexNet graph credit Matt Zeiler, Clarifai
Speech Recogni5on graph credit Matt Zeiler, Clarifai
MS COCO Image Cap5oning Challenge Karpathy & Fei-Fei, 2015; Donahue et al., 2015; Xu et al, 2015; many more Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Visual QA Challenge Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Unsupervised Learning Varia5onal Autoencoders [Kingma and Welling, 2014] n n DRAW [Gregor et al, 2015] n … Genera5ve Adversarial Networks [Goodfellow et al, 2014] n n DC-GAN [Radford, Metz, Chintala, 2016] n InfoGAN [Chen, Duan, Houthoof, Schulman, Sutskever, Abbeel, 2016] n … Pixel RNN [van den Oord et al, 2016] n n Pixel CNN [van den Oord et al, 2016] n … Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Image Genera5on – DC-GAN Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Training Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, 2016 Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Comparison with Real Images Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, 2016 Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
InfoGAN [Chen, Duan, Houthoof, Schulman, Sutskever, Abbeel, 2016] Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Robo5cs Current state-of-the-art robo/cs n Hand- Hand-tuned Hand- Motor engineered (or learned) engineered Percepts commands control 10’ish free state- policy class parameters estimation Deep reinforcement learning n Many-layer neural network Motor Percepts commands with many parameters to learn Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Deep Learning for Es5ma5on SE3 Nets [Byravan,Fox, 2016] Deep Tracking [Ondruska, Posner, 2016] Backprop KF [Haarnoja, Ajay, Structured Varia5onal Autoencoders [Johnson, Levine, Abbeel, 2016] Duvenaud, Wiltschko, Dala, Adams, 2016] Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Deep Es5ma5on for Grasping/Control DeepMPC [Lenz, Knepper, Saxena, RSS Deep Learning for Detec5ng Robo5c Grasps 2015] [Lenz, Lee, Saxena, RSS 2013] Dexnet Grasp Transfer [Mahler, …, Big Data for Grasp Planning [Kappler, Bohg, Goldberg, 2015] Schaal, 2015] Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Deep Reinforcement Learning (RL) probability of taking ac5on a in state s π θ ( a | s ) Robot + Environment Addi/onal challenges: n Stability Goal: n n Credit assignment n H X Explora/on max E[ R ( s t ) | π θ ] n θ t =0 Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
From Pixels to Ac5ons? Pong Enduro Beamrider Q*bert Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Deep Q-Network (DQN): From Pixels to Joys5ck Commands 32 8x8 filters with stride 4 + ReLU 64 4x4 filters with stride 2 + ReLU 64 3x3 filters with stride 1 + ReLU fully connected 512 units + ReLU [Source: Mnih et al., Nature 2015 (DeepMind) ] fully connected output units, one per ac5on Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
[ Source: Mnih et al., Nature 2015 (DeepMind) ] Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
How About Con5nuous Control, e.g., Locomo5on? Robot models in physics simulator (MuJoCo, from Emo Todorov) Input: joint angles and veloci5es Fully Input Mean Output: joint torques connected Sampling layer parameters layer Joint angles and kinematics Neural network architecture: Control Standard 30 units deviations Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Challenges with Q-Learning n How to score every possible ac5on? n How to ensure monotonic progress? Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Policy Op5miza5on H X max E[ R ( s t ) | π θ ] θ t =0 Ofen simpler to represent good policies than good value func5ons n True objec5ve of expected cost is op5mized (vs. a surrogate like Bellman error) n Exis5ng work: (natural) policy gradients n n Challenges: good, large step direc5ons Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Trust Region Policy Op5miza5on [Schulman, Levine, Moritz, Jordan, Abbeel, 2015] H X max E[ R ( s t ) | π θ ] θ t =0 g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε n Trust Region n Surrogate Loss Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Generalized Advantage Es5ma5on (GAE) H Objec5ve: X max E[ R ( s t ) | π θ ] θ t =0 H ! H Gradient: X X E[ r θ log π θ ( a t | s t ) R ( s k ) � V ( s t ) ] t =0 k = t single sample es5mate of advantage Generalized Advantage Es5ma5on n n Exponen5al interpola5on between actor-cri5c and Monte Carlo es5mates n Trust region approach to (high-dimensional) value func5on es5ma5on [Schulman, Moritz, Levine, Jordan, Abbeel, ICLR 2016] Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Learning Locomo5on [Schulman, Moritz, Levine, Jordan, Abbeel, ICLR 2016] Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Atari Games Deep Q-Network (DQN) [Mnih et al, 2013/2015] n Dagger with Monte Carlo Tree Search [Xiao-Xiao et al, 2014] n Trust Region Policy Op5miza5on (TRPO) [Schulman, Levine, Moritz, Jordan, Abbeel, 2015] n A3C [Mnih et al, 2016] n Pong Enduro Beamrider Q*bert Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Deep RL Benchmarking n Tasks n Algorithms n Experimental setup Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Deep RL Benchmarking -- Tasks 1. Basic tasks 3. Hierarchical 2. Locomo5on 4. Par5ally observable sensing, delayed ac5on, sysID 5. Driving… Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Deep RL Benchmarking -- Algorithms Reinforce n Truncated Natural Policy Gradient n Reward-Weighted Regression (RWR) n Rela5ve Entropy Policy Search (REPS) n Trust-Region Policy Op5miza5on (TRPO) n Cross-Entropy Method (CEM) n Covariance Matrix Adapta5on Evolu5on Strategy (CMA-ES) n Deep Determinis5c Policy Gradients (DDPG) n … n Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Benchmarking [Duan et al, ICML 2016] Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
rllab [Duan et al] Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Open AI Gym Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
How About Real Robo5c Visuo-Motor Skills? Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Guided Policy Search general-purpose neural network controller complex dynamics complex policy policy search (RL) HARD complex dynamics complex policy supervised learning EASY complex dynamics complex policy trajectory op5miza5on EASY trajectory op5miza5on Supervised learning Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Instrumented Training training time test time Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Deep Spa5al Neural Net Architecture π θ (92,000 parameters) [Levine*, Finn*, Darrell, Abbeel, JMLR 2016] Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Experimental Tasks [Levine*, Finn*, Darrell, Abbeel, JMLR 2016] Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Learning [Levine*, Finn*, Darrell, Abbeel, JMLR 2016] Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Learned Skills [Levine*, Finn*, Darrell, Abbeel, JMLR 2016 Pieter Abbeel -- UC Berkeley / OpenAI / Gradescope
Experiments: Learned Neural Network Policy [Khan, Zhang, Levine, Abbeel 2016]
Fron5ers: DATA | TRANSFER | EXPLORATION | REWARD | HIERARCHY Supersizing Self-Supervision: Learning to Grasp from 50K Tries and 700 Robot Hours [Pinto, Gupta, ICRA 2016] Learning Hand-Eye Coordina5on with Deep Learning and Large Scale Data Collec5on [Pastor, Krizhevsky, Quillen, Levine, 2016] Learning to Poke by Poking: Experien5al Learning of Intui5ve Physics [Agarwal, Nair, Abbeel, Malik, Levine, 2016]
Recommend
More recommend