making robots learn
play

Making Robots Learn Pieter Abbeel -- UC Berkeley EECS Object - PowerPoint PPT Presentation

Making Robots Learn Pieter Abbeel -- UC Berkeley EECS Object Detec9on in Computer Vision State-of-the-art object detec9on un9l 2012: n Support cat Hand-engineered Input Vector dog features (SIFT, Image Machine car HOG,


  1. Making Robots Learn Pieter Abbeel -- UC Berkeley EECS

  2. Object Detec9on in Computer Vision State-of-the-art object detec9on un9l 2012: n Support “cat” Hand-engineered Input Vector “dog” features (SIFT, Image Machine “car” HOG, DAISY, … ) (SVM) … Deep Supervised Learning (Krizhevsky, Sutskever, Hinton 2012; also LeCun, Bengio, Ng, Darrell, …): n “cat” “dog” Input 8-layer neural network with 60 million Image parameters to learn “car” … n ~1.2 million training images from ImageNet [Deng, Dong, Socher, Li, Li, Fei-Fei, 2009]

  3. Performance graph credit Matt Zeiler, Clarifai

  4. Performance graph credit Matt Zeiler, Clarifai

  5. Performance AlexNet graph credit Matt Zeiler, Clarifai

  6. Performance AlexNet graph credit Matt Zeiler, Clarifai

  7. Performance AlexNet graph credit Matt Zeiler, Clarifai

  8. Speech Recogni9on graph credit Matt Zeiler, Clarifai

  9. History (Olshausen, 1996) 2000s Sparse, Probabilistic, and Energy models (Hinton, Bengio, LeCun, Ng) Is deep learning 3, 30, or 60 years old? Rosenblatt’s Perceptron based on history by K. Cho

  10. What’s Changed Data Nonlinearity n n 1.2M training examples Sigmoid n 2048 (different crops) n à ReLU 90 (PCA re-colorings) n Regulariza9on n Compute power n n Drop-out Two NVIDIA GTX 580 GPUs n Explora9on of model structure 5-6 days of training 9me n n Op9miza9on know-how n

  11. Object Detec9on in Computer Vision State-of-the-art object detec9on un9l 2012: n Support “cat” Hand-engineered Input Vector “dog” features (SIFT, Image Machine “car” HOG, DAISY, … ) (SVM) … Deep Supervised Learning (Krizhevsky, Sutskever, Hinton 2012; also LeCun, Bengio, Ng, Darrell, …): n “cat” “dog” Input 8-layer neural network with 60 million Image parameters to learn “car” … n ~1.2 million training images from ImageNet [Deng, Dong, Socher, Li, Li, Fei-Fei, 2009]

  12. Robo9cs Current state-of-the-art robo5cs n Hand- Hand-tuned Hand- Motor engineered (or learned) engineered Percepts commands control 10’ish free state- policy class parameters estimation Deep reinforcement learning n Many-layer neural network Motor Percepts commands with many parameters to learn

  13. Reinforcement Learning probability of taking ac9on a in state s π θ ( a | s ) Robot + Environment H X Goal: max E[ R ( s t ) | π θ ] n θ t =0

  14. From Pixels to Ac9ons? Pong Enduro Beamrider Q*bert

  15. Deep Q-Network (DQN): From Pixels to Joys9ck Commands 32 8x8 filters with stride 4 + ReLU 64 4x4 filters with stride 2 + ReLU 64 3x3 filters with stride 1 + ReLU fully connected 512 units + ReLU [Source: Mnih et al., Nature 2015 (DeepMind) ] fully connected output units, one per ac9on

  16. [ Source: Mnih et al., Nature 2015 (DeepMind) ]

  17. Deep Q-Network (DQN) Approach: n Q-learning with e-greedy and deep network as func9on approximator n Key idea 1: stabilizing Q-learning n Mini-batches of size 32 (vs. single sample updates) n Q-values used to compute temporal difference only updated every 10,000 updates n Key idea 2: lots of data / compute n trained for a total of 50 million frames (= 38 days of game experience ) and use a replay n memory of one million most recent frames

  18. How About Con9nuous Control, e.g., Locomo9on? Robot models in physics simulator (MuJoCo, from Emo Todorov) Input: joint angles and veloci9es Fully Input Mean Output: joint torques connected Sampling layer parameters layer Joint angles and kinematics Neural network architecture: Control Standard 30 units deviations

  19. Challenges with Q-Learning n How to score every possible ac9on? n How to ensure monotonic progress?

  20. Policy Op9miza9on H X max E[ R ( s t ) | π θ ] θ t =0 Oqen simpler to represent good policies than good value func9ons n True objec9ve of expected cost is op9mized (vs. a surrogate like Bellman error) n Exis9ng work: (natural) policy gradients n n Challenges: good, large step direc9ons

  21. Trust Region Policy Op9miza9on H X max E[ R ( s t ) | π θ ] θ t =0 g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε n Trust Region: n Sampled evalua9on of gradient n Gradient only locally a good approxima9on n Change in policy changes state-ac9on visita9on frequencies [Schulman, Levine, Moritz, Jordan, Abbeel, 2015]

  22. Experiments in Locomo9on [Schulman, Levine, A.]

  23. Learning Curves -- Comparison

  24. Learning Curves -- Comparison

  25. Atari Games Deep Q-Network (DQN) [Mnih et al, 2013/2015] n Dagger with Monte Carlo Tree Search [Xiao-Xiao et al, 2014] n Trust Region Policy Op9miza9on [Schulman, Levine, Moritz, Jordan, Abbeel, 2015] n Pong Enduro Beamrider Q*bert

  26. Generalized Advantage Es9ma9on (GAE) H Objec9ve: X max E[ R ( s t ) | π θ ] θ t =0 H H ! Gradient: X X E[ r θ log π θ ( a t | s t ) R ( s k ) � V ( s t ) ] t =0 k = t single sample es9mate of advantage Generalized Advantage Es9ma9on n n Exponen9al interpola9on between actor-cri9c and Monte Carlo es9mates n Trust region approach to (high-dimensional) value func9on es9ma9on [Schulman, Moritz, Levine, Jordan, Abbeel, 2015]

  27. Learning Locomo9on [Schulman, Moritz, Levine, Jordan, Abbeel, 2015]

  28. In Contrast: Darpa Robo9cs Challenge

  29. How About Real Robo9c Visuo-Motor Skills?

  30. Guided Policy Search general-purpose neural network controller complex dynamics complex policy policy search (RL) HARD complex dynamics complex policy supervised learning EASY complex dynamics complex policy trajectory op9miza9on EASY trajectory op9miza9on Supervised learning

  31. [Levine & Abbeel, NIPS 2014]

  32. Guided Policy Search

  33. Comparison

  34. Block Stacking – Learning the Controller for a Single Instance

  35. Linear-Gaussian Controller Learning Curves

  36. Instrumented Training training time test time

  37. Architecture (92,000 parameters) [Levine*, Finn*, Darrell, Abbeel, 2015, TR at: rll.berkeley.edu/deeplearningrobo9cs]

  38. Experimental Tasks

  39. Learning

  40. Learned Skills [Levine*, Finn*, Darrell, Abbeel, 2015, TR at: rll.berkeley.edu/deeplearningrobo9cs]

  41. Comparisons end-to-end training (trained on pose only) pose predic9on (trained on pose only) pose features

  42. Comparisons coat hanger success rate pose predic9on 55.6% pose features 88.9% end-to-end training 100% shape sor9ng cube success rate pose predic9on 0% pose features 70.4% end-to-end training 96.3% toy claw hammer success rate pose predic9on 8.9% pose features 62.2% 2 cm end-to-end training 91.1% bowle cap success rate pose predic9on n/a pose features 55.6% Meeussen et al. (Willow Garage) end-to-end training 88.9%

  43. Visuomotor Learning Directly in Visual Space ? Provide image that defines goal Train controller in visual feature space [Finn, Tan, Duan, Darrell, Levine, Abbeel, 2015]

  44. Visuomotor Learning Directly in Visual Space 1. Set target end-effector pose 4. Provide image that defines goal features 2. Train exploratory non-vision controller 5. Train final controller in visual feature space 3. Learning visual features with collected images [Finn, Tan, Duan, Darrell, Levine, Abbeel, 2015]

  45. Visuomotor Learning Directly in Visual Space [Finn, Tan, Duan, Darrell, Levine, Abbeel, 2015]

  46. Fron9ers: Applica9ons Vision-based flight Natural language interac9on n n Locomo9on Dialogue n n Manipula9on Program analysis n n

  47. Fron9ers: Founda9ons Shared and transfer learning Memory n n n Es9ma9on n Temporal hierarchy / goal seyng Explora9on Tools / Experimenta9on n n n Stochas9c computa9on graphs n Computa9on graph toolkit (CGT)

Recommend


More recommend