Large-Scale Self-supervised Robot Learning with GPU-enabled Video-Prediction Models Frederik Ebert, Chelsea Finn, Alex Lee, Sergey Levine NVIDIA GTC 2018 1
Typical Bar in 20?? 1969 Stanford Arm 2015 DARPA Robotics Challenge
Humans have excellent mental models of physical objects 3
How can robots acquire general models and skills using large amounts of autonomously collected data? 4
Related work on self-supervised learning Levine et al. 2016 Pinto & Gupta, 2015 Predict raw sensory inputs instead of binary events . Gandhi et al. 2017 5
Related work on video-prediction Visual Model-Predictive Control Oh et al. 2015 Mathieu et al. 2016 Finn & Levine 2017 Byravan et al. 2017 6
Visual Model-Predictive Control User Input Planning Module Cost Function apply action to Robot Video Prediction Model Designated Pixel Goal Point [Finn et al. 2017] 7
Random Data Collection Collected 45,000 trajectories, recording camera images and actions 8
Action-Conditioned Video Prediction Real Action 2 Action 1 Action 0 state state Recurrent NN Recurrent NN Recurrent NN Generated 9
Skip Connection Neural Advection (SNA) DNA (Finn et al.) 10
Action-Conditioned Video Prediction Temporal Skip Connections Real Action 2 Action 1 Action 0 state state Recurrent NN Recurrent NN Recurrent NN Generated 11
Skip Connection Neural Advection (SNA) DNA (Finn et al.) SNA (Ours) 12
Skip Connection Neural Advection (SNA) 11 compositing masks Image of current time step 64x64x16 64 skip 32x32x32 32x32x32 16 16 skip 16x16x16 32 16x16x64 32 64 8x8x64 33 32 64 gen. Image of next time step conv conv conv conv conv conv conv de conv de conv de conv de conv de conv de conv 3×3 5x5 and channe l 3×3 3×3 3×3 3×3 1×1 3×3 3×3 3×3 3×3 3×3 3×3 softmax stride 2 stride 2 stride 2 stride 2 stride 2 stride 2 10 CDNA ke rne ls 9 maske d 9 compositing conv 9×9 Convolutional LSTM Conv-LSTM Masks CDNA Kernels Image from first time step, temporal skip connection 13
Prediction of Pixel Positions (Test Time) Action 2 Action 0 Action 1 state state Recurrent NN Recurrent NN Recurrent NN Generated 14
Effects of using temporal skip connections SNA (Ours) DNA (Finn et al.) Designated Pixel 15
Planning with Visual-MPC Designated Pixel Goal Pixel 16
Planning: Expected Distance to Goal Cost Predicted Designated Pixel Distance to Goal Distribution for Goal Point designated Pixel 17
Action Selection using Cross-Entropy Method Designated Pixel Goal Pixel Iteration 1 Iteration 2 Iteration 3 18
Results 19
Generalization to objects not seen during training 20
Collision Avoidance Task, involving Occlusion Designated Pixel Goal Pixel Static Pixel 21
Finn et al.
Ours 23
Multi-Goal Pushing Benchmark 24
25
26
27
28
29
30
Takeaways • Temporal skip connections significantly improve the ability to deal with occlusions . • Video-prediction models can be reused across many tasks . • Self-supervised learning on large scale data enables generalizable skills . 31
Q&A Chelsea Finn Alex X. Lee Sergey Levine Code and Data: https://sites.google.com/view/sna-visual-mpc
Recommend
More recommend