Learning Latent Dynamics for Planning from Pixels Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, James Davidson @danijarh danijar.com/planet
Planning with Learned Models Watter et al., 2015, Banijamali et al. 2017, Zhang et al. 2017 Agrawal et al., 2016; Finn & Levine, 2016; Ebert et al., 2018
Visual Control Tasks partially many sparse contacts balance observable joints reward Some model-free methods can solve these tasks but need up to 100,000 episodes
We introduce PlaNet Recipe for scalable model-based reinforcement learning 1 Efficient planning in latent space with large batch size 2 Reaches top performance using 200X fewer episodes 3
Latent Dynamics Model encode images
Latent Dynamics Model encode images predict states
Latent Dynamics Model encode images predict states decode images
Latent Dynamics Model encode images predict states decode images decode rewards
Recurrent State Space Model deterministic stochastic h 1 h 2 h 3 h 1 h 2 h 3 z 1 s 1 z 2 s 2 z 3 s 3 z 1 z 2 z 3 Recurrent Neural Network State Space Model Recurrent State Space Model
Unguided Video Predictions by Single Agent 5 frames context and 45 frames predicted
Recovers the True Dynamics Can predict simulator state from copy of model state
Planning in Latent Space
Planning in Latent Space
Planning in Latent Space
Planning in Latent Space
Planning in Latent Space
Planning in Latent Space
Cross Entropy Planner Initialize factorized Gaussian population distribution over action sequences 1 Horizon
Cross Entropy Planner Initialize factorized Gaussian population distribution over action sequences 1 Candidates Sample 1000 candidate action sequences 2 Horizon
Cross Entropy Planner Initialize factorized Gaussian population distribution over action sequences 1 Candidates Sample 1000 candidate action sequences 2 Evaluate candidates in parallel using the model 3 Horizon
Cross Entropy Planner Initialize factorized Gaussian population distribution over action sequences 1 Candidates Sample 1000 candidate action sequences 2 Evaluate candidates in parallel using the model 3 Horizon Re-fit the population to the top 100 candidates 4
Cross Entropy Planner Initialize factorized Gaussian population distribution over action sequences 1 Candidates Sample 1000 candidate action sequences 2 Evaluate candidates in parallel using the model 3 Horizon Re-fit the population to the top 100 candidates 4 Repeat for 10 steps 5
Comparison to Model-Free Agents Training time: 1 day on a single GPU
Comparison of Model Designs
Comparison of Iterative Planning
Some Additional Tasks In three dimensions Minitaur: 400 episodes Quadruped: 2000 episodes
Conclusions PlaNet solves control tasks from images by efficient planning in the 1 compact latent space of a learned model Pure planning with learned dynamics is feasible for control tasks with 2 image observations, contacts, sparse rewards Planning with learned models can reach the performance of top model-free 3 algorithms in 200 times fewer episodes and the same training time
Enabling More Model-Based RL Research With Jimmy Ba, Mohammad Norouzi, Timothy Lillicrap Explore dynamics Distill the planner to save Value function to extend without supervision computation planning horizon
Learning Latent Dynamics for Planning from Pixels Website with code, videos, blog post, animated paper: danijar.com/planet Thank you
Multi-Step Consistency in Latent Space Perfect one-step model would give perfect multi-step predictions 1 Under limited capacity, one-step and multi-step solutions may not coincide 2 Encourage consistency between one-step and multi-step in latent space 3
Recommend
More recommend