Meta-Reinforcement Learning of Structured Exploration Strategies - PowerPoint PPT Presentation

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell Mendonca, YuXuan Liu, Pieter Abbeel, Sergey Levine

Human Exploration vs Robot Exploration

Exploration Informed by Prior Experience

Exploration Informed by Prior Experience Desired: § Effective exploration for sparse rewards § Quick adaptation for new tasks

Key Insights in MAESN 1. Explore with random but structured behaviors (exploration)

Key Insights in MAESN 1. Explore with random but structured behaviors (exploration) 2. Explicitly train for quick learning on new tasks (adaptation)

Key Insights in MAESN 1. Explore with random but structured behaviors (exploration) 2. Explicitly train for quick learning on new tasks (adaptation) Fast Learning Grasp red object

Using Structured Stochasticity Per-timestep Exploration Structured Exploration Structured exploration: pick an intention, execute for entire episode. Explore across different intentions

<latexit sha1_base64="btQgEOcsFKLrPIvjacYwlzSy1n4=">AB/XicbVDLSsNAFJ3UV62v+Ni5GSxC3YREBF0W3bisYB/QhDCZTtqhM5M4MxHaUPwVNy4Ucet/uPNvnLZaOuBC4dz7uXe6KUaVd9sqrayurW+UNytb2zu7e/b+QUslmcSkiROWyE6EFGFUkKampFOKgniESPtaHgz9duPRCqaiHs9SknAUV/QmGKkjRTaR2PoK8rhQ5j7CSd9NKk5Z6FdR13BrhMvIJUQYFGaH/5vQRnAiNGVKq67mpDnIkNcWMTCp+pkiK8BD1SdQgThRQT67fgJPjdKDcSJNCQ1n6u+JHGlRjwynRzpgVr0puJ/XjfT8VWQU5Fmg8XxRnDOoETqOAPSoJ1mxkCMKSmlshHiCJsDaBVUwI3uLy6R17niu491dVOvXRxlcAxOQA14BLUwS1ogCbAYAyewSt4s56sF+vd+pi3lqxi5hD8gfX5A1p4lH4=</latexit><latexit sha1_base64="btQgEOcsFKLrPIvjacYwlzSy1n4=">AB/XicbVDLSsNAFJ3UV62v+Ni5GSxC3YREBF0W3bisYB/QhDCZTtqhM5M4MxHaUPwVNy4Ucet/uPNvnLZaOuBC4dz7uXe6KUaVd9sqrayurW+UNytb2zu7e/b+QUslmcSkiROWyE6EFGFUkKampFOKgniESPtaHgz9duPRCqaiHs9SknAUV/QmGKkjRTaR2PoK8rhQ5j7CSd9NKk5Z6FdR13BrhMvIJUQYFGaH/5vQRnAiNGVKq67mpDnIkNcWMTCp+pkiK8BD1SdQgThRQT67fgJPjdKDcSJNCQ1n6u+JHGlRjwynRzpgVr0puJ/XjfT8VWQU5Fmg8XxRnDOoETqOAPSoJ1mxkCMKSmlshHiCJsDaBVUwI3uLy6R17niu491dVOvXRxlcAxOQA14BLUwS1ogCbAYAyewSt4s56sF+vd+pi3lqxi5hD8gfX5A1p4lH4=</latexit><latexit sha1_base64="btQgEOcsFKLrPIvjacYwlzSy1n4=">AB/XicbVDLSsNAFJ3UV62v+Ni5GSxC3YREBF0W3bisYB/QhDCZTtqhM5M4MxHaUPwVNy4Ucet/uPNvnLZaOuBC4dz7uXe6KUaVd9sqrayurW+UNytb2zu7e/b+QUslmcSkiROWyE6EFGFUkKampFOKgniESPtaHgz9duPRCqaiHs9SknAUV/QmGKkjRTaR2PoK8rhQ5j7CSd9NKk5Z6FdR13BrhMvIJUQYFGaH/5vQRnAiNGVKq67mpDnIkNcWMTCp+pkiK8BD1SdQgThRQT67fgJPjdKDcSJNCQ1n6u+JHGlRjwynRzpgVr0puJ/XjfT8VWQU5Fmg8XxRnDOoETqOAPSoJ1mxkCMKSmlshHiCJsDaBVUwI3uLy6R17niu491dVOvXRxlcAxOQA14BLUwS1ogCbAYAyewSt4s56sF+vd+pi3lqxi5hD8gfX5A1p4lH4=</latexit> <latexit sha1_base64="cg9eTEqUtCZqkwCW5khYw4me7mE=">AB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpgfax71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVsijJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAv7xKWhc136v595fV+k0RxlO4BTOwYcrqMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOY/cD5/AE4yjcw=</latexit><latexit sha1_base64="cg9eTEqUtCZqkwCW5khYw4me7mE=">AB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpgfax71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVsijJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAv7xKWhc136v595fV+k0RxlO4BTOwYcrqMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOY/cD5/AE4yjcw=</latexit><latexit sha1_base64="cg9eTEqUtCZqkwCW5khYw4me7mE=">AB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpgfax71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVsijJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAv7xKWhc136v595fV+k0RxlO4BTOwYcrqMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOY/cD5/AE4yjcw=</latexit> <latexit sha1_base64="NO/AVq0yYsPdpG3K5Q5U13QK4s=">AB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpwfSx71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVsijJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAv7xKWhc136v595fV+k0RxlO4BTOwYcrqMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOY/cD5/AGmejd4=</latexit><latexit sha1_base64="NO/AVq0yYsPdpG3K5Q5U13QK4s=">AB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpwfSx71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVsijJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAv7xKWhc136v595fV+k0RxlO4BTOwYcrqMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOY/cD5/AGmejd4=</latexit><latexit sha1_base64="NO/AVq0yYsPdpG3K5Q5U13QK4s=">AB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpwfSx71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVsijJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAv7xKWhc136v595fV+k0RxlO4BTOwYcrqMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOY/cD5/AGmejd4=</latexit> Latent Conditioned Policies Structured stochasticity introduced through latent conditioned policy s t a t z ∼ q ω ( . ) Train latent space to capture prior task distribution Latent Space

Meta-Training Latent Spaces Beyond capturing task distribution, train for quick adaptation via meta-learning Latent Space

Meta-Training Latent Spaces Beyond capturing task distribution, train for quick adaptation via meta-learning Latent Space 1 step of RL Grasp red object

Meta-Training Latent Spaces Beyond capturing task distribution, train for quick adaptation via meta-learning Latent Space Latent Space 1 step of RL Grasp red object

Meta-Training Latent Spaces Beyond capturing task distribution, train for quick adaptation via meta-learning 1 step of RL 1 step of RL 1 step of RL Meta-train latent space, policy

Meta-Training Latent Spaces Beyond capturing task distribution, train for quick adaptation via meta-learning 1 step of RL 1 step of RL 1 step of RL Meta-train latent space, policy Train with algorithm based on Model Agnostic Meta-Learning [1] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, Finn et al ICML 2017

Experiments: Robotic Manipulation Random Exploration

Experiments: Robotic Manipulation Random Exploration MAESN exploration

Experiments: Legged Locomotion Random Exploration

Experiments: Legged Locomotion Random Exploration MAESN exploration

Quick Learning of New Tasks § Learns very quickly § Higher asymptotic reward than prior methods § Better exploration

Thank You! YuXuan Liu Russell Mendonca Pieter Abbeel Sergey Levine Please come visit our poster at Room 210 and 230, AB #134 Find code and paper online at https://sites.google.com/view/meta-explore/

Meta-Reinforcement Learning of Structured Exploration Strategies - PowerPoint PPT Presentation

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell Mendonca, YuXuan Liu, Pieter Abbeel, Sergey Levine Human Exploration vs Robot Exploration Human Exploration vs Robot Exploration Human Exploration vs Robot

Meta Reinforcement Learning as Task Inference Jan Humplik, Alexandre Galashov, Leonard

Meta-Learning of Structured Representation by Proximal Mapping Mao Li, Yingyi Ma, Xinhua

Meta Reinforcement Learning Kate Rakelly 11/13/19 Questions we seek to answer Motivation : What

Prefrontal cortex as a meta-reinforcement learning system Wang et al. CS330 Student

Reinforcement Learning by the People and for the People: With a Focus on Lifelong / Meta /

SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning Marvin Zhang*,

Neural Map: Structured Memory for Deep Reinforcement Learning Emilio Parisotto and Ruslan

Learning a Prior over Intent via Meta-Inverse Reinforcement Learning Kelvin Xu, Ellis Ratner, Anca

Distributed Meta Optimization of Reinforcement Learning Agents Greg Heinrich, Iuri Frosio - GTC

Todays Outline Reinforcement Learning Dan Weld Reinforcement Learning Q-value

Efficient Off-Policy Meta- Reinforcement Learning via Probabilistic Context Variables Rakelly,

Class Structure Last time: Midterm! This time: Exploration and Exploitation Next time: Batch RL

1 Video of Demo Q-Learning Auto Cliff Grid Exploration vs. Exploitation How to Explore? Video

Structured Computation and Representation in Deep Reinforcement Learning Jessica B. Hamrick

PEARL Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables Kate

Unsupervised Meta-Learning for Reinforcement Learning LAMDA, Nanjing University . .

Offline Reinforcement Learning CS 285 Instructor: Aviral Kumar UC Berkeley What have we

Exploration: Part 2 CS 294-112: Deep Reinforcement Learning Sergey Levine Class Notes 1.

Causal Reasoning from Meta-reinforcement Learning Dasgupta et al. (2018) CS330 Student

Meta Reinforcement Learning Chelsea Finn Why are humans so good at RL? People have prior

Gaussian Processes for Sample Efficient Reinforcement Learning with RMAX-like Exploration Tobias

Structured Losses Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning

Hierarchically Structured Meta-learning Huaxiu Yao 1,2 , Ying Wei 2 , Junzhou Huang 1 , Zhenhui Li

Distributional Reinforcement Learning for Efficient Exploration Hengshuai Yao Huawei Hi-Silicon