Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell Mendonca, YuXuan Liu, Pieter Abbeel, Sergey Levine
Human Exploration vs Robot Exploration
Human Exploration vs Robot Exploration
Human Exploration vs Robot Exploration
Exploration Informed by Prior Experience
Exploration Informed by Prior Experience
Exploration Informed by Prior Experience Desired: § Effective exploration for sparse rewards § Quick adaptation for new tasks
Key Insights in MAESN 1. Explore with random but structured behaviors (exploration)
Key Insights in MAESN 1. Explore with random but structured behaviors (exploration) 2. Explicitly train for quick learning on new tasks (adaptation)
Key Insights in MAESN 1. Explore with random but structured behaviors (exploration) 2. Explicitly train for quick learning on new tasks (adaptation)
Key Insights in MAESN 1. Explore with random but structured behaviors (exploration) 2. Explicitly train for quick learning on new tasks (adaptation) Fast Learning Grasp red object
Key Insights in MAESN 1. Explore with random but structured behaviors (exploration) 2. Explicitly train for quick learning on new tasks (adaptation) Fast Learning Grasp red object
Using Structured Stochasticity Per-timestep Exploration Structured Exploration Structured exploration: pick an intention, execute for entire episode. Explore across different intentions
<latexit sha1_base64="btQgEOcsFKLrPIvjacYwlzSy1n4=">AB/XicbVDLSsNAFJ3UV62v+Ni5GSxC3YREBF0W3bisYB/QhDCZTtqhM5M4MxHaUPwVNy4Ucet/uPNvnLZaOuBC4dz7uXe6KUaVd9sqrayurW+UNytb2zu7e/b+QUslmcSkiROWyE6EFGFUkKampFOKgniESPtaHgz9duPRCqaiHs9SknAUV/QmGKkjRTaR2PoK8rhQ5j7CSd9NKk5Z6FdR13BrhMvIJUQYFGaH/5vQRnAiNGVKq67mpDnIkNcWMTCp+pkiK8BD1SdQgThRQT67fgJPjdKDcSJNCQ1n6u+JHGlRjwynRzpgVr0puJ/XjfT8VWQU5Fmg8XxRnDOoETqOAPSoJ1mxkCMKSmlshHiCJsDaBVUwI3uLy6R17niu491dVOvXRxlcAxOQA14BLUwS1ogCbAYAyewSt4s56sF+vd+pi3lqxi5hD8gfX5A1p4lH4=</latexit><latexit sha1_base64="btQgEOcsFKLrPIvjacYwlzSy1n4=">AB/XicbVDLSsNAFJ3UV62v+Ni5GSxC3YREBF0W3bisYB/QhDCZTtqhM5M4MxHaUPwVNy4Ucet/uPNvnLZaOuBC4dz7uXe6KUaVd9sqrayurW+UNytb2zu7e/b+QUslmcSkiROWyE6EFGFUkKampFOKgniESPtaHgz9duPRCqaiHs9SknAUV/QmGKkjRTaR2PoK8rhQ5j7CSd9NKk5Z6FdR13BrhMvIJUQYFGaH/5vQRnAiNGVKq67mpDnIkNcWMTCp+pkiK8BD1SdQgThRQT67fgJPjdKDcSJNCQ1n6u+JHGlRjwynRzpgVr0puJ/XjfT8VWQU5Fmg8XxRnDOoETqOAPSoJ1mxkCMKSmlshHiCJsDaBVUwI3uLy6R17niu491dVOvXRxlcAxOQA14BLUwS1ogCbAYAyewSt4s56sF+vd+pi3lqxi5hD8gfX5A1p4lH4=</latexit><latexit sha1_base64="btQgEOcsFKLrPIvjacYwlzSy1n4=">AB/XicbVDLSsNAFJ3UV62v+Ni5GSxC3YREBF0W3bisYB/QhDCZTtqhM5M4MxHaUPwVNy4Ucet/uPNvnLZaOuBC4dz7uXe6KUaVd9sqrayurW+UNytb2zu7e/b+QUslmcSkiROWyE6EFGFUkKampFOKgniESPtaHgz9duPRCqaiHs9SknAUV/QmGKkjRTaR2PoK8rhQ5j7CSd9NKk5Z6FdR13BrhMvIJUQYFGaH/5vQRnAiNGVKq67mpDnIkNcWMTCp+pkiK8BD1SdQgThRQT67fgJPjdKDcSJNCQ1n6u+JHGlRjwynRzpgVr0puJ/XjfT8VWQU5Fmg8XxRnDOoETqOAPSoJ1mxkCMKSmlshHiCJsDaBVUwI3uLy6R17niu491dVOvXRxlcAxOQA14BLUwS1ogCbAYAyewSt4s56sF+vd+pi3lqxi5hD8gfX5A1p4lH4=</latexit> <latexit sha1_base64="cg9eTEqUtCZqkwCW5khYw4me7mE=">AB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpgfax71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVsijJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAv7xKWhc136v595fV+k0RxlO4BTOwYcrqMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOY/cD5/AE4yjcw=</latexit><latexit sha1_base64="cg9eTEqUtCZqkwCW5khYw4me7mE=">AB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpgfax71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVsijJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAv7xKWhc136v595fV+k0RxlO4BTOwYcrqMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOY/cD5/AE4yjcw=</latexit><latexit sha1_base64="cg9eTEqUtCZqkwCW5khYw4me7mE=">AB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpgfax71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVsijJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAv7xKWhc136v595fV+k0RxlO4BTOwYcrqMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOY/cD5/AE4yjcw=</latexit> <latexit sha1_base64="NO/AVq0yYsPdpG3K5Q5U13QK4s=">AB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpwfSx71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVsijJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAv7xKWhc136v595fV+k0RxlO4BTOwYcrqMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOY/cD5/AGmejd4=</latexit><latexit sha1_base64="NO/AVq0yYsPdpG3K5Q5U13QK4s=">AB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpwfSx71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVsijJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAv7xKWhc136v595fV+k0RxlO4BTOwYcrqMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOY/cD5/AGmejd4=</latexit><latexit sha1_base64="NO/AVq0yYsPdpG3K5Q5U13QK4s=">AB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpwfSx71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVsijJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAv7xKWhc136v595fV+k0RxlO4BTOwYcrqMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOY/cD5/AGmejd4=</latexit> Latent Conditioned Policies Structured stochasticity introduced through latent conditioned policy s t a t z ∼ q ω ( . ) Train latent space to capture prior task distribution Latent Space
Meta-Training Latent Spaces Beyond capturing task distribution, train for quick adaptation via meta-learning Latent Space
Meta-Training Latent Spaces Beyond capturing task distribution, train for quick adaptation via meta-learning Latent Space 1 step of RL Grasp red object
Meta-Training Latent Spaces Beyond capturing task distribution, train for quick adaptation via meta-learning Latent Space Latent Space 1 step of RL Grasp red object
Meta-Training Latent Spaces Beyond capturing task distribution, train for quick adaptation via meta-learning 1 step of RL 1 step of RL 1 step of RL Meta-train latent space, policy
Meta-Training Latent Spaces Beyond capturing task distribution, train for quick adaptation via meta-learning 1 step of RL 1 step of RL 1 step of RL Meta-train latent space, policy Train with algorithm based on Model Agnostic Meta-Learning [1] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, Finn et al ICML 2017
Experiments: Robotic Manipulation Random Exploration
Experiments: Robotic Manipulation Random Exploration MAESN exploration
Experiments: Robotic Manipulation Random Exploration MAESN exploration
Experiments: Legged Locomotion Random Exploration
Experiments: Legged Locomotion Random Exploration MAESN exploration
Experiments: Legged Locomotion Random Exploration MAESN exploration
Quick Learning of New Tasks § Learns very quickly § Higher asymptotic reward than prior methods § Better exploration
Quick Learning of New Tasks § Learns very quickly § Higher asymptotic reward than prior methods § Better exploration
Quick Learning of New Tasks § Learns very quickly § Higher asymptotic reward than prior methods § Better exploration
Quick Learning of New Tasks § Learns very quickly § Higher asymptotic reward than prior methods § Better exploration
Thank You! YuXuan Liu Russell Mendonca Pieter Abbeel Sergey Levine Please come visit our poster at Room 210 and 230, AB #134 Find code and paper online at https://sites.google.com/view/meta-explore/
Recommend
More recommend