meta reinforcement learning of structured exploration
play

Meta-Reinforcement Learning of Structured Exploration Strategies - PowerPoint PPT Presentation

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell Mendonca, YuXuan Liu, Pieter Abbeel, Sergey Levine Human Exploration vs Robot Exploration Human Exploration vs Robot Exploration Human Exploration vs Robot


  1. Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell Mendonca, YuXuan Liu, Pieter Abbeel, Sergey Levine

  2. Human Exploration vs Robot Exploration

  3. Human Exploration vs Robot Exploration

  4. Human Exploration vs Robot Exploration

  5. Exploration Informed by Prior Experience

  6. Exploration Informed by Prior Experience

  7. Exploration Informed by Prior Experience Desired: § Effective exploration for sparse rewards § Quick adaptation for new tasks

  8. Key Insights in MAESN 1. Explore with random but structured behaviors (exploration)

  9. Key Insights in MAESN 1. Explore with random but structured behaviors (exploration) 2. Explicitly train for quick learning on new tasks (adaptation)

  10. Key Insights in MAESN 1. Explore with random but structured behaviors (exploration) 2. Explicitly train for quick learning on new tasks (adaptation)

  11. Key Insights in MAESN 1. Explore with random but structured behaviors (exploration) 2. Explicitly train for quick learning on new tasks (adaptation) Fast Learning Grasp red object

  12. Key Insights in MAESN 1. Explore with random but structured behaviors (exploration) 2. Explicitly train for quick learning on new tasks (adaptation) Fast Learning Grasp red object

  13. Using Structured Stochasticity Per-timestep Exploration Structured Exploration Structured exploration: pick an intention, execute for entire episode. Explore across different intentions

  14. <latexit sha1_base64="btQgEOcsFKLrPIvjacYwlzSy1n4=">AB/XicbVDLSsNAFJ3UV62v+Ni5GSxC3YREBF0W3bisYB/QhDCZTtqhM5M4MxHaUPwVNy4Ucet/uPNvnLZaOuBC4dz7uXe6KUaVd9sqrayurW+UNytb2zu7e/b+QUslmcSkiROWyE6EFGFUkKampFOKgniESPtaHgz9duPRCqaiHs9SknAUV/QmGKkjRTaR2PoK8rhQ5j7CSd9NKk5Z6FdR13BrhMvIJUQYFGaH/5vQRnAiNGVKq67mpDnIkNcWMTCp+pkiK8BD1SdQgThRQT67fgJPjdKDcSJNCQ1n6u+JHGlRjwynRzpgVr0puJ/XjfT8VWQU5Fmg8XxRnDOoETqOAPSoJ1mxkCMKSmlshHiCJsDaBVUwI3uLy6R17niu491dVOvXRxlcAxOQA14BLUwS1ogCbAYAyewSt4s56sF+vd+pi3lqxi5hD8gfX5A1p4lH4=</latexit><latexit sha1_base64="btQgEOcsFKLrPIvjacYwlzSy1n4=">AB/XicbVDLSsNAFJ3UV62v+Ni5GSxC3YREBF0W3bisYB/QhDCZTtqhM5M4MxHaUPwVNy4Ucet/uPNvnLZaOuBC4dz7uXe6KUaVd9sqrayurW+UNytb2zu7e/b+QUslmcSkiROWyE6EFGFUkKampFOKgniESPtaHgz9duPRCqaiHs9SknAUV/QmGKkjRTaR2PoK8rhQ5j7CSd9NKk5Z6FdR13BrhMvIJUQYFGaH/5vQRnAiNGVKq67mpDnIkNcWMTCp+pkiK8BD1SdQgThRQT67fgJPjdKDcSJNCQ1n6u+JHGlRjwynRzpgVr0puJ/XjfT8VWQU5Fmg8XxRnDOoETqOAPSoJ1mxkCMKSmlshHiCJsDaBVUwI3uLy6R17niu491dVOvXRxlcAxOQA14BLUwS1ogCbAYAyewSt4s56sF+vd+pi3lqxi5hD8gfX5A1p4lH4=</latexit><latexit sha1_base64="btQgEOcsFKLrPIvjacYwlzSy1n4=">AB/XicbVDLSsNAFJ3UV62v+Ni5GSxC3YREBF0W3bisYB/QhDCZTtqhM5M4MxHaUPwVNy4Ucet/uPNvnLZaOuBC4dz7uXe6KUaVd9sqrayurW+UNytb2zu7e/b+QUslmcSkiROWyE6EFGFUkKampFOKgniESPtaHgz9duPRCqaiHs9SknAUV/QmGKkjRTaR2PoK8rhQ5j7CSd9NKk5Z6FdR13BrhMvIJUQYFGaH/5vQRnAiNGVKq67mpDnIkNcWMTCp+pkiK8BD1SdQgThRQT67fgJPjdKDcSJNCQ1n6u+JHGlRjwynRzpgVr0puJ/XjfT8VWQU5Fmg8XxRnDOoETqOAPSoJ1mxkCMKSmlshHiCJsDaBVUwI3uLy6R17niu491dVOvXRxlcAxOQA14BLUwS1ogCbAYAyewSt4s56sF+vd+pi3lqxi5hD8gfX5A1p4lH4=</latexit> <latexit sha1_base64="cg9eTEqUtCZqkwCW5khYw4me7mE=">AB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpgfax71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVsijJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAv7xKWhc136v595fV+k0RxlO4BTOwYcrqMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOY/cD5/AE4yjcw=</latexit><latexit sha1_base64="cg9eTEqUtCZqkwCW5khYw4me7mE=">AB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpgfax71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVsijJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAv7xKWhc136v595fV+k0RxlO4BTOwYcrqMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOY/cD5/AE4yjcw=</latexit><latexit sha1_base64="cg9eTEqUtCZqkwCW5khYw4me7mE=">AB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpgfax71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVsijJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAv7xKWhc136v595fV+k0RxlO4BTOwYcrqMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOY/cD5/AE4yjcw=</latexit> <latexit sha1_base64="NO/AVq0yYsPdpG3K5Q5U13QK4s=">AB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpwfSx71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVsijJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAv7xKWhc136v595fV+k0RxlO4BTOwYcrqMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOY/cD5/AGmejd4=</latexit><latexit sha1_base64="NO/AVq0yYsPdpG3K5Q5U13QK4s=">AB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpwfSx71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVsijJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAv7xKWhc136v595fV+k0RxlO4BTOwYcrqMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOY/cD5/AGmejd4=</latexit><latexit sha1_base64="NO/AVq0yYsPdpG3K5Q5U13QK4s=">AB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTKUw6HnfTmltfWNzq7xd2dnd2z9wD49aJsk0402WyER3Qmq4FIo3UaDknVRzGoeSt8Px7cxvP3FtRKIecZLyIKZDJSLBKFrpwfSx71a9mjcHWSV+QapQoNF3v3qDhGUxV8gkNabreykGOdUomOTSi8zPKVsTIe8a6miMTdBPj91Ss6sMiBRom0pJHP190ROY2MmcWg7Y4ojs+zNxP+8bobRdZALlWbIFVsijJMCGzv8lAaM5QTiyhTAt7K2EjqilDm07FhuAv7xKWhc136v595fV+k0RxlO4BTOwYcrqMdNKAJDIbwDK/w5kjnxXl3PhatJaeYOY/cD5/AGmejd4=</latexit> Latent Conditioned Policies Structured stochasticity introduced through latent conditioned policy s t a t z ∼ q ω ( . ) Train latent space to capture prior task distribution Latent Space

  15. Meta-Training Latent Spaces Beyond capturing task distribution, train for quick adaptation via meta-learning Latent Space

  16. Meta-Training Latent Spaces Beyond capturing task distribution, train for quick adaptation via meta-learning Latent Space 1 step of RL Grasp red object

  17. Meta-Training Latent Spaces Beyond capturing task distribution, train for quick adaptation via meta-learning Latent Space Latent Space 1 step of RL Grasp red object

  18. Meta-Training Latent Spaces Beyond capturing task distribution, train for quick adaptation via meta-learning 1 step of RL 1 step of RL 1 step of RL Meta-train latent space, policy

  19. Meta-Training Latent Spaces Beyond capturing task distribution, train for quick adaptation via meta-learning 1 step of RL 1 step of RL 1 step of RL Meta-train latent space, policy Train with algorithm based on Model Agnostic Meta-Learning [1] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, Finn et al ICML 2017

  20. Experiments: Robotic Manipulation Random Exploration

  21. Experiments: Robotic Manipulation Random Exploration MAESN exploration

  22. Experiments: Robotic Manipulation Random Exploration MAESN exploration

  23. Experiments: Legged Locomotion Random Exploration

  24. Experiments: Legged Locomotion Random Exploration MAESN exploration

  25. Experiments: Legged Locomotion Random Exploration MAESN exploration

  26. Quick Learning of New Tasks § Learns very quickly § Higher asymptotic reward than prior methods § Better exploration

  27. Quick Learning of New Tasks § Learns very quickly § Higher asymptotic reward than prior methods § Better exploration

  28. Quick Learning of New Tasks § Learns very quickly § Higher asymptotic reward than prior methods § Better exploration

  29. Quick Learning of New Tasks § Learns very quickly § Higher asymptotic reward than prior methods § Better exploration

  30. Thank You! YuXuan Liu Russell Mendonca Pieter Abbeel Sergey Levine Please come visit our poster at Room 210 and 230, AB #134 Find code and paper online at https://sites.google.com/view/meta-explore/

Recommend


More recommend