policy consolidation for continual reinforcement learning
play

Policy Consolidation for Continual Reinforcement Learning Christos - PowerPoint PPT Presentation

Policy Consolidation for Continual Reinforcement Learning Christos Kaplanis 1 , Murray Shanahan 1,2 and Claudia Clopath 1 1 Imperial College London, 2 DeepMind 11th June 2019 Motivation Motivation Catastrophic Forgetting in Artificial Neural


  1. Policy Consolidation for Continual Reinforcement Learning Christos Kaplanis 1 , Murray Shanahan 1,2 and Claudia Clopath 1 1 Imperial College London, 2 DeepMind 11th June 2019

  2. Motivation

  3. Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks

  4. Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks

  5. Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks

  6. Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks

  7. Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks ◮ Agents should cope with

  8. Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks ◮ Agents should cope with ◮ Both discrete and continuous changes to data distribution

  9. Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks ◮ Agents should cope with ◮ Both discrete and continuous changes to data distribution ◮ No prior knowledge of when/how changes occur

  10. Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks ◮ Agents should cope with ◮ Both discrete and continuous changes to data distribution ◮ No prior knowledge of when/how changes occur ◮ Test beds: alternating task, single task and multi-agent RL

  11. Policy Consolidation Train KL distillation agent loss Play game 𝜌 1 𝜌 2 𝜌 3 𝜌 N ... old old old 𝜌 1 𝜌 2 𝜌 3 old 𝜌 N Store Policy Recall Policy

  12. Alternating task experiments [Walker2d-v2, [HalfCheetah-v2, Walker2dBigLeg-v0] HalfCheetahBigLeg-v0] PC PC β = 1 β = 1 4000 6000 β = 5 β = 5 3000 Reward β = 10 Reward β = 10 4000 β = 20 β = 20 2000 β = 50 β = 50 2000 1000 clip=0.2 clip=0.2 clip=0.1 clip=0.1 0 0 clip=0.03 clip=0.03 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Steps 1e7 Steps 1e7 [HumanoidSmallLeg-v0, HumanoidBigLeg-v0] PC 6000 β = 1 β = 5 5000 β = 10 4000 Reward β = 20 3000 β = 50 2000 clip=0.2 clip=0.1 1000 clip=0.03 0 adaptive β 0.0 0.5 1.0 1.5 2.0 Steps 1e7

  13. Single task experiments [Walker2d-v2] [HalfCheetahBigLeg-v0] PC PC β = 1 β = 1 8000 5000 β = 5 β = 5 6000 4000 β = 10 β = 10 Reward Reward β = 20 β = 20 3000 4000 β = 50 β = 50 2000 clip=0.2 clip=0.2 2000 1000 clip=0.1 clip=0.1 0 clip=0.03 clip=0.03 0 adaptive β adaptive β 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Steps Steps 1e7 1e7 [RoboschoolHumanoid-v1] PC β = 1 2500 β = 5 2000 β = 10 Reward 1500 β = 20 β = 50 1000 clip=0.2 500 clip=0.1 clip=0.03 0 adaptive β 0 1 2 3 4 5 Steps 1e7

  14. Multi-agent self-play experiments 1.0 1.0 PC1 β = 0.5 PC2 β = 1.0 0.9 PC3 β = 2.0 0.8 Clip=0.2 β = 5.0 Mean Score Mean Score Clip=0.1 Adaptive β 0.8 0.6 0.7 0.4 0.6 PC vs Clip=0.2 PC vs β = 2.0 0.2 PC vs Clip=0.1 PC vs β = 5.0 0.5 PC vs β = 0.5 PC vs Adaptive β PC vs β = 1.0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Steps 1e8 Steps 1e8 (a) Final model vs. self history (b) PC vs. baselines over training

  15. Future work

  16. Future work ◮ Prioritised consolidation

  17. Future work ◮ Prioritised consolidation ◮ Adapt for off-policy learning

Recommend


More recommend