Policy Consolidation for Continual Reinforcement Learning Christos Kaplanis 1 , Murray Shanahan 1,2 and Claudia Clopath 1 1 Imperial College London, 2 DeepMind 11th June 2019
Motivation
Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks
Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks
Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks
Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks
Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks ◮ Agents should cope with
Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks ◮ Agents should cope with ◮ Both discrete and continuous changes to data distribution
Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks ◮ Agents should cope with ◮ Both discrete and continuous changes to data distribution ◮ No prior knowledge of when/how changes occur
Motivation ◮ Catastrophic Forgetting in Artificial Neural Networks ◮ Agents should cope with ◮ Both discrete and continuous changes to data distribution ◮ No prior knowledge of when/how changes occur ◮ Test beds: alternating task, single task and multi-agent RL
Policy Consolidation Train KL distillation agent loss Play game 𝜌 1 𝜌 2 𝜌 3 𝜌 N ... old old old 𝜌 1 𝜌 2 𝜌 3 old 𝜌 N Store Policy Recall Policy
Alternating task experiments [Walker2d-v2, [HalfCheetah-v2, Walker2dBigLeg-v0] HalfCheetahBigLeg-v0] PC PC β = 1 β = 1 4000 6000 β = 5 β = 5 3000 Reward β = 10 Reward β = 10 4000 β = 20 β = 20 2000 β = 50 β = 50 2000 1000 clip=0.2 clip=0.2 clip=0.1 clip=0.1 0 0 clip=0.03 clip=0.03 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Steps 1e7 Steps 1e7 [HumanoidSmallLeg-v0, HumanoidBigLeg-v0] PC 6000 β = 1 β = 5 5000 β = 10 4000 Reward β = 20 3000 β = 50 2000 clip=0.2 clip=0.1 1000 clip=0.03 0 adaptive β 0.0 0.5 1.0 1.5 2.0 Steps 1e7
Single task experiments [Walker2d-v2] [HalfCheetahBigLeg-v0] PC PC β = 1 β = 1 8000 5000 β = 5 β = 5 6000 4000 β = 10 β = 10 Reward Reward β = 20 β = 20 3000 4000 β = 50 β = 50 2000 clip=0.2 clip=0.2 2000 1000 clip=0.1 clip=0.1 0 clip=0.03 clip=0.03 0 adaptive β adaptive β 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Steps Steps 1e7 1e7 [RoboschoolHumanoid-v1] PC β = 1 2500 β = 5 2000 β = 10 Reward 1500 β = 20 β = 50 1000 clip=0.2 500 clip=0.1 clip=0.03 0 adaptive β 0 1 2 3 4 5 Steps 1e7
Multi-agent self-play experiments 1.0 1.0 PC1 β = 0.5 PC2 β = 1.0 0.9 PC3 β = 2.0 0.8 Clip=0.2 β = 5.0 Mean Score Mean Score Clip=0.1 Adaptive β 0.8 0.6 0.7 0.4 0.6 PC vs Clip=0.2 PC vs β = 2.0 0.2 PC vs Clip=0.1 PC vs β = 5.0 0.5 PC vs β = 0.5 PC vs Adaptive β PC vs β = 1.0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Steps 1e8 Steps 1e8 (a) Final model vs. self history (b) PC vs. baselines over training
Future work
Future work ◮ Prioritised consolidation
Future work ◮ Prioritised consolidation ◮ Adapt for off-policy learning
Recommend
More recommend