differentiable tree planning for deep rl
play

Differentiable Tree Planning for Deep RL Greg Farquhar 1 In - PowerPoint PPT Presentation

Differentiable Tree Planning for Deep RL Greg Farquhar 1 In Collaboration With Tim Rocktaschel, Maximilian Igl, & Shimon Whiteson Greg Farquhar 2 / 35 Overview Reinforcement learning Model-based RL and online planning


  1. Differentiable Tree Planning for Deep RL Greg Farquhar 1

  2. In Collaboration With Tim Rocktaschel, Maximilian Igl, & Shimon Whiteson Greg Farquhar 2 / 35

  3. Overview ● Reinforcement learning ● Model-based RL and online planning ● TreeQN and ATreeC (ICLR 2018) ● Results ● Future work Greg Farquhar 3 / 35

  4. Planning and Learning for Control Greg Farquhar 4 / 35

  5. Reinforcement Learning Greg Farquhar 5 / 35

  6. Reinforcement Learning ● Specify the reward, learn the solution ● Very general framework ● Problem is hard: ○ Rewards are sparse ○ Credit assignment ○ Exploration and exploitation ○ Large state/action spaces ○ Approximation and generalisation Greg Farquhar 6 / 35

  7. RL Key Concepts ● State (Observation) ● Action ● Transition ● Reward ● Policy: states → actions Greg Farquhar 7 / 35

  8. Model-free RL: Value Functions ● Learn without a model of the environment ● Value function ● Optimal value function ● Policy evaluation + improvement Greg Farquhar 8 / 35

  9. The Bellman Equation ● Temporal (Markov) structure ● Bellman optimality equation ● Q-learning ● Backups Greg Farquhar 9 / 35

  10. Deep RL ● Q → deep neural network ● Q-learning as regression ● Stability is hard ○ Target networks ○ Replay memory ○ Parallel environment threads Greg Farquhar 10 / 35

  11. Deep RL - Encode and Evaluate? Greg Farquhar 11 / 35

  12. Model-based RL Greg Farquhar 12 / 35

  13. Online Planning with Tree Search Greg Farquhar 13 / 35

  14. Environment Models ● State transition + reward ● Can be hard to learn ○ Complex ○ Generalise poorly to new parts of the state space ● Need very good fidelity for planning ● Standard approach: predictive error on observations Greg Farquhar 14 / 35

  15. Model fidelity in complex visual spaces is too low for effective planning Action-conditional video prediction using deep networks in atari games (Oh et. al 2015) Greg Farquhar 15 / 35

  16. Model fidelity in complex visual spaces is too low for effective planning Action-conditional video prediction using deep networks in atari games (Oh et. al 2015) Greg Farquhar 16 / 35

  17. Another Way to Learn Models ● Optimise the true objective downstream of the model ○ Value prediction ○ Performance on the real task ● Our approach: integrate differentiable model into differentiable planner, learn end to end. Greg Farquhar 17 / 35

  18. TreeQN: Encode Greg Farquhar 18 / 35

  19. TreeQN: Tree Expansion Greg Farquhar 19 / 35

  20. TreeQN: Evaluation Greg Farquhar 20 / 35

  21. TreeQN: Tree Backup Greg Farquhar 21 / 35

  22. TreeQN Greg Farquhar 22 / 35

  23. Architecture Details ● Two-step transition function ● Residual connections a 1 ● State normalisation a 2 a 3 ● Soft backups action shared normalise conditional Greg Farquhar 23 / 35

  24. Training ● Optimise end-to-end with primary RL objective ● Parameter sharing ● N-step Q-learning with parallel environment threads ● Batch thread data together for GPU ● Increase virtual batch size during tree expansion for efficient computation Greg Farquhar 24 / 35

  25. Grounding the Transition Model ● Observations ● Latent states ● Rewards ○ Inside true targets Greg Farquhar 25 / 35

  26. ATreeC ● Use tree architecture for policy ● Linear critic ● Train with policy gradient Greg Farquhar 26 / 35

  27. Results: Grounding ● Grounding weakly (just reward function) works best ● Maybe joint training of auxiliary objectives is wrong Greg Farquhar 27 / 35

  28. Results: Box Pushing ● TreeQN helps! ● Extra depth can help in some situations Greg Farquhar 28 / 35

  29. Results: Atari ● Good performance ● Makes use of depth (vs DQN-Deep) ● Main benefit from depth-1 ○ Reward + value ○ Auxiliary loss ○ Parameter sharing Greg Farquhar 29 / 35

  30. Results: ATreeC ● Works -- easy as a drop-in replacement ● Smaller benefits than TreeQN ● Limited by quality of critic? Greg Farquhar 30 / 35

  31. Just for fun Greg Farquhar 31 / 35

  32. Interpretability ● Sometimes (?) ● Firmly on model-free end of spectrum ● Grounding is an open question ○ Better auxiliary tasks? ○ Pre-training? ○ Different environments? Greg Farquhar 32 / 35

  33. Future Work ● Lessons learnt for model-free RL: ○ Depth ○ Structure ○ Auxiliary Tasks ● Online planning: ○ Need more grounded models to use more refined planning algorithms Greg Farquhar 33 / 35

  34. Summary ● Combining online planning with deep RL is a key challenge ● We can use a differentiable model inside a differentiable planner and train end-to-end ● Tree-structured models can encode a valuable inductive bias ● More work is needed to effectively learn and use grounded models Greg Farquhar 34 / 35

  35. Thank you! Greg Farquhar 35 / 35

Recommend


More recommend