model based reinforcement learning
play

Model Based Reinforcement Learning Oriol Vinyals (DeepMind) - PowerPoint PPT Presentation

Model Based Reinforcement Learning Oriol Vinyals (DeepMind) @OriolVinyalsML May 2018 Stanford University The Reinforcement Learning Paradigm OBSERVATIONS GOAL Agent Environment ACTIONS The Reinforcement Learning Paradigm Maximize Return -


  1. Model Based Reinforcement Learning Oriol Vinyals (DeepMind) @OriolVinyalsML May 2018 Stanford University

  2. The Reinforcement Learning Paradigm OBSERVATIONS GOAL Agent Environment ACTIONS

  3. The Reinforcement Learning Paradigm Maximize Return - long term reward: State Action R t =∑ t’≥t � t’-t r t’ = r t + � R t+1 x t a t � ∈ [0,1] With Policy - action distribution: Reward r t � =P(a t |x t ,...) Measure success with Value Function : V � (x t )=E � (R t )

  4. A Classic Dilemma “Old school” AI Researcher Deep Learning Researcher

  5. A Classic Dilemma

  6. A Classic Dilemma Model Based RL Deep RL

  7. (Deep) Model Based RL Deep Generative Model Model Based RL Deep RL + Deep RL Learning Model Based Planning Imagination Augmented Agents from Scratch

  8. Imagination Augmented Agents (NIPS17) Joint work with: Theo Weber*, Sebastien Racaniere*, David Reichert*, Razvan Pascanu*, Yujia Li*, Lars Buesing, Arthur Guez, Danilo Rezende, Adrià Puigdomènech Badia, Peter Battaglia, Nicolas Heess, David Silver, Daan Wierstra

  9. Intro to I2A ● We have good environment models ⇒ can we use them to solve tasks? ● How do we do model-based RL and deal with imperfect simulators? ● In this particular approach, we treat the generative model as an oracle of possible futures. ⇒ How do we interpret those ‘warnings’?

  10. Imagination Augmented Agents (I2A)

  11. Imagination Planning Networks (IPNs)

  12. Imagination Planning Networks (IPNs)

  13. Sokoban environment ● Procedurally generated ● Irreversible decisions

  14. Sokoban environment

  15. Video Success Failure

  16. What happens if our model is bad?

  17. Mental retries with I2A

  18. Mental retries with I2A

  19. Mental retries with I2A Solves 95% of levels!

  20. Imagination efficiency Imagination is expensive ⇒ can we limit the number of times we ask the agent to imagine a transition in order to solve a levels? In other words, can we guide the search more efficiently than current methods?

  21. One model, many tasks

  22. Metaminipacman Five events: ● Do nothing ● Eat a small pill ● Eat a power pill ● Eat a ghost ● Be eaten by a ghost We assign to each event a different reward, and create five different games: ● ‘Regular’ ● ‘Rush’ (eat big pills as fast as possible) ● ‘Hunt’ (eat ghosts, pills are ok i guess) ● ‘Ambush’ (eat ghosts, avoid everything else) ● ‘Avoid’ (everything hurts)

  23. Results Avoid Ambush

  24. Learning model-based planning from scratch Joint work with: Razvan Pascanu*, Yujia Li*, Theo Weber*, Sebastien Racaniere*, David Reichert*, Lars Buesing, Arthur Guez, Danilo Rezende, Adrià Puigdomènech Badia, Peter Battaglia, Nicolas Heess, David Silver, Daan Wierstra

  25. Prior work: Spaceship Task v1.0 Hamrick, Ballard, Pascanu, Vinyals, Heess, Battaglia (2017) Metacontrol for Adaptive Imagination-Based Optimization, ICLR 2017. ● Propel spaceship to home planet (white) by choosing thruster force and magnitude ● Other planets’ (grey) gravitational fields influence the trajectory ● Continuous, context bandit problem

  26. Prior work: Imagination-based metacontroller ● Restricted to bandit problems

  27. This paper: Imagination-based Planner (IBP)

  28. Spaceship Task v2.0: Multiple actions ● Use thruster multiple times ● Increase difficult than Spaceship Task v1.0: 1. Pay for fuel 2. Multiplicative control noise ● Opens up new strategies, such as: 1. Move away from challenging gravity wells 2. Apply thruster toward target

  29. Spaceship Task v2.0: Multiple actions ● Use thruster multiple times ● Increase difficult than Spaceship Task v1.0: 1. Pay for fuel 2. Multiplicative control noise ● Opens up new strategies, such as: 1. Move away from challenging gravity wells 2. Apply thruster toward target

  30. Spaceship Task v2.0: Multiple actions ● Use thruster multiple times ● Increase difficult than Spaceship Task v1.0: 1. Pay for fuel 2. Multiplicative control noise ● Opens up new strategies, such as: 1. Move away from challenging gravity wells 2. Apply thruster toward target

  31. Spaceship Task v2.0: Multiple actions ● Use thruster multiple times ● Increase difficult than Spaceship Task v1.0: 1. Pay for fuel 2. Multiplicative control noise ● Opens up new strategies, such as: 1. Move away from challenging gravity wells 2. Apply thruster toward target

  32. Spaceship Task v2.0: Multiple actions ● Use thruster multiple times ● Increase difficult than Spaceship Task v1.0: 1. Pay for fuel 2. Multiplicative control noise ● Opens up new strategies, such as: 1. Move away from challenging gravity wells 2. Apply thruster toward target

  33. Imagination-based Planner ● Imagination can be: Current step only : imagine only from the ○ current state

  34. Imagination-based Planner ● Imagination can be: Current step only : imagine only from the ○ current state ○ Chained steps only : imagine a sequence of actions

  35. Imagination-based Planner ● Imagination can be: Current step only : imagine only from the ○ current state ○ Chained steps only : imagine a sequence of actions Imagination tree : manager chooses whether ○ to use current (root) state, or chain imagined states together

  36. Imagination-based Planner

  37. Imagination-based Planner

  38. Real trials: 3 actions 0 imaginations per action 1 imagination per action 2 imaginations per action More complex plans: 1. Moves away from complex gravity 2. Slows its velocity 3. Moves to target

  39. Different strategies for exploration 1 step Imagination trees n step

  40. Results

  41. Results

  42. Results

  43. Results

  44. Imagination-based Planner How does it work? (learnable components are bold ) 1. On each step, inputs: ○ State, s t : the planet and ship positions, etc. ○ Imagined state, s’ t : internal state belief ○ History, h t : summary of planning steps so far

  45. Imagination-based Planner How does it work? (learnable components are bold ) 1. On each step, inputs: ○ State, s t : the planet and ship positions, etc. ○ Imagined state, s’ t : internal state belief ○ History, h t : summary of planning steps so far 2. Controller policy returns action, a t 3. Manager routes actions to world or imagination, r t

  46. Imagination-based Planner How does it work? (learnable components are bold ) 1. On each step, inputs: ○ State, s t : the planet and ship positions, etc. ○ Imagined state, s’ t : internal state belief ○ History, h t : summary of planning steps so far 2. Controller policy returns action, a t 3. Manager routes actions to world or imagination, r t 4. If route, r t , indicates: a. “Imagination”, predicts imagined state, s’ t+1

  47. Imagination-based Planner How does it work? (learnable components are bold ) 1. On each step, inputs: ○ State, s t : the planet and ship positions, etc. ○ Imagined state, s’ t : internal state belief ○ History, h t : summary of planning steps so far 2. Controller policy returns action, a t 3. Manager routes actions to world or imagination, r t 4. If route, r t , indicates: a. “Imagination”, predicts imagined state, s’ t+1 b. “World”, model predicts new state, s t+1

  48. Imagination-based Planner How does it work? (learnable components are bold ) 1. On each step, inputs: ○ State, s t : the planet and ship positions, etc. ○ Imagined state, s’ t : internal state belief ○ History, h t : summary of planning steps so far 2. Controller policy returns action, a t 3. Manager routes actions to world or imagination, r t 4. If route, r t , indicates: a. “Imagination”, predicts imagined state, s’ t+1 b. “World”, model predicts new state, s t+1 5. Memory aggregates new info into updated history, h t+1

  49. Imagination-based Planner How is it trained? Three distinct, concurrent, on-policy training loops

  50. Imagination-based Planner How is it trained? Three distinct, concurrent, on-policy training loops 1. Model/Imagination (interaction network) Supervised: s t , a t → s t+1

  51. Imagination-based Planner How is it trained? Three distinct, concurrent, on-policy training loops 1. Model/Imagination (interaction network) Supervised: s t , a t → s t+1 2. Controller/Memory (MLP/LSTM) SVG: Reward, u t , is assumed to be |s t+1 - s * | 2 . Model, imagination, memory, and controller are differentiable. Manager’s discrete r t choices are assumed to be constants.

  52. Imagination-based Planner How is it trained? Three distinct, concurrent, on-policy training loops 1. Model/Imagination (interaction network) Supervised: s t , a t → s t+1 2. Controller/Memory (MLP/LSTM) SVG: Reward, u t , is assumed to be |s t+1 - s * | 2 . Model, imagination, memory, and controller are differentiable. Manager’s discrete r t choices are assumed to be constants. 3. Manager : finite horizon MDP (MLP q-net, stochastic) REINFORCE: Return = (reward + comp. costs), (u t + c t )

  53. Bonus Paper: MCTSnet Joint work with: Arthur Guez*, Theo Weber*, Ioannis Antonoglou, Karen Simonyan, Daan Wierstra, Remi Munos, David Silver

Recommend


More recommend