deep reinforcement learning building blocks
play

Deep Reinforcement Learning Building Blocks Arjun Chandra Research - PowerPoint PPT Presentation

Deep Reinforcement Learning Building Blocks Arjun Chandra Research Scientist Telenor Research / Telenor-NTNU AI Lab arjun.chandra@telenor.com @boelger 8 November 2017 https://join.slack.com/t/deep-rl-tutorial/signup The Plan The Problem


  1. our toy problem lookup table N S E W 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 5 0 0 0 0 6 0 0 0 0 home 7 0 0 0 0 8 0 0 0 0 9 0 0 0 0

  2. our toy problem lookup table 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 4 5 6 0 0 0 0 0 0 home 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  3. reward structure? 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 4 5 6 0 0 0 0 0 0 0 0 home 0 0 0 0 0 0 0 7 7 8 9 0 0 0 0 0 0 home 0 0 0 move… to 7/home: out of bounds: to 5: to any cell except 5 and 7: 10 -5 -10 -1

  4. let’s fix 𝛽 = 0.1, 𝛿 = 0.5 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 4 5 6 0 0 0 0 0 0 home 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 0 0 0

  5. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0 say 𝜁 -greedy policy… episode 1 begins...

  6. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -1 0 0 0 1 2 3 0 0 ? 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  7. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s 0 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  8. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s 0 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  9. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -5 ? 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  10. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  11. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  12. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -1 ? 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  13. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  14. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  15. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -10 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 ? 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  16. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  17. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  18. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 -1 0 ? 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  19. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  20. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  21. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 10 7 8 9 0 0 ? 0 0 0 home 0 0 0

  22. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 7 8 9 0 0 1 0 0 0 home 0 0 0 episode 1 ends.

  23. let’s work out the next episode, star<ng at state 4 -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 7 8 9 0 0 1 0 0 0 home 0 0 0 go WEST and then SOUTH how does the table change?

  24. -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 -0.5 -1 0 0 0 0 1 -0.1 0 0 0 0 7 8 9 0 0 1 0 0 0 0 0 0

  25. and the next episode, star<ng at state 3 go WEST -> SOUTH -> WEST -> SOUTH

  26. -0.5 0 0 1 2 3 0 0 -0.1 0 -0.1 0 -0.1 -1 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 -0.5 -1 -0.05 0 0 0 1.9 -0.1 0 0 0 0 7 8 9 0 0 1 0 0 0 0 0 0 over <me, values will converge to op<mal!

  27. what we just saw was some episodes of Q-learning values update towards value of op+mal policy : target comes from value of assumed next best ac+on off-policy learning

  28. SARSA-learning? values update towards value of current policy : target comes from value of the actual next ac+on on-policy learning

  29. Q SARSA By Andreas Tille (Own work) [GFDL (www.gnu.org/copylei/fdl.html) or CC-BY-SA-3.0-2.5-2.0-1.0 (www.creaCvecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons data generated by data not generated by target policy target policy 𝜁 : 0.1 𝛿 : 1.0 SARSA Q Example credit Travis DeWolf : https://studywolf.wordpress.com/ and https://git.io/vFBvv

  30. Problem Decomposition nested sub-problems solution to sub-problem informs solution to whole problem

  31. Bellman Expectation Backup system of linear equations solution: value of policy a, q(s, a) s, v(s) r s’ a r a’, q(s’, a’) s’, v(s’) Value of = P(path) * Value(path) Value of = P(path) * Value(path) βŽ› ⎞ a + Ξ³ a + Ξ³ a v Ο€ ( s ') βˆ‘ βˆ‘ βˆ‘ βˆ‘ v Ο€ ( s ) = Ο€ ( a | s ) r s q Ο€ ( s , a ) = r s Ο€ ( a '| s ') q Ο€ ( s ', a ') P a P ⎜ ⎟ ⎝ ⎠ ss ' ss ' a s ' s ' a ' Bellman expectation equations under a given policy

  32. Bellman Optimality Backup system of non-linear equations solution: value of optimal policy a, q(s, a) s, v(s) r s’ a r a’, q(s’, a’) s’, v(s’) Value of = P(path) * Value(path) Value of = P(path) * Value(path) βŽ› ⎞ a + Ξ³ a + Ξ³ a v * ( s ') βˆ‘ βˆ‘ q * ( s , a ) = r s a v * ( s ) = max a r s P max a ' q * ( s ', a ') P ⎜ ⎟ ⎝ ⎠ ss ' ss ' s ' s ' Bellman optimality equations under optimal policy

  33. Value Based

  34. Dynamic Programming …using Bellman equations as iterative updates 1 2 what’s best to do? -5 -5 N E -1 W 1 2 -1 -5 -5 3 4 S -10 10 3 4 home

  35. Dynamic Programming …using Bellman equations as iterative updates 1 2 what’s best to do? -5 -5 N E -1 W 1 2 -1 -5 -5 3 4 S -10 10 3 4 home

Recommend


More recommend