cmp722
play

CMP722 ADVANCED COMPUTER VISION Lecture #6 Deep Reinforcement - PowerPoint PPT Presentation

Image: StarCraft II DeepMind feature layer API CMP722 ADVANCED COMPUTER VISION Lecture #6 Deep Reinforcement Learning Aykut Erdem // Hacettepe University // Spring 2019 Illustration: William Joel Previously on CMP722 image captioning


  1. Image: StarCraft II DeepMind feature layer API CMP722 ADVANCED COMPUTER VISION Lecture #6 – Deep Reinforcement Learning Aykut Erdem // Hacettepe University // Spring 2019

  2. Illustration: William Joel Previously on CMP722 • image captioning • visual question answering • case study: neural module networks

  3. Lecture overview • case studies (and a bit of history) • formalizing reinforcement learning • policy gradient methods • temporal differences, q-learning sclaimer: Much of the material and slides for this lecture were borrowed from • Discl — Katja Hofmann’s Deep Learning Indaba 2018 lecture on "Reinforcement Learning" 3

  4. Decision Making and Learning under Uncertainty Buzz TaMaties Java Junction Jeff’s Place DCM Nca’Kos Feathers Vlambojant Mirriam’s Kitchen Otaku Roman’s Pizza Hutmakers 4

  5. Reinforcement Learning (RL) • the science and engineering of decision making and learning under uncertainty • a type of machine learning that models learning from experience in a wide range of applications 5

  6. Case Studies (and a bit of history) 6

  7. RL can model a vast range of problems • Example problems that motivated RL research Animal Learning Games Optimal Control 7

  8. Photo by Magda Ehlers from Pexels Lindquist, J. 1962, " Operations of a hydrothermal electric system: A multistage decision process. " Transactions of the American Institute of Electrical Engineers. Mario Pereira, Nora Campodónico, & Rafael Kelman, 1998, " Long-term hydro scheduling based on stochastic models ." EPSOM 98. 8

  9. Long-term consequences in optimal control Figure from: Mario Pereira, Nora Campodónico, & Rafael Kelman. "Long-term hydro scheduling based on stochastic models." EPSOM 98. 9

  10. Photo credit: https://www.flickr.com/photos/shawnzlea/261793051/ Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 3(3), 210–229. Samuel, A. L. (1967). Some studies in machine learning using the game of checkers. II—Recent progress. IBM Journal on Research and Development, 11(6):601–617 10

  11. Samuel’s Checkers Player 11

  12. Photo credit: https://www.flickr.com/photos/scorius/750037290 Schultz, Wolfram, Peter Dayan, and P. Read Montague. "A neural substrate of prediction and reward" Science 275.5306 (1997): 1593-1599. 12

  13. RL as a valuable tool for modelling neurological phenomena Figure from: Schultz, Wolfram, Peter Dayan, and P. Read Montague. "A neural substrate of prediction and reward." Science 1997. 13

  14. Further Reading • White, D. J. (1985). Real applications of Markov decision processes. Interfaces, 15(6). • White, D. J. (1988). Further real applications of Markov decision processes. Interfaces, 18(5). • Maia, Tiago V., and Michael J. Frank. "From reinforcement learning models to psychiatric and neurological disorders." Nature neuroscience 14.2 (2011). • Sutton, R. S., & Barto, A. G. (2017). Reinforcement learning: An introduction . MIT press, 2 nd Edition. http://incompleteideas.net/book/the-book-2nd.html Chapter 1, 14-16 14

  15. Formalizing Reinforcement Learning 15

  16. In RL – agent interacts with an environment environment agent Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 16

  17. In RL – agent interacts with an environment environment agent state s " ∈ S Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 17

  18. In RL – agent interacts with an environment action a " ∈ A environment agent state s " ∈ S Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 18

  19. In RL – agent interacts with an environment action a & ∈ A environment agent reward r " ∈ ℝ state s " ∈ S Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 19

  20. In RL – agent interacts with an environment action a " ∈ A environment agent reward r " ∈ ℝ state s " ∈ S Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 20

  21. In RL – agent interacts with an environment action a & ∈ A environment agent reward r " ∈ ℝ state s " ∈ S Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 21

  22. In RL – agent interacts with an environment action a , ∈ A environment agent reward r "#$ ∈ acts with policy ℝ state s "#$ ∈ S π(a|s) Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 22

  23. In RL – agent interacts with an environment action a . ∈ A environment agent reward r "#$ ∈ transition dynamics acts with policy ℝ state s "#$ ∈ S p(- .#$ |s " , a " ) and π(a|s) reward function r(1 .#$ |s " , a " ) Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 23

  24. Markov Decision Process (MDP) Defined by ! = ($, &, ', (, )) discount factor: ) ∈ (0,1) • Key assumption: Markov property (dynamics only depend on most recent state and action) 24

  25. Markov Decision Process (MDP) Defined by ! = ($, &, ', (, )) discount factor: ) ∈ (0,1) • Key assumption: Markov property (dynamics only depend on most recent state and action) • Define goal: • Take actions that maximize (discounted) cumulative return 4 ) 1 5 G / = 0 167 123 25

  26. Examples – States, Actions, Rewards 26

  27. State space • Important modelling choice: how to represent the problem? • Example: hydroelectric power control problem • Consider choices: a) Discrete states “low” and “high” reservoir level b) Coarse discretization: “0-10%” “10-20%” … “90-100%” c) Continuous states – current reservoir level (e.g., 67%) 27

  28. State space • Important modelling choice: how to represent the problem? • Considerations: - Is the Markov property satisfied? - (How) can prior (expert) knowledge be encoded? - Effects on optimal solution? - Effects on data efficiency? 28

  29. Mnih et al. results in Atari – a lesson in generality Screenshots from: Kurin, V., Nowozin, S., Hofmann, K., Beyer, L., & Leibe, B. (2017). The Atari Grand Challenge Dataset. http://atarigrandchallenge.com/ Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature , 518 (7540). 29

  30. Case Study: Investigating Human Priors for Playing Video Games • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018. https://rach0012.github.io/humanRL_website/ 30

  31. Case Study: Investigating Human Priors for Playing Video Games • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018. https://rach0012.github.io/humanRL_website/ 31

  32. Case Study: Investigating Human Priors for Playing Video Games • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018. 32

  33. Case Study: Investigating Human Priors for Playing Video Games • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018. 33

  34. Case Study: Investigating Human Priors for Playing Video Games • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018. 34

  35. Case Study: Investigating Human Priors for Playing Video Games • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018. 35

  36. Action space Again – important modelling choice, common: a) Discrete, e.g., on/off, which button to press (Atari) b) Continuous, e.g., how much force to apply, how quickly to accelerate c) Active research area: large, complex action spaces (e.g., combinatorial, mixed discrete/continuous, natural language) • Trade-offs include: data efficiency, generalization 36

  37. A Platform for Research: TextWorld https://www.microsoft.com/en-us/research/project/textworld/ 37

  38. Rewards • Key Question: where do RL agents’ goals come from? • In some settings – natural reward signal may be available (e.g., game score in Atari) • More typically – important modelling choice with strong effects on learned solutions 38

  39. Rewards For details and full video: https://blog.openai.com/faulty-reward-functions/ 39

  40. Further Reading • Sutton, R. S., & Barto, A. G. (2017). Reinforcement learning: An introduction . MIT press, 2 nd Edition. http://incompleteideas.net/book/the-book-2nd.html Chapter 3, 9.5 40

  41. RL Approaches 1: Policy Gradient Methods 41

  42. Policy Gradient: Intuition • Focus on learning a good behaviour policy 42

  43. • Example: Learning in Policy Gradient: Intuition Multi-armed bandit problems Photo credit: https://www.flickr.com/photos/knothing/11264853546/ 43

  44. • Example: Learning in Policy Gradient: Intuition Multi-armed bandit problems .5 .5 .5 .5 Photo credit: https://www.flickr.com/photos/knothing/11264853546/ 44

  45. • Example: Learning in Policy Gradient: Intuition Multi-armed bandit problems .5 .5 .5 .5 Photo credit: https://www.flickr.com/photos/knothing/11264853546/ 45

  46. • Example: Learning in Policy Gradient: Intuition Multi-armed bandit problems Loose L .5 .5 .5 .5 Photo credit: https://www.flickr.com/photos/knothing/11264853546/ 46

Recommend


More recommend