Image: StarCraft II DeepMind feature layer API CMP722 ADVANCED COMPUTER VISION Lecture #6 – Deep Reinforcement Learning Aykut Erdem // Hacettepe University // Spring 2019
Illustration: William Joel Previously on CMP722 • image captioning • visual question answering • case study: neural module networks
Lecture overview • case studies (and a bit of history) • formalizing reinforcement learning • policy gradient methods • temporal differences, q-learning sclaimer: Much of the material and slides for this lecture were borrowed from • Discl — Katja Hofmann’s Deep Learning Indaba 2018 lecture on "Reinforcement Learning" 3
Decision Making and Learning under Uncertainty Buzz TaMaties Java Junction Jeff’s Place DCM Nca’Kos Feathers Vlambojant Mirriam’s Kitchen Otaku Roman’s Pizza Hutmakers 4
Reinforcement Learning (RL) • the science and engineering of decision making and learning under uncertainty • a type of machine learning that models learning from experience in a wide range of applications 5
Case Studies (and a bit of history) 6
RL can model a vast range of problems • Example problems that motivated RL research Animal Learning Games Optimal Control 7
Photo by Magda Ehlers from Pexels Lindquist, J. 1962, " Operations of a hydrothermal electric system: A multistage decision process. " Transactions of the American Institute of Electrical Engineers. Mario Pereira, Nora Campodónico, & Rafael Kelman, 1998, " Long-term hydro scheduling based on stochastic models ." EPSOM 98. 8
Long-term consequences in optimal control Figure from: Mario Pereira, Nora Campodónico, & Rafael Kelman. "Long-term hydro scheduling based on stochastic models." EPSOM 98. 9
Photo credit: https://www.flickr.com/photos/shawnzlea/261793051/ Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 3(3), 210–229. Samuel, A. L. (1967). Some studies in machine learning using the game of checkers. II—Recent progress. IBM Journal on Research and Development, 11(6):601–617 10
Samuel’s Checkers Player 11
Photo credit: https://www.flickr.com/photos/scorius/750037290 Schultz, Wolfram, Peter Dayan, and P. Read Montague. "A neural substrate of prediction and reward" Science 275.5306 (1997): 1593-1599. 12
RL as a valuable tool for modelling neurological phenomena Figure from: Schultz, Wolfram, Peter Dayan, and P. Read Montague. "A neural substrate of prediction and reward." Science 1997. 13
Further Reading • White, D. J. (1985). Real applications of Markov decision processes. Interfaces, 15(6). • White, D. J. (1988). Further real applications of Markov decision processes. Interfaces, 18(5). • Maia, Tiago V., and Michael J. Frank. "From reinforcement learning models to psychiatric and neurological disorders." Nature neuroscience 14.2 (2011). • Sutton, R. S., & Barto, A. G. (2017). Reinforcement learning: An introduction . MIT press, 2 nd Edition. http://incompleteideas.net/book/the-book-2nd.html Chapter 1, 14-16 14
Formalizing Reinforcement Learning 15
In RL – agent interacts with an environment environment agent Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 16
In RL – agent interacts with an environment environment agent state s " ∈ S Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 17
In RL – agent interacts with an environment action a " ∈ A environment agent state s " ∈ S Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 18
In RL – agent interacts with an environment action a & ∈ A environment agent reward r " ∈ ℝ state s " ∈ S Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 19
In RL – agent interacts with an environment action a " ∈ A environment agent reward r " ∈ ℝ state s " ∈ S Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 20
In RL – agent interacts with an environment action a & ∈ A environment agent reward r " ∈ ℝ state s " ∈ S Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 21
In RL – agent interacts with an environment action a , ∈ A environment agent reward r "#$ ∈ acts with policy ℝ state s "#$ ∈ S π(a|s) Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 22
In RL – agent interacts with an environment action a . ∈ A environment agent reward r "#$ ∈ transition dynamics acts with policy ℝ state s "#$ ∈ S p(- .#$ |s " , a " ) and π(a|s) reward function r(1 .#$ |s " , a " ) Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 23
Markov Decision Process (MDP) Defined by ! = ($, &, ', (, )) discount factor: ) ∈ (0,1) • Key assumption: Markov property (dynamics only depend on most recent state and action) 24
Markov Decision Process (MDP) Defined by ! = ($, &, ', (, )) discount factor: ) ∈ (0,1) • Key assumption: Markov property (dynamics only depend on most recent state and action) • Define goal: • Take actions that maximize (discounted) cumulative return 4 ) 1 5 G / = 0 167 123 25
Examples – States, Actions, Rewards 26
State space • Important modelling choice: how to represent the problem? • Example: hydroelectric power control problem • Consider choices: a) Discrete states “low” and “high” reservoir level b) Coarse discretization: “0-10%” “10-20%” … “90-100%” c) Continuous states – current reservoir level (e.g., 67%) 27
State space • Important modelling choice: how to represent the problem? • Considerations: - Is the Markov property satisfied? - (How) can prior (expert) knowledge be encoded? - Effects on optimal solution? - Effects on data efficiency? 28
Mnih et al. results in Atari – a lesson in generality Screenshots from: Kurin, V., Nowozin, S., Hofmann, K., Beyer, L., & Leibe, B. (2017). The Atari Grand Challenge Dataset. http://atarigrandchallenge.com/ Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature , 518 (7540). 29
Case Study: Investigating Human Priors for Playing Video Games • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018. https://rach0012.github.io/humanRL_website/ 30
Case Study: Investigating Human Priors for Playing Video Games • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018. https://rach0012.github.io/humanRL_website/ 31
Case Study: Investigating Human Priors for Playing Video Games • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018. 32
Case Study: Investigating Human Priors for Playing Video Games • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018. 33
Case Study: Investigating Human Priors for Playing Video Games • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018. 34
Case Study: Investigating Human Priors for Playing Video Games • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018. 35
Action space Again – important modelling choice, common: a) Discrete, e.g., on/off, which button to press (Atari) b) Continuous, e.g., how much force to apply, how quickly to accelerate c) Active research area: large, complex action spaces (e.g., combinatorial, mixed discrete/continuous, natural language) • Trade-offs include: data efficiency, generalization 36
A Platform for Research: TextWorld https://www.microsoft.com/en-us/research/project/textworld/ 37
Rewards • Key Question: where do RL agents’ goals come from? • In some settings – natural reward signal may be available (e.g., game score in Atari) • More typically – important modelling choice with strong effects on learned solutions 38
Rewards For details and full video: https://blog.openai.com/faulty-reward-functions/ 39
Further Reading • Sutton, R. S., & Barto, A. G. (2017). Reinforcement learning: An introduction . MIT press, 2 nd Edition. http://incompleteideas.net/book/the-book-2nd.html Chapter 3, 9.5 40
RL Approaches 1: Policy Gradient Methods 41
Policy Gradient: Intuition • Focus on learning a good behaviour policy 42
• Example: Learning in Policy Gradient: Intuition Multi-armed bandit problems Photo credit: https://www.flickr.com/photos/knothing/11264853546/ 43
• Example: Learning in Policy Gradient: Intuition Multi-armed bandit problems .5 .5 .5 .5 Photo credit: https://www.flickr.com/photos/knothing/11264853546/ 44
• Example: Learning in Policy Gradient: Intuition Multi-armed bandit problems .5 .5 .5 .5 Photo credit: https://www.flickr.com/photos/knothing/11264853546/ 45
• Example: Learning in Policy Gradient: Intuition Multi-armed bandit problems Loose L .5 .5 .5 .5 Photo credit: https://www.flickr.com/photos/knothing/11264853546/ 46
Recommend
More recommend