CMP722 ADVANCED COMPUTER VISION Lecture #6 Deep Reinforcement - PowerPoint PPT Presentation

Image: StarCraft II DeepMind feature layer API CMP722 ADVANCED COMPUTER VISION Lecture #6 – Deep Reinforcement Learning Aykut Erdem // Hacettepe University // Spring 2019

Illustration: William Joel Previously on CMP722 • image captioning • visual question answering • case study: neural module networks

Lecture overview • case studies (and a bit of history) • formalizing reinforcement learning • policy gradient methods • temporal differences, q-learning sclaimer: Much of the material and slides for this lecture were borrowed from • Discl — Katja Hofmann’s Deep Learning Indaba 2018 lecture on "Reinforcement Learning" 3

Decision Making and Learning under Uncertainty Buzz TaMaties Java Junction Jeff’s Place DCM Nca’Kos Feathers Vlambojant Mirriam’s Kitchen Otaku Roman’s Pizza Hutmakers 4

Reinforcement Learning (RL) • the science and engineering of decision making and learning under uncertainty • a type of machine learning that models learning from experience in a wide range of applications 5

Case Studies (and a bit of history) 6

RL can model a vast range of problems • Example problems that motivated RL research Animal Learning Games Optimal Control 7

Photo by Magda Ehlers from Pexels Lindquist, J. 1962, " Operations of a hydrothermal electric system: A multistage decision process. " Transactions of the American Institute of Electrical Engineers. Mario Pereira, Nora Campodónico, & Rafael Kelman, 1998, " Long-term hydro scheduling based on stochastic models ." EPSOM 98. 8

Long-term consequences in optimal control Figure from: Mario Pereira, Nora Campodónico, & Rafael Kelman. "Long-term hydro scheduling based on stochastic models." EPSOM 98. 9

Photo credit: https://www.flickr.com/photos/shawnzlea/261793051/ Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 3(3), 210–229. Samuel, A. L. (1967). Some studies in machine learning using the game of checkers. II—Recent progress. IBM Journal on Research and Development, 11(6):601–617 10

Samuel’s Checkers Player 11

Photo credit: https://www.flickr.com/photos/scorius/750037290 Schultz, Wolfram, Peter Dayan, and P. Read Montague. "A neural substrate of prediction and reward" Science 275.5306 (1997): 1593-1599. 12

RL as a valuable tool for modelling neurological phenomena Figure from: Schultz, Wolfram, Peter Dayan, and P. Read Montague. "A neural substrate of prediction and reward." Science 1997. 13

Further Reading • White, D. J. (1985). Real applications of Markov decision processes. Interfaces, 15(6). • White, D. J. (1988). Further real applications of Markov decision processes. Interfaces, 18(5). • Maia, Tiago V., and Michael J. Frank. "From reinforcement learning models to psychiatric and neurological disorders." Nature neuroscience 14.2 (2011). • Sutton, R. S., & Barto, A. G. (2017). Reinforcement learning: An introduction . MIT press, 2 nd Edition. http://incompleteideas.net/book/the-book-2nd.html Chapter 1, 14-16 14

Formalizing Reinforcement Learning 15

In RL – agent interacts with an environment environment agent Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 16

In RL – agent interacts with an environment environment agent state s " ∈ S Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 17

In RL – agent interacts with an environment action a " ∈ A environment agent state s " ∈ S Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 18

In RL – agent interacts with an environment action a & ∈ A environment agent reward r " ∈ ℝ state s " ∈ S Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 19

In RL – agent interacts with an environment action a " ∈ A environment agent reward r " ∈ ℝ state s " ∈ S Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 20

In RL – agent interacts with an environment action a & ∈ A environment agent reward r " ∈ ℝ state s " ∈ S Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 21

In RL – agent interacts with an environment action a , ∈ A environment agent reward r "#$ ∈ acts with policy ℝ state s "#$ ∈ S π(a|s) Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 22

In RL – agent interacts with an environment action a . ∈ A environment agent reward r "#$ ∈ transition dynamics acts with policy ℝ state s "#$ ∈ S p(- .#$ |s " , a " ) and π(a|s) reward function r(1 .#$ |s " , a " ) Photo credit: https://www.flickr.com/photos/steveonjava/8170183457 23

Markov Decision Process (MDP) Defined by ! = ($, &, ', (, )) discount factor: ) ∈ (0,1) • Key assumption: Markov property (dynamics only depend on most recent state and action) 24

Markov Decision Process (MDP) Defined by ! = ($, &, ', (, )) discount factor: ) ∈ (0,1) • Key assumption: Markov property (dynamics only depend on most recent state and action) • Define goal: • Take actions that maximize (discounted) cumulative return 4 ) 1 5 G / = 0 167 123 25

Examples – States, Actions, Rewards 26

State space • Important modelling choice: how to represent the problem? • Example: hydroelectric power control problem • Consider choices: a) Discrete states “low” and “high” reservoir level b) Coarse discretization: “0-10%” “10-20%” … “90-100%” c) Continuous states – current reservoir level (e.g., 67%) 27

State space • Important modelling choice: how to represent the problem? • Considerations: - Is the Markov property satisfied? - (How) can prior (expert) knowledge be encoded? - Effects on optimal solution? - Effects on data efficiency? 28

Mnih et al. results in Atari – a lesson in generality Screenshots from: Kurin, V., Nowozin, S., Hofmann, K., Beyer, L., & Leibe, B. (2017). The Atari Grand Challenge Dataset. http://atarigrandchallenge.com/ Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature , 518 (7540). 29

Case Study: Investigating Human Priors for Playing Video Games • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018. https://rach0012.github.io/humanRL_website/ 30

Case Study: Investigating Human Priors for Playing Video Games • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018. https://rach0012.github.io/humanRL_website/ 31

Case Study: Investigating Human Priors for Playing Video Games • Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., & Efros, A. A, ICML 2018. 32

Action space Again – important modelling choice, common: a) Discrete, e.g., on/off, which button to press (Atari) b) Continuous, e.g., how much force to apply, how quickly to accelerate c) Active research area: large, complex action spaces (e.g., combinatorial, mixed discrete/continuous, natural language) • Trade-offs include: data efficiency, generalization 36

A Platform for Research: TextWorld https://www.microsoft.com/en-us/research/project/textworld/ 37

Rewards • Key Question: where do RL agents’ goals come from? • In some settings – natural reward signal may be available (e.g., game score in Atari) • More typically – important modelling choice with strong effects on learned solutions 38

Rewards For details and full video: https://blog.openai.com/faulty-reward-functions/ 39

Further Reading • Sutton, R. S., & Barto, A. G. (2017). Reinforcement learning: An introduction . MIT press, 2 nd Edition. http://incompleteideas.net/book/the-book-2nd.html Chapter 3, 9.5 40

RL Approaches 1: Policy Gradient Methods 41

Policy Gradient: Intuition • Focus on learning a good behaviour policy 42

• Example: Learning in Policy Gradient: Intuition Multi-armed bandit problems Photo credit: https://www.flickr.com/photos/knothing/11264853546/ 43

• Example: Learning in Policy Gradient: Intuition Multi-armed bandit problems .5 .5 .5 .5 Photo credit: https://www.flickr.com/photos/knothing/11264853546/ 44

• Example: Learning in Policy Gradient: Intuition Multi-armed bandit problems .5 .5 .5 .5 Photo credit: https://www.flickr.com/photos/knothing/11264853546/ 45

• Example: Learning in Policy Gradient: Intuition Multi-armed bandit problems Loose L .5 .5 .5 .5 Photo credit: https://www.flickr.com/photos/knothing/11264853546/ 46

CMP722 ADVANCED COMPUTER VISION Lecture #6 Deep Reinforcement - PowerPoint PPT Presentation

Image: StarCraft II DeepMind feature layer API CMP722 ADVANCED COMPUTER VISION Lecture #6 Deep Reinforcement Learning Aykut Erdem // Hacettepe University // Spring 2019 Illustration: William Joel Previously on CMP722 image captioning

CMP722 ADVANCED COMPUTER VISION Lecture #3 Sequential Processing with NNs and Attention

CMP722 ADVANCED COMPUTER VISION Lecture #10 Modeling the Physical World Aykut Erdem //

CMP722 ADVANCED COMPUTER VISION Lecture #4 Multimodality Aykut Erdem // Hacettepe

CMP722 ADVANCED COMPUTER VISION Lecture #9 Graph Networks Aykut Erdem // Hacettepe

CMP722 ADVANCED COMPUTER VISION Lecture #5 Language and Vision Aykut Erdem // Hacettepe

CMP722 ADVANCED COMPUTER VISION Lecture #8 Image Synthesis Aykut Erdem // Hacettepe