CMP784 DEEP LEARNING Lecture #12 Deep Reinforcement Learning - PowerPoint PPT Presentation

DeepLoco by X. B. Peng, G. Berseth & M. van de Panne CMP784 DEEP LEARNING Lecture #12 – Deep Reinforcement Learning Aykut Erdem // Hacettepe University // Spring 2018

Neural Face by Taehoon Kim Previously on CMP784 • Generative Adversarial Networks (GANs) • How do GANs work • Conditional GAN • Tips and Tricks • Applications 2

Lecture overview • What is Reinforcement Learning? • Components of a RL problem • Markov Decision Processes • Value-Based Deep RL • Policy-Based Deep RL • Model-Based Deep RL r: Much of the material and slides for this lecture were borrowed from Disclaimer: — John Schulman’s talk on “Deep Reinforcement Learning: Policy Gradients and Q-Learning” — David Silver’s tutorial on “Deep Reinforcement Learning” — Lex Fridman’s MIT 6.S094 Deep Learning for Self-Driving Cars class 3

What is Reinforcement Learning? 4

What is Reinforcement Learning? Supervised Learning Unsupervised Learning Reinforcement Learning Slide credit: Razvan Pascanu 5

What is Reinforcement Learning? • Branch of machine learning concerned with taking sequences of actions • Usually described in terms of agent interacting with a previously unknown environment, trying to maximize cumulative reward action Agent Environment observation, reward 6

Motor Control and Robotics Robotics: • Observations: camera images, joint angles • Actions: joint torques • Rewards: stay balanced, navigate to target locations, serve and protect humans 7

Business Operations Inventory Management: • Observations: current inventory levels • Actions: number of units of each item to purchase • Rewards: profit 8

Image Captioning Hard Attention for Image Captioning: • Observations: current image window • Actions: where to look • Rewards: classification 9

Why is Go hard for computers to play? Game tree complexity = b d Games Brute force search intractable: 1. Search space is huge A different kind of optimization problem (min-max) 2. “Impossible” for computers to evaluate who is winning but still considered to be RL. • Go (complete information, deterministic) – AlphaGo • Backgammon (complete information, stochastic) – TD-Gammon • Stratego (incomplete information, deterministic) • Poker (incomplete information, stochastic) Matej Morav č ík et al. DeepStack: Expert-level artificial intelligence in heads-up no-limit poker. Science 02 Mar 2017 David Silver, Aja Huang, et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587 (2016), pp. 484–489. Gerald Tesauro. “Temporal difference learning and TD-Gammon”. In: Communications of the ACM 38.3 (1995), pp. 58–68. 10

How does RL relate to Supervised Learning? • Supervised learning: — Environment samples input-output pair pair ( x t , y t ) ∼ ⇢ — Agent predicts ts ˆ y t = f ( x t ) — Agent receives loss ss ` ( y t , ˆ y t ) — Environment asks agent a question, and then tells her the right answer 11

How does RL relate to Supervised Learning? • Reinforcement learning: − Environment samples input-output pair input x t ∼ P ( x t | x t − 1 , y t − 1 ) § Input depends on your previous actions! − Agent predicts ts ˆ y t = f ( x t ) − Agent receives cost where P a probability distribution st c t ∼ P ( c t | x t , ˆ y t ) w unknown to the agent. 12

Reinforcement Learning in a nutshell RL is a general-purpose framework for decision-making • RL is for an ag ent with the capacity to ac agen act action influences the agent’s future st • Each ac state • Success is measured by a scalar re rd signal reward • Goal: se select ct act ctions ons to o maximize fut uture ure re reward rd 14

Deep Learning in a nutshell RL is a general-purpose framework for decision-making • Given an obj object ctive ion that is required to achieve objective • Learn re repre rese sentatio • Directly from ra raw in inputs • Using minimal domain knowledge 15

Deep Reinforcement Learning: AI = RL + DL We seek a single agent which can solve any human-level task • RL defines the objective • DL gives the mechanism • RL + DL = general intelligence • Examples: Play games: Atari, poker, Go, ... − Pla plore worlds: 3D worlds, Labyrinth, ... − Ex Explor Control physical systems: manipulate, walk, swim, ... − Co ct with users: recommend, optimize, personalize, ... − Interact 16

Agent and Environment • At each step t the agent: • Executes action a t observation action • Receives observation o t o t a t • Receives scalar reward r t reward r t • The environment: • Receives action a t • Emits observations o t+1 • Emits scalar reward r t+1 17

Example Reinforcement Learning Problem • • • An agent operates in an environment: At Atari Breakout out • • • An agent has the capacity to ac act • Each action influences the agent’s • Each action influences the agent’s • Each action influences the agent’s fu futu ture sta tate te • • • rd signal • Success is measured by a re reward • Goal is to select actions to maximize future • Go reward 18

State • Experience is a sequence of observations, actions, rewards o 1 , r 1 , a 1 , ..., a t-1 , o t , r t • The state is a summary of experience s t = f(o 1 ,r 1 ,a 1 ,...,a t-1 ,o t ,r t ) • In a fully observed environment s t =f(o t ) 19

Major Components of an RL Agent • An RL agent may include one or more of these components: − Policy: Agent’s behavior function − Value function: How good is each state and/or action − Model: Agent’s representation of the environment 20

Policy • A policy is the agent’s behavior • It is a map from state to action: − Deterministic policy: : a = π ( s ) P − Stochastic policy: : π ( a | s ) = P [ a | s ] 21

Value Function • A value function is a prediction of future reward − “How much reward will I get from action a in state s ?” • Q -value function gives expected total reward − from state s and action a − under policy y π or γ − with discount factor ⇥ r t +1 + γ r t +2 + γ 2 r t +3 + ... | s , a ⇥ ⇤ Q π ( s , a ) = E • Value functions decompose into a Bellman equation Q π ( s , a ) = E s 0 , a 0 ⇥ ⇤ r + γ Q π ( s 0 , a 0 ) | s , a 22

Optimal Value Functions • An optimal value function is the maximum achievable value Q π ( s , a ) = Q π ⇤ ( s , a ) Q ⇤ ( s , a ) = max π • Once we have Q * we can act optimally, π ⇤ ( s ) = argmax Q ⇤ ( s , a ) a • Optimal value maximizes over all decisions. Informally: a t +1 r t +2 + γ 2 max Q ⇤ ( s , a ) = r t +1 + γ max a t +2 r t +3 + ... a t +1 Q ⇤ ( s t +1 , a t +1 ) = r t +1 + γ max • Formally, optimal values decompose into a Bellman equation  � Q ⇤ ( s , a ) = E s 0 Q ⇤ ( s 0 , a 0 ) | s , a r + γ max a 0 23

Model observation action o t a t reward r t 24

Model • Model is learnt from experience • Acts as proxy for environment • Planner interacts with model • e.g. using lookahead search 25

Approaches To Reinforcement Learning • Value-based RL − Estimate the optimal value function Q * ( s , a ) − This is the maximum value achievable under any policy • Policy-based RL − Search directly for the optimal policy y π ∗ − This is the policy achieving maximum future reward • Model-based RL − Build a model of the environment − Plan (e.g. by lookahead) using model 26

Deep Reinforcement Learning • Use deep neural networks to represent − Value function − Policy − Model • Optimize loss function by stochastic gradient descent 27

Value-Based Deep RL 28

Q-Networks • Represent value function by Q-network with weights w Q ( s , a , w ) ≈ Q ∗ ( s , a ) … Q(s,a, w ) Q(s,a 1 , w ) Q(s,a m , w ) w w s a s 29

Q-Learning • Optimal Q-values should obey Bellman equation  � Q ⇤ ( s , a ) = E s 0 Q ⇤ ( s 0 , a 0 ) | s , a r + γ max a 0 • Treat right-handside as a target e r + γ max Q ( s 0 , a 0 , w ) a 0 • Minimize MSE loss by stochastic gradient descent ◆ 2 ✓ Q ( s 0 , a 0 , w ) − Q ( s , a , w ) l = r + γ max a 0 • Converges to Q * using table lookup representation • But diverges using neural networks due to: − Correlations between samples − Non-stationary targets 30

Deep Q-Networks (DQN): Experience Replay • To remove correlations, build data-set from agent’s own experience s 1 , a 1 , r 2 , s 2 s , a , r , s 0 s 2 , a 2 , r 3 , s 3 → s 3 , a 3 , r 4 , s 4 ... s t , a t , r t +1 , s t +1 s t , a t , r t +1 , s t +1 → • Sample experiences from data-set and apply update ◆ 2 ✓ Q ( s 0 , a 0 , w � ) − Q ( s , a , w ) l = r + γ max a 0 • To deal with non-stationarity, target parameters w — are held fixed 31

Deep Reinforcement Learning in Atari state action s t a t reward r t Human level control through deep reinforcement learning, V. Mnih et al. Nature 518:529-533, 2015. 32

DQN in Atari • End-to-end learning of values Q ( s , a ) from pixels s • Input state s is stack of raw pixels from last 4 frames • Output is Q ( s , a ) for 18 joystick/button positions • Reward is change in score for that step Network architecture and hyperparameters fixed across all games 33

DQN Results in Atari 34

CMP784 DEEP LEARNING Lecture #12 Deep Reinforcement Learning - PowerPoint PPT Presentation

DeepLoco by X. B. Peng, G. Berseth & M. van de Panne CMP784 DEEP LEARNING Lecture #12 Deep Reinforcement Learning Aykut Erdem // Hacettepe University // Spring 2018 Neural Face by Taehoon Kim Previously on CMP784 Generative

CMP784 DEEP LEARNING Lecture #11 Variational Autoencoders Aykut Erdem // Hacettepe

CMP784 DEEP LEARNING Lecture #12 Self-Supervised Learning Aykut Erdem // Hacettepe

CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut Erdem // Hacettepe University

CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem // Hacettepe University //

CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem // Hacettepe University //

CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut Erdem // Hacettepe University

Decidability Classes for Mobile Agents Computing Pierre Fraigniaud CNRS and University Paris

Evaluation and Extension of a Polarity Lexicon for German Simon Clematide & Manfred Klenner

Progress with Metamaterial Research Prof. Subal Kar (Subal.Kar@fulbrightmail.org) Institute of

11/12/20 WA S H I N G T O N S T AT E 19 U N I V E R S I T Y Office of Research Assurances

Ludology Bo Kampmann Walther Bo Kampmann Walther Center for Media Studies, SDU Center for Media

Session overview Strange attractors Please turn in Controlling Chaos explorations HW4

The Role and Scope of the Chaplain on the Palliative Care Team Diane Wood, MS, MDiv April 18,

Ren Thomas Universit de Bruxelles Frontier diagrams: a global view of the structure of phase

MacroVoices Oil Discussion: OPEC Cant Fix The Problem of Low Oil Prices Art Berman Labyrinth

On Dedekinds Explanation of the Finite in Terms of the Infinite Erich Reck University of

Econ 551 Government Finance: Revenues Fall 2019 Given by Kevin Milligan Vancouver School of

1 Case 1: Scalp Laceration Case 1 Have I missed any additional scalp lacerations? How do you

Recordable Injuries for August 2010 An AD RF Dept. employee had a particle enter his eye while he

REALITY MARKETING FOR THE STARTUP Stanford Technology Venture Program | MS&E 273 Technology

Bringing machine learning to the point of care to inform suicide prevention Gregory Simon and

ACCIDENT REPORTING 237 217 200 80 252 237 217 200 119 174 237 217 200 27 .59 255 0

Flow Networks Flow Network: - digraph - weights, called capacities on edges - two

healthcare education using the Microsoft HoloLens Emma Collins Emma Collins Dr Liz Ditzel

CMP784 DEEP LEARNING Lecture #12 Deep Reinforcement Learning - PowerPoint PPT Presentation

DeepLoco by X. B. Peng, G. Berseth & M. van de Panne CMP784 DEEP LEARNING Lecture #12 Deep Reinforcement Learning Aykut Erdem // Hacettepe University // Spring 2018 Neural Face by Taehoon Kim Previously on CMP784 Generative

CMP784 DEEP LEARNING Lecture #11 Variational Autoencoders Aykut Erdem // Hacettepe

CMP784 DEEP LEARNING Lecture #12 Self-Supervised Learning Aykut Erdem // Hacettepe

CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut Erdem // Hacettepe University

CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem // Hacettepe University //

CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem // Hacettepe University //

CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut Erdem // Hacettepe University

Decidability Classes for Mobile Agents Computing Pierre Fraigniaud CNRS and University Paris

Evaluation and Extension of a Polarity Lexicon for German Simon Clematide &amp; Manfred Klenner

Progress with Metamaterial Research Prof. Subal Kar (Subal.Kar@fulbrightmail.org) Institute of

11/12/20 WA S H I N G T O N S T AT E 19 U N I V E R S I T Y Office of Research Assurances

Ludology Bo Kampmann Walther Bo Kampmann Walther Center for Media Studies, SDU Center for Media

Session overview Strange attractors Please turn in Controlling Chaos explorations HW4

The Role and Scope of the Chaplain on the Palliative Care Team Diane Wood, MS, MDiv April 18,

Ren Thomas Universit de Bruxelles Frontier diagrams: a global view of the structure of phase

MacroVoices Oil Discussion: OPEC Cant Fix The Problem of Low Oil Prices Art Berman Labyrinth

On Dedekinds Explanation of the Finite in Terms of the Infinite Erich Reck University of

Econ 551 Government Finance: Revenues Fall 2019 Given by Kevin Milligan Vancouver School of

1 Case 1: Scalp Laceration Case 1 Have I missed any additional scalp lacerations? How do you

Recordable Injuries for August 2010 An AD RF Dept. employee had a particle enter his eye while he

REALITY MARKETING FOR THE STARTUP Stanford Technology Venture Program | MS&amp;E 273 Technology

Bringing machine learning to the point of care to inform suicide prevention Gregory Simon and

ACCIDENT REPORTING 237 217 200 80 252 237 217 200 119 174 237 217 200 27 .59 255 0

Flow Networks Flow Network: - digraph - weights, called capacities on edges - two

healthcare education using the Microsoft HoloLens Emma Collins Emma Collins Dr Liz Ditzel

Evaluation and Extension of a Polarity Lexicon for German Simon Clematide & Manfred Klenner

REALITY MARKETING FOR THE STARTUP Stanford Technology Venture Program | MS&E 273 Technology