Model-Free Control (Reinforcement Learning) and Deep Learning M ARC - PowerPoint PPT Presentation

Model-Free Control (Reinforcement Learning) and Deep Learning M ARC G. B ELLEMARE Google Brain (Montréal)

1956 1992 2016 2015

T HE A RCADE L EARNING E NVIRONMENT (B ELLEMARE ET AL ., 2013)

160 pixels Reward: change in score 210 pixels 60 frames/second 18 actions

• 33,600 (discrete) dimensions • Up to 108,000 decisions/episode (30 minutes) • 60+ games: heterogenous dynamical systems

D EEP L EARNING : A N AI S UCCESS S TORY

M := h X , A , R, P, γ i Q ⇤ ( x, a ) = r ( x, a ) + γ E a 0 2 A Q ⇤ ( x 0 , a 0 ) P max Q = Π T π ˆ ˆ Q 1 � Π Q π − Q π � � ˆ � Q − Q π � � D ≤ � � D 1 − γ Theory Practice

W HERE HAS MODEL - FREE CONTROL BEEN SO SUCCESSFUL ? • Complex dynamical systems • Black-box simulators • High-dimensional state spaces • Long time horizons • Opponent / adversarial element

P RACTICAL C ONSIDERATIONS • Are simulations reasonably cheap? model-free • Is the notion of “state” complex? model-free • Is there partial observability? maybe model-free • Can the state space be enumerated? value iteration • Is there an explicit model available? model-based

O UTLINE OF T ALK Ideal Practical case case

  W HAT ’ S R EINFORCEMENT L EARNING , A NYWAY ? “A LL GOALS AND PURPOSES … CAN BE THOUGHT OF AS THE MAXIMIZATION OF SOME VALUE FUNCTION ” – S UTTON & B ARTO (2017, IN PRESS )

state action s t a t x t • At each step t, the agent • Observes a state reward r t • Takes an action • Receives a reward

T HREE LEARNING PROBLEMS IN ONE Stochastic   approximation Policy evaluation Optimal control Function approximation

    B ACKGROUND • Formalized as a Markov Decision Process:   M := h X , A , R, P, γ i • R , P reward, transition functions • 𝛿 discount factor • A trajectory is a sequence of interactions with the environment x 1 , a 1 , r 1 , x 2 , a 2 , . . .

  • Policy : a probability distribution over actions:   π If deterministic: a t ∼ π ( · | x t ) a t = π ( x t ) • Transition function: x t +1 ∼ P ( · | x t , a t ) • Value function : total discounted reward   Q π ( x, a ) " ∞ # X γ t r ( x t , a t ) | x 0 , a 0 = x, a Q π ( x, a ) = E P, π t =0 • As a vector in space of value functions: Q π ∈ Q

      • “Maximize value function”: find   Q ∗ ( x, a ) := max Q π ( x, a ) π • Bellman’s equation:   " 1 # X γ t r ( x t , a t ) | x 0 , a 0 = x, a Q π ( x, a ) = E P, π t =0 = r ( x, a ) + γ E P, π Q π ( x 0 , a 0 ) • Optimality equation:   Q ⇤ ( x, a ) = r ( x, a ) + γ E a 0 2 A Q ⇤ ( x 0 , a 0 ) P max

B ELLMAN O PERATOR T π Q ( x, a ) := r ( x, a ) + γ E Q ( x 0 , a 0 ) x 0 ⇠ P a 0 ⇠ π • The Bellman operator is a 𝛿 -contraction:   k T π Q � Q π k ∞  γ k Q � Q π k ∞ Q k • Fixed point: Q π Q π = T π Q π Q k +1 := T π Q k

B ELLMAN O PTIMALITY O PERATOR T Q ( x, a ) := r ( x, a ) + γ E a 0 2 A Q ( x 0 , a 0 ) x 0 ⇠ P max • Also a 𝛿 -contraction (beware! different proof):   k T Q � Q ∗ k ∞  γ k Q � Q ∗ k ∞ Q k • Fixed point is optimal v.f.: Q ∗ Q ∗ = T Q ∗ ≥ Q π Q k +1 := T Q k

  M ODEL - BASED A LGORITHMS 1.Value iteration:   Q k +1 ( x, a ) ← T Q k ( x, a ) = r ( x, a ) + γ E P max a 0 2 A Q k ( x 0 , a 0 ) 2.Policy iteration:   a. b. π k = arg max T π Q k ( x, a ) Q k +1 ( x, a ) ← Q π k ( x, a ) π 3.Optimistic policy iteration:   Q k +1 ( x, a ) ← ( T π k ) m Q k ( x, a ) = T π k · · · T π k Q k ( x, a ) | {z } m times

P OLICY I TERATION a. π k = arg max T π Q k ( x, a ) π Policy evaluation Optimal control b. Q k +1 ( x, a ) ← Q π k ( x, a )

M ODEL - FREE R EINFORCEMENT L EARNING • Typically no access to P, R • Two options: • Learn a model (not in this talk) • Model-free: learn or directly from samples Q ∗ Q π

Model-based Q k x t Q π a t E P Q k +1 := T π Q k Model-free x t state action s t a t x t a t reward r t x t +1 ∼ P ( · | x t , a t )

        M ODEL - FREE RL:   S YNCHRONOUS U PDATES • For all x, a , sample x 0 ∼ P ( · | x, a ) , a 0 ∼ π ( · | x 0 ) • The SARSA algorithm:   Q t +1 ( x, a ) ← (1 − α t ) Q t ( x, a ) + α t ˆ T π t Q t ( x, a ) � � r ( x, a ) + γ Q t ( x 0 , a 0 ) = (1 − α t ) Q t ( x, a ) + α t � � r ( x, a ) + γ Q t ( x 0 , a 0 ) − Q t ( x, a )) = Q t ( x, a ) + α t | {z } TD-error δ • is a step-size (sequence)   α t ∈ [0 , 1)

M ODEL - FREE RL:   Q-L EARNING • The Q-Learning algorithm: max. at each iteration Q t +1 ( x, a ) ← (1 − α t ) Q t ( x, a ) � � a 0 2 A Q t ( x 0 , a 0 ) − Q t ( x, a ) + α t r ( x, a ) + γ max

• Both converge under   • Not trivial! Interleaved   Robbins-Monro conditions learning problems Policy evaluation Optimal control Stochastic   approximation Q-Learning

A SYNCHRONOUS U PDATES • The asynchronous case: learn from trajectories   x 1 , a 1 , r 1 , x 2 , a 2 , · · · ∼ π , P • Apply update at each step:   � � Q ( x t , a t ) ← Q t ( x, a ) + α t r t + γ Q ( x t +1 , a t +1 ) − Q ( x t , a t ) • This is the setting we   usually deal with • Convergence even   more delicate

O PEN QUESTIONS / A REAS OF ACTIVE RESEARCH • Rates of convergence [1] • Variance reduction [2] • Convergence guarantees for multi-step methods [3, 4] • Off-policy learning: control from fixed behaviour [3, 4] [1] Konda and Tsitsiklis (2004) [2] Azar et al., Speedy Q-Learning (2011) [3] Harutyunyan, Bellemare, Stepleton, Munos (2016) [4] Munos, Stepleton, Harutyunyan, Bellemare (2016)

Stochastic   approximation Policy evaluation Optimal control Function approximation

(V ALUE ) FUNCTION A PPROXIMATION • Parametrize value function:   Q π ( x, a ) ≈ Q ( x, a, θ ) • Learning now involves a projection step :   Π Π T π Q ( x, a, θ k ) : θ k +1 ← arg min � T π Q k ( x, a, θ k ) − Q ( x, a, θ ) � � � D θ • This leads to additional,   compounding error • Can cause divergence  

S OME CLASSIC R ESULTS [1] • Linear approximation:   Q π ( x, a ) ≈ θ > φ ( x, a ) • SARSA converges to satisfying   ˆ Q 1 Q = Π T π ˆ � Π Q π − Q π � � ˆ ˆ � Q − Q π � � Q D ≤ � � D 1 − γ • Q-Learning may diverge! [1] Tsitsiklis and Van Roy (1997)

O PEN QUESTIONS / A REAS OF ACTIVE RESEARCH • Convergent, linear-time optimal control [1] • Exploration under function approximation [2] • Convergence of multi-step extensions [3] [1] Maei et al. (2009) [2] Bellemare, Srinivasan, Ostrovski,   Schaul, Saxton, Munos (2016) [3] Touati et al. (2017)

Ideal Practical case case

1956 1992 2016 2015

1956 1992 Q π ( x, a ) ≈ θ > φ ( x, a ) 2016 2015

1956 1992 2016 2015

D EEP L EARNING Slide adapted from Ali Eslami

D EEP   L EARNING r θ L ( θ ) L ( θ ) φ θ ( x, a ) Graphic by Volodymyr Mnih

Mnih et al., 2015

      D EEP R EINFORCEMENT L EARNING • Value function as a Q-network Q ( x, a, θ ) • Objective function: mean squared error   h⇣ ⌘ 2 i L ( θ ) := E a 0 2 A Q ( x 0 , a 0 , θ ) r + γ max − Q ( x, a, θ ) | {z } target • Q-Learning gradient:   h⇣ ⌘ i r θ L ( θ ) = E a 0 2 A Q ( x 0 , a 0 , θ ) � Q ( x, a, θ ) r + γ max r θ Q ( x, a, θ ) Based on material by David Silver

S TABILITY I SSUES • Naive Q-Learning oscillates or diverges 1. Data is sequential Successive samples are non-iid 2. Policy changes rapidly with Q-values May oscillate; extreme data distributions 3. Scale of rewards and Q-values is unknown Naive gradients can be large; unstable backpropagation Based on material by David Silver

D EEP   Q-N ETWORKS 1. Use experience replay Break correlations, learn from past policies 2. Target network to keep target values fixed Avoid oscillations 3. Clip rewards Provide robust gradients Based on material by David Silver

  Equivalent to E XPERIENCE R EPLAY planning with empirical model • Build dataset from agent’s experience • Take action according to ε -greedy policy • Store (x, a, r, x’, a’) in replay memory D • Sample transitions from D , perform asynchronous update:   ⌘ 2 i h⇣ E a 0 2 A Q ( x 0 , a 0 , θ ) − Q ( x, a, θ ) L ( θ ) = r + γ max x,a,r,x 0 ,a 0 ⇠ D • Effectively avoids correlations within trajectories Based on material by David Silver

    T ARGET Q-N ETWORK • To avoid oscillations, fix parameters of target in loss function • Compute targets w.r.t. old parameters   a 0 2 A Q ( x 0 , a 0 , θ � ) r + γ max • As before, minimize squared loss:   ⌘ 2 i h⇣ L ( θ ) = E D a 0 2 A Q ( x 0 , a 0 , θ � ) − Q ( x, a, θ ) r + γ max • Periodically update target network:   Similar to θ − ← θ policy iteration! Based on material by David Silver

Model-Free Control (Reinforcement Learning) and Deep Learning M ARC - PowerPoint PPT Presentation

Model-Free Control (Reinforcement Learning) and Deep Learning M ARC G. B ELLEMARE Google Brain (Montral) 1956 1992 2016 2015 T HE A RCADE L EARNING E NVIRONMENT (B ELLEMARE ET AL ., 2013) 160 pixels Reward: change in score 210 pixels

Model-Free Methods Model-Free Methods Model-based: use all branches S 2 A 1 S 3 R=2 A 2 S 2 S 1

Outline Scale-Free Networks Networks Scale-Free Networks Original model Original model

Industrial Robots Industrial Robots Control Control Part 1 Control Control Part 1 Part 1

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

Gluten Free & Free From August 25, 2014 Natural & Organic ECRM San Diego, CA Agenda

INTERSTARCH GLUTEN-FREE BAKING MIXES 1 GLUTEN-FREE PRODUCTS from INTERSTARCH GLUTEN -FREE

The Smoke- -Free Free The Smoke Arizona Act Arizona Act Arizona Department of Health Services

Free Software and the Environment Ben ONeill What makes free software good? What makes free

FREE FREE FREE FREE RIDE RIDE RIDE RIDE W HAT HAT IS IS F REE REE RIDE RIDE ? HAT HAT IS

EQUATION OF FREE FALL Chapter 2 = Free Fall v = u - gt Chapter 2 = Free Fall v = u - gt

Bio-inspired computation: Clock-free, grid-free, scale-free, and symbol-free (FA2386-12-1-4050)

When Free Software Isn't Better When Free Software Isn't Better benjamin mako hill :: when

Hopes and Fears for Evergreen Oh were free! Free! Forever were free Come join the song

Industrial Robots Industrial Robots Control Control Part 2 Control Control Part 2 Part 2

Access Control and Protection Overview Access control: What and Why Abstract Models of

Direct Inverse Control & Internal Model Control Modelling and Control of Dynamic Systems 17

Deep he(a)p, big feat arXiv:1707.06887 A Distributional Perspective on Reinforcement Learning

Deep Reinforcement Learning M. Soleymani Sharif University of Technology Fall 2017 Slides are

CSC2523: Deep Learning in Computer Vision Introduction Sanja Fidler January 12, 2016 Sanja

Retrieving Comparative Arguments using Deep Pre-trained Language Models and NLU Viktoriia

Markov Decision Processes School of Data Science, Fudan

Int Introductio ion t n to Deep Deep Lea earn rning Prof. Leal-Taix and Prof. Niessner 1

Parallel Programming Libraries and Implementations Reusing this material This work is licensed

Parallel Programming Libraries and implementations Funding Partners bioexcel.eu Reusing this

Model-Free Control (Reinforcement Learning) and Deep Learning M ARC - PowerPoint PPT Presentation

Model-Free Control (Reinforcement Learning) and Deep Learning M ARC G. B ELLEMARE Google Brain (Montral) 1956 1992 2016 2015 T HE A RCADE L EARNING E NVIRONMENT (B ELLEMARE ET AL ., 2013) 160 pixels Reward: change in score 210 pixels

Model-Free Methods Model-Free Methods Model-based: use all branches S 2 A 1 S 3 R=2 A 2 S 2 S 1

Outline Scale-Free Networks Networks Scale-Free Networks Original model Original model

Industrial Robots Industrial Robots Control Control Part 1 Control Control Part 1 Part 1

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

Gluten Free &amp; Free From August 25, 2014 Natural &amp; Organic ECRM San Diego, CA Agenda

INTERSTARCH GLUTEN-FREE BAKING MIXES 1 GLUTEN-FREE PRODUCTS from INTERSTARCH GLUTEN -FREE

The Smoke- -Free Free The Smoke Arizona Act Arizona Act Arizona Department of Health Services

Free Software and the Environment Ben ONeill What makes free software good? What makes free

FREE FREE FREE FREE RIDE RIDE RIDE RIDE W HAT HAT IS IS F REE REE RIDE RIDE ? HAT HAT IS

EQUATION OF FREE FALL Chapter 2 = Free Fall v = u - gt Chapter 2 = Free Fall v = u - gt

Bio-inspired computation: Clock-free, grid-free, scale-free, and symbol-free (FA2386-12-1-4050)

When Free Software Isn't Better When Free Software Isn't Better benjamin mako hill :: when

Hopes and Fears for Evergreen Oh were free! Free! Forever were free Come join the song

Industrial Robots Industrial Robots Control Control Part 2 Control Control Part 2 Part 2

Access Control and Protection Overview Access control: What and Why Abstract Models of

Direct Inverse Control &amp; Internal Model Control Modelling and Control of Dynamic Systems 17

Deep he(a)p, big feat arXiv:1707.06887 A Distributional Perspective on Reinforcement Learning

Deep Reinforcement Learning M. Soleymani Sharif University of Technology Fall 2017 Slides are

CSC2523: Deep Learning in Computer Vision Introduction Sanja Fidler January 12, 2016 Sanja

Retrieving Comparative Arguments using Deep Pre-trained Language Models and NLU Viktoriia

Markov Decision Processes School of Data Science, Fudan

Int Introductio ion t n to Deep Deep Lea earn rning Prof. Leal-Taix and Prof. Niessner 1

Parallel Programming Libraries and Implementations Reusing this material This work is licensed

Parallel Programming Libraries and implementations Funding Partners bioexcel.eu Reusing this

Gluten Free & Free From August 25, 2014 Natural & Organic ECRM San Diego, CA Agenda

Direct Inverse Control & Internal Model Control Modelling and Control of Dynamic Systems 17