Model-Free Control (Reinforcement Learning) and Deep Learning M ARC G. B ELLEMARE Google Brain (Montréal)
1956 1992 2016 2015
T HE A RCADE L EARNING E NVIRONMENT (B ELLEMARE ET AL ., 2013)
160 pixels Reward: change in score 210 pixels 60 frames/second 18 actions
• 33,600 (discrete) dimensions • Up to 108,000 decisions/episode (30 minutes) • 60+ games: heterogenous dynamical systems
D EEP L EARNING : A N AI S UCCESS S TORY
M := h X , A , R, P, γ i Q ⇤ ( x, a ) = r ( x, a ) + γ E a 0 2 A Q ⇤ ( x 0 , a 0 ) P max Q = Π T π ˆ ˆ Q 1 � Π Q π − Q π � � ˆ � Q − Q π � � D ≤ � � D 1 − γ Theory Practice
W HERE HAS MODEL - FREE CONTROL BEEN SO SUCCESSFUL ? • Complex dynamical systems • Black-box simulators • High-dimensional state spaces • Long time horizons • Opponent / adversarial element
P RACTICAL C ONSIDERATIONS • Are simulations reasonably cheap? model-free • Is the notion of “state” complex? model-free • Is there partial observability? maybe model-free • Can the state space be enumerated? value iteration • Is there an explicit model available? model-based
O UTLINE OF T ALK Ideal Practical case case
W HAT ’ S R EINFORCEMENT L EARNING , A NYWAY ? “A LL GOALS AND PURPOSES … CAN BE THOUGHT OF AS THE MAXIMIZATION OF SOME VALUE FUNCTION ” – S UTTON & B ARTO (2017, IN PRESS )
state action s t a t x t • At each step t, the agent • Observes a state reward r t • Takes an action • Receives a reward
T HREE LEARNING PROBLEMS IN ONE Stochastic approximation Policy evaluation Optimal control Function approximation
B ACKGROUND • Formalized as a Markov Decision Process: M := h X , A , R, P, γ i • R , P reward, transition functions • 𝛿 discount factor • A trajectory is a sequence of interactions with the environment x 1 , a 1 , r 1 , x 2 , a 2 , . . .
• Policy : a probability distribution over actions: π If deterministic: a t ∼ π ( · | x t ) a t = π ( x t ) • Transition function: x t +1 ∼ P ( · | x t , a t ) • Value function : total discounted reward Q π ( x, a ) " ∞ # X γ t r ( x t , a t ) | x 0 , a 0 = x, a Q π ( x, a ) = E P, π t =0 • As a vector in space of value functions: Q π ∈ Q
• “Maximize value function”: find Q ∗ ( x, a ) := max Q π ( x, a ) π • Bellman’s equation: " 1 # X γ t r ( x t , a t ) | x 0 , a 0 = x, a Q π ( x, a ) = E P, π t =0 = r ( x, a ) + γ E P, π Q π ( x 0 , a 0 ) • Optimality equation: Q ⇤ ( x, a ) = r ( x, a ) + γ E a 0 2 A Q ⇤ ( x 0 , a 0 ) P max
B ELLMAN O PERATOR T π Q ( x, a ) := r ( x, a ) + γ E Q ( x 0 , a 0 ) x 0 ⇠ P a 0 ⇠ π • The Bellman operator is a 𝛿 -contraction: k T π Q � Q π k ∞ γ k Q � Q π k ∞ Q k • Fixed point: Q π Q π = T π Q π Q k +1 := T π Q k
B ELLMAN O PTIMALITY O PERATOR T Q ( x, a ) := r ( x, a ) + γ E a 0 2 A Q ( x 0 , a 0 ) x 0 ⇠ P max • Also a 𝛿 -contraction (beware! different proof): k T Q � Q ∗ k ∞ γ k Q � Q ∗ k ∞ Q k • Fixed point is optimal v.f.: Q ∗ Q ∗ = T Q ∗ ≥ Q π Q k +1 := T Q k
M ODEL - BASED A LGORITHMS 1.Value iteration: Q k +1 ( x, a ) ← T Q k ( x, a ) = r ( x, a ) + γ E P max a 0 2 A Q k ( x 0 , a 0 ) 2.Policy iteration: a. b. π k = arg max T π Q k ( x, a ) Q k +1 ( x, a ) ← Q π k ( x, a ) π 3.Optimistic policy iteration: Q k +1 ( x, a ) ← ( T π k ) m Q k ( x, a ) = T π k · · · T π k Q k ( x, a ) | {z } m times
P OLICY I TERATION a. π k = arg max T π Q k ( x, a ) π Policy evaluation Optimal control b. Q k +1 ( x, a ) ← Q π k ( x, a )
M ODEL - FREE R EINFORCEMENT L EARNING • Typically no access to P, R • Two options: • Learn a model (not in this talk) • Model-free: learn or directly from samples Q ∗ Q π
Model-based Q k x t Q π a t E P Q k +1 := T π Q k Model-free x t state action s t a t x t a t reward r t x t +1 ∼ P ( · | x t , a t )
M ODEL - FREE RL: S YNCHRONOUS U PDATES • For all x, a , sample x 0 ∼ P ( · | x, a ) , a 0 ∼ π ( · | x 0 ) • The SARSA algorithm: Q t +1 ( x, a ) ← (1 − α t ) Q t ( x, a ) + α t ˆ T π t Q t ( x, a ) � � r ( x, a ) + γ Q t ( x 0 , a 0 ) = (1 − α t ) Q t ( x, a ) + α t � � r ( x, a ) + γ Q t ( x 0 , a 0 ) − Q t ( x, a )) = Q t ( x, a ) + α t | {z } TD-error δ • is a step-size (sequence) α t ∈ [0 , 1)
M ODEL - FREE RL: Q-L EARNING • The Q-Learning algorithm: max. at each iteration Q t +1 ( x, a ) ← (1 − α t ) Q t ( x, a ) � � a 0 2 A Q t ( x 0 , a 0 ) − Q t ( x, a ) + α t r ( x, a ) + γ max
• Both converge under • Not trivial! Interleaved Robbins-Monro conditions learning problems Policy evaluation Optimal control Stochastic approximation Q-Learning
A SYNCHRONOUS U PDATES • The asynchronous case: learn from trajectories x 1 , a 1 , r 1 , x 2 , a 2 , · · · ∼ π , P • Apply update at each step: � � Q ( x t , a t ) ← Q t ( x, a ) + α t r t + γ Q ( x t +1 , a t +1 ) − Q ( x t , a t ) • This is the setting we usually deal with • Convergence even more delicate
O PEN QUESTIONS / A REAS OF ACTIVE RESEARCH • Rates of convergence [1] • Variance reduction [2] • Convergence guarantees for multi-step methods [3, 4] • Off-policy learning: control from fixed behaviour [3, 4] [1] Konda and Tsitsiklis (2004) [2] Azar et al., Speedy Q-Learning (2011) [3] Harutyunyan, Bellemare, Stepleton, Munos (2016) [4] Munos, Stepleton, Harutyunyan, Bellemare (2016)
Stochastic approximation Policy evaluation Optimal control Function approximation
Stochastic approximation Policy evaluation Optimal control Function approximation
(V ALUE ) FUNCTION A PPROXIMATION • Parametrize value function: Q π ( x, a ) ≈ Q ( x, a, θ ) • Learning now involves a projection step : Π Π T π Q ( x, a, θ k ) : θ k +1 ← arg min � T π Q k ( x, a, θ k ) − Q ( x, a, θ ) � � � D θ • This leads to additional, compounding error • Can cause divergence
S OME CLASSIC R ESULTS [1] • Linear approximation: Q π ( x, a ) ≈ θ > φ ( x, a ) • SARSA converges to satisfying ˆ Q 1 Q = Π T π ˆ � Π Q π − Q π � � ˆ ˆ � Q − Q π � � Q D ≤ � � D 1 − γ • Q-Learning may diverge! [1] Tsitsiklis and Van Roy (1997)
O PEN QUESTIONS / A REAS OF ACTIVE RESEARCH • Convergent, linear-time optimal control [1] • Exploration under function approximation [2] • Convergence of multi-step extensions [3] [1] Maei et al. (2009) [2] Bellemare, Srinivasan, Ostrovski, Schaul, Saxton, Munos (2016) [3] Touati et al. (2017)
Ideal Practical case case
Ideal Practical case case
1956 1992 2016 2015
1956 1992 Q π ( x, a ) ≈ θ > φ ( x, a ) 2016 2015
1956 1992 2016 2015
D EEP L EARNING Slide adapted from Ali Eslami
D EEP L EARNING r θ L ( θ ) L ( θ ) φ θ ( x, a ) Graphic by Volodymyr Mnih
Mnih et al., 2015
D EEP R EINFORCEMENT L EARNING • Value function as a Q-network Q ( x, a, θ ) • Objective function: mean squared error h⇣ ⌘ 2 i L ( θ ) := E a 0 2 A Q ( x 0 , a 0 , θ ) r + γ max − Q ( x, a, θ ) | {z } target • Q-Learning gradient: h⇣ ⌘ i r θ L ( θ ) = E a 0 2 A Q ( x 0 , a 0 , θ ) � Q ( x, a, θ ) r + γ max r θ Q ( x, a, θ ) Based on material by David Silver
S TABILITY I SSUES • Naive Q-Learning oscillates or diverges 1. Data is sequential Successive samples are non-iid 2. Policy changes rapidly with Q-values May oscillate; extreme data distributions 3. Scale of rewards and Q-values is unknown Naive gradients can be large; unstable backpropagation Based on material by David Silver
D EEP Q-N ETWORKS 1. Use experience replay Break correlations, learn from past policies 2. Target network to keep target values fixed Avoid oscillations 3. Clip rewards Provide robust gradients Based on material by David Silver
Equivalent to E XPERIENCE R EPLAY planning with empirical model • Build dataset from agent’s experience • Take action according to ε -greedy policy • Store (x, a, r, x’, a’) in replay memory D • Sample transitions from D , perform asynchronous update: ⌘ 2 i h⇣ E a 0 2 A Q ( x 0 , a 0 , θ ) − Q ( x, a, θ ) L ( θ ) = r + γ max x,a,r,x 0 ,a 0 ⇠ D • Effectively avoids correlations within trajectories Based on material by David Silver
T ARGET Q-N ETWORK • To avoid oscillations, fix parameters of target in loss function • Compute targets w.r.t. old parameters a 0 2 A Q ( x 0 , a 0 , θ � ) r + γ max • As before, minimize squared loss: ⌘ 2 i h⇣ L ( θ ) = E D a 0 2 A Q ( x 0 , a 0 , θ � ) − Q ( x, a, θ ) r + γ max • Periodically update target network: Similar to θ − ← θ policy iteration! Based on material by David Silver
Recommend
More recommend