model free control
play

Model-Free Control (Reinforcement Learning) and Deep Learning M ARC - PowerPoint PPT Presentation

Model-Free Control (Reinforcement Learning) and Deep Learning M ARC G. B ELLEMARE Google Brain (Montral) 1956 1992 2016 2015 T HE A RCADE L EARNING E NVIRONMENT (B ELLEMARE ET AL ., 2013) 160 pixels Reward: change in score 210 pixels


  1. Model-Free Control (Reinforcement Learning) and Deep Learning M ARC G. B ELLEMARE Google Brain (Montréal)

  2. 1956 1992 2016 2015

  3. T HE A RCADE L EARNING E NVIRONMENT (B ELLEMARE ET AL ., 2013)

  4. 160 pixels Reward: change in score 210 pixels 60 frames/second 18 actions

  5. • 33,600 (discrete) dimensions • Up to 108,000 decisions/episode (30 minutes) • 60+ games: heterogenous dynamical systems

  6. D EEP L EARNING : A N AI S UCCESS S TORY

  7. M := h X , A , R, P, γ i Q ⇤ ( x, a ) = r ( x, a ) + γ E a 0 2 A Q ⇤ ( x 0 , a 0 ) P max Q = Π T π ˆ ˆ Q 1 � Π Q π − Q π � � ˆ � Q − Q π � � D ≤ � � D 1 − γ Theory Practice

  8. W HERE HAS MODEL - FREE CONTROL BEEN SO SUCCESSFUL ? • Complex dynamical systems • Black-box simulators • High-dimensional state spaces • Long time horizons • Opponent / adversarial element

  9. P RACTICAL C ONSIDERATIONS • Are simulations reasonably cheap? model-free • Is the notion of “state” complex? model-free • Is there partial observability? maybe model-free • Can the state space be enumerated? value iteration • Is there an explicit model available? model-based

  10. O UTLINE OF T ALK Ideal Practical case case

  11. 
 W HAT ’ S R EINFORCEMENT L EARNING , A NYWAY ? “A LL GOALS AND PURPOSES … CAN BE THOUGHT OF AS THE MAXIMIZATION OF SOME VALUE FUNCTION ” – S UTTON & B ARTO (2017, IN PRESS )

  12. state action s t a t x t • At each step t, the agent • Observes a state reward r t • Takes an action • Receives a reward

  13. T HREE LEARNING PROBLEMS IN ONE Stochastic 
 approximation Policy evaluation Optimal control Function approximation

  14. 
 
 B ACKGROUND • Formalized as a Markov Decision Process: 
 M := h X , A , R, P, γ i • R , P reward, transition functions • 𝛿 discount factor • A trajectory is a sequence of interactions with the environment x 1 , a 1 , r 1 , x 2 , a 2 , . . .

  15. 
 • Policy : a probability distribution over actions: 
 π If deterministic: a t ∼ π ( · | x t ) a t = π ( x t ) • Transition function: x t +1 ∼ P ( · | x t , a t ) • Value function : total discounted reward 
 Q π ( x, a ) " ∞ # X γ t r ( x t , a t ) | x 0 , a 0 = x, a Q π ( x, a ) = E P, π t =0 • As a vector in space of value functions: Q π ∈ Q

  16. 
 
 
 • “Maximize value function”: find 
 Q ∗ ( x, a ) := max Q π ( x, a ) π • Bellman’s equation: 
 " 1 # X γ t r ( x t , a t ) | x 0 , a 0 = x, a Q π ( x, a ) = E P, π t =0 = r ( x, a ) + γ E P, π Q π ( x 0 , a 0 ) • Optimality equation: 
 Q ⇤ ( x, a ) = r ( x, a ) + γ E a 0 2 A Q ⇤ ( x 0 , a 0 ) P max

  17. B ELLMAN O PERATOR T π Q ( x, a ) := r ( x, a ) + γ E Q ( x 0 , a 0 ) x 0 ⇠ P a 0 ⇠ π • The Bellman operator is a 𝛿 -contraction: 
 k T π Q � Q π k ∞  γ k Q � Q π k ∞ Q k • Fixed point: Q π Q π = T π Q π Q k +1 := T π Q k

  18. B ELLMAN O PTIMALITY O PERATOR T Q ( x, a ) := r ( x, a ) + γ E a 0 2 A Q ( x 0 , a 0 ) x 0 ⇠ P max • Also a 𝛿 -contraction (beware! different proof): 
 k T Q � Q ∗ k ∞  γ k Q � Q ∗ k ∞ Q k • Fixed point is optimal v.f.: Q ∗ Q ∗ = T Q ∗ ≥ Q π Q k +1 := T Q k

  19. 
 M ODEL - BASED A LGORITHMS 1.Value iteration: 
 Q k +1 ( x, a ) ← T Q k ( x, a ) = r ( x, a ) + γ E P max a 0 2 A Q k ( x 0 , a 0 ) 2.Policy iteration: 
 a. b. π k = arg max T π Q k ( x, a ) Q k +1 ( x, a ) ← Q π k ( x, a ) π 3.Optimistic policy iteration: 
 Q k +1 ( x, a ) ← ( T π k ) m Q k ( x, a ) = T π k · · · T π k Q k ( x, a ) | {z } m times

  20. P OLICY I TERATION a. π k = arg max T π Q k ( x, a ) π Policy evaluation Optimal control b. Q k +1 ( x, a ) ← Q π k ( x, a )

  21. M ODEL - FREE R EINFORCEMENT L EARNING • Typically no access to P, R • Two options: • Learn a model (not in this talk) • Model-free: learn or directly from samples Q ∗ Q π

  22. Model-based Q k x t Q π a t E P Q k +1 := T π Q k Model-free x t state action s t a t x t a t reward r t x t +1 ∼ P ( · | x t , a t )

  23. 
 
 
 
 M ODEL - FREE RL: 
 S YNCHRONOUS U PDATES • For all x, a , sample x 0 ∼ P ( · | x, a ) , a 0 ∼ π ( · | x 0 ) • The SARSA algorithm: 
 Q t +1 ( x, a ) ← (1 − α t ) Q t ( x, a ) + α t ˆ T π t Q t ( x, a ) � � r ( x, a ) + γ Q t ( x 0 , a 0 ) = (1 − α t ) Q t ( x, a ) + α t � � r ( x, a ) + γ Q t ( x 0 , a 0 ) − Q t ( x, a )) = Q t ( x, a ) + α t | {z } TD-error δ • is a step-size (sequence) 
 α t ∈ [0 , 1)

  24. M ODEL - FREE RL: 
 Q-L EARNING • The Q-Learning algorithm: max. at each iteration Q t +1 ( x, a ) ← (1 − α t ) Q t ( x, a ) � � a 0 2 A Q t ( x 0 , a 0 ) − Q t ( x, a ) + α t r ( x, a ) + γ max

  25. • Both converge under 
 • Not trivial! Interleaved 
 Robbins-Monro conditions learning problems Policy evaluation Optimal control Stochastic 
 approximation Q-Learning

  26. A SYNCHRONOUS U PDATES • The asynchronous case: learn from trajectories 
 x 1 , a 1 , r 1 , x 2 , a 2 , · · · ∼ π , P • Apply update at each step: 
 � � Q ( x t , a t ) ← Q t ( x, a ) + α t r t + γ Q ( x t +1 , a t +1 ) − Q ( x t , a t ) • This is the setting we 
 usually deal with • Convergence even 
 more delicate

  27. O PEN QUESTIONS / A REAS OF ACTIVE RESEARCH • Rates of convergence [1] • Variance reduction [2] • Convergence guarantees for multi-step methods [3, 4] • Off-policy learning: control from fixed behaviour [3, 4] [1] Konda and Tsitsiklis (2004) [2] Azar et al., Speedy Q-Learning (2011) [3] Harutyunyan, Bellemare, Stepleton, Munos (2016) [4] Munos, Stepleton, Harutyunyan, Bellemare (2016)

  28. Stochastic 
 approximation Policy evaluation Optimal control Function approximation

  29. Stochastic 
 approximation Policy evaluation Optimal control Function approximation

  30. (V ALUE ) FUNCTION A PPROXIMATION • Parametrize value function: 
 Q π ( x, a ) ≈ Q ( x, a, θ ) • Learning now involves a projection step : 
 Π Π T π Q ( x, a, θ k ) : θ k +1 ← arg min � T π Q k ( x, a, θ k ) − Q ( x, a, θ ) � � � D θ • This leads to additional, 
 compounding error • Can cause divergence 


  31. S OME CLASSIC R ESULTS [1] • Linear approximation: 
 Q π ( x, a ) ≈ θ > φ ( x, a ) • SARSA converges to satisfying 
 ˆ Q 1 Q = Π T π ˆ � Π Q π − Q π � � ˆ ˆ � Q − Q π � � Q D ≤ � � D 1 − γ • Q-Learning may diverge! [1] Tsitsiklis and Van Roy (1997)

  32. O PEN QUESTIONS / A REAS OF ACTIVE RESEARCH • Convergent, linear-time optimal control [1] • Exploration under function approximation [2] • Convergence of multi-step extensions [3] [1] Maei et al. (2009) [2] Bellemare, Srinivasan, Ostrovski, 
 Schaul, Saxton, Munos (2016) [3] Touati et al. (2017)

  33. Ideal Practical case case

  34. Ideal Practical case case

  35. 1956 1992 2016 2015

  36. 1956 1992 Q π ( x, a ) ≈ θ > φ ( x, a ) 2016 2015

  37. 1956 1992 2016 2015

  38. D EEP L EARNING Slide adapted from Ali Eslami

  39. D EEP 
 L EARNING r θ L ( θ ) L ( θ ) φ θ ( x, a ) Graphic by Volodymyr Mnih

  40. Mnih et al., 2015

  41. 
 
 
 D EEP R EINFORCEMENT L EARNING • Value function as a Q-network Q ( x, a, θ ) • Objective function: mean squared error 
 h⇣ ⌘ 2 i L ( θ ) := E a 0 2 A Q ( x 0 , a 0 , θ ) r + γ max − Q ( x, a, θ ) | {z } target • Q-Learning gradient: 
 h⇣ ⌘ i r θ L ( θ ) = E a 0 2 A Q ( x 0 , a 0 , θ ) � Q ( x, a, θ ) r + γ max r θ Q ( x, a, θ ) Based on material by David Silver

  42. S TABILITY I SSUES • Naive Q-Learning oscillates or diverges 1. Data is sequential Successive samples are non-iid 2. Policy changes rapidly with Q-values May oscillate; extreme data distributions 3. Scale of rewards and Q-values is unknown Naive gradients can be large; unstable backpropagation Based on material by David Silver

  43. D EEP 
 Q-N ETWORKS 1. Use experience replay Break correlations, learn from past policies 2. Target network to keep target values fixed Avoid oscillations 3. Clip rewards Provide robust gradients Based on material by David Silver

  44. 
 Equivalent to E XPERIENCE R EPLAY planning with empirical model • Build dataset from agent’s experience • Take action according to ε -greedy policy • Store (x, a, r, x’, a’) in replay memory D • Sample transitions from D , perform asynchronous update: 
 ⌘ 2 i h⇣ E a 0 2 A Q ( x 0 , a 0 , θ ) − Q ( x, a, θ ) L ( θ ) = r + γ max x,a,r,x 0 ,a 0 ⇠ D • Effectively avoids correlations within trajectories Based on material by David Silver

  45. 
 
 T ARGET Q-N ETWORK • To avoid oscillations, fix parameters of target in loss function • Compute targets w.r.t. old parameters 
 a 0 2 A Q ( x 0 , a 0 , θ � ) r + γ max • As before, minimize squared loss: 
 ⌘ 2 i h⇣ L ( θ ) = E D a 0 2 A Q ( x 0 , a 0 , θ � ) − Q ( x, a, θ ) r + γ max • Periodically update target network: 
 Similar to θ − ← θ policy iteration! Based on material by David Silver

Recommend


More recommend