6 3 pts what three things form the deadly triad the three
play

6. (3 pts) What three things form the deadly triad the three - PowerPoint PPT Presentation

6. (3 pts) What three things form the deadly triad the three things that cannot be combined in the same learning situation without risking divergence? (circle three) (a) eligibility traces (b) bootstrapping (c) sample backups (d)


  1. 6. (3 pts) What three things form the “deadly triad” – the three things that cannot be combined in the same learning situation without risking divergence? (circle three) (a) eligibility traces (b) bootstrapping (c) sample backups (d) ε -greedy action selection (e) linear function approximation (f) off-line updating (g) off-policy learning (h) exploration bonuses

  2. The Deadly Triad the three things that together result in instability 1. Function approximation 2. Bootstrapping 3. Off-policy training data (e.g., Q-learning, DP) even if: prediction (fixed given policies) • linear with binary features • expected updates (as in asynchronous DP, iid) •

  3. 7. True or False: For any stationary MDP, assuming a step-size ( α ) sequence satisfying the stan- dard stochastic approximation criteria, and a fixed policy, convergence in the prediction problem is guaranteed for T F (2 pts) online, off-policy TD(1) with linear function approximation T F (2 pts) online, on-policy TD(0) with linear function approximation T F (2 pts) offline, off-policy TD(0) with linear function approximation T F (2 pts) dynamic programming with linear function approximation T F (2 pts) dynamic programming with nonlinear function approximation T F (2 pts) gradient-descent Monte Carlo with linear function approximation T F (2 pts) gradient-descent Monte Carlo with nonlinear function approximation 8. True or False: (3 pts) TD(0) with linear function approximation converges to a local minimum in the MSE between the approximate value function and the true value function V π .

  4. The Deadly Triad the three things that together result in instability 1. Function approximation • linear or more with proportional complexity • state aggregation ok; ok if “nearly Markov” 2. Bootstrapping • λ =1 ok, ok if λ big enough (problem dependent) 3. Off-policy training • may be ok if “nearly on-policy” • if policies very different, variance may be too high anyway

  5. Off-policy learning • Learning about a policy different than the policy being used to generate actions • Most often used to learn optimal behaviour from a given data set, or from more exploratory behaviour • Key to ambitious theories of knowledge and perception as continual prediction about the outcomes of many options

  6. Baird’s counter-example • P and d are not linked • d is all states with equal probability • P is according to this Markov chain: V k ( s ) = V k ( s ) = V k ( s ) = V k ( s ) = V k ( s ) = ! (7) +2 ! (1) ! (7) +2 ! (2) ! (7) +2 ! (3) ! (7) +2 ! (4) ! (7) +2 ! (5) 100% r = 0 on all transitions V k ( s ) = 2 ! (7) + ! (6) terminal 1% state 99%

  7. TD can diverge: Baird’s counter-example 10 10 ! k (1) – ! k (5) ! k (7) 5 10 Parameter ! k (6) values, ! k ( i ) 0 0 / -10 10 (log scale, broken at ± 1) 5 - 10 10 - 10 0 1000 2000 3000 4000 5000 Iterations ( k ) deterministic updates θ 0 = (1 , 1 , 1 , 1 , 1 , 10 , 1) � γ = 0 . 99 α = 0 . 01

  8. TD(0) can diverge: A simple example r=1 θ 2 θ r + γθ ⇥ φ � − θ ⇥ φ δ = 0 + 2 θ − θ = θ = TD update: ∆ θ αδφ = Diverges! αθ = TD fixpoint: θ ∗ = 0

  9. Previous attempts to solve the off-policy problem • Importance sampling • With recognizers • Least-squares methods, LSTD, LSPI, iLSTD • Averagers • Residual gradient methods

  10. Desiderata: We want a TD algorithm that • Bootstraps (genuine TD) • Works with linear function approximation 
 (stable, reliably convergent) • Is simple, like linear TD — O(n) • Learns fast, like linear TD • Can learn off-policy (arbitrary P and d ) • Learns from online causal trajectories 
 (no repeat sampling from the same state)

  11. A little more theory r + γθ > φ 0 − θ > φ � � = ∆ θ ∝ δφ φ θ > ( γφ 0 − φ ) φ + r φ = φ ( γφ 0 − φ ) > θ + r φ = h φ ( φ − γφ 0 ) > i E [ ∆ θ ] θ + E [ r φ ] − E ∝ convergent if 
 E [ ∆ θ ] + − A θ b A is pos. def. ∝ therefore, at 
 A θ ∗ b = LSTD computes this directly the TD fixpoint: A − 1 b θ ∗ = φφ > ⇤ ⇥ C = E � 1 � A > C � 1 ( A θ � b ) 2 r θ MSPBE = covariance 
 matrix always pos. def.

  12. TD(0) Solution and Stability ⇣ ⌘ � φ ( S t ) ( φ ( S t ) � γ φ ( S t +1 )) > θ t +1 = θ t + α R t +1 φ ( S t ) θ t | {z } | {z } b t 2 R n A t 2 R n × n = θ t + α ( b t � A t θ t ) = ( I � α A t ) θ t + α b t . θ t +1 . ¯ = ¯ θ t + α ( b � A ¯ θ t ) , θ ∗ = A − 1 b

  13. LSTD(0) Ideal: φ t ( φ t − γ φ t +1 ) > ⇤ ⇥ A = lim t !1 E π b = lim t →∞ E π [ R t +1 φ t ] θ ∗ = A − 1 b Algorithm: � > X X � A t = ρ k φ k φ k − γ φ k +1 b t = ρ k R k φ k k k t →∞ b t = b lim t →∞ A t = A lim θ t = A − 1 t b t t →∞ θ t = θ ∗ lim

  14. LSTD( λ ) Ideal: e t ( φ t − γ φ t +1 ) > ⇤ ⇥ A = lim t !1 E π b = lim t →∞ E π [ R t +1 e t ] θ ∗ = A − 1 b Algorithm: � > X X � A t = ρ k e k φ k − γ φ k +1 b t = ρ k R k e k k k t →∞ b t = b lim t →∞ A t = A lim θ t = A − 1 t b t t →∞ θ t = θ ∗ lim

Recommend


More recommend