an emphatic approach to the problem of off policy td
play

An Emphatic Approach to the Problem of Off-policy TD Learning Rich - PowerPoint PPT Presentation

R L & A I An Emphatic Approach to the Problem of Off-policy TD Learning Rich Sutton Rupam Mahmood Martha White Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science University of Alberta,


  1. R L & A I An Emphatic Approach to the Problem of Off-policy TD Learning Rich Sutton Rupam Mahmood Martha White Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science University of Alberta, Canada

  2. Temporal-Difference Learning with Linear Function Approximation 2 R policy e π ( a | s ) . on A t 2 A e S t 2 S R t +1 2 R = P { A t = a | S t = s } . states actions rewards × π s [ P π ] ij . a π ( a | i ) p ( j | i, a ) where p ( j | i, a ) . = P = P { S t +1 = j | S t = i, A t = a } . T transition prob matrix [ d π ] s . = d π ( s ) . property of d is that P > ergodic stationary distribution = lim t !1 P { S t = s } , w > 0 π d π = d π . G t . = R t +1 + γ R t +2 + γ 2 R t +3 + · · · return 0 ≤ γ < 1 d x ( s ) 2 R n ∀ s ∈ S feature vectors ⇡ v π ( s ) . ) ⇡ w > weight vector or w t 2 R n , n ⌧ | S | t x ( s ) = E π [ G t | S t = s ] , value function ⇣ ⌘ w t +1 . R t +1 + γ w > t x ( S t +1 ) � w > linear TD(0): = w t + α t x ( S t ) x ( S t ) , ⇣ ⌘ � x ( S t ) ( x ( S t ) � γ x ( S t +1 )) > w t +1 = w t + α R t +1 x ( S t ) w t | {z } | {z } b t 2 R n A t 2 R n ⇥ n = w t + α ( b t � A t w t ) = ( I � α A t ) w t + α b t . . w t +1 = ( I − α A ) ¯ ¯ w t + α b deterministic ‘expected’ update: . h x ( S t ) ( x ( S t ) − γ x ( S t +1 )) > i A = lim t !1 E [ A t ] = lim t !1 E π Stable if is positive definite A . y > Ay > 0 , 8 y 6 = 0 i.e., if ! > X X w t = A − 1 b . [ P π ] ss 0 x ( s 0 ) = d π ( s ) x ( s ) x ( s ) − γ Converges to lim t →∞ ¯ s s 0 ⇣ = X > D π ( I − γ P π ) X , t +1 t − x (1) > −   | {z }   R − x (2) > − 0 if this “key matrix” I showed in 1988 X .   D π . = d π .   is pos. def., then that the key matrix =  ,  .  .   . 0 is pos. def. and is pos. def. if its A  − x ( | S | ) > − everything is stable column sums are >0

  3. × π s [ P π ] ij . a π ( a | i ) p ( j | i, a ) where p ( j | i, a ) . = P = P { S t +1 = j | S t = i, A t = a } . T transition prob matrix property of d is that [ d π ] s . = d π ( s ) . P > ergodic stationary distribution = lim t !1 P { S t = s } , w > 0 π d π = d π . . w t +1 = ( I − α A ) ¯ ¯ w t + α b deterministic ‘expected’ update: . h x ( S t ) ( x ( S t ) − γ x ( S t +1 )) > i A = lim t !1 E [ A t ] = lim t !1 E π Stable if is positive definite A . y > Ay > 0 , 8 y 6 = 0 i.e., if ! > X X w t = A − 1 b . [ P π ] ss 0 x ( s 0 ) = d π ( s ) x ( s ) x ( s ) − γ Converges to lim t →∞ ¯ s s 0 ⇣ = X > D π ( I − γ P π ) X , t +1 t | {z } . For the j th column, the sum is R if this “key matrix” I showed in 1988 is pos. def., then that the key matrix X X X [ D π ( I − γ P π )] ij = [ D π ] ik [ I − γ P π ] kj is pos. def. and is pos. def. if its A i i k everything is stable column sums are >0 X = [ D π ] ii [ I − γ P π ] ij i X = d π ( i )[ I − γ P π ] ij i = [ d > π ( I − γ P π )] j = [ d > π − γ d > π )] j π P − x (1) > −   = [ d > π − γ d >   π ] j − x (2) > − 0 X .   D π . = d π .   = (1 − γ ) d π ( j ) =  ,  .  .   . 0  > 0 . − x ( | S | ) > −

  4. 2 off-policy learning problems 1. Correcting for the distribution of future returns solution: importance sampling (Sutton & Barto 1998, improved by Precup, Sutton & Singh, 2000), now used in GTD( λ ) and GQ( λ ) 2. Correcting for the state-update distribution solution: none known, other than more importance sampling (Precup, Sutton & Dasgupta, 2001) which as proposed was of very high variance. The ideas of that work are strikingly similar to those of emphasis…

  5. Off-policy Temporal-Difference Learning with Linear Function Approximation 2 R on A t 2 A e S t 2 S R t +1 2 R states actions rewards h ⇡ ( a | s ) assume coverage: target policy is no longer used to select actions ∀ s, a π ( a | s ) > 0 = µ ( a | s ) > 0 µ ( a | s ) behavior policy is used to select actions! ⇒ [ d µ ] s . on d µ ( s ) . = new ergodic stationary distribution = lim t →∞ P { S t = s } > 0 , ∀ s ∈ S , w ⇡ v π ( s ) . ) ⇡ w > t x ( s ) = E π [ G t | S t = s ] old value function = ⇡ ( A t | S t ) µ ( a | s ) ⇡ ( a | s ) importance sampling ratio ⇢ t . X X µ ( A t | S t ) . E µ [ ⇢ t | S t = s ] = µ ( a | s ) = ⇡ ( a | s ) = 1 . a a µ ( a | s ) π ( a | s ) X X For any r.v. : Z t +1 E µ [ ρ t Z t +1 | S t = s ] = µ ( a | s ) Z t +1 = π ( a | s ) Z t +1 = E π [ Z t +1 | S t = s ] {z } | a a ⇣ ⌘ w t +1 . d x t . R t +1 + γ w > t x t +1 � w > = w t + ρ t α linear off-policy TD(0): t x t x t = x ( S t ) ⇣ ⌘ � ρ t x t ( x t � γ x t +1 ) > = w t + α ρ t R t +1 x t w t | {z } | {z } b t A t h ρ t x t ( x t � γ x t +1 ) > i and its A matrix: A = lim t !1 E [ A t ] = lim t !1 E µ ρ t x t ( x t � γ x t +1 ) > � h i X � = d µ ( s ) E µ � S t = s s x t ( x t � γ x t +1 ) > � h i X � = d µ ( s ) E π � S t = s s ! > key matrix now X X [ P π ] ss 0 x ( s 0 ) = d µ ( s ) x ( s ) x ( s ) � γ has mismatched s D and P matrices; s 0 = X > D µ ( I � γ P π ) X , it is not stable

  6. h ρ t x t ( x t � γ x t +1 ) > i A = lim t !1 E [ A t ] = lim t !1 E µ Off-policy Temporal-Difference Learning with Linear Function Approximation 2 R ρ t x t ( x t � γ x t +1 ) > � h i X on A t 2 A � e S t 2 S R t +1 2 R states actions rewards = d µ ( s ) E µ � S t = s s h ⇡ ( a | s ) assume coverage: target policy is no longer used to select actions x t ( x t � γ x t +1 ) > � h i X � = d µ ( s ) E π � S t = s ∀ s, a π ( a | s ) > 0 = µ ( a | s ) > 0 µ ( a | s ) behavior policy is used to select actions! ⇒ [ d µ ] s . on d µ ( s ) . s = new ergodic stationary distribution = lim t →∞ P { S t = s } > 0 , ∀ s ∈ S , w ! > ⇡ v π ( s ) . X X ) ⇡ w > [ P π ] ss 0 x ( s 0 ) t x ( s ) = E π [ G t | S t = s ] = d µ ( s ) x ( s ) x ( s ) � γ old value function key matrix now s s 0 has mismatched = X > D off-policy TD(0)’s A matrix: µ ( I � γ P π ) X , A D and P matrices; it is not stable Counterexample:  � λ = 0 µ (right |· ) = 0 . 5 1 2 w X = w γ = 0 . 9 π (right |· ) = 1 2  0 � π 1 s [ P π ] ij . a π ( a | i ) p ( j | i, a ) where p ( j | i, a ) . = P π = = P P transition prob matrix: 0 1 property of d is that  0 . 5 �  1 �  0 . 5 � 0 − 0 . 9 − 0 . 45 µ ( I − γ P π ) = = key matrix: D sums to <0! × 0 0 . 5 0 0 . 1 0 0 . 05  0 . 5 �  1 �  � − 0 . 45 − 0 . 4 X > D ⇥ ⇤ ⇥ ⇤ pos def test: µ ( I − γ P π ) X = 1 2 = 1 2 = − 0 . 2 . × × × 0 0 . 05 2 0 . 1 A is not positive definite! Stability is not assured.

  7. 2 off-policy learning problems 1. Correcting for the distribution of future returns solution: importance sampling (Sutton & Barto 1998, improved by Precup, Sutton & Singh, 2000), now used in GTD( λ ) and GQ( λ ) 2. Correcting for the state-update distribution solution: none known, other than more importance sampling (Precup, Sutton & Dasgupta, 2001) which as proposed was of very high variance. The ideas of that work are strikingly similar to those of emphasis…

  8. Geometric Insight ˜ ˆ J r v v π J ∗ Ben Van Roy 2009

  9. Other Distribution ˜ ˆ J r v v π J ∗ Ben Van Roy 2009

  10. Problem 2 of off-policy learning: Correcting for the state-update distribution • The distribution of updated states does not ‘match’ the target policy • Only a problem with function approximation, but that’s a show stopper • Precup, Sutton & Dasgupta (2001) treated the episodic case, used importance sampling to warp the state distribution from the behavior policy’s distribution to the target policy’s distribution, then did a future- reweighted update at each state • equivalent to emphasis = product of all i.s. ratios since the beginning of time • ok algorithm, but severe variance problems in both theory and practice • Performance assessed on whole episodes following the target policy • This ‘alternate life’ view of off-policy learning was then abandoned

  11. The excursion view 
 of off-policy learning • In which we are following a (possibly changing) behavior policy forever, and are in its stationary distribution • We want to predict the consequences of deviating from it for a limited time with various target policies (e.g., options) • Error is assessed on these ‘excursions’ starting from states in the behavior distribution • Much more practical setting than ‘alternate life’ • This setting was the basis for all the work with gradient-TD and MSPBE

  12. Emphasis warping • The idea is that emphasis warps the distribution of updated states from the behavior policy’s stationary distribution to something like the ‘followon distribution’ of the target policy started in the behavior policy’s stationary distribution • From which future-reweighted updates will be stable in expectation—this follows from old results (Dayan 1992, Sutton 1988) on convergence of TD( λ ) in episodic MDPs • A new algorithm: Emphatic TD( λ )

Recommend


More recommend