eligibility traces
play

Eligibility Traces Chapter 12 Eligibility traces are Another way - PowerPoint PPT Presentation

Eligibility Traces Chapter 12 Eligibility traces are Another way of interpolating between MC and TD methods A way of implementing compound -return targets A basic mechanistic idea a short-term, fading memory A new style of algorithm


  1. Eligibility Traces Chapter 12

  2. Eligibility traces are Another way of interpolating between MC and TD methods A way of implementing compound λ -return targets A basic mechanistic idea — a short-term, fading memory A new style of algorithm development/analysis the forward-view ⇔ backward-view transformation Forward view: 
 conceptually simple — good for theory, intuition Backward view: 
 computationally congenial implementation of the f. view

  3. Unified View width of backup Dynamic Temporal- programming difference learning height Multi-step (depth) bootstrapping of backup Exhaustive Monte search Carlo ... 3

  4. Recall n -step targets For example, in the episodic case, 
 with linear function approximation: 2-step target: . G (2) = R t +1 + γ R t +2 + γ 2 θ > t +1 φ t +2 t n -step target: . G ( n ) = R t +1 + · · · + γ n � 1 R t + n + γ n θ > t + n � 1 φ t + n t taken as zero and the n - . ( G ( n ) with = G t if t + n � T ). t

  5. Any set of update targets can be averaged to produce new compound update targets A compound backup For example, half a 2-step plus half a 4-step U t = 1 + 1 2 G (2) 2 G (4) t t 1 2 Called a compound backup Draw each component 1 2 Label with the weights for that component

  6. The λ -return is a compound update target TD( " ), " -return The λ -return a target that 
 averages all n -step targets each weighted by λ n -1 1 !" (1 !" ) " ∞ . λ n − 1 G ( n ) X G λ = (1 − λ ) t t (1 !" ) " 2 n =1 # = 1 T-t- 1 "

  7. λ -return Weighting Function weight given to total area = 1 the 3-step return is (1 − λ ) λ 2 decay by " Weight weight given to 1 !" actual, final return is λ T − t − 1 t T Time T − t − 1 λ n − 1 G ( n ) X λ T − t − 1 G t . G λ = (1 − λ ) + t t n =1 Until termination After termination 7

  8. Relation to TD(0) and MC The λ -return can be rewritten as: T − t − 1 λ n − 1 G ( n ) X λ T − t − 1 G t . G λ = (1 − λ ) + t t n =1 Until termination After termination If λ = 1, you get the MC target: T − t − 1 1 n − 1 G ( n ) X 1 T − t − 1 G t G λ = (1 � 1) + = G t t t n =1 If λ = 0, you get the TD(0) target: T − t − 1 0 n − 1 G ( n ) G (1) X 0 T − t − 1 G t G λ = (1 � 0) + = t t t n =1 8

  9. The off-line λ -return “algorithm” Wait until the end of the episode (offline) Then go back over the time steps, updating θ t +1 . h i G λ = θ t + α t � ˆ v ( S t , θ t ) r ˆ v ( S t , θ t ) , t = 0 , . . . , T � 1 .

  10. The λ -return alg performs similarly to n -step algs 
 on the 19-state random walk (Tabular) n-step TD methods Off-line λ -return algorithm (from Chapter 7) 256 512 λ =1 128 n=64 n=32 λ =.99 λ =.975 λ =.95 RMS error at the end of the episode over the first 10 episodes n=32 n=1 λ =0 λ =.95 n=16 λ =.9 n=2 λ =.4 n=8 n=4 λ =.8 α α Intermediate λ is best (just like intermediate n is best) λ -return slightly better than n -step

  11. The forward view looks forward from the state being updated to future states and rewards R r T r t +3 R s t +3 St +3 r t +2 R s t +2 St +2 R r t +1 s t +1 St +1 s t St e m i T

  12. The backward view looks back to the recently visited states (marked by eligibility traces) ! t δ t e t e t e t e t s t -3 S t - 3 e t e t s t -2 S t - 2 e t e t s t -1 S t - 1 s t S t s t +1 S t + 1 T i m e Shout the TD error backwards The traces fade with temporal distance by γλ

  13. Demo Here we are marking state-action pairs with a replacing eligibility trace 13

  14. Eligibility traces (mechanism) The forward view was for theory The backward view is for mechanism same shape as 휽 e t ∈ R n ≥ 0 New memory vector called eligibility trace On each step, decay each component by γλ and increment the trace for the current state by 1 Accumulating trace e 0 . = 0 , accumulating eligibility trace e t . = r ˆ v ( S t , θ t ) + γλ e t − 1 , times of visits to a state 14

  15. The Semi-gradient TD( λ ) algorithm θ t +1 . = θ t + αδ t e t , . = R t +1 + γ ˆ v ( S t +1 , θ t ) � ˆ v ( S t , θ t ) . δ t e 0 . = 0 , e t . = r ˆ v ( S t , θ t ) + γλ e t − 1

  16. TD( λ ) performs similarly to offline λ -return alg. but slightly worse, particularly at high α Tabular 19-state random walk task Off-line λ -return algorithm TD( λ ) (from the previous section) 1 λ =1 .99 .975 λ =.95 λ =.99 λ =.9 λ =.975 λ =.8 λ =.95 RMS error at the end of the episode over the first λ =0 10 episodes λ =0 λ =.95 λ =.4 λ =.9 λ =.9 λ =.4 λ =.8 λ =.8 α α Can we do better? Can we update online?

  17. The online λ -return algorithm performs best of all Tabular 19-state random walk task On-line λ -return algorithm Off-line λ -return algorithm = true online TD( λ ) λ =1 λ =1 λ =.99 λ =.99 λ =.975 λ =.975 λ =.95 RMS error λ =.95 over first 10 episodes λ =0 λ =0 λ =.95 λ =.95 λ =.9 λ =.9 λ =.4 λ =.4 λ =.8 λ =.8 α α Figure 12.7:

  18. The online λ -return alg uses a truncated λ -return 
 as its target h − t − 1 . G λ | h λ n − 1 G ( n ) λ h − t − 1 G ( h − t ) X = (1 − λ ) + , 0 ≤ t < h ≤ T. t t t n =1 r T R r t +3 R s t +3 horizon h = t +3 St +3 r t +2 R s t +2 St +2 R r t +1 s t +1 St +1 s t St e m i T There is a separate . h i G λ | h θ h = θ h v ( S t , θ h v ( S t , θ h t + α � ˆ t ) r ˆ t ) t +1 t 휽 sequence for each h !

  19. The online λ -return algorithm There is a separate . h i G λ | h θ h = θ h v ( S t , θ h v ( S t , θ h t + α � ˆ t ) r ˆ t ) t +1 t 휽 sequence for each h ! . h i G λ | 1 θ 1 = θ 1 v ( S 0 , θ 1 v ( S 0 , θ 1 h = 1 : 0 + α � ˆ 0 ) r ˆ 0 ) , 1 0 θ 0 0 . h i G λ | 2 θ 2 = θ 2 v ( S 0 , θ 2 v ( S 0 , θ 2 θ 1 θ 1 h = 2 : 0 + α � ˆ 0 ) r ˆ 0 ) , 1 0 1 0 θ 2 θ 2 θ 2 . h i 0 1 2 G λ | 2 θ 2 = θ 2 v ( S 1 , θ 2 v ( S 1 , θ 2 1 + α � ˆ 1 ) r ˆ 1 ) , θ 3 θ 3 θ 3 θ 3 2 1 0 1 2 3 . . . . ... . . . . . . . . . θ T θ T θ T θ T θ T h i G λ | 3 θ 3 = θ 3 v ( S 0 , θ 3 v ( S 0 , θ 3 h = 3 : 0 + α � ˆ 0 ) r ˆ 0 ) , · · · 0 1 2 3 T 1 0 . h i G λ | 3 θ 3 = θ 3 v ( S 1 , θ 3 v ( S 1 , θ 3 1 + α � ˆ 1 ) r ˆ 1 ) , 2 1 True online TD( λ ) . h i G λ | 3 θ 3 = θ 3 v ( S 2 , θ 3 v ( S 2 , θ 3 2 + α � ˆ 2 ) r ˆ 2 ) . computes just the 3 2 diagonal, cheaply … (for linear FA)

  20. True online TD( λ ) θ t +1 . ⇣ ⌘ θ > t φ t − θ > = θ t + αδ t e t + α ( e t − φ t ) , t � 1 φ t e t . ⇣ ⌘ 1 − αγλ e > = γλ e t � 1 + φ t . t � 1 φ t dutch trace

  21. Accumulating, Dutch, and Replacing Traces All traces fade the same: But increment differently! times of state visits accumulating traces dutch traces ( α = 0.5) replacing traces 21

  22. The simplest example of deriving a backward view from a forward view Monte Carlo learning of a final target Will derive dutch traces Showing the dutch traces really are not about TD They are about efficiently implementing online algs

  23. The Problem: Predict final target Z with linear function approximation episode next episode Time 0 1 2 . . . T-1 T 0 1 2 φ 0 φ 1 φ 2 φ T − 1 Z Data . . . θ 0 θ 0 θ 0 θ 0 θ T θ T θ T θ T Weights . . . θ > 0 φ 0 θ > 0 φ 1 θ > θ > 0 φ 2 0 φ T � 1 Predictions . . . ≈ Z θ t +1 . � � Z − φ > = θ t + α t t = 0 , . . . , T − 1 , MC: t θ t φ t , step size all done at time T

  24. Computational goals Computation per step (including memory) must be 1. Constant . (non-increasing with number of episodes) 2. Proportionate . (proportional to number of weights, or O(n)) 3. Independent of span . (not increasing with episode length) In general, the predictive span is the number of steps between making a prediction and observing the outcome θ t +1 . � � Z − φ > = θ t + α t t = 0 , . . . , T − 1 , MC: t θ t φ t , What is the span? T step size all done at time T Is MC indep of span? No

  25. Computational goals Computation per step (including memory) must be 1. Constant . (non-increasing with number of episodes) 2. Proportionate . (proportional to number of weights, or O(n)) 3. Independent of span . (not increasing with episode length) In general, the predictive span is the number of steps between making a prediction and observing the outcome θ t +1 . � � Z − φ > = θ t + α t t = 0 , . . . , T − 1 , MC: t θ t φ t , Computation and memory needed step size all done at time T all done at time T at step T increases with T ⇒ not IoS

  26. Final Result Given: θ 0 φ 0 , φ 1 , φ 2 , . . . , φ T � 1 Z MC algorithm: θ t +1 . � � Z − φ > = θ t + α t t = 0 , . . . , T − 1 , t θ t φ t , Equivalent independent-of-span algorithm: θ T . a t 2 < n , e t 2 < n = a T � 1 + Z e T � 1 , a 0 . = θ 0 , then a t . = a t � 1 − α t φ t φ > t = 1 , . . . , T − 1 t a t � 1 , e 0 . = α 0 φ 0 , then e t . = e t � 1 − α t φ t φ > t e t � 1 + α t φ t , t = 1 , . . . , T − 1 Proved: θ T = θ T

Recommend


More recommend