Eligibility Traces Unifying Monte Carlo and TD key algorithms: TD( λ ), Sarsa( λ ), Q( λ )
Unified View width of backup Dynamic Temporal- programming difference learning height (depth) of backup Exhaustive Monte search Carlo ... 2
N-step TD Prediction Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps) TD (1-step) 2-step 3-step n -step Monte Carlo 3
Mathematics of N-step TD Prediction G t . = R t +1 + γ R t +2 + γ 2 R t +3 + · · · + γ T − t − 1 R T Monte Carlo: . G (1) TD: = R t +1 + γ V t ( S t +1 ) t Use V t to estimate remaining return n -step TD: . G (2) = R t +1 + γ R t +2 + γ 2 V t ( S t +2 ) 2 step return: t . = R t +1 + γ R t +2 + γ 2 + · · · + γ n − 1 R t + n + γ n V t ( S t + n ) , G ( n ) n -step return: t
Forward View of TD( λ ) Look forward from each state to determine update from future states and rewards: R r T r t +3 R s t +3 St +3 r t +2 R s t +2 St +2 R r t +1 s t +1 St +1 s t St e m i T 5
Learning with n -step Backups Backup computes an increment: ∆ t ( S t ) . h i G ( n ) = α − V t ( S t ) ( ∆ t ( s ) = 0 , 8 s 6 = S t ). t Then, Online updating: V t +1 ( s ) = V t ( s ) + ∆ t ( s ) , 8 s 2 S . Off-line updating: T − 1 X V ( s ) V ( s ) + ∆ t ( s ) . 8 s 2 S . t =0 6
Error-reduction property Error reduction property of n -step returns � � � � � h i � γ n max G ( n ) max � S t = s � v π ( s ) � V t ( s ) � v π ( s ) � E π � � � � � t � s s Maximum error using n -step return Maximum error using V Using this, you can show that n -step methods converge 7
Random Walk Examples 0 0 0 0 0 1 A B C D E start How does 2-step TD work here? How about 3-step TD? 8
A Larger Example – 19-state Random Walk On-line n-step TD methods Off-line n-step TD methods 256 256 512 512 128 128 n=64 n=64 n=32 n=1 RMS error n=64 n=3 over first n=2 10 episodes n=32 n=32 n=1 n=4 n=16 n=8 n=16 n=2 n=8 n=4 α α On-line is better than off-line An intermediate n is best Do you think there is an optimal n ? for every task? 9
Averaging N-step Returns A complex backup n -step methods were introduced to help with TD( λ ) understanding Idea: backup an average of several returns e.g. backup half of 2-step and half of 4-step 1 2 G (2) 2 G (4) : 1 + 1 2 t t as long as the we Called a complex backup 1 2 Draw each component Label with the weights for that component 10
Forward View of TD( λ ) TD( " ), " -return TD( λ ) is a method for averaging all n -step backups weight by λ n -1 (time 1 !" since visitation) λ -return: (1 !" ) " ∞ . λ n − 1 G ( n ) X G λ = (1 − λ ) t t (1 !" ) " 2 n =1 Backup using λ -return: ∆ t ( S t ) . h i G λ # = 1 = α t � V t ( S t ) T-t- 1 " 11
λ -return Weighting Function weight given to total area = 1 the 3-step return decay by " Weight weight given to 1 !" actual, final return T t Time T − t − 1 λ n − 1 G ( n ) X λ T − t − 1 G t . G λ = (1 − λ ) + t t n =1 Until termination After termination 12
Relation to TD(0) and MC The λ -return can be rewritten as: T − t − 1 λ n − 1 G ( n ) X λ T − t − 1 G t . G λ = (1 − λ ) + t t n =1 Until termination After termination If λ = 1, you get MC: T − t − 1 1 n − 1 G ( n ) X 1 T − t − 1 G t G λ = (1 � 1) + = G t t t n =1 If λ = 0, you get TD(0) T − t − 1 0 n − 1 G ( n ) G (1) X 0 T − t − 1 G t G λ = (1 � 0) + = t t t n =1 13
Forward View of TD( λ ) Look forward from each state to determine update from future states and rewards: R r T r t +3 R s t +3 St +3 r t +2 R s t +2 St +2 R r t +1 s t +1 St +1 s t St e m i T 14
λ -return on the Random Walk Off-line λ -return algorithm On-line λ -return algorithm ≡ off-line TD( λ ), accumulating traces λ =1 λ =1 λ =.99 λ =.99 λ =.975 λ =0 λ =.95 RMS error λ =.99 λ =.4 over first λ =.975 10 episodes λ =.8 λ =.95 λ =0 λ =.95 λ =.9 λ =.9 λ =.4 λ =.8 α α On-line >> Off-line Intermediate values of λ best λ -return better than n -step return 15
Backward View . = R t +1 + γ V t ( S t +1 ) � V t ( S t ) . δ t ∆ V t ( s ) . = αδ t E t ( s ) ! t 훿 t e t E t ( e t E t ( S t -3 s t -3 e t E t ( St -2 s t -2 e t E t ( St -1 s t -1 s t St s t +1 St +1 Time Shout δ t backwards over time The strength of your voice decreases with temporal distance by γλ 16
Backward View of TD( λ ) The forward view was for theory The backward view is for mechanism trace . The elig New variable called eligibility trace E t ( s ) 2 R + . On each step, decay all traces by γλ and increment the trace for the current state by 1 Accumulating trace ⇢ γλ E t − 1 ( s ) if s 6 = S t ; accumulating eligibility trace E t ( s ) = γλ E t − 1 ( s ) + 1 if s = S t , times of visits to a state 17
On-line Tabular TD( λ ) Initialize V ( s ) arbitrarily (but set to 0 if s is terminal) Repeat (for each episode): Initialize E ( s ) = 0, for all s ∈ S Initialize S Repeat (for each step of episode): A ← action given by π for S Take action A , observe reward, R , and next state, S 0 δ ← R + γ V ( S 0 ) − V ( S ) E ( S ) ← E ( S ) + 1 (accumulating traces) or E ( S ) ← (1 − α ) E ( S ) + 1 (dutch traces) or E ( S ) ← 1 (replacing traces) For all s ∈ S : V ( s ) ← V ( s ) + αδ E ( s ) E ( s ) ← γλ E ( s ) S ← S 0 until S is terminal 18
Relation of Backwards View to MC & TD(0) Using update rule: ∆ V t ( s ) . = αδ t E t ( s ) As before, if you set λ to 0, you get to TD(0) If you set λ to 1, you get MC but in a better way Can apply TD(1) to continuing tasks Works incrementally and on-line (instead of waiting to the end of the episode) 19
Forward View = Backward View The forward (theoretical) view of TD( λ ) is equivalent to the backward (mechanistic) view for off-line updating T − 1 T − 1 X X ∆ V TD ∆ V λ ( s ) = t ( S t ) I sS t , t t =0 t =0 Backward updates Forward updates X X algebra T − 1 T − 1 X X ( γλ ) k − t δ k . α I sS t t =0 k = t On-line updating with small α is similar 20
On-line versus Off-line on Random Walk Off-line TD( λ ), accumulating traces On-line TD( λ ), accumulating traces ≡ off-line λ -return algorithm λ =1 1 .99 λ =.99 .975 λ =.95 λ =.9 λ =.8 λ =0 RMS error λ =.99 λ =.4 over first λ =.975 10 episodes λ =0 λ =.8 λ =.95 λ =.9 λ =.4 λ =.9 λ =.8 α α Same 19 state random walk On-line performs better over a broader range of parameters 21
Replacing and Dutch Traces All traces fade the same: E t ( s ) . = γλ E t − 1 ( s ) , 8 s 2 S , s 6 = S t , But increment differently! times of state visits E t ( S t ) . = γλ E t − 1 ( S t ) + 1 accumulating traces E t ( S t ) . = (1 − α ) γλ E t − 1 ( S t ) + 1 dutch traces ( α = 0.5) E t ( S t ) . = 1 . replacing traces 22
Replacing and Dutch on the Random Walk On-line TD( λ ), replacing traces On-line TD( λ ), dutch traces λ =1 λ =1 λ =.99 λ =.975 λ =.99 λ =.975 RMS error λ =.95 over first 10 episodes λ =0 λ =.975 λ =0 λ =.95 λ =.95 λ =.9 λ =.4 λ =.4 λ =.9 λ =.8 λ =.8 α α 23
On-line λ -return Off-line λ -return = off-line TD( λ ), accumulating traces λ =1 λ =1 λ =.99 λ =.99 λ =.975 λ =0 λ =.95 λ =.99 λ =.4 RMS error over first 10 episodes on 19-state random walk λ =.975 All λ results λ =.8 λ =.95 λ =0 λ =.95 λ =.9 on the λ =.9 λ =.4 λ =.8 random walk On-line TD( λ ), dutch traces On-line TD( λ ), accumulating traces 1 λ =1 .99 .975 λ =.95 λ =.9 λ =.99 λ =.8 λ =.975 λ =.95 λ =0 λ =0 λ =.95 λ =.4 λ =.9 λ =.4 λ =.8 True on-line TD( λ ) On-line TD( λ ), replacing traces = real-time λ -return λ =1 λ =1 λ =.99 λ =.99 λ =.975 λ =.975 λ =.95 λ =0 λ =.975 λ =0 λ =.95 λ =.95 λ =.9 λ =.4 λ =.4 λ =.9 λ =.8 λ =.8 α α
Control: Sarsa( λ ) Sarsa( λ ) Everything changes from St , At s , a t t states to state-action pairs 1 −λ (1 −λ ) λ Q t +1 ( s, a ) = Q t ( s, a ) + αδ t E t ( s, a ) , for all s, a 8 s, a (1 −λ ) λ 2 where Σ = 1 S T s T δ t = R t +1 + γ Q t ( S t +1 , A t +1 ) − Q t ( S t , A t ) T-t- 1 λ and ⇢ γλ E t − 1 ( s, a ) + 1 if s = S t and a = A t ; E t ( s, a ) = for all s, a γλ E t − 1 ( s, a ) otherwise. 25
Demo 26
Sarsa( λ ) Algorithm Initialize Q ( s, a ) arbitrarily, for all s ∈ S , a ∈ A ( s ) Repeat (for each episode): E ( s, a ) = 0, for all s ∈ S , a ∈ A ( s ) Initialize S , A Repeat (for each step of episode): Take action A , observe R , S 0 Choose A 0 from S 0 using policy derived from Q (e.g., ε -greedy) δ ← R + γ Q ( S 0 , A 0 ) − Q ( S, A ) E ( S, A ) ← E ( S, A ) + 1 For all s ∈ S , a ∈ A ( s ): Q ( s, a ) ← Q ( s, a ) + αδ E ( s, a ) E ( s, a ) ← γλ E ( s, a ) S ← S 0 ; A ← A 0 until S is terminal 27
Sarsa( λ ) Gridworld Example Action values increased Action values increased Path taken by one-step Sarsa by Sarsa( ! ) with ! =0.9 With one trial, the agent has much more information about how to get to the goal not necessarily the best way Can considerably accelerate learning 28
Recommend
More recommend