Eligibility Traces Unifying Monte Carlo and TD key algorithms: TD( - PowerPoint PPT Presentation

Eligibility Traces Unifying Monte Carlo and TD key algorithms: TD( λ ), Sarsa( λ ), Q( λ )

Unified View width of backup Dynamic Temporal- programming difference learning height (depth) of backup Exhaustive Monte search Carlo ... 2

N-step TD Prediction Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps) TD (1-step) 2-step 3-step n -step Monte Carlo 3

Mathematics of N-step TD Prediction G t . = R t +1 + γ R t +2 + γ 2 R t +3 + · · · + γ T − t − 1 R T Monte Carlo: . G (1) TD: = R t +1 + γ V t ( S t +1 ) t Use V t to estimate remaining return n -step TD: . G (2) = R t +1 + γ R t +2 + γ 2 V t ( S t +2 ) 2 step return: t . = R t +1 + γ R t +2 + γ 2 + · · · + γ n − 1 R t + n + γ n V t ( S t + n ) , G ( n ) n -step return: t

Forward View of TD( λ ) Look forward from each state to determine update from future states and rewards: R r T r t +3 R s t +3 St +3 r t +2 R s t +2 St +2 R r t +1 s t +1 St +1 s t St e m i T 5

Learning with n -step Backups Backup computes an increment: ∆ t ( S t ) . h i G ( n ) = α − V t ( S t ) ( ∆ t ( s ) = 0 , 8 s 6 = S t ). t Then, Online updating: V t +1 ( s ) = V t ( s ) + ∆ t ( s ) , 8 s 2 S . Off-line updating: T − 1 X V ( s ) V ( s ) + ∆ t ( s ) . 8 s 2 S . t =0 6

Error-reduction property Error reduction property of n -step returns � � � � � h i �  γ n max G ( n ) max � S t = s � v π ( s ) � V t ( s ) � v π ( s ) � E π � � � � � t � s s Maximum error using n -step return Maximum error using V Using this, you can show that n -step methods converge 7

Random Walk Examples 0 0 0 0 0 1 A B C D E start How does 2-step TD work here? How about 3-step TD? 8

A Larger Example – 19-state Random Walk On-line n-step TD methods Off-line n-step TD methods 256 256 512 512 128 128 n=64 n=64 n=32 n=1 RMS error n=64 n=3 over first n=2 10 episodes n=32 n=32 n=1 n=4 n=16 n=8 n=16 n=2 n=8 n=4 α α On-line is better than off-line An intermediate n is best Do you think there is an optimal n ? for every task? 9

Averaging N-step Returns A complex backup n -step methods were introduced to help with TD( λ ) understanding Idea: backup an average of several returns e.g. backup half of 2-step and half of 4-step 1 2 G (2) 2 G (4) : 1 + 1 2 t t as long as the we Called a complex backup 1 2 Draw each component Label with the weights for that component 10

Forward View of TD( λ ) TD( " ), " -return TD( λ ) is a method for averaging all n -step backups weight by λ n -1 (time 1 !" since visitation) λ -return: (1 !" ) " ∞ . λ n − 1 G ( n ) X G λ = (1 − λ ) t t (1 !" ) " 2 n =1 Backup using λ -return: ∆ t ( S t ) . h i G λ # = 1 = α t � V t ( S t ) T-t- 1 " 11

λ -return Weighting Function weight given to total area = 1 the 3-step return decay by " Weight weight given to 1 !" actual, final return T t Time T − t − 1 λ n − 1 G ( n ) X λ T − t − 1 G t . G λ = (1 − λ ) + t t n =1 Until termination After termination 12

Relation to TD(0) and MC The λ -return can be rewritten as: T − t − 1 λ n − 1 G ( n ) X λ T − t − 1 G t . G λ = (1 − λ ) + t t n =1 Until termination After termination If λ = 1, you get MC: T − t − 1 1 n − 1 G ( n ) X 1 T − t − 1 G t G λ = (1 � 1) + = G t t t n =1 If λ = 0, you get TD(0) T − t − 1 0 n − 1 G ( n ) G (1) X 0 T − t − 1 G t G λ = (1 � 0) + = t t t n =1 13

Forward View of TD( λ ) Look forward from each state to determine update from future states and rewards: R r T r t +3 R s t +3 St +3 r t +2 R s t +2 St +2 R r t +1 s t +1 St +1 s t St e m i T 14

λ -return on the Random Walk Off-line λ -return algorithm On-line λ -return algorithm ≡ off-line TD( λ ), accumulating traces λ =1 λ =1 λ =.99 λ =.99 λ =.975 λ =0 λ =.95 RMS error λ =.99 λ =.4 over first λ =.975 10 episodes λ =.8 λ =.95 λ =0 λ =.95 λ =.9 λ =.9 λ =.4 λ =.8 α α On-line >> Off-line Intermediate values of λ best λ -return better than n -step return 15

Backward View . = R t +1 + γ V t ( S t +1 ) � V t ( S t ) . δ t ∆ V t ( s ) . = αδ t E t ( s ) ! t 훿 t e t E t ( e t E t ( S t -3 s t -3 e t E t ( St -2 s t -2 e t E t ( St -1 s t -1 s t St s t +1 St +1 Time Shout δ t backwards over time The strength of your voice decreases with temporal distance by γλ 16

Backward View of TD( λ ) The forward view was for theory The backward view is for mechanism trace . The elig New variable called eligibility trace E t ( s ) 2 R + . On each step, decay all traces by γλ and increment the trace for the current state by 1 Accumulating trace ⇢ γλ E t − 1 ( s ) if s 6 = S t ; accumulating eligibility trace E t ( s ) = γλ E t − 1 ( s ) + 1 if s = S t , times of visits to a state 17

On-line Tabular TD( λ ) Initialize V ( s ) arbitrarily (but set to 0 if s is terminal) Repeat (for each episode): Initialize E ( s ) = 0, for all s ∈ S Initialize S Repeat (for each step of episode): A ← action given by π for S Take action A , observe reward, R , and next state, S 0 δ ← R + γ V ( S 0 ) − V ( S ) E ( S ) ← E ( S ) + 1 (accumulating traces) or E ( S ) ← (1 − α ) E ( S ) + 1 (dutch traces) or E ( S ) ← 1 (replacing traces) For all s ∈ S : V ( s ) ← V ( s ) + αδ E ( s ) E ( s ) ← γλ E ( s ) S ← S 0 until S is terminal 18

Relation of Backwards View to MC & TD(0) Using update rule: ∆ V t ( s ) . = αδ t E t ( s ) As before, if you set λ to 0, you get to TD(0) If you set λ to 1, you get MC but in a better way Can apply TD(1) to continuing tasks Works incrementally and on-line (instead of waiting to the end of the episode) 19

Forward View = Backward View The forward (theoretical) view of TD( λ ) is equivalent to the backward (mechanistic) view for off-line updating T − 1 T − 1 X X ∆ V TD ∆ V λ ( s ) = t ( S t ) I sS t , t t =0 t =0 Backward updates Forward updates X X algebra T − 1 T − 1 X X ( γλ ) k − t δ k . α I sS t t =0 k = t On-line updating with small α is similar 20

On-line versus Off-line on Random Walk Off-line TD( λ ), accumulating traces On-line TD( λ ), accumulating traces ≡ off-line λ -return algorithm λ =1 1 .99 λ =.99 .975 λ =.95 λ =.9 λ =.8 λ =0 RMS error λ =.99 λ =.4 over first λ =.975 10 episodes λ =0 λ =.8 λ =.95 λ =.9 λ =.4 λ =.9 λ =.8 α α Same 19 state random walk On-line performs better over a broader range of parameters 21

Replacing and Dutch Traces All traces fade the same: E t ( s ) . = γλ E t − 1 ( s ) , 8 s 2 S , s 6 = S t , But increment differently! times of state visits E t ( S t ) . = γλ E t − 1 ( S t ) + 1 accumulating traces E t ( S t ) . = (1 − α ) γλ E t − 1 ( S t ) + 1 dutch traces ( α = 0.5) E t ( S t ) . = 1 . replacing traces 22

Replacing and Dutch on the Random Walk On-line TD( λ ), replacing traces On-line TD( λ ), dutch traces λ =1 λ =1 λ =.99 λ =.975 λ =.99 λ =.975 RMS error λ =.95 over first 10 episodes λ =0 λ =.975 λ =0 λ =.95 λ =.95 λ =.9 λ =.4 λ =.4 λ =.9 λ =.8 λ =.8 α α 23

On-line λ -return Off-line λ -return = off-line TD( λ ), accumulating traces λ =1 λ =1 λ =.99 λ =.99 λ =.975 λ =0 λ =.95 λ =.99 λ =.4 RMS error over first 10 episodes on 19-state random walk λ =.975 All λ results λ =.8 λ =.95 λ =0 λ =.95 λ =.9 on the   λ =.9 λ =.4 λ =.8 random walk On-line TD( λ ), dutch traces On-line TD( λ ), accumulating traces 1 λ =1 .99 .975 λ =.95 λ =.9 λ =.99 λ =.8 λ =.975 λ =.95 λ =0 λ =0 λ =.95 λ =.4 λ =.9 λ =.4 λ =.8 True on-line TD( λ ) On-line TD( λ ), replacing traces = real-time λ -return λ =1 λ =1 λ =.99 λ =.99 λ =.975 λ =.975 λ =.95 λ =0 λ =.975 λ =0 λ =.95 λ =.95 λ =.9 λ =.4 λ =.4 λ =.9 λ =.8 λ =.8 α α

Control: Sarsa( λ ) Sarsa( λ ) Everything changes from St , At s , a t t states to state-action pairs 1 −λ (1 −λ ) λ Q t +1 ( s, a ) = Q t ( s, a ) + αδ t E t ( s, a ) , for all s, a 8 s, a (1 −λ ) λ 2 where Σ = 1 S T s T δ t = R t +1 + γ Q t ( S t +1 , A t +1 ) − Q t ( S t , A t ) T-t- 1 λ and ⇢ γλ E t − 1 ( s, a ) + 1 if s = S t and a = A t ; E t ( s, a ) = for all s, a γλ E t − 1 ( s, a ) otherwise. 25

Demo 26

Sarsa( λ ) Algorithm Initialize Q ( s, a ) arbitrarily, for all s ∈ S , a ∈ A ( s ) Repeat (for each episode): E ( s, a ) = 0, for all s ∈ S , a ∈ A ( s ) Initialize S , A Repeat (for each step of episode): Take action A , observe R , S 0 Choose A 0 from S 0 using policy derived from Q (e.g., ε -greedy) δ ← R + γ Q ( S 0 , A 0 ) − Q ( S, A ) E ( S, A ) ← E ( S, A ) + 1 For all s ∈ S , a ∈ A ( s ): Q ( s, a ) ← Q ( s, a ) + αδ E ( s, a ) E ( s, a ) ← γλ E ( s, a ) S ← S 0 ; A ← A 0 until S is terminal 27

Sarsa( λ ) Gridworld Example Action values increased Action values increased Path taken by one-step Sarsa by Sarsa( ! ) with ! =0.9 With one trial, the agent has much more information about how to get to the goal not necessarily the best way Can considerably accelerate learning 28

Eligibility Traces Unifying Monte Carlo and TD key algorithms: TD( - PowerPoint PPT Presentation

Eligibility Traces Unifying Monte Carlo and TD key algorithms: TD( ), Sarsa( ), Q( ) Unified View width of backup Dynamic Temporal- programming difference learning height (depth) of backup Exhaustive Monte search Carlo ...

Eligibility Traces Chapter 12 Eligibility traces are Another way of interpolating between MC and

Traces Exist (Hypothetically)! Carl Pollard Structure and Evidence in Linguistics Workshop in

Dependent Eligibility Audit Dependent Eligibility Audit Purpose: The dependent eligibility audit

Agenda Tribal employee eligibility Family member eligibility Tribal employer

LN-10 Searching for traces of Norwegian economic thought Intro When trying to find the traces of

The evaluation of evidence relating to traces of cocaine on banknotes Amy Wilson September 2015

Capturing Traffic Traces with Ground- Capturing Traffic Traces with Ground- Truth Information

Modeling interaction traces of an online panel From raw interaction traces to actionable

On The Fidelity of 802.11 Packet Traces Aaron Schulman, Dave Levin, Neil Spring University of

Discovering Bits of Place Histories from People's Activity Traces from People s Activity Traces

Two 2-traces Simon Willerton University of Sheffield f Tr ( f ) := V

ELLIPSOID : traces on the coordinate planes are ellipses 2 2 2 x 2 y 2 z 2 = 1 a b c

Checking Eligibility Resources and Guide Proprietary and Confidential 1 Themes of the course What

Release Update Office of Medicaid Eligibility and Policy Medicaid Eligibility and Community

Shelter Eligibility & Service Restrictions Q&A December 19, 2017 Shelter Eligibility

NCAA Initial Eligibility Burlington Central High School September 20, 2017 Overview NCAA

Nitro: A Fast, Scalable In-Memory Storage Engine for NoSQL Global Secondary Index Sarath

A New Fire Station for Newbury The Current Station The Current Station Lacks adequate floor

CS 1501 www.cs.pitt.edu/~nlf4/cs1501/ String Pattern Matching General idea Have a pattern

Contextual Inquiry SWEN-445 Contextual Inquiry is the process of discovering what users cannot

Real-Time Case Management with Blaise September 23-26, 2013 Presentation at the 2013

Back-Office Web Traffic on the Internet Enric Pujol TU-Berlin

EB-5 Money in the Bank Andrew Pan, East W est Bank Christopher Howley, Sandler ONiell +

BRC Recommend BRC Recommendation ation of a of a New St New Strategy for Back rategy for Back

Sambuz

Useful Links

Newsletter

Mail Us

Eligibility Traces Unifying Monte Carlo and TD key algorithms: TD( - PowerPoint PPT Presentation

Eligibility Traces Unifying Monte Carlo and TD key algorithms: TD( ), Sarsa( ), Q( ) Unified View width of backup Dynamic Temporal- programming difference learning height (depth) of backup Exhaustive Monte search Carlo ...

Eligibility Traces Chapter 12 Eligibility traces are Another way of interpolating between MC and

Traces Exist (Hypothetically)! Carl Pollard Structure and Evidence in Linguistics Workshop in

Dependent Eligibility Audit Dependent Eligibility Audit Purpose: The dependent eligibility audit

Agenda Tribal employee eligibility Family member eligibility Tribal employer

LN-10 Searching for traces of Norwegian economic thought Intro When trying to find the traces of

The evaluation of evidence relating to traces of cocaine on banknotes Amy Wilson September 2015

Capturing Traffic Traces with Ground- Capturing Traffic Traces with Ground- Truth Information

Modeling interaction traces of an online panel From raw interaction traces to actionable

On The Fidelity of 802.11 Packet Traces Aaron Schulman, Dave Levin, Neil Spring University of

Discovering Bits of Place Histories from People's Activity Traces from People s Activity Traces

Two 2-traces Simon Willerton University of Sheffield f Tr ( f ) := V

ELLIPSOID : traces on the coordinate planes are ellipses 2 2 2 x 2 y 2 z 2 = 1 a b c

Checking Eligibility Resources and Guide Proprietary and Confidential 1 Themes of the course What

Release Update Office of Medicaid Eligibility and Policy Medicaid Eligibility and Community

Shelter Eligibility &amp; Service Restrictions Q&amp;A December 19, 2017 Shelter Eligibility

NCAA Initial Eligibility Burlington Central High School September 20, 2017 Overview NCAA

Nitro: A Fast, Scalable In-Memory Storage Engine for NoSQL Global Secondary Index Sarath

A New Fire Station for Newbury The Current Station The Current Station Lacks adequate floor

CS 1501 www.cs.pitt.edu/~nlf4/cs1501/ String Pattern Matching General idea Have a pattern

Contextual Inquiry SWEN-445 Contextual Inquiry is the process of discovering what users cannot

Real-Time Case Management with Blaise September 23-26, 2013 Presentation at the 2013

Back-Office Web Traffic on the Internet Enric Pujol TU-Berlin

EB-5 Money in the Bank Andrew Pan, East W est Bank Christopher Howley, Sandler ONiell +

BRC Recommend BRC Recommendation ation of a of a New St New Strategy for Back rategy for Back

Sambuz

Useful Links

Newsletter

Mail Us

Shelter Eligibility & Service Restrictions Q&A December 19, 2017 Shelter Eligibility