6. (3 pts) What three things form the deadly triad the three - PowerPoint PPT Presentation

6. (3 pts) What three things form the “deadly triad” – the three things that cannot be combined in the same learning situation without risking divergence? (circle three) (a) eligibility traces (b) bootstrapping (c) sample backups (d) ε -greedy action selection (e) linear function approximation (f) off-line updating (g) off-policy learning (h) exploration bonuses

The Deadly Triad the three things that together result in instability 1. Function approximation 2. Bootstrapping 3. Off-policy training data (e.g., Q-learning, DP) even if: prediction (fixed given policies) • linear with binary features • expected updates (as in asynchronous DP, iid) •

7. True or False: For any stationary MDP, assuming a step-size ( α ) sequence satisfying the stan- dard stochastic approximation criteria, and a fixed policy, convergence in the prediction problem is guaranteed for T F (2 pts) online, off-policy TD(1) with linear function approximation T F (2 pts) online, on-policy TD(0) with linear function approximation T F (2 pts) offline, off-policy TD(0) with linear function approximation T F (2 pts) dynamic programming with linear function approximation T F (2 pts) dynamic programming with nonlinear function approximation T F (2 pts) gradient-descent Monte Carlo with linear function approximation T F (2 pts) gradient-descent Monte Carlo with nonlinear function approximation 8. True or False: (3 pts) TD(0) with linear function approximation converges to a local minimum in the MSE between the approximate value function and the true value function V π .

The Deadly Triad the three things that together result in instability 1. Function approximation • linear or more with proportional complexity • state aggregation ok; ok if “nearly Markov” 2. Bootstrapping • λ =1 ok, ok if λ big enough (problem dependent) 3. Off-policy training • may be ok if “nearly on-policy” • if policies very different, variance may be too high anyway

Off-policy learning • Learning about a policy different than the policy being used to generate actions • Most often used to learn optimal behaviour from a given data set, or from more exploratory behaviour • Key to ambitious theories of knowledge and perception as continual prediction about the outcomes of many options

Baird’s counter-example • P and d are not linked • d is all states with equal probability • P is according to this Markov chain: V k ( s ) = V k ( s ) = V k ( s ) = V k ( s ) = V k ( s ) = ! (7) +2 ! (1) ! (7) +2 ! (2) ! (7) +2 ! (3) ! (7) +2 ! (4) ! (7) +2 ! (5) 100% r = 0 on all transitions V k ( s ) = 2 ! (7) + ! (6) terminal 1% state 99%

TD can diverge: Baird’s counter-example 10 10 ! k (1) – ! k (5) ! k (7) 5 10 Parameter ! k (6) values, ! k ( i ) 0 0 / -10 10 (log scale, broken at ± 1) 5 - 10 10 - 10 0 1000 2000 3000 4000 5000 Iterations ( k ) deterministic updates θ 0 = (1 , 1 , 1 , 1 , 1 , 10 , 1) � γ = 0 . 99 α = 0 . 01

TD(0) can diverge: A simple example r=1 θ 2 θ r + γθ ⇥ φ � − θ ⇥ φ δ = 0 + 2 θ − θ = θ = TD update: ∆ θ αδφ = Diverges! αθ = TD fixpoint: θ ∗ = 0

Previous attempts to solve the off-policy problem • Importance sampling • With recognizers • Least-squares methods, LSTD, LSPI, iLSTD • Averagers • Residual gradient methods

Desiderata: We want a TD algorithm that • Bootstraps (genuine TD) • Works with linear function approximation   (stable, reliably convergent) • Is simple, like linear TD — O(n) • Learns fast, like linear TD • Can learn off-policy (arbitrary P and d ) • Learns from online causal trajectories   (no repeat sampling from the same state)

A little more theory r + γθ > φ 0 − θ > φ � � = ∆ θ ∝ δφ φ θ > ( γφ 0 − φ ) φ + r φ = φ ( γφ 0 − φ ) > θ + r φ = h φ ( φ − γφ 0 ) > i E [ ∆ θ ] θ + E [ r φ ] − E ∝ convergent if   E [ ∆ θ ] + − A θ b A is pos. def. ∝ therefore, at   A θ ∗ b = LSTD computes this directly the TD fixpoint: A − 1 b θ ∗ = φφ > ⇤ ⇥ C = E � 1 � A > C � 1 ( A θ � b ) 2 r θ MSPBE = covariance   matrix always pos. def.

TD(0) Solution and Stability ⇣ ⌘ � φ ( S t ) ( φ ( S t ) � γ φ ( S t +1 )) > θ t +1 = θ t + α R t +1 φ ( S t ) θ t | {z } | {z } b t 2 R n A t 2 R n × n = θ t + α ( b t � A t θ t ) = ( I � α A t ) θ t + α b t . θ t +1 . ¯ = ¯ θ t + α ( b � A ¯ θ t ) , θ ∗ = A − 1 b

LSTD(0) Ideal: φ t ( φ t − γ φ t +1 ) > ⇤ ⇥ A = lim t !1 E π b = lim t →∞ E π [ R t +1 φ t ] θ ∗ = A − 1 b Algorithm: � > X X � A t = ρ k φ k φ k − γ φ k +1 b t = ρ k R k φ k k k t →∞ b t = b lim t →∞ A t = A lim θ t = A − 1 t b t t →∞ θ t = θ ∗ lim

LSTD( λ ) Ideal: e t ( φ t − γ φ t +1 ) > ⇤ ⇥ A = lim t !1 E π b = lim t →∞ E π [ R t +1 e t ] θ ∗ = A − 1 b Algorithm: � > X X � A t = ρ k e k φ k − γ φ k +1 b t = ρ k R k e k k k t →∞ b t = b lim t →∞ A t = A lim θ t = A − 1 t b t t →∞ θ t = θ ∗ lim

6. (3 pts) What three things form the deadly triad the three - PowerPoint PPT Presentation

6. (3 pts) What three things form the deadly triad the three things that cannot be combined in the same learning situation without risking divergence? (circle three) (a) eligibility traces (b) bootstrapping (c) sample backups (d)

THE ORIGINAL TRACTOR PULLING Andy Mastin UND-30 16 pts Nate Niehoff 460-36 16 pts Nate

THE TRIAD THE TRIAD APPROACH APPROACH The Triad Approach to make contaminated sites cleanup

Be Deadly Get Healthy An Opportunity for Exercise in the Aboriginal Community KRISTY

Y/Y % Y/Y % HPE WW Q115 Q215

30 September 1-2 October 2016 123 consecutive pts treated by Hypofractionated Radiotherapy with

Interpolation points and interpolation formulae on the square Stefano De Marchi Department of

Some Cryptanalytic Results on TRIAD Abhishek Kesarwani IIT Madras, India INDOCRYPT 2019 16

eric gilbert | asst prof | georgia tech [Gilbert & Karahalios. CHI 2009] Triad Trait

he ur Managed by Triad National Security, LLC for the U.S. Department of Energy's NNSA Los

The Seven (More) DEADLY SINS OF Microservices Daniel Bryant @ danielbryantuk OpencRedo

Distracted Driving--A Deadly and Preventable Risk NCOIL July 14, 2018 David F. Snyder, PCI

The Seven (More) DEADLY SINS OF Microservices @ danielbryantuk @ spectolabs Previously, AT Devoxx

why are the 7 deadly? 1) They are difficult to shake. 2) The y are endemic (common) to humanity.

The Lament Form The Lament Form The Lament Form The Lament Form The Lament Form Psalm 64

Things you can do Things you can do Things you can do Everything you need to know

WWW.TOTW.ORG By Kenneth M Hoeck Finally, brethren, whatsoever things are true, whatsoever things

A Hybrid of Western/Eastern Approaches April 26th, 2012 Naoki Ogiwara Senior KM Officer

Plan for today Knowledge-based systems 1 Explicit knowledge Knowledge Representation Inferred

v6ops and security IPv6 Transition/Co-existence Security Considerations

Machine Learning Jrg Denzinger, ICT 752, denzinge@cpsc.ucalgary.ca 0. Organizational Stuff

CS449/649: Human-Computer Interaction Winter 2018 Lecture II Anastasia Kuzminykh Understand

Debugging Auto-Generated Code with Source Specification in Exploratory Modeling Tomohiro Oda

Temporal Difference Methods, Off-Policy Methods Milan Straka October 21, 2019 Charles

Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences

Sambuz

Useful Links

Newsletter

Mail Us

6. (3 pts) What three things form the deadly triad the three - PowerPoint PPT Presentation

6. (3 pts) What three things form the deadly triad the three things that cannot be combined in the same learning situation without risking divergence? (circle three) (a) eligibility traces (b) bootstrapping (c) sample backups (d)

THE ORIGINAL TRACTOR PULLING Andy Mastin UND-30 16 pts Nate Niehoff 460-36 16 pts Nate

THE TRIAD THE TRIAD APPROACH APPROACH The Triad Approach to make contaminated sites cleanup

Be Deadly Get Healthy An Opportunity for Exercise in the Aboriginal Community KRISTY

Y/Y % Y/Y % HPE WW Q115 Q215

30 September 1-2 October 2016 123 consecutive pts treated by Hypofractionated Radiotherapy with

Interpolation points and interpolation formulae on the square Stefano De Marchi Department of

Some Cryptanalytic Results on TRIAD Abhishek Kesarwani IIT Madras, India INDOCRYPT 2019 16

eric gilbert | asst prof | georgia tech [Gilbert &amp; Karahalios. CHI 2009] Triad Trait

he ur Managed by Triad National Security, LLC for the U.S. Department of Energy's NNSA Los

The Seven (More) DEADLY SINS OF Microservices Daniel Bryant @ danielbryantuk OpencRedo

Distracted Driving--A Deadly and Preventable Risk NCOIL July 14, 2018 David F. Snyder, PCI

The Seven (More) DEADLY SINS OF Microservices @ danielbryantuk @ spectolabs Previously, AT Devoxx

why are the 7 deadly? 1) They are difficult to shake. 2) The y are endemic (common) to humanity.

The Lament Form The Lament Form The Lament Form The Lament Form The Lament Form Psalm 64

Things you can do Things you can do Things you can do Everything you need to know

WWW.TOTW.ORG By Kenneth M Hoeck Finally, brethren, whatsoever things are true, whatsoever things

A Hybrid of Western/Eastern Approaches April 26th, 2012 Naoki Ogiwara Senior KM Officer

Plan for today Knowledge-based systems 1 Explicit knowledge Knowledge Representation Inferred

v6ops and security IPv6 Transition/Co-existence Security Considerations

Machine Learning Jrg Denzinger, ICT 752, denzinge@cpsc.ucalgary.ca 0. Organizational Stuff

CS449/649: Human-Computer Interaction Winter 2018 Lecture II Anastasia Kuzminykh Understand

Debugging Auto-Generated Code with Source Specification in Exploratory Modeling Tomohiro Oda

Temporal Difference Methods, Off-Policy Methods Milan Straka October 21, 2019 Charles

Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences

Sambuz

Useful Links

Newsletter

Mail Us

eric gilbert | asst prof | georgia tech [Gilbert & Karahalios. CHI 2009] Triad Trait