new temporal difference methods based on gradient descent
play

New Temporal-Difference Methods Based on Gradient Descent Rich - PowerPoint PPT Presentation

New Temporal-Difference Methods Based on Gradient Descent Rich Sutton Hamid Maei Doina Precup (McGill) Shalabh Bhatnagar (IIS Bangalore) Csaba Szepesvari Eric Wiewiora David Silver Outline The promise and problems of TD learning


  1. New Temporal-Difference Methods Based on Gradient Descent Rich Sutton Hamid Maei Doina Precup (McGill) Shalabh Bhatnagar (IIS Bangalore) Csaba Szepesvari Eric Wiewiora David Silver

  2. Outline • The promise and problems of TD learning • Value-function approximation • Gradient-descent methods - LMS example • Objective functions for TD • GD derivation of new algorithms • Proofs of convergence • Empirical results • Conclusions

  3. What is temporal-difference learning? • The most important and distinctive idea in reinforcement learning • A way of learning to predict, 
 from changes in your predictions, 
 without waiting for the final outcome • A way of taking advantage of state 
 in multi-step prediction problems • Learning a guess from a guess

  4. Examples of TD learning opportunities • Learning to evaluate backgammon positions from changes in evaluation within a game • Learning where your tennis opponent will hit the ball from his approach • Learning what features of a market indicate that it will have a major decline • Learning to recognize your friend’s face

  5. Function approximation • TD learning is sometimes done in a table- lookup context - where every state is distinct and treated totally separately • But really, to be powerful, we must generalize between states • The same state never occurs twice For example, in Computer Go, we use 10 6 parameters to learn about 10 170 positions

  6. Advantages of TD methods for prediction 1. Data efficient. 
 Learn much faster on Markov problems 2. Cheap to implement. 
 Require less memory, peak computation; 3. Able to learn from incomplete sequences. 
 In particular, able to learn off-policy

  7. Off-policy learning • Learning about a policy different than the one being used to generate actions • Most often used to learn optimal behavior from a given data set, or from more exploratory behavior • Key to ambitious theories of knowledge and perception as continual prediction about the outcomes of options

  8. Outline • The promise and problems of TD learning • Value-function approximation • Gradient-descent methods - LMS example • Objective functions for TD • GD derivation of new algorithms • Proofs of convergence • Empirical results • Conclusions

  9. Value-function approximation from sample trajectories states • True values: outcome V ( s ) = E [outcome | s ] 5 • Estimated values: V θ ( s ) � V ( s ) , θ ⇥ ⇤ n 2 • Linear approximation: -1 V θ ( s ) = θ � φ s , φ s � ⇥ n modifiable parameter vector feature vector for state s

  10. Value-function approximation from sample trajectories feature parameter • True values: vector vector state 0 0.1 V ( s ) = E [outcome | s ] 1 -2 0 0 x = -2 + 0 + 5 = 3 0 0.5 • Estimated values: 0 0 1 0 1 5 V θ ( s ) � V ( s ) , θ ⇥ ⇤ n 0 -.4 φ s θ s • Linear approximation: V θ ( s ) = θ � φ s , φ s � ⇥ n modifiable parameter vector feature vector for state s

  11. From terminal outcomes to per-step rewards • True values: state trajectory 6 � ∞ ⇥ 1 ⇤ γ t r t | s 0 = s V ( s ) = E 5 target values (returns) 1 t =0 rewards = sum of future 4 rewards until end 2 of episode, or until discount rate, 2 discounting horizon 0 ≤ γ ≤ 1 1 1 1

  12. TD methods operate on individual transitions trajectories transitions d s - distribution of first state s • T r s - expected reward given s 0 1 2 P ss’ - prob of next state s’ given s 0 1 2 Training set is now a bag of transitions 1 1 P and d Select from them i.i.d. are linked (independently, identically distributed) -1 Sample transition: ( s, r, s � ) or ( φ , r, φ � ) TD(0) algorithm: θ ← θ + αδφ δ = r + γθ ⇥ φ � − θ ⇥ φ

  13. Off-policy training trajectories transitions d s • T r s 0 1 2 P ss’ 0 1 2 1 1 P and d are no longer linked -1 TD(0) may diverge!

  14. Baird’s counter-example • P and d are not linked • d is all states with equal probability • P is according to this Markov chain: V k ( s ) = V k ( s ) = V k ( s ) = V k ( s ) = V k ( s ) = ! (7) +2 ! (1) ! (7) +2 ! (2) ! (7) +2 ! (3) ! (7) +2 ! (4) ! (7) +2 ! (5) α = 0 . 01 γ = 0 . 99 100% θ 0 = (1 , 1 , 1 , 1 , 1 , 10 , 1) � V k ( s ) = r = 0 2 ! (7) + ! (6) terminal 1% state 99%

  15. TD can diverge: Baird’s counter-example 10 10 ! k (1) – ! k (5) ! k (7) 5 10 Parameter ! k (6) values, ! k ( i ) 0 0 / -10 10 (log scale, broken at ± 1) 5 - 10 10 - 10 0 1000 2000 3000 4000 5000 Iterations ( k ) deterministic updates θ 0 = (1 , 1 , 1 , 1 , 1 , 10 , 1) � γ = 0 . 99 α = 0 . 01

  16. TD(0) can diverge: A simple example r=1 θ 2 θ r + γθ ⇥ φ � − θ ⇥ φ δ = 0 + 2 θ − θ = θ = TD update: ∆ θ αδφ = Diverges! αθ = TD fixpoint: θ ∗ = 0

  17. Previous attempts to solve the off-policy problem • Importance sampling • With recognizers • Least-squares methods, LSTD, LSPI, iLSTD • Averagers • Residual gradient methods

  18. Desiderata: We want a TD algorithm that • Bootstraps (genuine TD) • Works with linear function approximation 
 (stable, reliably convergent) • Is simple, like linear TD — O(n) • Learns fast, like linear TD • Can learn off-policy (arbitrary P and d ) • Learns from online causal trajectories 
 (no repeat sampling from the same state)

  19. Outline • The promise and problems of TD learning • Value-function approximation • Gradient-descent methods - LMS example • Objective functions for TD • GD derivation of new algorithms • Proofs of convergence • Empirical results • Conclusions

  20. Gradient-descent learning methods - the recipe 1. Pick an objective function , a J ( θ ) parameterized function to be minimized 2. Use calculus to analytically compute the gradient � θ J ( θ ) 3. Find a “sample gradient” that you can sample on every time step and whose expected value equals the gradient 4. Take small steps in proportional to the θ sample gradient: θ ⇥ θ � α ⇤ θ J t ( θ )

  21. Conventional TD is not the gradient of anything ∆ θ = αδφ TD(0) algorithm: δ = r + γθ ⇥ φ � − θ ⇥ φ ∂ J Assume there is a J such that: = δφ i ∂θ i Then look at the second derivative: } ∂ 2 J = ∂ ( δφ i ) C o = ( γφ � j − φ j ) φ i n ∂ 2 J ∂ 2 J t ∂θ j ∂θ i ∂θ j r a d i � = c ∂θ j ∂θ i ∂θ i ∂θ j t i o ∂ 2 J = ∂ ( δφ j ) n ! = ( γφ � i − φ i ) φ j ∂θ i ∂θ j ∂θ i Real 2 nd derivatives must be symmetric

  22. Outline • The promise and problems of TD learning • Value-function approximation • Gradient-descent methods - LMS example • Objective functions for TD • GD derivation of new algorithms • Proofs of convergence • Empirical results • Conclusions

  23. 
 
 Gradient descent for TD: What should the objective function be? • Close to the true values? d s ( V θ ( s ) � V ( s )) 2 � MSE( θ ) = Mean-Square Value Error s True value 
 ⇥ V θ � V ⇥ 2 = function D • Or close to satisfying the Bellman equation? 
 Mean-Square ⇥ V θ � TV θ ⇥ 2 MSBE( θ ) = Bellman Error D where T is the Bellman operator defined by V r + γ PV = TV =

  24. Value function geometry Previous work on gradient methods for TD T takes you outside 
 minimized this objective fn TV θ RMSBE the space (Baird 1995, 1999) Π projects you back 
 T Π into it Π TV θ V θ E B P S M R Better objective fn? Φ , D V θ = Π TV θ The space spanned by the feature vectors, 
 weighted by the state visitation distribution Is the TD fix-point D = diag( d ) Mean Square Projected Bellman Error (MSPBE)

  25. A-split example (Dayan 1992) Clearly, the true values are V ( A ) = 0 . 5 A V ( B ) = 1 50% 50% But if you minimize the naive B objective fn, 100% , J ( θ ) = E [ δ 2 ] then you get the solution 1 0 V ( A ) = 1 / 3 V ( B ) = 2 / 3 Even in the tabular case (no FA)

  26. Split-A example The two ‘A’ states look the same, they share a single A1 A2 feature and must be given the 100% same approximate value 100% B The example appears just like 100% the previous, and the 1 0 minimum MSBE solution is V ( A ) = 1 / 3 V ( B ) = 2 / 3

  27. Outline • The promise and problems of TD learning • Value-function approximation • Gradient-descent methods - LMS example • Objective functions for TD • GD derivation of new algorithms • Proofs of convergence • Empirical results • Conclusions

  28. Three new algorithms • GTD, the original gradient TD algorithm (Sutton, Szepevari & Maei, 2008) • GTD-2, a second-generation GTD • TD-C, TD with gradient correction • GTD( λ ), GQ( λ )

  29. First relate the geometry to the iid statistics TV θ RMSBE T Π Π TV θ V θ RMSPBE MSPBE ( θ ) Φ , D ⇥ V θ � Π TV θ ⇥ 2 = D Φ T D ( TV θ − V θ ) = E [ δφ ] ⇥ Π ( V θ � TV θ ) ⇥ 2 = D Φ T D Φ = E [ φφ T ] ( Π ( V θ � TV θ )) ⇤ D ( Π ( V θ � TV θ )) = ( V θ � TV θ ) ⇤ Π ⇤ D Π ( V θ � TV θ ) = ( V θ � TV θ ) ⇤ D ⇤ Φ ( Φ ⇤ D Φ ) � 1 Φ ⇤ D ( V θ � TV θ ) = ( Φ ⇤ D ( TV θ � V θ )) ⇤ ( Φ ⇤ D Φ ) � 1 Φ ⇤ D ( TV θ � V θ ) = φφ ⇤ ⇥ � 1 E [ δφ ] . E [ δφ ] ⇤ E � =

Recommend


More recommend