it is hard to predict especially about the future niels
play

It is hard to predict, especially about the future. Niels Bohr You - PowerPoint PPT Presentation

It is hard to predict, especially about the future. Niels Bohr You are what you pretend to be, so be careful what you pretend to be. Kurt Vonnegut Prashanth L A Convergence rate of TD(0) March 27, 2015 1 / 84 Convergence rate of TD(0) with


  1. Concentration bounds: Non-averaged case Why are these bounds problematic? 1 / √ n � � with a step-size γ n = c / ( c + n ) Obtaining optimal rate O In expectation: Require c to be chosen such that ( 1 − β ) 2 µ c ∈ ( 1 / 2 , ∞ ) In high-probability: c should satisfy ( µ ( 1 − β ) / 2 + 3 B ( s 0 )) c > 1. Optimal rate requires knowledge of the mixing bound B ( s 0 ) Even for finite state space settings, B ( s 0 ) is a constant, albeit one that depends on the transition dynamics! Solution Iterate averaging Prashanth L A Convergence rate of TD(0) March 27, 2015 16 / 84

  2. Concentration bounds: Non-averaged case Why are these bounds problematic? 1 / √ n � � with a step-size γ n = c / ( c + n ) Obtaining optimal rate O In expectation: Require c to be chosen such that ( 1 − β ) 2 µ c ∈ ( 1 / 2 , ∞ ) In high-probability: c should satisfy ( µ ( 1 − β ) / 2 + 3 B ( s 0 )) c > 1. Optimal rate requires knowledge of the mixing bound B ( s 0 ) Even for finite state space settings, B ( s 0 ) is a constant, albeit one that depends on the transition dynamics! Solution Iterate averaging Prashanth L A Convergence rate of TD(0) March 27, 2015 16 / 84

  3. Concentration bounds: Non-averaged case Proof Outline Let z n = θ n − θ ∗ . We first bound the deviation of this error from its mean:   ǫ 2   P ( � z n � 2 − E � z n � 2 ≥ ǫ ) ≤ exp  −  , ∀ ǫ > 0 ,   n � L 2 2 i i = 1 and then bound the size of the mean itself: � E � z n � 2 ≤ 2 exp ( − ( 1 − β ) µ Γ n ) � z 0 � 2 � �� � initial error � n − 1 � 1 2 � � ( 3 + 6 H ) 2 B ( s 0 ) 2 γ 2 + k + 1 exp ( − 2 ( 1 − β ) µ (Γ n − Γ k + 1 ) , k = 1 � �� � sampling and mixing error n ��� 1 / 2 � � � � γ j � � Note that L i := γ i 1 − 2 γ j µ 1 − β − + [ 1 + β ( 3 − β )] B ( s 0 ) 2 j = i + 1 Prashanth L A Convergence rate of TD(0) March 27, 2015 17 / 84

  4. Concentration bounds: Non-averaged case Proof Outline Let z n = θ n − θ ∗ . We first bound the deviation of this error from its mean:   ǫ 2   P ( � z n � 2 − E � z n � 2 ≥ ǫ ) ≤ exp  −  , ∀ ǫ > 0 ,   n � L 2 2 i i = 1 and then bound the size of the mean itself: � E � z n � 2 ≤ 2 exp ( − ( 1 − β ) µ Γ n ) � z 0 � 2 � �� � initial error � n − 1 � 1 2 � � ( 3 + 6 H ) 2 B ( s 0 ) 2 γ 2 + k + 1 exp ( − 2 ( 1 − β ) µ (Γ n − Γ k + 1 ) , k = 1 � �� � sampling and mixing error n ��� 1 / 2 � � � � γ j � � Note that L i := γ i 1 − 2 γ j µ 1 − β − + [ 1 + β ( 3 − β )] B ( s 0 ) 2 j = i + 1 Prashanth L A Convergence rate of TD(0) March 27, 2015 17 / 84

  5. Concentration bounds: Non-averaged case Proof Outline: Bound in Expectation Let f X n ( θ ) := [ r ( s n , π ( s n )) + βθ T n − 1 φ ( s n + 1 ) − θ T n − 1 φ ( s n )] φ ( s n ) . Then, TD update is equivalent to θ n + 1 = θ n + γ n [ E Ψ ( f X n ( θ n )) + ǫ n + ∆ M n ] (1) Mixing error ǫ n := E ( f X n ( θ n ) | s 0 ) − E Ψ ( f X n ( θ n )) Martingale sequence ∆ M n := f X n ( θ n ) − E ( f X n ( θ n ) | s 0 ) Unrolling (1), we obtain: z n + 1 = ( I − γ n A ) z n + γ n ( ǫ n + ∆ M n ) n � γ k Π n Π − 1 = Π n z 0 + ( ǫ k + ∆ M k ) k k = 1 n � Here A := Φ T Ψ( I − β P )Φ and Π n := ( I − γ k A ) . k = 1 Prashanth L A Convergence rate of TD(0) March 27, 2015 18 / 84

  6. Concentration bounds: Non-averaged case Proof Outline: Bound in Expectation Let f X n ( θ ) := [ r ( s n , π ( s n )) + βθ T n − 1 φ ( s n + 1 ) − θ T n − 1 φ ( s n )] φ ( s n ) . Then, TD update is equivalent to θ n + 1 = θ n + γ n [ E Ψ ( f X n ( θ n )) + ǫ n + ∆ M n ] (1) Mixing error ǫ n := E ( f X n ( θ n ) | s 0 ) − E Ψ ( f X n ( θ n )) Martingale sequence ∆ M n := f X n ( θ n ) − E ( f X n ( θ n ) | s 0 ) Unrolling (1), we obtain: z n + 1 = ( I − γ n A ) z n + γ n ( ǫ n + ∆ M n ) n � γ k Π n Π − 1 = Π n z 0 + ( ǫ k + ∆ M k ) k k = 1 n � Here A := Φ T Ψ( I − β P )Φ and Π n := ( I − γ k A ) . k = 1 Prashanth L A Convergence rate of TD(0) March 27, 2015 18 / 84

  7. Concentration bounds: Non-averaged case Proof Outline: Bound in Expectation z n + 1 = ( I − γ n A ) z n + γ n ( ǫ n + ∆ M n ) n � γ k Π n Π − 1 = Π n z 0 + ( ǫ k + ∆ M k ) k k = 1 By Jensen’s inequality, we obtain 1 E ( � z n � 2 | s 0 ) ≤ ( E ( � z n , z n � ) | s 0 ) 2 � � 1 � n n � � � � 2 � � � � 2 � 2 � Π n Π − 1 � Π n Π − 1 2 � Π n z 0 � 2 γ 2 � � � ǫ k � 2 γ 2 � � � ∆ M k � 2 ≤ 2 + 3 2 | s 0 + 2 2 | s 0 2 E 2 E k � k � k k k = 1 k = 1 Rest of the proof amounts to bounding each of the terms on RHS above. Prashanth L A Convergence rate of TD(0) March 27, 2015 19 / 84

  8. Concentration bounds: Non-averaged case Proof Outline: High Probability Bound Recall z n = θ n − θ ∗ . Step 1: (Error decomposition) n n � � � z n � 2 − E � z n � 2 = g i − E [ g i |F i − 1 ] = D i , i = 1 i = 1 where D i := g i − E [ g i |F i − 1 ] , g i := E [ � z n � 2 | θ i ] , and F i = σ ( θ 1 , . . . , θ n ) . Step 2: (Lipschitz continuity) Functions g i are Lipschitz continuous with Lipschitz constants L i . Step 3: (Concentration inequality) � n � αλ 2 � � n � � L 2 P ( � z n � 2 − E � z n � 2 ≥ ǫ ) = P D i ≥ ǫ ≤ exp ( − λǫ ) exp . i 2 i = 1 i = 1 Prashanth L A Convergence rate of TD(0) March 27, 2015 20 / 84

  9. Concentration bounds: Non-averaged case Proof Outline: High Probability Bound Recall z n = θ n − θ ∗ . Step 1: (Error decomposition) n n � � � z n � 2 − E � z n � 2 = g i − E [ g i |F i − 1 ] = D i , i = 1 i = 1 where D i := g i − E [ g i |F i − 1 ] , g i := E [ � z n � 2 | θ i ] , and F i = σ ( θ 1 , . . . , θ n ) . Step 2: (Lipschitz continuity) Functions g i are Lipschitz continuous with Lipschitz constants L i . Step 3: (Concentration inequality) � n � αλ 2 � � n � � L 2 P ( � z n � 2 − E � z n � 2 ≥ ǫ ) = P D i ≥ ǫ ≤ exp ( − λǫ ) exp . i 2 i = 1 i = 1 Prashanth L A Convergence rate of TD(0) March 27, 2015 20 / 84

  10. Concentration bounds: Non-averaged case Proof Outline: High Probability Bound Recall z n = θ n − θ ∗ . Step 1: (Error decomposition) n n � � � z n � 2 − E � z n � 2 = g i − E [ g i |F i − 1 ] = D i , i = 1 i = 1 where D i := g i − E [ g i |F i − 1 ] , g i := E [ � z n � 2 | θ i ] , and F i = σ ( θ 1 , . . . , θ n ) . Step 2: (Lipschitz continuity) Functions g i are Lipschitz continuous with Lipschitz constants L i . Step 3: (Concentration inequality) � n � αλ 2 � � n � � L 2 P ( � z n � 2 − E � z n � 2 ≥ ǫ ) = P D i ≥ ǫ ≤ exp ( − λǫ ) exp . i 2 i = 1 i = 1 Prashanth L A Convergence rate of TD(0) March 27, 2015 20 / 84

  11. Concentration bounds: Iterate Averaging Concentration Bounds: Iterate Averaged TD(0) Prashanth L A Convergence rate of TD(0) March 27, 2015 21 / 84

  12. Concentration bounds: Iterate Averaging Polyak-Ruppert averaging: Bound in expectation Bigger step-size + Averaging � � α γ n := ( 1 − β ) c ¯ θ n + 1 := ( θ 1 + . . . + θ n ) / n 2 c + n with α ∈ ( 1 / 2 , 1 ) and c > 0 Bound in expectation � � K IA 1 ( n ) � ¯ θ n − ˆ � � θ T 2 ≤ ( n + c ) α/ 2 , where E � � � � θ 0 − θ ∗ � 2 2 β ( 1 − β ) c α HB ( s 0 ) � K A 1 ( n ) := 1 + 9 B ( s 0 ) 2 ( n + c ) ( 1 − α ) / 2 + 1 + 2 α ( µ c α ( 1 − β ) 2 ) α 2 ( 1 − α ) Prashanth L A Convergence rate of TD(0) March 27, 2015 22 / 84

  13. Concentration bounds: Iterate Averaging Polyak-Ruppert averaging: Bound in expectation Bigger step-size + Averaging � � α γ n := ( 1 − β ) c ¯ θ n + 1 := ( θ 1 + . . . + θ n ) / n 2 c + n with α ∈ ( 1 / 2 , 1 ) and c > 0 Bound in expectation � � K IA 1 ( n ) � ¯ θ n − ˆ � � θ T 2 ≤ ( n + c ) α/ 2 , where E � � � � θ 0 − θ ∗ � 2 2 β ( 1 − β ) c α HB ( s 0 ) � K A 1 ( n ) := 1 + 9 B ( s 0 ) 2 ( n + c ) ( 1 − α ) / 2 + 1 + 2 α ( µ c α ( 1 − β ) 2 ) α 2 ( 1 − α ) Prashanth L A Convergence rate of TD(0) March 27, 2015 22 / 84

  14. Concentration bounds: Iterate Averaging Iterate averaging: High probability bound Bigger step-size + Averaging � � α γ n := ( 1 − β ) c ¯ θ n + 1 := ( θ 1 + . . . + θ n ) / n 2 c + n High-probability bound �� � � K IA 2 ( n ) � ¯ θ n − ˆ � � θ T 2 ≤ ≥ 1 − δ, where P � ( n + c ) α/ 2 � � � � � c α + 2 ( 3 α ) 2 α � ( 1 + 9 B ( s 0 ) 2 ) � 1 − β � α µ + B ( s 0 ) 2 K A 2 ( n ) := + K 1 ( n ) � � 2 + B ( s 0 ) 1 n ( 1 − α ) / 2 µ 1 − β Prashanth L A Convergence rate of TD(0) March 27, 2015 23 / 84

  15. Concentration bounds: Iterate Averaging Iterate averaging: High probability bound Bigger step-size + Averaging � � α γ n := ( 1 − β ) c ¯ θ n + 1 := ( θ 1 + . . . + θ n ) / n 2 c + n High-probability bound �� � � K IA 2 ( n ) � ¯ θ n − ˆ � � θ T 2 ≤ ≥ 1 − δ, where P � ( n + c ) α/ 2 � � � � � c α + 2 ( 3 α ) 2 α � ( 1 + 9 B ( s 0 ) 2 ) � 1 − β � α µ + B ( s 0 ) 2 K A 2 ( n ) := + K 1 ( n ) � � 2 + B ( s 0 ) 1 n ( 1 − α ) / 2 µ 1 − β Prashanth L A Convergence rate of TD(0) March 27, 2015 23 / 84

  16. Concentration bounds: Iterate Averaging Iterate averaging: High probability bound Bigger step-size + Averaging � � α γ n := ( 1 − β ) c ¯ θ n + 1 := ( θ 1 + . . . + θ n ) / n c + n 2 High-probability bound �� � K IA � 2 ( n ) � ¯ θ n − ˆ � � θ T 2 ≤ ≥ 1 − δ, where P � ( n + c ) α/ 2 1 / √ n � � α can be chosen arbitrarily close to 1, resulting in a rate O . Prashanth L A Convergence rate of TD(0) March 27, 2015 24 / 84

  17. Concentration bounds: Iterate Averaging Proof Outline Let ¯ θ n + 1 := ( θ 1 + . . . + θ n ) / n and z n = ¯ θ n + 1 − θ ∗ . Then,   ǫ 2   P ( � z n � 2 − E � z n � 2 ≥ ǫ ) ≤ exp  −  , ∀ ǫ > 0 ,   n � L 2 2 i i = 1 n − 1 l � � �� γ i � � � γ j � � � where L i := 1 + 1 − 2 γ j µ 1 − β − + [ 1 + β ( 3 − β )] B ( s 0 ) . n 2 l = i + 1 j = i With γ n = ( 1 − β )( c / ( c + n )) α , we obtain   2 2 α  + 5 α  � 1 − β   � α   c α µ + B ( s 0 ) n � 2 × 1 L 2 i ≤ � 1 � 2 2 + B ( s 0 ) n i = 1 µ 2 1 − β Prashanth L A Convergence rate of TD(0) March 27, 2015 25 / 84

  18. Concentration bounds: Iterate Averaging Proof Outline Let ¯ θ n + 1 := ( θ 1 + . . . + θ n ) / n and z n = ¯ θ n + 1 − θ ∗ . Then,   ǫ 2   P ( � z n � 2 − E � z n � 2 ≥ ǫ ) ≤ exp  −  , ∀ ǫ > 0 ,   n � L 2 2 i i = 1 n − 1 l � � �� γ i � � � γ j � � � where L i := 1 + 1 − 2 γ j µ 1 − β − + [ 1 + β ( 3 − β )] B ( s 0 ) . n 2 l = i + 1 j = i With γ n = ( 1 − β )( c / ( c + n )) α , we obtain   2 2 α  + 5 α  � 1 − β   � α   c α µ + B ( s 0 ) n � 2 × 1 L 2 i ≤ � 1 � 2 2 + B ( s 0 ) n i = 1 µ 2 1 − β Prashanth L A Convergence rate of TD(0) March 27, 2015 25 / 84

  19. Concentration bounds: Iterate Averaging Proof outline: Bound in expectation To bound the expected error we directly average the errors of the non-averaged iterates: n 2 ≤ 1 � � θ n + 1 − θ ∗ � � ¯ E � θ k − θ ∗ � 2 , E � n k = 1 and then specialise to the choice of step-size: γ n = ( 1 − β )( c / ( c + n )) α � ∞ � 1 + 9 B ( s 0 ) � � θ n + 1 − θ ∗ � � ¯ exp ( − µ c ( n + c ) 1 − α ) � θ 0 − θ ∗ � 2 2 ≤ E � n n = 1 � µ c α ( 1 − β ) 2 � − α 1 + 2 α � 2 ( 1 − α ) ( n + c ) − α + 2 β Hc α ( 1 − β ) 2 Prashanth L A Convergence rate of TD(0) March 27, 2015 26 / 84

  20. Centered TD(0) Centered TD (CTD) Prashanth L A Convergence rate of TD(0) March 27, 2015 27 / 84

  21. Centered TD(0) The Variance Problem Why does iterate averaging work? in TD(0), each iterate introduces a high variance , which must be controlled by the step-size choice averaging the iterates reduces the variance of the final estimator reduced variance allows for more exploration within the iterates through larger step sizes Prashanth L A Convergence rate of TD(0) March 27, 2015 28 / 84

  22. Centered TD(0) A Control Variate Solution Centering: another approach to variance reduction instead of averaging iterates one can use an average to guide the iterates now all iterates are informed by their history constructing this average in epochs allows a constant step-size choice Prashanth L A Convergence rate of TD(0) March 27, 2015 29 / 84

  23. Centered TD(0) Centering: The Idea Recall that for TD ( 0 ) , θ n + 1 = θ n + γ n ( r ( s n , π ( s n )) + βθ T n φ ( s n + 1 ) − θ T n φ ( s n )) φ ( s n ) � �� � = f n ( θ n ) and that θ n → θ ∗ , the solution of F ( θ ) := Π T π (Φ θ ) − Φ θ = 0. Centering each iterate:    f n ( θ n ) − f n (¯ θ n ) + F (¯   θ n + 1 = θ n + γ θ n )  � �� � (*) Prashanth L A Convergence rate of TD(0) March 27, 2015 30 / 84

  24. Centered TD(0) Centering: The Idea    f n ( θ n ) − f n (¯ θ n ) + F (¯   θ n + 1 = θ n + γ θ n )  � �� � (*) Why Centering helps? No updates after hitting θ ∗ An average guides the updates, resulting in low variance of term (*) Allows using a (large) constant step-size O ( d ) update - same as TD(0) Working with epochs ⇒ need to store only the averaged iterate ¯ θ n and an estimate of ˆ F (¯ θ n ) Prashanth L A Convergence rate of TD(0) March 27, 2015 31 / 84

  25. Centered TD(0) Centering: The Idea Centered update: � � f n ( θ n ) − f n (¯ θ n ) + F (¯ θ n + 1 = θ n + γ θ n ) Challenges compared to gradient descent with a accessible cost function F is unknown and inaccessible in our setting To prove convergence bounds one has to cope with the error due to incomplete mixing Prashanth L A Convergence rate of TD(0) March 27, 2015 32 / 84

  26. Centered TD(0) Centering: The Idea Centered update: � � f n ( θ n ) − f n (¯ θ n ) + F (¯ θ n + 1 = θ n + γ θ n ) Challenges compared to gradient descent with a accessible cost function F is unknown and inaccessible in our setting To prove convergence bounds one has to cope with the error due to incomplete mixing Prashanth L A Convergence rate of TD(0) March 27, 2015 32 / 84

  27. Centered TD(0) Take action Update θ n ¯ θ ( m ) , ˆ F ( m ) (¯ θ ( m ) ) θ ( m + 1 ) , ˆ ¯ F ( m + 1 ) (¯ θ ( m + 1 ) ) θ n θ n + 1 π ( s n ) using (2) Centering Simulation Fixed point iteration Centering Epoch Run Beginning of each epoch, θ ( m ) is chosen uniformly at random from the previous epoch an iterate ¯ Epoch run Set θ mM := ¯ θ ( m ) , and, for n = mM , . . . , ( m + 1 ) M − 1 � � f X in ( θ n ) − f X in (¯ θ ( m ) ) + ˆ F ( m ) (¯ θ ( m ) ) θ n + 1 = θ n + γ , mM (2) F ( m ) ( θ ) := 1 � where ˆ f X i ( θ ) M i =( m − 1 ) M Prashanth L A Convergence rate of TD(0) March 27, 2015 33 / 84

  28. Centered TD(0) Take action Update θ n ¯ θ ( m ) , ˆ F ( m ) (¯ θ ( m ) ) θ ( m + 1 ) , ˆ ¯ F ( m + 1 ) (¯ θ ( m + 1 ) ) θ n θ n + 1 π ( s n ) using (2) Centering Simulation Fixed point iteration Centering Epoch Run Beginning of each epoch, θ ( m ) is chosen uniformly at random from the previous epoch an iterate ¯ Epoch run Set θ mM := ¯ θ ( m ) , and, for n = mM , . . . , ( m + 1 ) M − 1 � � f X in ( θ n ) − f X in (¯ θ ( m ) ) + ˆ F ( m ) (¯ θ ( m ) ) θ n + 1 = θ n + γ , mM (2) F ( m ) ( θ ) := 1 � where ˆ f X i ( θ ) M i =( m − 1 ) M Prashanth L A Convergence rate of TD(0) March 27, 2015 33 / 84

  29. Centered TD(0) Centering: Results Epoch length and step size choice Choose M and γ such that C 1 < 1, where � � γ d 2 1 C 1 := 2 µγ M (( 1 − β ) − d 2 γ ) + 2 (( 1 − β ) − d 2 γ ) Error bound � � θ ( m ) − θ ∗ ) � 2 θ ( 0 ) − θ ∗ ) � 2 � Φ(¯ � Φ(¯ Ψ ≤ C m Ψ 1 m − 1 � C ( m − 2 ) − k B kM + C 2 H ( 5 γ + 4 ) ( k − 1 ) M ( s 0 ) , 1 k = 1 kM where C 2 = γ/ ( 2 M (( 1 − β ) − d 2 γ )) and B kM � ( k − 1 ) M is an upper bound on the partial sums ( E ( φ ( s i ) | s 0 ) − E Ψ ( φ ( s i ))) i =( k − 1 ) M kM ( E ( φ ( s i ) φ ( s i + l ) | s 0 ) − E Ψ ( φ ( s i ) φ ( s i + l ) T )) , for l = 0 , 1. � and i =( k − 1 ) M Prashanth L A Convergence rate of TD(0) March 27, 2015 34 / 84

  30. Centered TD(0) Centering: Results Epoch length and step size choice Choose M and γ such that C 1 < 1, where � � γ d 2 1 C 1 := 2 µγ M (( 1 − β ) − d 2 γ ) + 2 (( 1 − β ) − d 2 γ ) Error bound � � θ ( m ) − θ ∗ ) � 2 θ ( 0 ) − θ ∗ ) � 2 � Φ(¯ � Φ(¯ Ψ ≤ C m Ψ 1 m − 1 � C ( m − 2 ) − k B kM + C 2 H ( 5 γ + 4 ) ( k − 1 ) M ( s 0 ) , 1 k = 1 kM where C 2 = γ/ ( 2 M (( 1 − β ) − d 2 γ )) and B kM � ( k − 1 ) M is an upper bound on the partial sums ( E ( φ ( s i ) | s 0 ) − E Ψ ( φ ( s i ))) i =( k − 1 ) M kM ( E ( φ ( s i ) φ ( s i + l ) | s 0 ) − E Ψ ( φ ( s i ) φ ( s i + l ) T )) , for l = 0 , 1. � and i =( k − 1 ) M Prashanth L A Convergence rate of TD(0) March 27, 2015 34 / 84

  31. Centered TD(0) Centering: Results cont. The effect of mixing error If the Markov chain underlying policy π satisfies the following property: | P ( s t = s | s 0 ) − ψ ( s ) | ≤ C ρ t / M , then � � θ ( m ) − θ ∗ ) � 2 θ ( 0 ) − θ ∗ ) � 2 � Φ(¯ � Φ(¯ + CMC 2 H ( 5 γ + 4 ) max { C 1 , ρ M } ( m − 1 ) Ψ ≤ C m 1 Ψ When the MDP mixes exponentially fast (e.g. finite state-space MDPs) we get the exponential convergence rate (* only in the first term) Otherwise the decay of the error is dominated by the mixing rate Prashanth L A Convergence rate of TD(0) March 27, 2015 35 / 84

  32. Centered TD(0) Centering: Results cont. The effect of mixing error If the Markov chain underlying policy π satisfies the following property: | P ( s t = s | s 0 ) − ψ ( s ) | ≤ C ρ t / M , then � � θ ( m ) − θ ∗ ) � 2 θ ( 0 ) − θ ∗ ) � 2 � Φ(¯ � Φ(¯ + CMC 2 H ( 5 γ + 4 ) max { C 1 , ρ M } ( m − 1 ) Ψ ≤ C m 1 Ψ When the MDP mixes exponentially fast (e.g. finite state-space MDPs) we get the exponential convergence rate (* only in the first term) Otherwise the decay of the error is dominated by the mixing rate Prashanth L A Convergence rate of TD(0) March 27, 2015 35 / 84

  33. Centered TD(0) Centering: Results cont. The effect of mixing error If the Markov chain underlying policy π satisfies the following property: | P ( s t = s | s 0 ) − ψ ( s ) | ≤ C ρ t / M , then � � θ ( m ) − θ ∗ ) � 2 θ ( 0 ) − θ ∗ ) � 2 � Φ(¯ � Φ(¯ + CMC 2 H ( 5 γ + 4 ) max { C 1 , ρ M } ( m − 1 ) Ψ ≤ C m 1 Ψ When the MDP mixes exponentially fast (e.g. finite state-space MDPs) we get the exponential convergence rate (* only in the first term) Otherwise the decay of the error is dominated by the mixing rate Prashanth L A Convergence rate of TD(0) March 27, 2015 35 / 84

  34. Centered TD(0) Proof Outline Let ¯ f X in ( θ n ) := f X in ( θ n ) − f X in (¯ θ ( m ) ) + E Ψ ( f X in (¯ θ ( m ) )) . Step 1: (Rewriting CTD update) � � ¯ where ǫ n := E ( f X in (¯ θ ( m ) ) | F mM ) − E Ψ ( f X in (¯ θ ( m ) )) θ n + 1 = θ n + γ f X in ( θ n ) + ǫ n Step 2: (Bounding the variance of centered updates) �� � ≤ d 2 � � � θ ( m ) − θ ∗ ) � 2 � ¯ � 2 � Φ( θ n − θ ∗ ) � 2 Ψ + � Φ(¯ f X in ( θ n ) E Ψ Ψ 2 Prashanth L A Convergence rate of TD(0) March 27, 2015 36 / 84

  35. Centered TD(0) Proof Outline Let ¯ f X in ( θ n ) := f X in ( θ n ) − f X in (¯ θ ( m ) ) + E Ψ ( f X in (¯ θ ( m ) )) . Step 1: (Rewriting CTD update) � � ¯ where ǫ n := E ( f X in (¯ θ ( m ) ) | F mM ) − E Ψ ( f X in (¯ θ ( m ) )) θ n + 1 = θ n + γ f X in ( θ n ) + ǫ n Step 2: (Bounding the variance of centered updates) �� � ≤ d 2 � � � θ ( m ) − θ ∗ ) � 2 � ¯ � 2 � Φ( θ n − θ ∗ ) � 2 Ψ + � Φ(¯ f X in ( θ n ) E Ψ Ψ 2 Prashanth L A Convergence rate of TD(0) March 27, 2015 36 / 84

  36. Centered TD(0) Proof Outline Step 3: (Analysis for a particular epoch) �� � � ¯ � � � 2 E θ n � θ n + 1 − θ ∗ � 2 2 ≤ � θ n − θ ∗ � 2 2 + γ 2 E θ n � ǫ n � 2 2 + 2 γ ( θ n − θ ∗ ) T E θ n + γ 2 E θ n � ¯ f X in ( θ n ) f X in ( θ n ) 2 Ψ + γ 2 d 2 � � θ ( m ) − θ ∗ ) � 2 ≤ � θ n − θ ∗ � 2 2 − 2 γ (( 1 − β ) − d 2 γ ) � Φ( θ n − θ ∗ ) � 2 � Φ(¯ + γ 2 E θ n � ǫ n � 2 Ψ 2 Summing the above inequality over an epoch and noting that θ ( m ) − θ ∗ ) ≤ 1 θ ( m ) − θ ∗ ) T I (¯ θ ( m ) − θ ∗ ) T Φ T ΨΦ(¯ θ ( m ) − θ ∗ ) , 2 ≥ 0 and (¯ µ (¯ E Ψ ,θ n � θ n + 1 − θ ∗ � 2 we obtain the following by setting θ 0 = ¯ θ ( m ) : � 1 � � � θ ( m + 1 ) − θ ∗ ) � 2 θ ( m ) − θ ∗ ) � 2 2 γ M (( 1 − β ) − d 2 γ ) � Φ(¯ µ + γ 2 Md 2 � Φ(¯ Ψ ≤ Ψ mM � + γ 2 E θ i � ǫ i � 2 2 i =( m − 1 ) M The final step is to unroll (across epochs) the final recursion above to obtain the rate for CTD. Prashanth L A Convergence rate of TD(0) March 27, 2015 37 / 84

  37. Centered TD(0) TD(0) on a batch Prashanth L A Convergence rate of TD(0) March 27, 2015 38 / 84

  38. Centered TD(0) Dilbert’s boss on big data! Prashanth L A Convergence rate of TD(0) March 27, 2015 39 / 84

  39. fast LSTD LSTD - A Batch Algorithm Given dataset D := { ( s i , r i , s ′ i ) , i = 1 , . . . , T ) } LSTD approximates the TD fixed point by ˆ θ T = ¯ A − 1 T ¯ O ( d 2 T ) Complexity b T , T A T = 1 � where ¯ φ ( s i )( φ ( s i ) − βφ ( s ′ i )) T T i = 1 T b T = 1 � ¯ r i φ ( s i ) . T i = 1 Prashanth L A Convergence rate of TD(0) March 27, 2015 40 / 84

  40. fast LSTD LSTD - A Batch Algorithm Given dataset D := { ( s i , r i , s ′ i ) , i = 1 , . . . , T ) } LSTD approximates the TD fixed point by ˆ θ T = ¯ A − 1 T ¯ O ( d 2 T ) Complexity b T , T A T = 1 � where ¯ φ ( s i )( φ ( s i ) − βφ ( s ′ i )) T T i = 1 T b T = 1 � ¯ r i φ ( s i ) . T i = 1 Prashanth L A Convergence rate of TD(0) March 27, 2015 40 / 84

  41. fast LSTD Complexity of LSTD [1] Policy Evaluation Policy π Q-value Q π Policy Improvement Figure: LSPI - a batch-mode RL algorithm for control LSTD Complexity O ( d 2 T ) using the Sherman-Morrison lemma or O ( d 2 . 807 ) using the Strassen algorithm or O ( d 2 . 375 ) the Coppersmith-Winograd algorithm Prashanth L A Convergence rate of TD(0) March 27, 2015 41 / 84

  42. fast LSTD Complexity of LSTD [1] Policy Evaluation Policy π Q-value Q π Policy Improvement Figure: LSPI - a batch-mode RL algorithm for control LSTD Complexity O ( d 2 T ) using the Sherman-Morrison lemma or O ( d 2 . 807 ) using the Strassen algorithm or O ( d 2 . 375 ) the Coppersmith-Winograd algorithm Prashanth L A Convergence rate of TD(0) March 27, 2015 41 / 84

  43. fast LSTD Complexity of LSTD [2] Problem Practical applications involve high-dimensional features (e.g. Computer-Go: d ∼ 10 6 ) ⇒ solving LSTD is computationally intensive Related works: GTD 1 , GTD2 2 , iLSTD 3 Solution Use stochastic approximation (SA) Complexity O ( dT ) ⇒ O ( d ) reduction in complexity Theory SA variant of LSTD does not impact overall rate of convergence Experiments On traffic control application, performance of SA-based LSTD is comparable to LSTD, while gaining in runtime! 1Sutton et al. (2009) A convergent O(n) algorithm for off-policy temporal difference learning. In: NIPS 2Sutton et al. (2009) Fast gradient-descent methods for temporal-difference learning with linear func- tion approximation. In: ICML 3Geramifard A et al. (2007) iLSTD: Eligibility traces and convergence analysis. In: NIPS Prashanth L A Convergence rate of TD(0) March 27, 2015 42 / 84

  44. fast LSTD Complexity of LSTD [2] Problem Practical applications involve high-dimensional features (e.g. Computer-Go: d ∼ 10 6 ) ⇒ solving LSTD is computationally intensive Related works: GTD 1 , GTD2 2 , iLSTD 3 Solution Use stochastic approximation (SA) Complexity O ( dT ) ⇒ O ( d ) reduction in complexity Theory SA variant of LSTD does not impact overall rate of convergence Experiments On traffic control application, performance of SA-based LSTD is comparable to LSTD, while gaining in runtime! 1Sutton et al. (2009) A convergent O(n) algorithm for off-policy temporal difference learning. In: NIPS 2Sutton et al. (2009) Fast gradient-descent methods for temporal-difference learning with linear func- tion approximation. In: ICML 3Geramifard A et al. (2007) iLSTD: Eligibility traces and convergence analysis. In: NIPS Prashanth L A Convergence rate of TD(0) March 27, 2015 42 / 84

  45. fast LSTD Fast LSTD using Stochastic Approximation Update θ n Pick i n uniformly θ n + 1 θ n using ( s i n , r i n , s ′ in { 1 , . . . , T } i n ) Random Sampling SA Update Update rule: � � n − 1 φ ( s ′ r i n + βθ T i n ) − θ T θ n = θ n − 1 + γ n n − 1 φ ( s i n ) φ ( s i n ) Step-sizes Fixed-point iteration Complexity: O ( d ) per iteration Prashanth L A Convergence rate of TD(0) March 27, 2015 43 / 84

  46. fast LSTD Fast LSTD using Stochastic Approximation Update θ n Pick i n uniformly θ n + 1 θ n using ( s i n , r i n , s ′ in { 1 , . . . , T } i n ) Random Sampling SA Update Update rule: � � n − 1 φ ( s ′ r i n + βθ T i n ) − θ T θ n = θ n − 1 + γ n n − 1 φ ( s i n ) φ ( s i n ) Step-sizes Fixed-point iteration Complexity: O ( d ) per iteration Prashanth L A Convergence rate of TD(0) March 27, 2015 43 / 84

  47. fast LSTD Assumptions Setting: Given dataset D := { ( s i , r i , s ′ i ) , i = 1 , . . . , T ) } Bounded features (A1) � φ ( s i ) � 2 ≤ 1 Bounded rewards (A2) | r i | ≤ R max < ∞ � � T 1 � Co-variance matrix (A3) λ min φ ( s i ) φ ( s i ) T ≥ µ . has a min-eigenvalue T i = 1 Prashanth L A Convergence rate of TD(0) March 27, 2015 44 / 84

  48. fast LSTD Assumptions Setting: Given dataset D := { ( s i , r i , s ′ i ) , i = 1 , . . . , T ) } Bounded features (A1) � φ ( s i ) � 2 ≤ 1 Bounded rewards (A2) | r i | ≤ R max < ∞ � � T 1 � Co-variance matrix (A3) λ min φ ( s i ) φ ( s i ) T ≥ µ . has a min-eigenvalue T i = 1 Prashanth L A Convergence rate of TD(0) March 27, 2015 44 / 84

  49. fast LSTD Assumptions Setting: Given dataset D := { ( s i , r i , s ′ i ) , i = 1 , . . . , T ) } Bounded features (A1) � φ ( s i ) � 2 ≤ 1 Bounded rewards (A2) | r i | ≤ R max < ∞ � � T 1 � Co-variance matrix (A3) λ min φ ( s i ) φ ( s i ) T ≥ µ . has a min-eigenvalue T i = 1 Prashanth L A Convergence rate of TD(0) March 27, 2015 44 / 84

  50. fast LSTD Assumptions Setting: Given dataset D := { ( s i , r i , s ′ i ) , i = 1 , . . . , T ) } Bounded features (A1) � φ ( s i ) � 2 ≤ 1 Bounded rewards (A2) | r i | ≤ R max < ∞ � � T 1 � Co-variance matrix (A3) λ min φ ( s i ) φ ( s i ) T ≥ µ . has a min-eigenvalue T i = 1 Prashanth L A Convergence rate of TD(0) March 27, 2015 44 / 84

  51. fast LSTD Convergence Rate Step-size choice γ n = ( 1 − β ) c 2 ( c + n ) , with ( 1 − β ) 2 µ c ∈ ( 1 . 33 , 2 ) Bound in expectation � � K 1 � θ n − ˆ � � √ n + c E θ T 2 ≤ � High-probability bound �� � � K 2 � θ n − ˆ � � θ T 2 ≤ √ n + c ≥ 1 − δ, P � By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Convergence rate of TD(0) March 27, 2015 45 / 84

  52. fast LSTD Convergence Rate Step-size choice γ n = ( 1 − β ) c 2 ( c + n ) , with ( 1 − β ) 2 µ c ∈ ( 1 . 33 , 2 ) Bound in expectation � � K 1 � θ n − ˆ � � √ n + c E θ T 2 ≤ � High-probability bound �� � � K 2 � θ n − ˆ � � θ T 2 ≤ √ n + c ≥ 1 − δ, P � By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Convergence rate of TD(0) March 27, 2015 45 / 84

  53. fast LSTD Convergence Rate Step-size choice γ n = ( 1 − β ) c 2 ( c + n ) , with ( 1 − β ) 2 µ c ∈ ( 1 . 33 , 2 ) Bound in expectation � � K 1 � θ n − ˆ � � √ n + c E θ T 2 ≤ � High-probability bound �� � � K 2 � θ n − ˆ � � θ T 2 ≤ √ n + c ≥ 1 − δ, P � By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Convergence rate of TD(0) March 27, 2015 45 / 84

  54. fast LSTD Convergence Rate Step-size choice γ n = ( 1 − β ) c 2 ( c + n ) , with ( 1 − β ) 2 µ c ∈ ( 1 . 33 , 2 ) Bound in expectation � � K 1 � θ n − ˆ � � √ n + c E θ T 2 ≤ � High-probability bound �� � � K 2 � θ n − ˆ � � θ T 2 ≤ √ n + c ≥ 1 − δ, P � By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Convergence rate of TD(0) March 27, 2015 45 / 84

  55. fast LSTD The constants √ c � � � θ 0 − ˆ � � θ T n (( 1 − β ) 2 µ c − 1 ) / 2 + ( 1 − β ) ch 2 ( n ) � 2 K 1 ( n ) = , 2 � log δ − 1 ( 1 − β ) c K 2 ( n ) = � + K 1 ( n ) , �� 4 3 ( 1 − β ) 2 µ c − 1 2 where ��� � � � � � 4 h ( k ) :=( 1 + R max + β ) 2 max � θ 0 − ˆ � ˆ � � � � θ T 2 + ln n + θ T , 1 � � 2 Both K 1 ( n ) and K 2 ( n ) are O ( 1 ) Prashanth L A Convergence rate of TD(0) March 27, 2015 46 / 84

  56. fast LSTD Iterate Averaging Bigger step-size + Averaging � � α γ n := ( 1 − β ) c ¯ θ n + 1 := ( θ 1 + . . . + θ n ) / n c + n 2 Bound in expectation � � K IA 1 ( n ) � ¯ θ n − ˆ � � θ T 2 ≤ E � ( n + c ) α/ 2 High-probability bound �� � K IA � 2 ( n ) � ¯ θ n − ˆ � � P θ T 2 ≤ ≥ 1 − δ, � ( n + c ) α/ 2 Dependency of c on µ is removed dependency at the cost of ( 1 − α ) / 2 in the rate. Prashanth L A Convergence rate of TD(0) March 27, 2015 47 / 84

  57. fast LSTD Iterate Averaging Bigger step-size + Averaging � � α γ n := ( 1 − β ) c ¯ θ n + 1 := ( θ 1 + . . . + θ n ) / n c + n 2 Bound in expectation � � K IA 1 ( n ) � ¯ θ n − ˆ � � θ T 2 ≤ E � ( n + c ) α/ 2 High-probability bound �� � K IA � 2 ( n ) � ¯ θ n − ˆ � � P θ T 2 ≤ ≥ 1 − δ, � ( n + c ) α/ 2 Dependency of c on µ is removed dependency at the cost of ( 1 − α ) / 2 in the rate. Prashanth L A Convergence rate of TD(0) March 27, 2015 47 / 84

  58. fast LSTD Iterate Averaging Bigger step-size + Averaging � � α γ n := ( 1 − β ) c ¯ θ n + 1 := ( θ 1 + . . . + θ n ) / n c + n 2 Bound in expectation � � K IA 1 ( n ) � ¯ θ n − ˆ � � θ T 2 ≤ E � ( n + c ) α/ 2 High-probability bound �� � K IA � 2 ( n ) � ¯ θ n − ˆ � � P θ T 2 ≤ ≥ 1 − δ, � ( n + c ) α/ 2 Dependency of c on µ is removed dependency at the cost of ( 1 − α ) / 2 in the rate. Prashanth L A Convergence rate of TD(0) March 27, 2015 47 / 84

  59. fast LSTD Iterate Averaging Bigger step-size + Averaging � � α γ n := ( 1 − β ) c ¯ θ n + 1 := ( θ 1 + . . . + θ n ) / n c + n 2 Bound in expectation � � K IA 1 ( n ) � ¯ θ n − ˆ � � θ T 2 ≤ E � ( n + c ) α/ 2 High-probability bound �� � K IA � 2 ( n ) � ¯ θ n − ˆ � � P θ T 2 ≤ ≥ 1 − δ, � ( n + c ) α/ 2 Dependency of c on µ is removed dependency at the cost of ( 1 − α ) / 2 in the rate. Prashanth L A Convergence rate of TD(0) March 27, 2015 47 / 84

  60. fast LSTD The constants � � � θ 0 − ˆ � � C θ T h ( n ) c α ( 1 − β ) � K IA 2 1 ( n ) := ( n + c ) ( 1 − α ) / 2 + , and 1 + 2 α ( µ c α ( 1 − β ) 2 ) α 2 ( 1 − α ) � � 2 � � � µ c α ( 1 − β ) 2 + 2 α log δ − 1 2 α 1 3 α + K IA ( n + c ) ( 1 − α ) / 2 + K IA 2 ( n ) := 1 ( n ) . µ ( 1 − β ) α As before, both K IA 1 ( n ) and K IA 2 ( n ) are O ( 1 ) Prashanth L A Convergence rate of TD(0) March 27, 2015 48 / 84

  61. fast LSTD Performance bounds Approximate value function ˜ v n := Φ θ n True value function v �� �� � � � v − Π v � T d ( 1 − β ) 2 µ 2 n ln 1 1 � v − ˜ v n � T ≤ + O + O � ( 1 − β ) 2 µ T δ 1 − β 2 � �� � � �� � � �� � approximation error estimation error computational error T 1 � f � 2 T := T − 1 � f ( s i ) 2 , for any function f . i = 1 2Lazaric, A., Ghavamzadeh, M., Munos, R. (2012) Finite-sample analysis of least-squares policy iteration. In: JMLR Prashanth L A Convergence rate of TD(0) March 27, 2015 49 / 84

  62. fast LSTD Performance bounds �� � �� � � v − Π v � T d ( 1 − β ) 2 µ 2 n ln 1 1 1 � v − ˜ v n � T ≤ + O + O � ( 1 − β ) 2 µ T δ 1 − β 2 � �� � � �� � � �� � approximation error estimation error computational error Artifacts of function approximation and least squares methods Consequence of using SA for LSTD Setting n = ln ( 1 /δ ) T / ( d µ ) , the convergence rate is unaffected! Prashanth L A Convergence rate of TD(0) March 27, 2015 50 / 84

  63. fast LSTD Performance bounds �� � �� � � v − Π v � T d ( 1 − β ) 2 µ 2 n ln 1 1 1 � v − ˜ v n � T ≤ + O + O � ( 1 − β ) 2 µ T δ 1 − β 2 � �� � � �� � � �� � approximation error estimation error computational error Artifacts of function approximation and least squares methods Consequence of using SA for LSTD Setting n = ln ( 1 /δ ) T / ( d µ ) , the convergence rate is unaffected! Prashanth L A Convergence rate of TD(0) March 27, 2015 50 / 84

  64. fast LSTD Performance bounds �� � �� � � v − Π v � T d ( 1 − β ) 2 µ 2 n ln 1 1 1 � v − ˜ v n � T ≤ + O + O � ( 1 − β ) 2 µ T δ 1 − β 2 � �� � � �� � � �� � approximation error estimation error computational error Artifacts of function approximation and least squares methods Consequence of using SA for LSTD Setting n = ln ( 1 /δ ) T / ( d µ ) , the convergence rate is unaffected! Prashanth L A Convergence rate of TD(0) March 27, 2015 50 / 84

  65. Fast LSPI using SA LSPI - A Quick Recap Policy Evaluation Policy π Q-value Q π Policy Improvement � ∞ � � Q π ( s , a ) = E β t r ( s t , π ( s t )) | s 0 = s , a 0 = a t = 0 π ′ ( s ) = arg max θ T φ ( s , a ) a ∈A Prashanth L A Convergence rate of TD(0) March 27, 2015 51 / 84

  66. Fast LSPI using SA LSPI - A Quick Recap Policy Evaluation Policy π Q-value Q π Policy Improvement � ∞ � � Q π ( s , a ) = E β t r ( s t , π ( s t )) | s 0 = s , a 0 = a t = 0 π ′ ( s ) = arg max θ T φ ( s , a ) a ∈A Prashanth L A Convergence rate of TD(0) March 27, 2015 51 / 84

  67. Fast LSPI using SA Policy Evaluation: LSTDQ and its SA variant Given a set of samples D := { ( s i , a i , r i , s ′ i ) , i = 1 , . . . , T ) } LSTDQ approximates Q π by θ T = ¯ ˆ T ¯ A − 1 b T where T T A T = 1 � � ¯ φ ( s i , a i )( φ ( s i , a i ) − βφ ( s ′ i , π ( s ′ i ))) T , and ¯ b T = T − 1 r i φ ( s i , a i ) . T i = 1 i = 1 Fast LSTDQ using SA: � � k − 1 φ ( s ′ i k , π ( s ′ r i k + βθ T i k )) − θ T θ k = θ k − 1 + γ k k − 1 φ ( s i k , a i k ) φ ( s i k , a i k ) Prashanth L A Convergence rate of TD(0) March 27, 2015 52 / 84

  68. Fast LSPI using SA Policy Evaluation: LSTDQ and its SA variant Given a set of samples D := { ( s i , a i , r i , s ′ i ) , i = 1 , . . . , T ) } LSTDQ approximates Q π by θ T = ¯ ˆ T ¯ A − 1 b T where T T A T = 1 � � ¯ φ ( s i , a i )( φ ( s i , a i ) − βφ ( s ′ i , π ( s ′ i ))) T , and ¯ b T = T − 1 r i φ ( s i , a i ) . T i = 1 i = 1 Fast LSTDQ using SA: � � k − 1 φ ( s ′ i k , π ( s ′ r i k + βθ T i k )) − θ T θ k = θ k − 1 + γ k k − 1 φ ( s i k , a i k ) φ ( s i k , a i k ) Prashanth L A Convergence rate of TD(0) March 27, 2015 52 / 84

  69. Fast LSPI using SA Fast LSPI using SA (fLSPI-SA) Input: Sample set D := { s i , a i , r i , s ′ i } T i = 1 repeat Policy Evaluation For k = 1 to τ - Get random sample index: i k ∼ U ( { 1 , . . . , T } ) - Update fLSTD-SA iterate θ k θ ′ ← θ τ , ∆ = � θ − θ ′ � 2 Policy Improvement Obtain a greedy policy π ′ ( s ) = arg max θ ′ T φ ( s , a ) a ∈A θ ← θ ′ , π ← π ′ until ∆ < ǫ Prashanth L A Convergence rate of TD(0) March 27, 2015 53 / 84

  70. Fast LSPI using SA Fast LSPI using SA (fLSPI-SA) Input: Sample set D := { s i , a i , r i , s ′ i } T i = 1 repeat Policy Evaluation For k = 1 to τ - Get random sample index: i k ∼ U ( { 1 , . . . , T } ) - Update fLSTD-SA iterate θ k θ ′ ← θ τ , ∆ = � θ − θ ′ � 2 Policy Improvement Obtain a greedy policy π ′ ( s ) = arg max θ ′ T φ ( s , a ) a ∈A θ ← θ ′ , π ← π ′ until ∆ < ǫ Prashanth L A Convergence rate of TD(0) March 27, 2015 53 / 84

  71. Experiments - Traffic Signal Control The traffic control problem Prashanth L A Convergence rate of TD(0) March 27, 2015 54 / 84

  72. Experiments - Traffic Signal Control Simulation Results on 7x9-grid network Throughput (TAR) Tracking error · 10 4 � � 2 � θ k − ˆ � � θ T 0 . 6 � 1 . 5 0 . 5 2 0 . 4 � � � θ T � θ k − ˆ 1 TAR 0 . 3 � � 0 . 2 0 . 5 0 . 1 LSPI 0 0 fLSPI-SA 0 100 200 300 400 500 0 1 , 000 2 , 000 3 , 000 4 , 000 5 , 000 step k of fLSTD-SA time steps Prashanth L A Convergence rate of TD(0) March 27, 2015 55 / 84

  73. Experiments - Traffic Signal Control Runtime Performance on three road networks · 10 5 1 . 91 · 10 5 2 1 . 5 runtime (ms) 1 0 . 5 30 , 144 4 , 917 66 159 287 0 7x9-Grid 14x9-Grid 14x18-Grid ( d = 504) ( d = 1008) ( d = 2016) LSPI fLSPI-SA Prashanth L A Convergence rate of TD(0) March 27, 2015 56 / 84

  74. Experiments - Traffic Signal Control SGD in Linear Bandits Prashanth L A Convergence rate of TD(0) March 27, 2015 57 / 84

  75. Experiments - Traffic Signal Control Complacs News Recommendation Platform NOAM database: 17 million articles from 2010 Task: Find the best among 2000 news feeds Reward: Relevancy score of the article Feature dimension: 80000 (approx) 1 In collaboration with Nello Cristianini and Tom Welfare at University of Bristol Prashanth L A Convergence rate of TD(0) March 27, 2015 58 / 84

  76. Experiments - Traffic Signal Control Complacs News Recommendation Platform NOAM database: 17 million articles from 2010 Task: Find the best among 2000 news feeds Reward: Relevancy score of the article Feature dimension: 80000 (approx) 1 In collaboration with Nello Cristianini and Tom Welfare at University of Bristol Prashanth L A Convergence rate of TD(0) March 27, 2015 58 / 84

Recommend


More recommend