Stochastic approximation for speeding up LSTD/LSPI (and least squares regression/LinUCB) Prashanth L A † Joint work with Nathaniel Korda ♯ and Rémi Munos † † INRIA Lille - Team SequeL ♯ MLRG - Oxford University November 24, 2014 Prashanth L A Fast LSTD using SA November 24, 2014 1 / 39
Fast LSTD using SA Outline Fast LSTD using SA 1 Fast LSPI using SA 2 Experiments - Traffic Signal Control 3 Extension to Least Squares Regression 4 Experiments - News Recommendation 5 Proof outline 6 Prashanth L A Fast LSTD using SA November 24, 2014 2 / 39
Fast LSTD using SA Background MDP Set of States X , Set of Actions A , Rewards r ( x , a ) � ∞ � � V π ( s ) := E β t r ( s t , π ( s t )) | s 0 = s Value function t = 0 � T π ( V )( s ) := r ( s , π ( s )) + β p ( s , π ( s ) , s ′ ) V ( s ′ ) Bellman Operator s ′ Prashanth L A Fast LSTD using SA November 24, 2014 3 / 39
Fast LSTD using SA TD with Function Approximation Linear Function Approximation. T φ ( s ) V π ( s ) ≈ θ Parameter θ ∈ R d Feature φ ( s ) ∈ R d TD Fixed Point Φ θ = Π T π (Φ θ ) Feature Matrix Orthogonal Projection to B = { Φ θ | θ ∈ R d } with rows φ ( s ) T , ∀ s ∈ S Prashanth L A Fast LSTD using SA November 24, 2014 4 / 39
Fast LSTD using SA TD with Function Approximation Linear Function Approximation. T φ ( s ) V π ( s ) ≈ θ Parameter θ ∈ R d Feature φ ( s ) ∈ R d TD Fixed Point Φ θ = Π T π (Φ θ ) Feature Matrix Orthogonal Projection to B = { Φ θ | θ ∈ R d } with rows φ ( s ) T , ∀ s ∈ S Prashanth L A Fast LSTD using SA November 24, 2014 4 / 39
Fast LSTD using SA LSTD - A Batch Algorithm Given dataset D := { ( s i , r i , s ′ i ) , i = 1 , . . . , T ) } LSTD approximates the TD fixed point by ˆ θ T = ¯ A − 1 T ¯ O ( d 2 T ) Complexity b T , T A T = 1 � where ¯ φ ( s i )( φ ( s i ) − βφ ( s ′ i )) T T i = 1 T b T = 1 � ¯ r i φ ( s i ) . T i = 1 Prashanth L A Fast LSTD using SA November 24, 2014 5 / 39
Fast LSTD using SA LSTD - A Batch Algorithm Given dataset D := { ( s i , r i , s ′ i ) , i = 1 , . . . , T ) } LSTD approximates the TD fixed point by ˆ θ T = ¯ A − 1 T ¯ O ( d 2 T ) Complexity b T , T A T = 1 � where ¯ φ ( s i )( φ ( s i ) − βφ ( s ′ i )) T T i = 1 T b T = 1 � ¯ r i φ ( s i ) . T i = 1 Prashanth L A Fast LSTD using SA November 24, 2014 5 / 39
Fast LSTD using SA Complexity of LSTD [1] Policy Evaluation Policy π Q-value Q π Policy Improvement Figure : LSPI - a batch-mode RL algorithm for control LSTD Complexity O ( d 2 T ) using the Sherman-Morrison lemma or O ( d 2 . 807 ) using the Strassen algorithm or O ( d 2 . 375 ) the Coppersmith-Winograd algorithm Prashanth L A Fast LSTD using SA November 24, 2014 6 / 39
Fast LSTD using SA Complexity of LSTD [1] Policy Evaluation Policy π Q-value Q π Policy Improvement Figure : LSPI - a batch-mode RL algorithm for control LSTD Complexity O ( d 2 T ) using the Sherman-Morrison lemma or O ( d 2 . 807 ) using the Strassen algorithm or O ( d 2 . 375 ) the Coppersmith-Winograd algorithm Prashanth L A Fast LSTD using SA November 24, 2014 6 / 39
Fast LSTD using SA Complexity of LSTD [2] Problem Practical applications involve high-dimensional features (e.g. Computer-Go: d ∼ 10 6 ) ⇒ solving LSTD is computationally intensive Related works: GTD 1 , GTD2 2 , iLSTD 3 Solution Use stochastic approximation (SA) Complexity O ( dT ) ⇒ O ( d ) reduction in complexity Theory SA variant of LSTD does not impact overall rate of convergence Experiments On traffic control application, performance of SA-based LSTD is comparable to LSTD, while gaining in runtime! 1Sutton et al. (2009) A convergent O(n) algorithm for off-policy temporal difference learning. In: NIPS 2Sutton et al. (2009) Fast gradient-descent methods for temporal-difference learning with linear func- tion approximation. In: ICML 3Geramifard A et al. (2007) iLSTD: Eligibility traces and convergence analysis. In: NIPS Prashanth L A Fast LSTD using SA November 24, 2014 7 / 39
Fast LSTD using SA Complexity of LSTD [2] Problem Practical applications involve high-dimensional features (e.g. Computer-Go: d ∼ 10 6 ) ⇒ solving LSTD is computationally intensive Related works: GTD 1 , GTD2 2 , iLSTD 3 Solution Use stochastic approximation (SA) Complexity O ( dT ) ⇒ O ( d ) reduction in complexity Theory SA variant of LSTD does not impact overall rate of convergence Experiments On traffic control application, performance of SA-based LSTD is comparable to LSTD, while gaining in runtime! 1Sutton et al. (2009) A convergent O(n) algorithm for off-policy temporal difference learning. In: NIPS 2Sutton et al. (2009) Fast gradient-descent methods for temporal-difference learning with linear func- tion approximation. In: ICML 3Geramifard A et al. (2007) iLSTD: Eligibility traces and convergence analysis. In: NIPS Prashanth L A Fast LSTD using SA November 24, 2014 7 / 39
Fast LSTD using SA Fast LSTD using Stochastic Approximation Update θ n Pick i n uniformly θ n + 1 θ n using ( s i n , r i n , s ′ in { 1 , . . . , T } i n ) Random Sampling SA Update Update rule: � � n − 1 φ ( s ′ r i n + βθ T i n ) − θ T θ n = θ n − 1 + γ n n − 1 φ ( s i n ) φ ( s i n ) Step-sizes Fixed-point iteration Complexity: O ( d ) per iteration Prashanth L A Fast LSTD using SA November 24, 2014 8 / 39
Fast LSTD using SA Fast LSTD using Stochastic Approximation Update θ n Pick i n uniformly θ n + 1 θ n using ( s i n , r i n , s ′ in { 1 , . . . , T } i n ) Random Sampling SA Update Update rule: � � n − 1 φ ( s ′ r i n + βθ T i n ) − θ T θ n = θ n − 1 + γ n n − 1 φ ( s i n ) φ ( s i n ) Step-sizes Fixed-point iteration Complexity: O ( d ) per iteration Prashanth L A Fast LSTD using SA November 24, 2014 8 / 39
Fast LSTD using SA Assumptions Setting: Given dataset D := { ( s i , r i , s ′ i ) , i = 1 , . . . , T ) } Bounded features (A1) � φ ( s i ) � 2 ≤ 1 Bounded rewards (A2) | r i | ≤ R max < ∞ � � T 1 � Co-variance matrix (A3) λ min φ ( s i ) φ ( s i ) T ≥ µ . has a min-eigenvalue T i = 1 Prashanth L A Fast LSTD using SA November 24, 2014 9 / 39
Fast LSTD using SA Assumptions Setting: Given dataset D := { ( s i , r i , s ′ i ) , i = 1 , . . . , T ) } Bounded features (A1) � φ ( s i ) � 2 ≤ 1 Bounded rewards (A2) | r i | ≤ R max < ∞ � � T 1 � Co-variance matrix (A3) λ min φ ( s i ) φ ( s i ) T ≥ µ . has a min-eigenvalue T i = 1 Prashanth L A Fast LSTD using SA November 24, 2014 9 / 39
Fast LSTD using SA Assumptions Setting: Given dataset D := { ( s i , r i , s ′ i ) , i = 1 , . . . , T ) } Bounded features (A1) � φ ( s i ) � 2 ≤ 1 Bounded rewards (A2) | r i | ≤ R max < ∞ � � T 1 � Co-variance matrix (A3) λ min φ ( s i ) φ ( s i ) T ≥ µ . has a min-eigenvalue T i = 1 Prashanth L A Fast LSTD using SA November 24, 2014 9 / 39
Fast LSTD using SA Assumptions Setting: Given dataset D := { ( s i , r i , s ′ i ) , i = 1 , . . . , T ) } Bounded features (A1) � φ ( s i ) � 2 ≤ 1 Bounded rewards (A2) | r i | ≤ R max < ∞ � � T 1 � Co-variance matrix (A3) λ min φ ( s i ) φ ( s i ) T ≥ µ . has a min-eigenvalue T i = 1 Prashanth L A Fast LSTD using SA November 24, 2014 9 / 39
Fast LSTD using SA Convergence Rate Step-size choice γ n = ( 1 − β ) c 2 ( c + n ) , with ( 1 − β ) 2 µ c ∈ ( 1 . 33 , 2 ) Bound in expectation � � K 1 � θ n − ˆ � � √ n + c E θ T 2 ≤ � High-probability bound �� � � K 2 � θ n − ˆ � � θ T 2 ≤ √ n + c ≥ 1 − δ, P � By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Fast LSTD using SA November 24, 2014 10 / 39
Fast LSTD using SA Convergence Rate Step-size choice γ n = ( 1 − β ) c 2 ( c + n ) , with ( 1 − β ) 2 µ c ∈ ( 1 . 33 , 2 ) Bound in expectation � � K 1 � θ n − ˆ � � √ n + c E θ T 2 ≤ � High-probability bound �� � � K 2 � θ n − ˆ � � θ T 2 ≤ √ n + c ≥ 1 − δ, P � By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Fast LSTD using SA November 24, 2014 10 / 39
Fast LSTD using SA Convergence Rate Step-size choice γ n = ( 1 − β ) c 2 ( c + n ) , with ( 1 − β ) 2 µ c ∈ ( 1 . 33 , 2 ) Bound in expectation � � K 1 � θ n − ˆ � � √ n + c E θ T 2 ≤ � High-probability bound �� � � K 2 � θ n − ˆ � � θ T 2 ≤ √ n + c ≥ 1 − δ, P � By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Fast LSTD using SA November 24, 2014 10 / 39
Fast LSTD using SA Convergence Rate Step-size choice γ n = ( 1 − β ) c 2 ( c + n ) , with ( 1 − β ) 2 µ c ∈ ( 1 . 33 , 2 ) Bound in expectation � � K 1 � θ n − ˆ � � √ n + c E θ T 2 ≤ � High-probability bound �� � � K 2 � θ n − ˆ � � θ T 2 ≤ √ n + c ≥ 1 − δ, P � By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Fast LSTD using SA November 24, 2014 10 / 39
Recommend
More recommend