Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Q -learning Algorithms for Optimal Stopping Based on Least Squares H. Yu 1 . Bertsekas 2 D. P 1Department of Computer Science University of Helsinki 2Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology European Control Conference, Kos, Greece, 2007
Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Outline Introduction Optimal Stopping Problems Preliminaries Least Squares Q -Learning Algorithm Convergence Convergence Rate Variants with Reduced Computation Motivation First Variant Second Variant
Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Basic Problem and Bellman Equation • An irreducible Markov chain with n states and transition matrix P Action: stop or continue Cost at state i : c ( i ) if stop; g ( i ) if continue Minimize the expected discounted total cost till stop • Bellman equations in vector notation 1 J ∗ = min { c , g + α PJ ∗ } , Q ∗ = g + α P min { c , Q ∗ } Optimal policy: stop as soon as the state hits the set D = { i | c ( i ) ≤ Q ∗ ( i ) } • Applications: search, sequential hypothesis testing, finance • Focus of this paper: Q -learning with linear function approximation 2 1 α : discount factor, J ∗ : optimal cost, Q ∗ : Q -factor for the continuation action (the cost of continuing for the first stage and using an optimal stopping policy in the remaining stages) 2 Q-learning aims to find the Q-factor for each action-state pair, i.e., the vector Q ∗ (the Q-factor vector for the stop action is c ).
Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Q -Learning with Function Approximation (Tsitsiklis and Van Roy 1999) Subspace Approximation 3 2 3 · · · 4 φ ( i ) ′ 5 , or, Q ( i , r ) = φ ( i ) ′ r [Φ] n × s = Q = Φ r · · · Weighted Euclidean Projection Π Q = arg min � Q − Φ r � π , π = ( π ( 1 ) , . . . , π ( n )) : invariant distribution of P r ∈ℜ s Key Fact: DP mapping F is � · � π -contraction and so is Π F , where FQ def = g + α P min { c , Q } Temporal Difference (TD) Learning solves Projected Bellman Equation : Φ r ∗ = Π F (Φ r ∗ ) Suboptimal policy µ : stop as soon as the state hits the set { i | c ( i ) ≤ φ ( i ) ′ r ∗ } 4 n X ` ´ 2 1 − α 2 � Π Q ∗ − Q ∗ � π J µ ( i ) − J ∗ ( i ) π ( i ) ≤ p ( 1 − α ) i = 1 3 Assume that Φ has linearly independent columns. 4 Denote by J µ the cost of this policy.
Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Basis of Least Squares Methods I Projected Value Iteration Simulation: ( x 0 , x 1 , . . . ) unstopped state process; implicitly approximate Π F with increasing accuracy Projected Value Iteration and LSPE (Bertsekas and Ioffe 1996): 5 Π t b Φ r t + 1 = b Φ r t + 1 = Π F (Φ r t ) , F t (Φ r t ) = Π F (Φ r t ) + ǫ t Value Iterate Value Iterate F( Φ r t ) F( Φ r t ) Projection Projection on S on S Φ r t+1 Φ r t+1 Φ r t Φ r t Simulation error 0 0 S: Subspace spanned by basis functions S: Subspace spanned by basis functions Projected Value Iteration Least Squares Policy Evaluation (LSPE) 5 Roughly speaking, b Π t b F t → Π F , ǫ t → 0 as t → ∞ .
Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Basis of Least Squares Methods II Solving Approximate Projected Bellman Equation LSTD (Bradtke and Barto 1996, Boyan 1999): find r t + 1 solving an approximate projected Bellman equation Φ r t + 1 = b Π t b F t (Φ r t + 1 ) Not viable for optimal stopping because F is non-linear 6 Comparison with Temporal Difference Learning Algorithm (Tsitsiklis and Van Roy 1999): 7 ` ´ g ( x t , x t + 1 ) + α min { c ( x t + 1 ) , φ ( x t + 1 ) ′ r t } − φ ( x t ) ′ r t r t + 1 = r t + γ t φ ( x t ) • TD: use each sample state only once; averaging through long time interval, approximately perform the mapping Π F • Least squares (LS) methods: use effectively the past information; no need to store the past (in policy evaluation context) 6 In the case of policy evaluation, this is a linear equation and can be solved efficiently. 7 Abusing notation, we denote by g ( i , j ) the one-stage cost of transiting from state i to j under the continuation action.
Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Least Squares Q -Learning The Algorithm 2 ( x 0 , x 1 , . . . ) unstopped state process, γ ∈ ( 0 , 1 + α ) constant stepsize r t + 1 = r t + γ (ˆ r t + 1 − r t ) (1) where ˆ r t + 1 is the LS solution: t “ ¯” 2 X ˘ φ ( x k ) ′ r − g ( x k , x k + 1 ) − α min c ( x k + 1 ) , φ ( x k + 1 ) ′ r t ˆ r t + 1 = arg min (2) r ∈ℜ s k = 0 Can compute ˆ r t + 1 almost recursively: ! − 1 “ ¯” X t X t ˘ φ ( x k ) φ ( x k ) ′ c ( x k + 1 ) , φ ( x k + 1 ) ′ r t ˆ r t + 1 = φ ( x k ) g ( x k , x k + 1 ) + α min k = 0 k = 0 ˘ ¯ c ( x k + 1 ) , φ ( x k + 1 ) ′ r t except the calculation of min , k ≤ t requires repartitioning past states into stopping or continuation sets (a remedy will be discussed later)
Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Convergence Analysis Express LS solution in matrix notation as 8 “ ¯” ˘ r t + 1 = b Π t b F t (Φ r t ) = b g t + α ˜ Φˆ Π t ˆ P t min c , Φ r t (3) With probability 1 (w.p.1), for all t sufficiently large, • b Π t b F t is � · � π -contraction with modulus ˆ α ∈ ( α, 1 ) • ( 1 − γ ) I + γ b Π t b 2 F t is � · � π -contraction for γ ∈ ( 0 , 1 + α ) Proposition „ « 2 r t → r ∗ , as t → ∞ , w . p . 1 . For all γ ∈ 0 , , 1 + α Note: Unit stepsize is in the convergence range 8 Here b g t and ˜ Π t , ˆ P t are increasingly accurate simulation-based approximations of Π , g and P , respectively.
Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Comparison to an LSTD Analogue Φ r t + 1 = ( 1 − γ )Φ r t + γ b Π t b LS Q -learning: F t (Φ r t ) (4) r t + 1 = b Π t b Φ˜ F t (Φ˜ LSTD analogue: r t + 1 ) (5) Eq. (4) is one single fixed point iteration for solving Eq. (5). Yet, the LS Q -learning algorithm and the idealized LSTD algorithm have the same convergence rate [two-time scale argument, similar to a comparison analysis of LSPE/LSTD (Yu and Bertsekas 2006)]: 9 Proposition „ « 2 For all γ ∈ 0 , , t (Φ r t − Φ˜ r t ) < ∞ , w . p . 1 . 1 + α Implications: for all stepsize γ in the convergence range • empirical phenomenon: r t “tracks” ˜ r t r t → r ∗ at the rate of • more precisely: r t − ˜ r t → 0 at the rate of O ( t ) , faster than r t , ˜ √ O ( t ) 9 A coarse explanation is as follows: ˜ r t + 1 changes slowly at the rate of O ( t ) and can be viewed as if “frozen” for iteration (4), which, being a contraction mapping, has geometric rate of convergence to the vicinity of the “fixed point” ˜ r t + 1 .
Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Variants with Reduced Computation Motivation LS solution ! − 1 “ ¯” X t X t ˘ φ ( x k ) φ ( x k ) ′ c ( x k + 1 ) , φ ( x k + 1 ) ′ r t ˆ r t + 1 = φ ( x k ) g ( x k , x k + 1 ) + α min k = 0 k = 0 requires extra overhead/repartition per iteration: ˘ ¯ c ( x k + 1 ) , φ ( x k + 1 ) ′ r t min , k ≤ t Introduce algorithms with limited repartition at the expense of likely worse asymptotic convergence rate
Introduction Least Squares Q -Learning Variants with Reduced Computation Summary First Variant: Forgo Repartition With an Optimistic Policy Iteration Flavor Set of past stopping decisions for state samples ˘ ¯ k | c ( x k + 1 ) ≤ φ ( x k + 1 ) ′ r k K = ˘ ¯ c ( x k + 1 ) , φ ( x k + 1 ) ′ r t Replace the terms min , k ≤ t by ( c ( x k + 1 ) if k ∈ K ˜ q ( x k + 1 , r t ) = φ ( x k + 1 ) ′ r t if k / ∈ K Algorithm ! − 1 X t X t φ ( x k ) φ ( x k ) ′ r t + 1 = φ ( x k ) g ( x k , x k + 1 ) k = 0 k = 0 ! X X φ ( x k ) φ ( x k + 1 ) ′ r t + α φ ( x k ) c ( x k + 1 ) + α k ≤ t , k ∈ K k ≤ t , k / ∈ K Can compute recursively; LSTD approach is also applicable 10 But we have no proof of convergence at present 11 10 This is because the r.h.s. above is linear in r t . 11 Note that if the algorithm converges, it converges to the correct solution r ∗ .
Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Second Variant: Repartition within a Finite Window Repartition at most m times per state sample, m ≥ 1: window size ˘ ¯ c ( x k + 1 ) , φ ( x k + 1 ) ′ r t Replace the terms min , k ≤ t by ˘ ¯ c ( x k + 1 ) , φ ( x k + 1 ) ′ r l k , t min , l k , t = min { k + m − 1 , t } Algorithm “ ¯” 2 t X ˘ φ ( x k ) ′ r − g ( x k , x k + 1 ) − α min c ( x k + 1 ) , φ ( x k + 1 ) ′ r l k , t r t + 1 = arg min (6) r ∈ℜ s k = 0 Special cases • m → ∞ : LS Q -learning algorithm • m = 1: the fixed point Kalman filter (TD with scaling), (Choi and Van Roy 2006) ` ´ 1 t + 1 B − 1 g ( x t , x t + 1 ) + α min { c ( x t + 1 ) , φ ( x t + 1 ) ′ r t } − φ ( x t ) ′ r t r t + 1 = r t + φ ( x t ) t
Recommend
More recommend