Q -learning Algorithms for Optimal Stopping Based on Least Squares - PowerPoint PPT Presentation

Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Q -learning Algorithms for Optimal Stopping Based on Least Squares H. Yu 1 . Bertsekas 2 D. P 1Department of Computer Science University of Helsinki 2Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology European Control Conference, Kos, Greece, 2007

Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Outline Introduction Optimal Stopping Problems Preliminaries Least Squares Q -Learning Algorithm Convergence Convergence Rate Variants with Reduced Computation Motivation First Variant Second Variant

Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Basic Problem and Bellman Equation • An irreducible Markov chain with n states and transition matrix P Action: stop or continue Cost at state i : c ( i ) if stop; g ( i ) if continue Minimize the expected discounted total cost till stop • Bellman equations in vector notation 1 J ∗ = min { c , g + α PJ ∗ } , Q ∗ = g + α P min { c , Q ∗ } Optimal policy: stop as soon as the state hits the set D = { i | c ( i ) ≤ Q ∗ ( i ) } • Applications: search, sequential hypothesis testing, finance • Focus of this paper: Q -learning with linear function approximation 2 1 α : discount factor, J ∗ : optimal cost, Q ∗ : Q -factor for the continuation action (the cost of continuing for the first stage and using an optimal stopping policy in the remaining stages) 2 Q-learning aims to find the Q-factor for each action-state pair, i.e., the vector Q ∗ (the Q-factor vector for the stop action is c ).

Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Q -Learning with Function Approximation (Tsitsiklis and Van Roy 1999) Subspace Approximation 3 2 3 · · · 4 φ ( i ) ′ 5 , or, Q ( i , r ) = φ ( i ) ′ r [Φ] n × s = Q = Φ r · · · Weighted Euclidean Projection Π Q = arg min � Q − Φ r � π , π = ( π ( 1 ) , . . . , π ( n )) : invariant distribution of P r ∈ℜ s Key Fact: DP mapping F is � · � π -contraction and so is Π F , where FQ def = g + α P min { c , Q } Temporal Difference (TD) Learning solves Projected Bellman Equation : Φ r ∗ = Π F (Φ r ∗ ) Suboptimal policy µ : stop as soon as the state hits the set { i | c ( i ) ≤ φ ( i ) ′ r ∗ } 4 n X ` ´ 2 1 − α 2 � Π Q ∗ − Q ∗ � π J µ ( i ) − J ∗ ( i ) π ( i ) ≤ p ( 1 − α ) i = 1 3 Assume that Φ has linearly independent columns. 4 Denote by J µ the cost of this policy.

Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Basis of Least Squares Methods I Projected Value Iteration Simulation: ( x 0 , x 1 , . . . ) unstopped state process; implicitly approximate Π F with increasing accuracy Projected Value Iteration and LSPE (Bertsekas and Ioffe 1996): 5 Π t b Φ r t + 1 = b Φ r t + 1 = Π F (Φ r t ) , F t (Φ r t ) = Π F (Φ r t ) + ǫ t Value Iterate Value Iterate F( Φ r t ) F( Φ r t ) Projection Projection on S on S Φ r t+1 Φ r t+1 Φ r t Φ r t Simulation error 0 0 S: Subspace spanned by basis functions S: Subspace spanned by basis functions Projected Value Iteration Least Squares Policy Evaluation (LSPE) 5 Roughly speaking, b Π t b F t → Π F , ǫ t → 0 as t → ∞ .

Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Basis of Least Squares Methods II Solving Approximate Projected Bellman Equation LSTD (Bradtke and Barto 1996, Boyan 1999): find r t + 1 solving an approximate projected Bellman equation Φ r t + 1 = b Π t b F t (Φ r t + 1 ) Not viable for optimal stopping because F is non-linear 6 Comparison with Temporal Difference Learning Algorithm (Tsitsiklis and Van Roy 1999): 7 ` ´ g ( x t , x t + 1 ) + α min { c ( x t + 1 ) , φ ( x t + 1 ) ′ r t } − φ ( x t ) ′ r t r t + 1 = r t + γ t φ ( x t ) • TD: use each sample state only once; averaging through long time interval, approximately perform the mapping Π F • Least squares (LS) methods: use effectively the past information; no need to store the past (in policy evaluation context) 6 In the case of policy evaluation, this is a linear equation and can be solved efficiently. 7 Abusing notation, we denote by g ( i , j ) the one-stage cost of transiting from state i to j under the continuation action.

Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Least Squares Q -Learning The Algorithm 2 ( x 0 , x 1 , . . . ) unstopped state process, γ ∈ ( 0 , 1 + α ) constant stepsize r t + 1 = r t + γ (ˆ r t + 1 − r t ) (1) where ˆ r t + 1 is the LS solution: t “ ¯” 2 X ˘ φ ( x k ) ′ r − g ( x k , x k + 1 ) − α min c ( x k + 1 ) , φ ( x k + 1 ) ′ r t ˆ r t + 1 = arg min (2) r ∈ℜ s k = 0 Can compute ˆ r t + 1 almost recursively: ! − 1 “ ¯” X t X t ˘ φ ( x k ) φ ( x k ) ′ c ( x k + 1 ) , φ ( x k + 1 ) ′ r t ˆ r t + 1 = φ ( x k ) g ( x k , x k + 1 ) + α min k = 0 k = 0 ˘ ¯ c ( x k + 1 ) , φ ( x k + 1 ) ′ r t except the calculation of min , k ≤ t requires repartitioning past states into stopping or continuation sets (a remedy will be discussed later)

Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Convergence Analysis Express LS solution in matrix notation as 8 “ ¯” ˘ r t + 1 = b Π t b F t (Φ r t ) = b g t + α ˜ Φˆ Π t ˆ P t min c , Φ r t (3) With probability 1 (w.p.1), for all t sufficiently large, • b Π t b F t is � · � π -contraction with modulus ˆ α ∈ ( α, 1 ) • ( 1 − γ ) I + γ b Π t b 2 F t is � · � π -contraction for γ ∈ ( 0 , 1 + α ) Proposition „ « 2 r t → r ∗ , as t → ∞ , w . p . 1 . For all γ ∈ 0 , , 1 + α Note: Unit stepsize is in the convergence range 8 Here b g t and ˜ Π t , ˆ P t are increasingly accurate simulation-based approximations of Π , g and P , respectively.

Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Comparison to an LSTD Analogue Φ r t + 1 = ( 1 − γ )Φ r t + γ b Π t b LS Q -learning: F t (Φ r t ) (4) r t + 1 = b Π t b Φ˜ F t (Φ˜ LSTD analogue: r t + 1 ) (5) Eq. (4) is one single fixed point iteration for solving Eq. (5). Yet, the LS Q -learning algorithm and the idealized LSTD algorithm have the same convergence rate [two-time scale argument, similar to a comparison analysis of LSPE/LSTD (Yu and Bertsekas 2006)]: 9 Proposition „ « 2 For all γ ∈ 0 , , t (Φ r t − Φ˜ r t ) < ∞ , w . p . 1 . 1 + α Implications: for all stepsize γ in the convergence range • empirical phenomenon: r t “tracks” ˜ r t r t → r ∗ at the rate of • more precisely: r t − ˜ r t → 0 at the rate of O ( t ) , faster than r t , ˜ √ O ( t ) 9 A coarse explanation is as follows: ˜ r t + 1 changes slowly at the rate of O ( t ) and can be viewed as if “frozen” for iteration (4), which, being a contraction mapping, has geometric rate of convergence to the vicinity of the “fixed point” ˜ r t + 1 .

Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Variants with Reduced Computation Motivation LS solution ! − 1 “ ¯” X t X t ˘ φ ( x k ) φ ( x k ) ′ c ( x k + 1 ) , φ ( x k + 1 ) ′ r t ˆ r t + 1 = φ ( x k ) g ( x k , x k + 1 ) + α min k = 0 k = 0 requires extra overhead/repartition per iteration: ˘ ¯ c ( x k + 1 ) , φ ( x k + 1 ) ′ r t min , k ≤ t Introduce algorithms with limited repartition at the expense of likely worse asymptotic convergence rate

Introduction Least Squares Q -Learning Variants with Reduced Computation Summary First Variant: Forgo Repartition With an Optimistic Policy Iteration Flavor Set of past stopping decisions for state samples ˘ ¯ k | c ( x k + 1 ) ≤ φ ( x k + 1 ) ′ r k K = ˘ ¯ c ( x k + 1 ) , φ ( x k + 1 ) ′ r t Replace the terms min , k ≤ t by ( c ( x k + 1 ) if k ∈ K ˜ q ( x k + 1 , r t ) = φ ( x k + 1 ) ′ r t if k / ∈ K Algorithm ! − 1 X t X t φ ( x k ) φ ( x k ) ′ r t + 1 = φ ( x k ) g ( x k , x k + 1 ) k = 0 k = 0 ! X X φ ( x k ) φ ( x k + 1 ) ′ r t + α φ ( x k ) c ( x k + 1 ) + α k ≤ t , k ∈ K k ≤ t , k / ∈ K Can compute recursively; LSTD approach is also applicable 10 But we have no proof of convergence at present 11 10 This is because the r.h.s. above is linear in r t . 11 Note that if the algorithm converges, it converges to the correct solution r ∗ .

Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Second Variant: Repartition within a Finite Window Repartition at most m times per state sample, m ≥ 1: window size ˘ ¯ c ( x k + 1 ) , φ ( x k + 1 ) ′ r t Replace the terms min , k ≤ t by ˘ ¯ c ( x k + 1 ) , φ ( x k + 1 ) ′ r l k , t min , l k , t = min { k + m − 1 , t } Algorithm “ ¯” 2 t X ˘ φ ( x k ) ′ r − g ( x k , x k + 1 ) − α min c ( x k + 1 ) , φ ( x k + 1 ) ′ r l k , t r t + 1 = arg min (6) r ∈ℜ s k = 0 Special cases • m → ∞ : LS Q -learning algorithm • m = 1: the fixed point Kalman filter (TD with scaling), (Choi and Van Roy 2006) ` ´ 1 t + 1 B − 1 g ( x t , x t + 1 ) + α min { c ( x t + 1 ) , φ ( x t + 1 ) ′ r t } − φ ( x t ) ′ r t r t + 1 = r t + φ ( x t ) t

Q -learning Algorithms for Optimal Stopping Based on Least Squares - PowerPoint PPT Presentation

Introduction Least Squares Q -Learning Variants with Reduced Computation Summary Q -learning Algorithms for Optimal Stopping Based on Least Squares H. Yu 1 . Bertsekas 2 D. P 1Department of Computer Science University of Helsinki 2Department

Keys to Creating Thumb Keys to Creating Thumb - Stopping Content Stopping Content Sean Ellenby

1 2 3 4 Stopping Probability Visiting Probability 5 Stopping

The Cayley-Moser Problem Optimal Stopping Buying a house, selling an asset, or searching for a

Option contracts for power system balancing Part 1: Optimal stopping problems John Moriarty

Option contracts for power system balancing Part 2: Geometric solution of optimal stopping

An Optimal Private Stochastic-MAB Algorithm Based on an Optimal Private Stopping Rule Touqir Sajed

14/ 04/ 2020 Are we in a Crisis? Communic ation A c r isis is a pe ople - stopping,

Practical Least-Squares for Computer Graphics Siggraph Course 11 Siggraph Course 11 Practical

Optimal Algorithms for Learning Bayesian Optimal Algorithms for Learning Bayesian Network

Martingales and Stopping Times Use of martingales in obtaining bounds and analyzing algorithms

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Option contracts for power system balancing Part 3: Power system balancing and multiple optimal

Least-Action Filtering L. C. G. Rogers Statistical Laboratory, University of Cambridge

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

Least Mean Squares Regression Machine Learning 1 Least Squares Method for regression

Advanced Computer Graphics CS 563: VPL based RT GI Techniques William DiSanto Computer

Status Report on Technology Evaluation for JL ab E lectron I on C ollider (JLEIC) Ion Linac R.C.

AIXI: Universal Optimal Sequential Decision Making Marcus Hutter (2005) Reinforcement Learning

Foundations of Chemical Kinetics Lecture 12: Transition-state theory: The thermodynamic formalism

Responsive Typography Design for Meaning, Not for Screen Size UX Fest #UXFest Jason Pamental |

R_k+1 R_k+1

SOFTWARE DESIGN (SWD) Instructor: Dr. Hany H. Ammar Dept. of Computer Science and Electrical

The System and Software Development Process Instructor: Dr. Hany H. Ammar Dept. of Computer