q learning without stochastic approximation
play

Q-LEARNING WITHOUT STOCHASTIC APPROXIMATION Vivek S. Borkar, IIT - PowerPoint PPT Presentation

Q-LEARNING WITHOUT STOCHASTIC APPROXIMATION Vivek S. Borkar, IIT Bombay Mar. 23, 2015, IIT, Chennai Joint work with Dileep Kalathil (Uni. of California, Berkeley), Rahul Jain (Uni. of Southern California) Work supported in part by


  1. Q-LEARNING WITHOUT STOCHASTIC APPROXIMATION Vivek S. Borkar, IIT Bombay ∗† Mar. 23, 2015, IIT, Chennai ∗ Joint work with Dileep Kalathil (Uni. of California, Berkeley), Rahul Jain (Uni. of Southern California) † Work supported in part by the Department of Science and Technology

  2. OUTLINE 1. Markov Decision Processes (Discounted cost) 2. Value/Q-value iteration algorithms 3. Classical Q-learning 4. Main results

  3. { X n , n ≥ 0 } a controlled Markov chain with: • a finite state space S = { 1 , 2 , · · · , s } , • a finite action space A = { a 1 , · · · , a d } , • an A -valued control process { Z n , n ≥ 0 } ,

  4. • a controlled transition probability function p ( j | i, u ) , i, j ∈ S, u ∈ A, such that P ( X n +1 = i | X m , Z m , m ≤ n ) = p ( i | X n , Z n ) ∀ n, i.e., the probability of going from X n = j (say) to i under action Z n = u (say) is p ( i | j, u ).

  5. Say that { Z n } is: • admissible if above holds, • randomized stationary Markov if P ( Z n = u |F n − 1 , X n = x ) = ( ϕ ( x ))( u ) ∀ n for some ϕ : S �→ P ( A ), • stationary Markov if Z n = v ( X n ) ∀ n for some v : S �→ A .

  6. With abuse of terminology, the last two are identified with ϕ, v esp. Objective: Minimize the discounted cost ∞   β m c ( X m , Z m ) | X 0 = i  , � J i ( { Z n } ) := E  m =0 where • c : S × A �→ R is a prescribed ‘running cost’ function, • β ∈ (0 , 1) is the discount factor.

  7. Dynamic Programming Define ‘ value function ’ V : S �→ R by V ( i ) = inf { Z n } J i ( { Z n } ) . Then by the ‘dynamic programming principle’   � V ( i ) = min  c ( i, u ) + β p ( j | i, u ) V ( j )  , i ∈ S.   u j This is the associated dynamic programming equation . Furthermore, if the minimum of the right is attained at u = v ∗ ( i ), then the stationary Markov policy v ∗ ( · ) is optimal. The converse also holds.

  8. DP equation is a fixed point equation: V = F ( V ) for [ F 1 ( x ) , · · · , F s ( x )] T where F ( x ) = F i ( x ) := min u [ c ( i, u ) + β � p ( j | i, u ) x j ] . j Then � F ( x ) − F ( y ) � ∞ ≤ β � x − y � ∞ , i.e., F is an � · � ∞ -contraction = ⇒ V a unique solution to the DP equation and the ‘ value iteration scheme ’   V n +1 ( i ) = min p ( j | i, u ) V n ( j )  c ( i, u ) + β �  , n ≥ 0 ,   u j converges exponentially to V .

  9. Other schemes: policy iteration, linear programming (primal/dual) Problematic if: • ( i ) p ( ·|· , · ) unknown, or, • ( ii ) p ( ·|· , · ) known, but too complex (e.g., extremely large state space).

  10. Sometimes simulation of the system is ‘easy’, e.g., when the system is composed of a large number of intercon- nected simple components whose individual transitions are easy to simulate (e.g., queuing networks, robots). This has motivated simulation based schemes for ap- proximate dynamic programming, based on stochastic approximation versions of classical iterative schemes. (‘reinforcement learning’, ‘approximate dynamic program- ming’, ‘neurodynamic programming’)

  11. Q-learning: a simulation based scheme for approxi- mate dynamic programming due to CJCH Watkins (1992). Define Q-values Q ( i, u ) := c ( i, u ) + β � p ( j | i, u ) V ( j ) , i ∈ S, u ∈ A. j Then V ( i ) = min Q ( i, u ) , u � Q ( i, u ) = c ( i, u ) + β p ( j | i, u ) min Q ( j, a ) . a j This is the ‘DP equation’ for Q-values.

  12. Again, the last equation is of the form Q = G ( Q ) where � G ( x ) − G ( y ) � ∞ ≤ β � x − y � ∞ Thus we have the ‘ Q-value iteration ’ Q n +1 ( i, u ) = c ( i, u ) + β Q n ( j, a ) , n ≥ 0 . � p ( j | i, u ) min a j Then Q n → the unique solution to the Q-DP equation. Furthermore, v ∗ ( i ) ∈ Argmin Q ( i, · ) , i ∈ S, yields an optimal stationary Markov policy v ∗ . Note V n ∈ R s , Q n ∈ R s × d = ⇒ no motivation to do Q-value iteration.

  13. However, one big change from value iteration: the nonlinearity (minimization over A ) is now inside the averaging = ⇒ can use an incremental method based on stochastic approximation. Advantage: can be based upon simulation, low computation per iterate Disadvantage: slow convergence

  14. Stochastic Approximation Robbins-Monro scheme: x ( n + 1) = x ( n ) + a ( n )[ h ( x ( n )) + M ( n + 1)] . Here, for F n := σ ( x (0) , M ( k ) , k ≤ n ) (i.e., the ‘history till time n ’), n a ( n ) 2 < ∞ , and, • a ( n ) > 0 with n a ( n ) = ∞ , � � • { M ( n ) } a martingale difference sequence: E [ M ( n + 1) |F n ] = 0 ∀ n.

  15. Need: h Lipschitz and E [ � M ( n + 1) � 2 |F n ] ≤ K (1 + � x ( n ) � 2 ) . Typically, x ( n + 1) = x ( n ) + a ( n ) f ( x ( n ) , ξ ( n + 1)) , with { ξ ( n ) } IID. Then set h ( x ) = E [ f ( x, ξ n )] , M ( n + 1) = f ( x ( n ) , ξ ( n + 1)) − h ( x ( n )) .

  16. ‘ODE’ approach (Derevitskii-Fradkov, Ljung): Treat the iteration as a noisy discretization of the ODE x ( t ) = h ( x ( t )) . ˙ If this has x ∗ as its unique asymptotically stable equilibrium, then ⇒ x ( n ) → x ∗ a.s. sup n � x ( n ) � < ∞ = (LHS needs separate ‘stability’ tests)

  17. Idea of proof: Treat the iteration as noisy discretization of the ODE. Specifically, � n − 1 • define ¯ x ( t ) , t ≥ 0, by ¯ x ( m =0 a ( m )) := x ( n ), with linear interpolation, • compare ¯ x ( s ) , t ≤ s ≤ t + T , with ODE trajectory on the same time interval with the same initial condition,

  18. • Gronwall inequality yields bound in terms of discretiza- tion error and error due to noise, • verify that these errors go to zero asymptotically (the latter follows by martingale arguments, using square- summability of { a ( n ) } ), • use either a Liapunov function argument (when avail- able) or a characterization of limit set (Benaim) to conclude.

  19. Synchronous Q-learning: j p ( j | i, u ) min a Q n ( j, a ) by 1. Replace conditional average � evaluation at an actual simulated sample: Q n ( ζ i,u ( n + 1) , a ) , min a where ζ i,u ( n + 1) ≈ p ( ·| i, u ). 2. replace ‘full move’ by an incremental move, i.e., a convex combination of the previous iterate and the correction term due to the new observation.

  20. The algorithm is: Q n +1 ( i, u ) = (1 − a ( n )) Q n ( i, u ) u ′ Q n ( ξ i,u ( n + 1) , u ′ )] + a ( n )[ c ( i, u ) + β min = Q n ( i, u ) + a ( n )[ c ( i, u ) u ′ Q n ( ξ i,u ( n + 1) , u ′ ) − Q n ( i, u )] . + β min Limiting ODE is x ( t ) = G ( x ( t )) − x ( t ) ˙ has the desired Q as its globally asymptotically stable equilibrium ( � x − Q � ∞ works as a Liapunov function) = ⇒ a.s. convergence to Q (stability is separately proved).

  21. Asynchronous version (single simulation case): Q n +1 ( i, u ) = Q n ( i, u ) + a ( n ) I { X n = i, Z n = u } × u ′ Q n ( X n +1 , u ′ ) − Q n ( i, u )] . [ c ( i, u ) + β min Limiting ODE: ˙ x ( t ) = Λ( t )( G ( x ( t )) − x ( t )), Λ( · ) diagonal, non-negative (‘relative frequency’) Convergence to Q if diagonal elements of Λ( · ) are bounded away from zero ⇐ ⇒ all pairs ( i, u ) are sampled comparably often. (‘infinitely often’ suffices (Yu-Bertsekas)) Problem: slow!

  22. Non-incremental Q-learning Fix N := number of samples per stage. The algorithm is:  N   1 Q n +1 ( i, u ) = c ( i, u ) + β Q n ( ξ m  , � min i,u ( n + 1) , a ) a N m =1 where: • { ξ m i,u ( n ) } are IID ≈ p ( ·| i, u ) for each ( i, u ), and, • { ξ m i,u ( n ) } i,u,m,n are independent.

  23. This is equivalent to Q n +1 ( i, u ) = c ( i, u ) + β p ( n ) ( j | i, u ) min Q n ( j, a ) , � ˜ a j p ( n ) ( ·| i, u ) are the empirical transition probabilities where ˜ given by N p ( n ) ( j | i, u ) := 1 I { ξ m ˜ � i,u ( n + 1) = j } . N m =1 For a fixed sample run, we can view this as ‘quenched’ randomness, leading to a time-dependent sequence of transition matrices.

  24. Claim: Q n → Q a.s.! Empirical observation: Convergence extremely fast initially to a ‘ball park’ estimate, then very slow. = ⇒ one can consider hybrid schemes where one switches to stochastic approximation after the initial period.

  25. Idea of proof Consider a controlled Markov chain { X n } governed by time-inhomogeneous transition probabilities p ( n ) ( j | i, u ) , n ≥ 0 . ˜ V n in value iteration (always) has the interpretation of being the optimal finite horizon cost with ‘terminal cost’ V 0 , i.e., n − 1   β m c ( X m , Z m ) + β n V 0 ( X n ) | X 0 = i V n ( i ) = min � { Z n } E   m =0

  26. Thus n − 1   β m c ( X ∗ m , v ∗ ( m, X ∗ m )) + β n V 0 ( X ∗ n ) | X ∗ V n ( i ) = E  , � 0 = i  m =0 where ( X ∗ n , v ∗ ( n, X ∗ n )) is the optimal state-control process, defined consistently because the function v ( n, · ) depends on the remaining time horizon. Similarly, n − 1   m ) + β n min β m c ( X ∗ m , Z ∗ Q 0 ( X ∗ n , a ) | X ∗ Q n ( i, u ) = E  , � 0 = i  a m =0 where Z ∗ 0 = u and Z ∗ n = v ∗ ( n, X ∗ n ) thereafter.

Recommend


More recommend