Discounted Stochastic Games Stochastic Games with mean payoff Multigrid methods for two player zero-sum stochastic games Sylvie Detournay INRIA Saclay and CMAP, ´ Ecole Polytechnique Soutenance de th` ese Le 25 septembre, 2012 Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 1 / 53
Discounted Stochastic Games Stochastic Games with mean payoff Outline Zero-sum two player stochastic game with discounted payoff Dynamic Programing equations Policy iteration and multigrids : AMG π Numerical results Zero-sum two player stochastic game with mean payoff Unichain case Dynamic Programing equations Policy iteration and multigrids : AMG π Numerical results Multichain case Dynamic Programing equations Policy iteration for multichain Numerical results Conclusions Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 2 / 53
Discounted Stochastic Games Stochastic Games with mean payoff Dynamic programming equation of zero-sum two-player stochastic games � v ( x ) = max min γ P ( y | x , a , b ) v ( y ) + r ( x , a , b ) a ∈A ( x ) b ∈B ( x , a ) y ∈X ∀ x ∈ X (DP) X state space v ( x ) the value of the game starting at x ∈ X , a , b action of the 1st, 2nd player MAX, MIN r ( x , a , b ) reward paid by MIN to MAX P ( y | x , a , b ) transition probability from x to y given the actions a , b γ < 1 discount factor Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 3 / 53
Discounted Stochastic Games Stochastic Games with mean payoff Value of the game starting in x � ∞ � � γ k r ( X k , a k , b k ) v ( x ) = max ( a k ) k ≥ 0 min ( b k ) k ≥ 0 E k =0 where � a k = a k ( X k , b k − 1 , a k − 1 , X k − 1 · · · ) b k = b k ( X k , a k , · · · ) are strategies and the state dynamics satisfies the process X k P ( X k +1 = y | X k = x , a k = a , b k = b ) = P ( y | x , a , b ) Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 4 / 53
Discounted Stochastic Games Stochastic Games with mean payoff Deterministic zero-sum two-player game 5 3 Circles : Max plays Squares : MIN plays −2 4’ 0 11 Weight on the edges : payment made by −3 MIN to MAX 2 1’ −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 5 / 53
Discounted Stochastic Games Stochastic Games with mean payoff 5 3 −2 4’ If Max initially moves to 2 ′ 0 11 −3 2 1’ −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 6 / 53
Discounted Stochastic Games Stochastic Games with mean payoff 5 3 −2 4’ If Max initially moves to 2 ′ 0 11 −3 2 1’ −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 6 / 53
Discounted Stochastic Games Stochastic Games with mean payoff 5 3 −2 4’ If Max initially moves to 2 ′ 0 11 −3 2 1’ −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 6 / 53
Discounted Stochastic Games Stochastic Games with mean payoff 5 3 −2 4’ If Max initially moves to 2 ′ 0 11 −3 2 1’ he eventually looses 5 per turn. −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 6 / 53
Discounted Stochastic Games Stochastic Games with mean payoff 5 3 But if Max initially moves to 1 ′ −2 4’ 0 11 −3 2 1’ −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 7 / 53
Discounted Stochastic Games Stochastic Games with mean payoff 5 3 But if Max initially moves to 1 ′ −2 4’ 0 11 −3 2 1’ −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 7 / 53
Discounted Stochastic Games Stochastic Games with mean payoff 5 3 But if Max initially moves to 1 ′ −2 4’ 0 11 −3 2 1’ −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 7 / 53
Discounted Stochastic Games Stochastic Games with mean payoff 5 3 But if Max initially moves to 1 ′ −2 4’ 0 11 −3 he only looses eventually 2 1’ (1 + 0 + 2 + 3) / 2 = 3 per turn. −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 7 / 53
Discounted Stochastic Games Stochastic Games with mean payoff Feedback strategies or policy � ∞ � � γ k r ( X k , a k , b k ) v ( x ) = max min E ( a k ) k ≥ 0 ( b k ) k ≥ 0 k =0 For α : x → α ( x ) ∈ A ( x ) and β : ( x , a ) → β ( x , a ) ∈ B ( x , a ), the strategies � a k = α ( X k ) b k = β ( X k , a k ) are such that X k is a Markov Chain with transition matrix P α,β where P α,β := P ( y | x , α ( x ) , β ( x , α ( x ))) xy x , y in X . Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 8 / 53
Discounted Stochastic Games Stochastic Games with mean payoff Dynamic programming operator and optimal policy � v ( x ) = max min γ P ( y | x , a , b ) v ( y ) + r ( x , a , b ) := F ( v ; x ) a ∈A ( x ) b ∈B ( x , a ) y ∈X � �� � F ( v ;( x , a ) , b ) α policy maximizing (DP)eq for MAX β policy minimizing F ( v ; ( x , a ) , b ) for MIN The dynamic programming operator F is monotone and additively sub-homogeneous ( F ( λ + v ) ≤ λ + F ( v ), λ ≥ 0). Method to solve (DP) eqs : Policy iteration algorithm [Howard, 60 (1player game)], [Denardo, 67 (2player game)] Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 9 / 53
Discounted Stochastic Games Stochastic Games with mean payoff Dynamic programming equation of zero-sum two-player stochastic differential games PDE of Isaacs (or Hamilton-Jacobi-Bellman for one player) ∂ 2 v − λ v ( x ) + H ( x , ∂ v , ) = 0 , x ∈ X (I) ∂ x i ∂ x i ∂ x j where H ( x , p , K ) = max b ∈B ( x , a ) [ p · f ( x , a , b ) min a ∈A ( x ) � +1 2 tr ( σ ( x , a , b ) σ T ( x , a , b ) K ) + r ( x , a , b ) Discretization with monotone schemes of (I) yields (DP) Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 10 / 53
Discounted Stochastic Games Stochastic Games with mean payoff Motivation Solve dynamic programming equations arising from the discretization of Isaacs equations or other DP eq of diffucions (eg varitional inequalities) applications: pursuit-evasion games, finance,. . . Solve large scale zero-sum stochastic games (with discrete state space) for example, problems arising from the web, problems in verification of programs in computer science, . . . → Use policy iteration algorithm where the linear systems involved are solved using AMG Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 11 / 53
Discounted Stochastic Games Stochastic Games with mean payoff Policy Iteration (PI) Algorithm for games � v ( x ) = max min γ P ( y | x , a , b ) v ( y ) + r ( x , a , b ) a ∈A ( x ) b ∈B ( x , a ) y ∈X � �� � F ( v ; x , a ) Start with α 0 : x → α 0 ( x ) ∈ A ( x ), apply successively 1 The value v k +1 of policy α k is solution of v k +1 ( x ) = F ( v k +1 ; x , α k ( x )) ∀ x ∈ X . 2 Improve the policy: select α k +1 optimal for v k +1 : F ( v k +1 ; x , a ) α k +1 ( x ) ∈ argmax ∀ x ∈ X . a ∈A ( x ) Until α k +1 ( x ) = α k ( x ) ∀ x ∈ X . Step 1 is solved by PI Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 12 / 53
Discounted Stochastic Games Stochastic Games with mean payoff Policy Iteration (PI) for 1-player games (Howard, 60) Start with β k , 0 , apply successively 1 The value v k , s +1 of policy β k , s is solution of v k , s +1 = γ P α k ,β k , s v k , s +1 + r α k ,β k , s where P α,β := P ( y | x , α ( x ) , β ( x , α ( x ))) xy β 0 , 0 2 Improve the policy: find . . α 0 PI int β k , s +1 optimal for v k , s +1 . β 0 , s PI ext Until β k , s +1 = β k , s . . . . α k Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 13 / 53
Discounted Stochastic Games Stochastic Games with mean payoff ( v k ) k ≥ 1 ր non decreasing (MAX player) ( v k , s ) s ≥ 1 ց non increasing (MIN player) PI stops after a finite time when sets of actions are finite Internal loop (1player game): PI ≈ Newton algorithm where differentials are replaced by superdifferentials of the (DP) operator External loop (2player game): PI ≈ Newton algorithm where the (DP) operator is approached by below by piecewise affine and concave maps → expect super linear convergence in good cases Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 14 / 53
Recommend
More recommend