multigrid methods for zero sum two player stochastic
play

Multigrid methods for zero-sum two player stochastic games with mean - PowerPoint PPT Presentation

Multigrid methods for zero-sum two player stochastic games with mean reward Sylvie Detournay and Marianne Akian INRIA Saclay and CMAP, Ecole Polytechnique (France) 15th Copper Mountain Conference on Multigrid Methods 27 March - 1 April,


  1. Multigrid methods for zero-sum two player stochastic games with mean reward Sylvie Detournay and Marianne Akian INRIA Saclay and CMAP, ´ Ecole Polytechnique (France) 15th Copper Mountain Conference on Multigrid Methods 27 March - 1 April, 2011 Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 1 / 22

  2. DP for zero-sum stochastic games with mean reward Dynamic programming equation of zero-sum two-player stochastic games with mean reward � ρ + v ( x ) = max min P ( y | x , α, β ) v ( y ) + r ( x , α, β ) α ∈A ( x ) β ∈B ( x ,α ) y ∈ X ∀ x ∈ X (DP) X state space ρ is the mean reward of the game = non linear eigenvalue v ( x ) is the bias or relative value of the game starting at x ∈ X α, β action of the 1st, 2nd player MAX, MIN r ( x , α, β ) reward paid by MIN to MAX P ( y | x , α, β ) transition probability from x to y given the actions α, β Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 2 / 22

  3. DP for zero-sum stochastic games with mean reward Value of the game with mean reward starting at x ∈ X � N � 1 � ρ ( x ) = sup ( β k ) k ≥ 0 lim sup inf r ( x k , α k , β k ) N E N →∞ ( α k ) k ≥ 0 k =0 where � α k = α k ( X k , α k − 1 , β k − 1 , · · · ) β k = β k ( X k , α k , α k − 1 , β k − 1 , · · · ) are strategies and the state dynamics satisfies the process X k P ( X k +1 = y | X k = x , α k = α, β k = β ) = P ( y | x , α, β ) Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 3 / 22

  4. A deterministic zero-sum game Deterministic zero-sum two-player game The circles (resp. squares) represent the nodes at which Max (resp. Min) can play. 5 3 Values in the (DP) equation: −2 X = { Max nodes } 4’ 0 11 A ( x ) = { Min nodes accessible from x } −3 B ( x , α ) = { Max nodes accessible from 2 1’ α } −1 r ( x , α, β ) =weight( x , α )+weight( α, β ) 1 y = β 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 4 / 22

  5. A deterministic zero-sum game 5 3 −2 4’ If Max initially moves to 2 ′ 0 11 −3 2 1’ −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 5 / 22

  6. A deterministic zero-sum game 5 3 −2 4’ If Max initially moves to 2 ′ 0 11 −3 2 1’ −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 5 / 22

  7. A deterministic zero-sum game 5 3 −2 4’ If Max initially moves to 2 ′ 0 11 −3 2 1’ −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 5 / 22

  8. A deterministic zero-sum game 5 3 −2 4’ If Max initially moves to 2 ′ 0 11 −3 2 1’ he eventually looses 5 per turn. −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 5 / 22

  9. A deterministic zero-sum game 5 3 But if Max initially moves to 1 ′ −2 4’ 0 11 −3 2 1’ −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 6 / 22

  10. A deterministic zero-sum game 5 3 But if Max initially moves to 1 ′ −2 4’ 0 11 −3 2 1’ −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 6 / 22

  11. A deterministic zero-sum game 5 3 But if Max initially moves to 1 ′ −2 4’ 0 11 −3 2 1’ −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 6 / 22

  12. A deterministic zero-sum game 5 3 But if Max initially moves to 1 ′ −2 4’ 0 11 −3 he only looses eventually 2 1’ (1 + 0 + 2 + 3) / 2 = 3 per turn. −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 6 / 22

  13. DP for zero-sum stochastic games Optimal strategies and dynamic programming � N � 1 � ρ ( x ) = sup inf lim sup r ( x k , α k , β k ) x ∈ X N E ( β k ) k ≥ 0 N →∞ ( α k ) k ≥ 0 k =0 β where α ( X k ), β k = ¯ α, ¯ α ( X k )), define the matrix P ¯ For α k = ¯ β ( X k , ¯ α, ¯ α ( x ) , ¯ P ¯ β := P ( y | x , ¯ β ( x , ¯ α ( x ))). xy β are irreducible for all ¯ α, ¯ α and ¯ If P ¯ β then ρ ( x ) ≡ ρ is the unique solution of � ρ + v ( x ) = max min P ( y | x , α, β ) v ( y ) + r ( x , α, β ) (DP) α ∈A ( x ) β ∈B ( x ,α ) y ∈ X α, ¯ x ∈ X and ¯ β given by (DP)eq are optimal feedback strategies for both players. Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 7 / 22

  14. DP for zero-sum stochastic games Dynamic programming equation of zero-sum two-player stochastic differential games Isaacs PDE (diffusion problems) ∂ 2 v − ρ + H ( x , ∂ v , ) = 0 , x ∈ X (I) ∂ x i ∂ x i ∂ x j where H ( x , p , K ) = max β ∈B ( x ,α ) [ p · f ( x , α, β ) min α ∈A ( x ) � +1 2 tr ( σ ( x , α, β ) σ T ( x , α, β ) K ) + r ( x , α, β ) Discretization with monotone schemes of (I) yields (DP) Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 8 / 22

  15. DP for zero-sum stochastic games Motivation Solve dynamic programming equations arising from the discretization of Isaacs equations for example, long term diffusion’s problems, risk sensitive problems (finance), singular perturbations of Isaacs eq . . . Solve large scale zero-sum stochastic games (with discrete state space) for example, problems arising from the web, problems in verification of programs in computer science, . . . Extend this equation for the general case, that is without irreducible assumption. → Use policy iteration algorithm combined with multigrids to solve the dynamic programming equation Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 9 / 22

  16. DP for zero-sum stochastic games Dynamic programming for multichain games In general, the value of the game is solution of the dynamic programming equation: ρ ( x ) ( t + 1) + v ( x ) = F ( ρ t + v ; x ) , x ∈ X , t large enough where F is the dynamic programming operator: � F ( v ; x ) := max min P ( y | x , α, β ) v ( y ) + r ( x , α, β ) . α ∈A ( x ) β ∈B ( x ,α ) y ∈ X ( { ρ t + v , t large } is an invariant half line) . Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 10 / 22

  17. DP for zero-sum stochastic games This is equivalent to solve the system for x ∈ X : � ρ ( x ) = max min P ( y | x , α, β ) ρ ( y ) α ∈A ( x ) β ∈B ( x ,α ) y ∈ X � ρ ( x ) + v ( x ) = max min P ( y | x , α, β ) v ( y ) + r ( x , α, β ) α ∈A ρ ( x ) β ∈B ρ ( x ,α ) y ∈ X � � with A ρ ( x ) := argmax α ∈A ( x ) min β ∈B ( x ,α ) � y ∈ X P ( y | x , α, β ) ρ ( y ) �� � and B ρ ( x , α ) := argmin β ∈B ( x ,α ) y ∈ X P ( y | x , α, β ) ρ ( y ) For a one player game: � ρ ( x ) = min P ( y | x , β ) ρ ( y ) β ∈B ( x ) y ∈ X � ρ ( x ) + v ( x ) = min P ( y | x , β ) v ( y ) + r ( x , β ) β ∈B ρ ( x ) y ∈ X with B ρ ( x ) = argmin β ∈B ( x ) � y ∈ X P ( y | x , β ) ρ ( y ). Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 11 / 22

  18. Policy iteration (PI) algorithm Multichain Policy Iteration Algorithm for one player (Denardo, Fox, 67) Start with ¯ β 0 : x �→ ¯ β 0 ( x ) Calculate value and bias ( ρ k +1 , v k +1 ) for policy ¯ β k solution of 1 ρ k +1 = P ρ k +1 + v k +1 = P β k v k +1 + r ¯ ¯ ¯ β k ρ k +1 β k and Improve the policy: find ¯ β k +1 optimal for ( ρ k +1 , v k +1 ) 2 �� � ¯ P ( y | x , β ) v k +1 ( y ) + r ( x , β ) β k +1 ( x ) ∈ argmin , x ∈ X β ∈B ρ k +1 ( x ) y ∈ X with B ρ ( x ) = argmin β ∈B ( x ) � y ∈ X P ( y | x , β ) ρ ( y ). Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 12 / 22

  19. Policy iteration (PI) algorithm Easy to show ρ k +1 ≤ ρ k � if ρ k +1 = ρ k → degenerate iteration v k +1 is defined up to Ker ( I − P ¯ β k ) with dim = nb of ergodic class of P ¯ β k ≥ 1. → PI may cycle when they are multiple ergodic classes To avoid this : Optimal strategies are improved in a conservative way (¯ β k +1 ( x ) = ¯ β k ( x ) if optimal) v k +1 is fixed on a point of each ergodic class of P ¯ β k ⇒ when ρ k +1 = ρ k , v k +1 ( x ) = v k ( x ) on each ergodic classes of P ¯ β k ⇒ ( ρ k , v k ) k ≥ 1 is non increasing in a lexicographical order ρ k +1 ≤ ρ k and if ρ k +1 = ρ k , v k +1 ≤ v k ⇒ PI stops after a finite time when sets of actions are finite Remark: PI ≈ Newton algorithm in the case with unique solution v . Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 13 / 22

Recommend


More recommend