nash q learning for general sum stochastic games

Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman - PowerPoint PPT Presentation

Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6th, 2006 CS286r Presented by Ilan Lobel Outline Stochastic Games and Markov Perfect Equilibria Bellmans Operator as a Contraction Mapping Stochastic

  1. Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6th, 2006 CS286r Presented by Ilan Lobel

  2. Outline  Stochastic Games and Markov Perfect Equilibria  Bellman’s Operator as a Contraction Mapping  Stochastic Approximation of a Contraction Mapping  Application to Zero-Sum Markov Games  Minimax-Q Learning  Theory of Nash-Q Learning  Empirical Testing of Nash-Q Learning

  3. How do we model games that evolve over time ?  Stochastic Games !  Current Game = State  Ingredients: – Agents (N) – States (S) – Payoffs (R) – Transition Probabilities (P) – Discount Factor ( δ )

  4. Example of a Stochastic Game C D δ = 0.9 1,2 3,4 A Move with 50% probability when (A,C) or (A,D) 5,6 7,8 B C D E -1,2 -3,4 0,0 A -5,6 -7,8 -10,10 Move with 30% probability B when (B,D)

  5. Markov Game is a Generalization of… Repeated Games Markov Games Add States

  6. Markov Game is a Generalization of… Repeated Games MDP Markov Games Add Agents Add States

  7. Markov Perfect Equilibrium (MPE)  Strategy maps states into randomized actions – π i: S Δ (A)  No agent has an incentive to unilaterally change her policy.

  8. Cons & Pros of MPEs  Cons: – Can’t implement everything described by the Folk Theorems (i.e., no trigger strategies)  Pros: – MPEs always exist in finite Markov Games (Fink, 64) – Easier to “search for”

  9. Learning in Stochastic Games  Learning is specially important in Markov Games because MPE are hard to compute.  Do we know: – Our own payoffs ? – Others’ rewards ? – Transition probabilities ? – Others’ strategies ?

  10. Learning in Stochastic Games  Adapted from Reinforcement Learning: – Minimax-Q Learning (zero-sum games) – Nash-Q Learning – CE-Q Learning

  11. Zero-Sum Stochastic Games  Nice properties: – All equilibria have the same value. – Any equilibrium strategy of player 1 against any equilibrium strategy of player 2 produces an MPE. – It has a Bellman’s-type equation.

  12. Bellman’s Equation in DP  Bellman Operator: T  Bellman’s Equation Rewritten:

  13. Contraction Mapping  The Bellman Operator has the contraction property:  Bellman’s Equation is a direct consequence of the contraction.

  14. The Shapley Operator for Zero-Sum Stochastic Games  The Shapley Operator is a contraction mapping. (Shapley, 53)  Hence, it also has a fixed point, which is an MPE:

  15. Value Iteration for Zero-Sum Stochastic Games  Direct consequence of contraction.  Converges to fixed point of operator.

  16. Q-Learning  Another consequence of a contraction mapping: – Q-Learning converges !  Q-Learning can be described as an approximation of value iteration: – Value iteration with noise.

  17. Q-Learning Convergence  Q-Learning is called a Stochastic Iterative Approximation of Bellman’s operator: – Learning Rate of 1/t. – Noise is zero-mean and has bounded variance.  It converges if all state-action pairs are visited infinitely often. (Neuro-Dynamic Programming – Bertsekas, Tsitsiklis)

  18. Minimax-Q Learning Algorithm For Zero-Sum Stochastic Games  Initialize your Q0(s,a1,a2) for all states, actions.  Update rule:  Player 1 then chooses action u1 in the next stage sk+1.

  19. Minimax-Q Learning  It’s a Stochastic Iterative Approximation of Shapley Operator.  It converges to a Nash Equilibrium if all state- action-action triplets are visited infinitely often. (Littman, 96)

  20. Can we extend it to General-Sum Stochastic Games ?  Yes & No.  Nash-Q Learning is such an extension.  However, it has much worse computational and theoretical properties.

  21. Nash-Q Learning Algorithm  Initialize Q0j(s,a1,a2) for all states, actions and for every agent. – You must simulate everyone’s Q-factors.  Update rule:  Choose the randomized action generated by the Nash operator.

  22. The Nash Operator and The Principle of Optimality  Nash Operator finds the Nash of a stage game.  Find Nash of stage game with Q-factors as your payoffs. Payoffs for Rest of the Current Reward Markov Game

  23. The Nash Operator  Unkown complexity even for 2 players.  In comparison, the minimax operator can be solved in polynomial time. (there’s a linear programming formulation)  For convergence, all players must break ties in favor of the same Nash Equilibrium.  Why not go model-based if computation is so expensive ?

  24. Convergence Results  If every stage game encountered during learning has a global optimum, Nash-Q converges.  If every stage game encountered during learning has a saddle point, Nash-Q converges.  Both of these are VERY strong assumptions.

  25. Convergence Result Analysis  The global optimum assumption implies full cooperation between agents.  The saddle point assumption implies no cooperation between agents.  Are these equivalent to DP Q-Learning and minimax-Q Learning, respectively ?

  26. Empirical Testing: The Grid-world WORLD 1 Some Nash Equilibria

  27. Empirical Testing: Nash Equilibria (3%) (3%) (97%) WORLD 2 All Nash Equilibria

  28. Empirical Performance  In very small and simple games, Nash-Q learning often converged even though theory did not predict so.  In particular, if all Nash Equilibria have the same value Nash-Q did better than expected.

  29. Conclusions  Nash-Q is a nice step forward: – It can be used for any Markov Game. – It uses the Principle of Optimality in a smart way.  But there is still a long way to go: – Convergence results are weak. – There are no computational complexity results.


More recommend