Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6th, 2006 CS286r Presented by Ilan Lobel
Outline Stochastic Games and Markov Perfect Equilibria Bellman’s Operator as a Contraction Mapping Stochastic Approximation of a Contraction Mapping Application to Zero-Sum Markov Games Minimax-Q Learning Theory of Nash-Q Learning Empirical Testing of Nash-Q Learning
How do we model games that evolve over time ? Stochastic Games ! Current Game = State Ingredients: – Agents (N) – States (S) – Payoffs (R) – Transition Probabilities (P) – Discount Factor ( δ )
Example of a Stochastic Game C D δ = 0.9 1,2 3,4 A Move with 50% probability when (A,C) or (A,D) 5,6 7,8 B C D E -1,2 -3,4 0,0 A -5,6 -7,8 -10,10 Move with 30% probability B when (B,D)
Markov Game is a Generalization of… Repeated Games Markov Games Add States
Markov Game is a Generalization of… Repeated Games MDP Markov Games Add Agents Add States
Markov Perfect Equilibrium (MPE) Strategy maps states into randomized actions – π i: S Δ (A) No agent has an incentive to unilaterally change her policy.
Cons & Pros of MPEs Cons: – Can’t implement everything described by the Folk Theorems (i.e., no trigger strategies) Pros: – MPEs always exist in finite Markov Games (Fink, 64) – Easier to “search for”
Learning in Stochastic Games Learning is specially important in Markov Games because MPE are hard to compute. Do we know: – Our own payoffs ? – Others’ rewards ? – Transition probabilities ? – Others’ strategies ?
Learning in Stochastic Games Adapted from Reinforcement Learning: – Minimax-Q Learning (zero-sum games) – Nash-Q Learning – CE-Q Learning
Zero-Sum Stochastic Games Nice properties: – All equilibria have the same value. – Any equilibrium strategy of player 1 against any equilibrium strategy of player 2 produces an MPE. – It has a Bellman’s-type equation.
Bellman’s Equation in DP Bellman Operator: T Bellman’s Equation Rewritten:
Contraction Mapping The Bellman Operator has the contraction property: Bellman’s Equation is a direct consequence of the contraction.
The Shapley Operator for Zero-Sum Stochastic Games The Shapley Operator is a contraction mapping. (Shapley, 53) Hence, it also has a fixed point, which is an MPE:
Value Iteration for Zero-Sum Stochastic Games Direct consequence of contraction. Converges to fixed point of operator.
Q-Learning Another consequence of a contraction mapping: – Q-Learning converges ! Q-Learning can be described as an approximation of value iteration: – Value iteration with noise.
Q-Learning Convergence Q-Learning is called a Stochastic Iterative Approximation of Bellman’s operator: – Learning Rate of 1/t. – Noise is zero-mean and has bounded variance. It converges if all state-action pairs are visited infinitely often. (Neuro-Dynamic Programming – Bertsekas, Tsitsiklis)
Minimax-Q Learning Algorithm For Zero-Sum Stochastic Games Initialize your Q0(s,a1,a2) for all states, actions. Update rule: Player 1 then chooses action u1 in the next stage sk+1.
Minimax-Q Learning It’s a Stochastic Iterative Approximation of Shapley Operator. It converges to a Nash Equilibrium if all state- action-action triplets are visited infinitely often. (Littman, 96)
Can we extend it to General-Sum Stochastic Games ? Yes & No. Nash-Q Learning is such an extension. However, it has much worse computational and theoretical properties.
Nash-Q Learning Algorithm Initialize Q0j(s,a1,a2) for all states, actions and for every agent. – You must simulate everyone’s Q-factors. Update rule: Choose the randomized action generated by the Nash operator.
The Nash Operator and The Principle of Optimality Nash Operator finds the Nash of a stage game. Find Nash of stage game with Q-factors as your payoffs. Payoffs for Rest of the Current Reward Markov Game
The Nash Operator Unkown complexity even for 2 players. In comparison, the minimax operator can be solved in polynomial time. (there’s a linear programming formulation) For convergence, all players must break ties in favor of the same Nash Equilibrium. Why not go model-based if computation is so expensive ?
Convergence Results If every stage game encountered during learning has a global optimum, Nash-Q converges. If every stage game encountered during learning has a saddle point, Nash-Q converges. Both of these are VERY strong assumptions.
Convergence Result Analysis The global optimum assumption implies full cooperation between agents. The saddle point assumption implies no cooperation between agents. Are these equivalent to DP Q-Learning and minimax-Q Learning, respectively ?
Empirical Testing: The Grid-world WORLD 1 Some Nash Equilibria
Empirical Testing: Nash Equilibria (3%) (3%) (97%) WORLD 2 All Nash Equilibria
Empirical Performance In very small and simple games, Nash-Q learning often converged even though theory did not predict so. In particular, if all Nash Equilibria have the same value Nash-Q did better than expected.
Conclusions Nash-Q is a nice step forward: – It can be used for any Markov Game. – It uses the Principle of Optimality in a smart way. But there is still a long way to go: – Convergence results are weak. – There are no computational complexity results.
Recommend
More recommend