Efficient Learning Equilibrium R.Brafman and M.Tennenholtz Presented by Neal Gupta CS 286r March 20, 2006
Efficient Learning Equilibrium • Infinitely repeated stage games M with single- stage matrix G • Individual Rationality • Efficiency – Unilateral deviation irrational after a poly- nomial number of stages – Without deviation, after a polynomial number of steps, the expected payoff will be within ǫ of a Nash equilbrium (hence also within ǫ of minimax pay- offs). • A set of algorithms with respect to a spe- cific class of games that meets the above conditions is considered to be an ELE 1
Motivation • Get objective convergence rather than con- vergence in beliefs – Works with (relatively) patient agents • Exploit richness of results from Folk The- orems 2
Assumptions about Agents • Agents care about average reward, as well as well as how quickly this is reached • No discounting • Agents have NO PRIOR about the payoffs of the game • Agents may or may not be able to observe payoffs – Perfect monitoring (main results proved) – Weak imperfect monitoring ∗ observe other players actions, not pay- offs – Strict imperfect monitoring 3
Formal definition of learning equilbrium • Define U i ( M, σ 1 , σ 2 , T ) given repeated game M to be the expected average reward for player i after T periods when players choose the strategy (policy) σ = ( σ 1 , σ 2 ). • Then let U i ( M, σ 1 , σ 2 ) represented the av- erage reward in the limit: U i ( M, σ 1 , σ 2 ) = lim inf T →∞ U i ( M, σ 1 , σ 2 , T ). • A strategy (policy) is in LE if, given strat- egy σ = ( σ 1 , σ 2 ), then for all repeated games M neither player can benefit from a unilat- eral deviation ∀ σ ′ 1 , U 1 ( M, σ ′ 1 , σ 2 ) ≤ U 1 ( M, σ 1 , σ 2 ) and ∀ σ ′ 2 , U 2 ( M, σ 1 , σ ′ 2 ) ≤ U 2 ( M, σ ′ 1 , σ 2 ). 4
Efficient Learning Equilibrium • Add requirements for speed of convergence from basic definition in learning equilibrium • ∀ ǫ > 0 , 0 < δ < 1, there exists some T = poly ( 1 ǫ , 1 δ , k ) s.t. ∀ t ≥ T , if player 1 switches from strategy σ 1 to σ ′ 1 in itera- tion l , then U 1 ( M, σ ′ 1 , σ 2 , l + t ) ≤ U 1 ( M, σ 1 , σ 2 , l + t ) + ǫ , with probability of failure bounded above by δ . 5
Efficient Learning Equilibrium (cont’d) • In order to provide an IR constraint the authors require that the utilities are higher than those in some Nash equilibrium (which must be higher than minimax) • Note the limit is dropped: if there exists a policy beneficial in polynomial time but disastrous in the long-term, we do not have ELE – Toy example: Opponent has trigger strat- egy with exponential delay 6
Theorem 1 (BT, 7) • There exists an ELE for any perfect mon- itoring setting • Three stages (if no deviation) – Exploration – Offline Computation of Equilbrium – Play of equilibrium • Proof demonstrates that deviation can be punished in exploration stage – No subgame perfection demonstrated – In folk theorems more general results found - Pareto-ELE 7
Theorem 1 (BT, 7) cont’d • If player 2 deviates during exploration player 1 minimaxes player 2 using available pay- offs (player 1 may continue exploring to learn payoffs in order to minimax player 2) – Note that if player 2 starts exploring again, he/she might do better, but will never be able to do better than the max- imin of the game • Lemma 1: Chernoff bounds if maximin re- quires randomizing • Lemma 2: Let R max be the minimax pay- off, then there exists a some z polynomial in R max , k, 1 ǫ and 1 δ s.t. if player 1 punishes player 2 as prescribed for z steps, then ei- ther player 1 will learn a new entry or will reach the desired minimax payoff with high probability. 8
Theorem 1 (BT, 7) cont’d • Given k is an upper bound on the number of actions for each player, then player 1 can only learn k 2 − 1 new entries after deviation. • Thus, the probablistic minimax can be reached in a polynomial number of moves. • With a second Chernoff bound, the au- thors conclude that the actual payoff will be within ǫ of the expected minimax value with probability 1 − δ with only a polynomial (linear) increase in the number of trials as 1 δ and 1 ǫ grow. 9
Weaknesses • Proofs are restricted to 2 player setting • Trigger strategies used are very far from subgame perfection • Agents that care about average payoffs are significant deviation from discounting agents • Choosing from multiple equilibria?? • Exhibited learning algorithms seem na ¨ ıve – Explore entire state space, then simply compute equilibrium 10
Theorem 2 (BT, 7) • An ELE does not always exist in the im- perfect monitoring setting. � � 6 , 0 0 , 100 M 1 = 5 , − 100 1, 500 � � 0 , 1 6, 9 M ′ 1 = 5 , 11 1, 10 • In both M 1 and M ′ 1 , player 1 has the same payoffs. Both games have a unique Nash equilibrium - the one in M 1 must be played for it to be ELE. • Player 2 benefits if he plays as if he is in M ′ 1 11
• Contradicts definition of ELE - Player 2 immediately, permanently benefits from uni- lateral deviation – In example it seems like player 2 must know player 1’s payoffs but not vice- versa � � 6 , 0 0 , 100 M 1 = 5 , − 100 1, 500 � � 0 , 1 6, 9 M ′ 1 = 5 , 11 1, 10 • Player 2 could just pretend playing Right is a dominant strategy
Theorem 3 (BT, 9) • There exists ELE for the class of common- interest games under strict imperfect mon- itoring. – Agents know own action and payoffs, but neither action nor payoff of oppo- nent • Proof outline: proceed as in Thm. 1, but explore by independently randomizing over actions until both agents are confident all actions have been seen. 12
Theorem 3 (BT, 9) cont’d • If agents both play action that lead to the highest reward they saw, they are guaran- teed to coordinate • Concerns – Common interest implies players DO know opponent’s payoffs. – If players don’t know number of actions of opponent, how do they decide when to stop? – If players do know number of actions of opponent, why not use direct result from Thm. 1? 13
Pareto ELE • Exploits repeated game strategy to allow wider range of payoffs – Differs in that it allows side payoffs – Now within ǫ of economically efficient outcome rather than NE • Given efficient joint actions ( P 1 ( G ) , P 2 ( G )), with values PV i ( M ) for agent i , we now re- quire U 1 ( M, σ 1 , σ 2 , t ) + U 2 ( M, σ 1 , σ 2 , t ) ≥ PV 1 ( M ) + PV 2 ( M ) − ǫ . • Same condition that with prob 1 − δ gain of less than ǫ for deviation after polynomial time 14
Theorem 4 • There exists a Pareto ELE for any perfect monitoring setting • Proof outline: proceed as in regular ELE in exploration • Pay a player if she receives less than her probablistic maximin value • By definition of Pareto optimality, both play- ers now exceed their maximin value • Use same punishment approaches from be- fore 15
Stochastic games • Players observe payoffs, new states, must create model for probablistic transitions. • Nash equilbrium results in average payoffs are hard to prove - work in Pareto-ELE setting with side payoffs • Ergodicity assumption: every state is reach- able from every other state – This combined with finite number of states implies we can expect to explore the entire game matrix in finite time • Results polynomial in 1 δ , 1 ǫ , and T min – T mix denotes the ǫ -return mixing time 16
ǫ -Mixing Time • Informally the time it takes for the expected average reward to approach the infinite re- ward ∀ states s • T mix is the minimum t s.t. ∀ s ∈ S U ( s, σ 1 , σ 2 , t ) > U ( s, σ 1 , σ 2 ) − ǫ • How long is this given a ”reasonable” tran- sition function 17
Theorem 6 (BT, 13) • A Pareto-ELE in a stochastic game exists if (1) the agents have perfect monitoring and (2) T mix is known. • Proof similar to previous ones, requires E 3 approach and results from Learning to co- ordinate efficiently (BT, [5]). 18
Extensions • Move towards credible threats if not SPE – Automated agents can implement unre- alistic threats • More results in the case of imperfect mon- itoring – May require probablistic reasoning or con- ditional priors rather than just learning the entire game matrix • Model of Pareto-ELE based off of cycling rather than side-payoffs? 19
Conclusions • Pros: – We get objective convergence, not con- vergence in beliefs – Punishment is relatively quick, if not dis- counted • Cons: – NO priors! – Discounting seems more realistic model of behavior – Hard time horizon for punishment may be required otherwise agents will try to delay costly punishment forever – Trigger strategies far from SPE 20
Recommend
More recommend