beliefs and learning in repeated games
play

Beliefs and Learning in Repeated Games Florin Constantin and Ivo - PDF document

Beliefs and Learning in Repeated Games Florin Constantin and Ivo Parashkevov March 15, 2006 Context 2-player discounted repeated games [can be extended to n -player] want to provably learn equilibrium play, as quickly as possible and


  1. Beliefs and Learning in Repeated Games Florin Constantin and Ivo Parashkevov March 15, 2006

  2. Context • 2-player discounted repeated games [can be extended to n -player] • want to provably learn equilibrium play, as quickly as possible and with as little info as possible 1

  3. Rational (Bayesian) Learning • use beliefs about opponents’ strategies to guide prediction of future play • play Best Response to beliefs • update beliefs based on actual play • learning = recurrently update beliefs until convergence to equilibrium 2

  4. Belief Learning vs. Bayesian Learning • Behavior Strategy: history → distribution over opponent’s play in next period. Example: h opp = ( C, C, D ) → Pr t =4 ( C ) = 2 3 , Pr t =4 ( D ) = 1 3 • Belief Learning - prediction rule as behav- ior strategy : associate probabilities with future play of opponents based on play his- tory. Best Respond to prediction rule • Bayesian Learning - Best Respond to be- liefs • Belief Learning as Bayesian Learning: Best Respond to belief that puts probability 1 on the behavior strategy predicted by the prediction rule 3

  5. Belief Learning vs. Bayesian Learning II • Bayesian Learning as Belief Learning: For any belief B of player 1 over player 2’s be- havior strategies, there exists an equivalent belief assigning probability 1 to a particular behavior strategy (called reduced form of B ). Prediction rule: predict the reduced form 4

  6. Fictitious Play • P (opponent plays s at time t ) = t k t + k (freq of s up to time t ) + t + k prior( s ) • Assumptions – myopia • if it converges, it converges to NE 5

  7. Calibrated Learning • use forecasts; if – every player plays Best Response to fore- casts – forecasts are calibrated then learning converges to Correlated Equi- librium • history is correlating device (umpire) • Assumptions – stationary tie-breaking rule 6

  8. Problematic assumptions in papers so far • myopia – ignores strategic considerations about the future - can not experiment for long run benefit – can only implement any NE of repeated game that consist of NE in stage games (e.g. no trigger strategies) • observable rewards • common prior 7

  9. Kalai & Lehrer - Rational Learning • Setting: – n -player infinitely repeated discounted games – subjective rationality - best responding to beliefs – learning is through Bayesian updating of individual prior – encode beliefs as behavior strategies • Main result: if individual beliefs are com- patible with actual play then best response to beliefs leads to accurate prediction of future play. Play converges to Nash equi- librium play . 8

  10. Assumptions • Perfect monitoring - observe actions of other players • Independence of other players’ actions and beliefs • No longer assume common prior or myopia • Opponents not assumed to be rational • Knowledge of own payoff matrix 9

  11. Some Notation • n finite sets Σ 1 , Σ 2 , ..., Σ n of actions • H t - set of histories of length t . H = � t H t is the set of all finite histories. • behavior strategy of player i is a function f i : H → ∆(Σ i ), where ∆(Σ i ) denotes the set of probability distributions over Σ i • µ f is the probability distribution over the set of infinite play paths induced by the strategy vector f 10

  12. Absolute Continuity and Grain of Truth Assumptions What does it mean to have ”beliefs compatible with actual play”? • Do not assign zero probability to events that can occur in the play of the game. • ”Grain of Truth” - beliefs about opponent’s play assign a (small) positive probability to the strategy actually chosen. – Sufficient, but stronger than needed. • Absolute Continuity - measure µ f is ab- solutely continuous w.r.t. µ g ( µ f ≪ µ g ) if µ f ( A ) > 0 ⇒ µ g ( A ) > 0 for all sets A ⊆ Σ ∞ . • Main result requires: actual µ ≪ belief ˜ µ i 11

  13. Prisoner’s Dilemma Example D C D 0,0 2,-1 C -1,2 1,1 • Consider strategies – g ∞ : grim trigger – g t : use grim trigger until time t , then defect forever • P1 assigns probs ( β 0 , β 1 , . . . , β ∞ ) to P2 play- ing ( g 0 , g 1 , . . . , g ∞ ), β t > 0. P2 assigns probs ( α 0 , α 1 , . . . , α ∞ ) to P1 playing ( g 0 , g 1 , . . . , g ∞ α t > 0. • According to own learning parameters, P1 chooses g t 1 and P 2 chooses g t 2 . 12

  14. Prisoner’s Dilemma Example • all events with positive probability in the game (C until time t < min( t 1 , t 2 ), D af- ter min( t 1 , t 2 ) etc) are associated positive probability by players’ beliefs: beliefs are compatible with actual play. • learning must occur - if t 1 < t 2 then P2 will assign prob 1 to P1 playing g t 1 from time t 1 +1 on. So P2 knows that P1 will defect forever. • P1 will not know P2’s strategy - will only know that t 2 > t 1 , but will be able to pre- dict that P2 will defect forever as well - future play is learned only on the play path . 13

  15. Prisoner’s Dilemma Example What if t 1 = t 2 = ∞ ? • after time t , P1 knows P2 did not play g 0 , . . . , g t and assigns probabilities ∞ � ( β t +1 , . . . , β ∞ ) / β r r = t +1 to ( g t +1 , . . . , g ∞ ). β ∞ t →∞ Since β ∞ > 0, 1 → � ∞ r = t +1 β r • P1 becomes more and more confident that P2 is playing g ∞ , but never knows for sure . 14

  16. Definitions • Let ε > 0 and µ, ˜ µ two probability measures. µ is ε -close to ˜ µ if ∃ set Q such that – µ ( Q ) > 1 − ε and ˜ µ ( Q ) > 1 − ε – ∀ A ⊆ Q, (1 − ε )˜ µ ( A ) ≤ µ ( A ) ≤ (1+ ε )˜ µ ( A ) • f plays ε -like g if µ f is ε -close to µ g . • Let f be a strategy, t ≥ 0 and h a history up to time t . The induced strategy f h ( · ) f h ( h ′ ) = f ( concat ( h, h ′ )) for all h ′ of length r 15

  17. Theorem 1 Let f be the strategy that is actually chosen and f i be the beliefs of player i . Assume f is absolutely continuous with respect to f i . Then ∀ ε > 0 for almost every play path z ∃ time T ( z, ε ) such that ∀ t ≥ T ( z, ε ), f z ( t ) plays ε -like f i z ( t ) . If players maximize payoff then they will even- tually be playing a subjective ε -equilibrium : • each player plays a Best Response to own beliefs • these beliefs are ε -never going to be con- tradicted by actual play Interpretation? 16

  18. Theorem 2 Let f be the strategy that is actually chosen and f 1 , . . . , f n be the beliefs of players 1 , . . . , n . Assume • f is absolutely continuous with respect to f i • each player plays a Best Response to its beliefs. Then ∀ ε > 0, for almost every play path z ∃ a time T ( z, ε ) such that for all t ≥ T ( z, ε ) there exists an ε -Nash Equilibrium ¯ f of the repeated game such that f z ( t ) plays ε -like ¯ f . 17

  19. Comments • Theorem 1 does not assume anything about players’ strategies. • Convergence of beliefs with reality occurs only on the actual play path. Players do not learn what their opponents will do in response to actions that will not be taken. • If players are best responding (Theorem 2), then convergence is to NE play in the re- peated game, not to repeated play of a single stage NE. • Convergence is to an equilibrium play , not to an equilibrium. We are not learning Nash strategies, but we can learn to play as if we knew them 18

  20. • So what? • If – Assumptions are met – All other players play Best Response to their beliefs can you do better? 19

  21. Beliefs in Repeated Games - Nachbar 2005 Main Result For a large class of repeated games, beliefs can not simultaneously satisfy: • learnability • consistency • CSP (diversity of belief condition) 20

  22. Learnability - informally Player 1 learns to predict the path of play gen- erated by σ 2 if her one -period-ahead forecasts along the path of play eventually becomes al- most as accurate as if she knew σ 2 . 21

  23. Learnability - formally Fix a belief β 2 of player 1 about player 2’s strategy. Player 1 learns to predict the path of play gen- erated by behavior strategy σ = ( σ 1 , σ 2 ) iff • ∀ finite history h , µ ( σ 1 ,σ 2 ) ( h ∗ ) > 0 ⇒ µ ( σ 1 ,β 2 ) ( h ∗ ) > 0 h ∗ = the set of all paths of play starting with h • ∀ ε > 0 and for almost all paths of play z , ∃ T ( ε, z ) such that for any time t > T ( ε, z ) and any action a 2 of player 2, | (the prob that σ 2 ( h ) assigns to a 2 ) − (the prob that σ β 2 ( h ) assigns to a 2 ) | < ε where h = the first t stages of z and σ β 2 = the reduced form of β 2 . 22

  24. CSP Two conditions: CS and P, both addressing the richness of ˆ Σ. All restrictions only on the path of play! • CS - (Weak) Caution and Symmetry – s 1 is a simple variant of s 2 if s 1 can be generated from s 2 by a uniform relabel- ing of actions – Weak Caution means: if I believe you could play the pure strategy s 1 , then I also believe you could play all simple variants of s 1 . Strong caution would mean ˆ S i = S i – Symmetry means: if I believe that you can play s 1 , then you believe I can play all simple variants of s 1 . 23

  25. – Symmetry motivated by the necessity to have equally powerful strategy-generating machines. • P ˆ – if a behavior strategy σ 2 is in Σ 2 , then at least one pure strategy that coarsely ˆ approximates σ 2 is in Σ 2 as well.

Recommend


More recommend