Generalised weakened fictitious play and random belief learning David S. Leslie 12 April 2010 Collaborators: Sean Collins, Claudio Mezzetti, Archie Chapman
Overview • Learning in games • Stochastic approximation • Generalised weakened fictitious play – Random belief learning – Oblivious learners
Normal form games • Players i = 1 , . . . , N • Action sets A i • Reward functions r i : A 1 × · · · × A N → R
Mixed strategies • Mixed strategies π i ∈ ∆ i • Joint mixed strategy π = ( π 1 , . . . , π N ) • Reward function extended so that r i ( π ) = E π [ r i ( a )]
Best responses Assume other players use mixed strategy π − i . Player i should choose a mixed strategy in the best response set b i ( π − i ) = argmax π i ∈ ∆ i r i (˜ π i , π − i ) ˜
Best responses Assume other players use mixed strategy π − i . Player i should choose a mixed strategy in the best response set b i ( π − i ) = argmax π i ∈ ∆ i r i (˜ π i , π − i ) ˜ A Nash equilibrium is a fixed point of the best response map: π i ∈ b i ( π − i ) for all i
A problem with Nash Consider the game � (2 , 0) (0 , 1) � (0 , 2) (1 , 0) with unique Nash equilibrium π 1 = (2 / 3 , 1 / 3) , π 2 = (1 / 3 , 2 / 3)
A problem with Nash Consider the game � (2 , 0) (0 , 1) � (0 , 2) (1 , 0) with unique Nash equilibrium π 1 = (2 / 3 , 1 / 3) , π 2 = (1 / 3 , 2 / 3) • r i ( a i , π − i ) = 2 / 3 for each i, a i • How does Player 1 know to use π 1 = (2 / 3 , 1 / 3) ? • Player 2 to use π 2 = (1 / 3 , 2 / 3) ?
Learning in games • Attempts to justify equilibrium play as the end point of a learning process • Generally assumes pretty stupid players! • Related to evolutionary game theory
Multi-armed bandits At time n , choose action a n , and receive reward R n
Multi-armed bandits Estimate after time n of the expected reward for action a ∈ A is: � R m m ≤ n : a m = a Q n ( a ) = κ n ( a ) where κ n ( a ) = � n m =1 I { a m = a }
Multi-armed bandits If a n � = a , κ n ( a ) = κ n − 1 ( a ) and: �� n − 1 � m =1 I { a m = a } R m + 0 Q n ( a ) = = Q n − 1 ( a ) κ n − 1 ( a )
Multi-armed bandits if a n = a , �� n − 1 � m =1 I { a m = a } R m + R n Q n ( a ) = κ n ( a ) � � 1 1 = 1 − Q n − 1 ( a ) + κ n ( a ) R n κ n ( a )
Multi-armed bandits Update estimates using 1 Q n − 1 ( a ) + κ n ( a ) { R n − Q n − 1 ( a ) } if a n = a Q n ( a ) = Q n − 1 ( a ) if a n � = a At time n + 1 use Q n to choose an action a n +1
Fictitious play At iteration n + 1 , player i : • forms beliefs σ − i ∈ ∆ − i about the other players’ strategies n • chooses an action in b i ( σ − i n )
Belief formation The beliefs about player j are simply the MLE: n ( a j ) = κ j n ( a j ) where κ j m =1 I { a j σ j n ( a j ) = � n m = a j } n
Belief formation The beliefs about player j are simply the MLE: n ( a j ) = κ j n ( a j ) where κ j m =1 I { a j σ j n ( a j ) = � n m = a j } n Recursive update: n +1 ( a j ) = κ j = κ j n ( a j )+ I { a j n +1 ( a j ) n +1 = a j } σ j n +1 n +1
Belief formation The beliefs about player j are simply the MLE: n ( a j ) = κ j n ( a j ) where κ j m =1 I { a j σ j n ( a j ) = � n m = a j } n Recursive update: n +1 ( a j ) = κ j = κ j n ( a j )+ I { a j + I { a j n +1 ( a j ) n +1 = a j } κ j n +1 = a j } n ( a j ) σ j n = n +1 n +1 n +1 n n +1
Belief formation The beliefs about player j are simply the MLE: n ( a j ) = κ j n ( a j ) where κ j m =1 I { a j σ j n ( a j ) = � n m = a j } n Recursive update: σ j n +1 I { a j n +1 ( a j ) = σ j n ( a j ) + n +1 = a j } � 1 � 1 1 − n +1
Belief formation The beliefs about player j are simply the MLE: n ( a j ) = κ j n ( a j ) where κ j m =1 I { a j σ j n ( a j ) = � n m = a j } n Recursive update: σ j σ j � 1 � 1 = 1 − + n +1 e a j n n +1 n +1 n +1
Belief formation The beliefs about player j are simply the MLE: n ( a j ) = κ j n ( a j ) where κ j m =1 I { a j σ j n ( a j ) = � n m = a j } n Recursive update: σ j σ j � 1 � 1 = 1 − + n +1 e a j n n +1 n +1 n +1 In terms of best responses: σ j � 1 � σ j n +1 b j ( σ − j 1 ∈ 1 − + n ) n n +1 n +1
Belief formation The beliefs about player j are simply the MLE: n ( a j ) = κ j n ( a j ) where κ j m =1 I { a j σ j n ( a j ) = � n m = a j } n Recursive update: σ j σ j � 1 � 1 = 1 − + n +1 e a j n n +1 n +1 n +1 In terms of best responses: � 1 � 1 σ n +1 ∈ 1 − σ n + n +1 b ( σ n ) n +1
Stochastic approximation
Stochastic approximation θ n +1 ∈ θ n + α n +1 { F ( θ n ) + M n +1 }
Stochastic approximation θ n +1 ∈ θ n + α n +1 { F ( θ n ) + M n +1 } • F : Θ → Θ is a (bounded u.s.c.) set-valued map • α n → 0 , � n α n = ∞ • For any T > 0 , � � k − 1 � � � � � lim sup α i +1 M i +1 � = 0 � � n →∞ � � k>n : � k − 1 i = n α i +1 ≤ T i = n � n ( α n ) 2 < ∞ , E [ M n +1 | θ n ] → 0 , and The last is implied by: � Var [ M n +1 | θ n ] < C almost surely.
Stochastic approximation θ n +1 ∈ θ n + α n +1 { F ( θ n ) + M n +1 } θ n +1 − θ n ∈ F ( θ n ) + M n +1 α n ↓ d d t θ ∈ F ( θ ) , a differential inclusion (Bena ¨ ım, Hofbauer and Sorin, 2005)
Stochastic approximation θ n +1 ∈ θ n + α n +1 { F ( θ n ) + M n +1 } In fictitious play: 1 σ n +1 ∈ σ n + n +1 { b ( σ n ) − σ n } ↓ d d t σ ∈ b ( σ ) − σ, the best response differential inclusion. Hence σ n converges to the set of Nash equilibria in zero-sum games, potential games, and generic 2 × m games.
Generalised weakened fictitious play
Weakened fictitious play • Van der Genugten (2000) showed that the convergence rate of fictitious play can be improved if players use ǫ n -best re- sponses. (For 2-player zero-sum games, and a very specific choice of ǫ n ) • π ∈ b ǫ n ( σ n ) ⇒ π ∈ b ( σ n ) + M n +1 where M n → 0 as ǫ n → 0 (by continuity properties of b and boundedness of r ) • For general games and general ǫ n → 0 this fits into the stochastic approximation framework
Generalised weakened fictitious play Theorem: Any process such that σ n +1 ∈ σ n + α n +1 { b ǫ n ( σ n ) − σ n + M n +1 } where • ǫ n → 0 as n → ∞ • α n → 0 as n → ∞ � � k − 1 � � � � � • lim sup α i +1 M i +1 � = 0 � � n →∞ � � k>n : � k − 1 i = n i = n α i +1 ≤ T � converges to the set of Nash equilibria for zero-sum games, po- tential games and generic 2 × m games.
Recency • For classical fictitious play α n = 1 n , ǫ n ≡ 0 and M n ≡ 0 • For any α n → 0 the conditions are met (since M n ≡ 0 ) 1 1 • How about α n = √ n , or even α n = log n ?
Recency Belief that Player 1 plays Heads over 200 plays of the two-player matching pennies game under clas- sical fictitious play (top), under a modified ficti- 1 tious play with α n = √ n (middle), and with α n = 1 log n (bottom)
Stochastic fictitious play In fictitious play, players always choose pure actions ⇒ strategies never converge to mixed strategies (beliefs do, but played strategies do not)
Stochastic fictitious play Instead consider smooth best responses: β i τ ( σ − i ) = argmax � r i ( π i , σ − i ) + τv ( π i ) � π i ∈ ∆ i exp { r i ( a i ,σ − i ) /τ } For example β i τ ( σ − i )( a i ) = a ∈ Ai exp { r i ( a,σ − i ) /τ } �
Stochastic fictitious play Instead consider smooth best responses: β i τ ( σ − i ) = argmax � r i ( π i , σ − i ) + τv ( π i ) � π i ∈ ∆ i exp { r i ( a i ,σ − i ) /τ } For example β i τ ( σ − i )( a i ) = a ∈ Ai exp { r i ( a,σ − i ) /τ } � Strategies evolve according to σ n +1 = σ n + 1 n +1 { β τ ( σ n ) + M n +1 − σ n } where E [ M n +1 | σ n ] = 0
Convergence 1 σ n +1 = σ n + n +1 { β τ ( σ n ) − σ n + M n +1 }
Convergence 1 σ n +1 = σ n + n +1 { β τ ( σ n ) − σ n + M n +1 } 1 ∈ σ n + n +1 { b ǫ ( σ n ) − σ n + M n +1 }
Convergence 1 σ n +1 = σ n + n +1 { β τ ( σ n ) − σ n + M n +1 } 1 ∈ σ n + n +1 { b ǫ ( σ n ) − σ n + M n +1 } But can now consider the effect of using smooth best response β τ n with τ n → 0 . . . . . . it means that ǫ n → 0 , resulting in a GWFP!
Random belief learning
Random beliefs (Friedman and Mezzetti 2005) Best response ‘assumes’ complete confidence in: • knowledge of the reward functions • beliefs σ about opponent strategy
Random beliefs (Friedman and Mezzetti 2005) Best response ‘assumes’ complete confidence in: • knowledge of the reward functions • beliefs σ about opponent strategy
Random beliefs (Friedman and Mezzetti 2005) Best response ‘assumes’ complete confidence in: • knowledge of the reward functions • beliefs σ about opponent strategy Uncertainty in the beliefs σ n ← → distribution on belief space
Belief distributions • The belief about player j is that π j ∼ µ j • E µ j [ π j ] = σ j , the focus of µ j .
Recommend
More recommend