Multi-agent learning Repeated games Multi-agent learning Repeated games Gerard Vreeswijk , Intelligent Systems Group, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Last modified on February 9 th , 2012 at 17:15 Gerard Vreeswijk. Slide 1
Multi-agent learning Repeated games Repeated games: motivation 1. Much interaction in multi-agent systems can be modelled through games . 2. Much learning in multi-agent systems can therefore be modelled through learning in games . 3. Learning in games usually takes place through the (gradual) adaption of strategies (hence, behaviour) in a repeated game. 4. In most repeated games, one game (a.k.a. stage game ) is played repeatedly. Possibilities: • A finite number of times. • An indefinite (same: indeterminate ) number of times. • An infinite number of times. 5. Therefore, familiarity with the basic concepts and results from the theory of repeated games is essential to understand multi-agent learning. Last modified on February 9 th , 2012 at 17:15 Gerard Vreeswijk. Slide 2
Multi-agent learning Repeated games Plan for today • NE in normal form games that are repeated a finite number of times. – Principle of backward induction . • NE in normal form games that are repeated an indefinite number of times. – Discount factor . Models the probability of continuation. – Folk theorem . (Actually many FT’s.) Repeated games generally do have infinitely many Nash equilibria. – Trigger strategy , on-path vs. off-path play, the threat to “minmax” an opponent. This presentation draws heavily on (Peters, 2008). * H. Peters (2008): Game Theory: A Multi-Leveled Approach . Springer, ISBN: 978-3-540-69290-4. Ch. 8: Repeated games. Last modified on February 9 th , 2012 at 17:15 Gerard Vreeswijk. Slide 3
Multi-agent learning Repeated games Example 1: Nash equilibria in playing the PD twice Other: Prisoners’ Dilemma Cooperate Defect You: ( 3 , 3 ) ( 0 , 5 ) Cooperate ( 5 , 0 ) ( 1 , 1 ) Defect • Even if mixed strategies are allowed, the PD possesses one Nash equilibrium , viz. ( D , D ) with payoffs ( 1 , 1 ) . • This equilibrium is Pareto sub-optimal. (Because ( 3 , 3 ) makes both players better off.) • Does the situation change if two parties get to play the Prisoners’ Dilemma two times in succession ? • The following diagram (hopefully) shows that playing the PD two times in succession does not yield an essentially new NE. Last modified on February 9 th , 2012 at 17:15 Gerard Vreeswijk. Slide 4
Multi-agent learning Repeated games Example 1: Nash equilibria in playing the PD twice ( 2 ) ( 0 , 0 ) C C C D D C D D ( 3 , 3 ) ( 0 , 5 ) ( 5 , 0 ) ( 1 , 1 ) C C C D D C D D C C C D D C D D C C C D D C D D C C C D D C D D ( 6 , 6 ) ( 3 , 8 ) ( 8 , 3 ) ( 4 , 4 ) ( 3 , 8 ) ( 0 , 10 ) ( 5 , 5 ) ( 1 , 6 ) ( 8 , 3 ) ( 5 , 5 ) ( 10 , 0 ) ( 6 , 1 ) ( 4 , 4 ) ( 1 , 6 ) ( 6 , 1 ) ( 2 , 2 ) Last modified on February 9 th , 2012 at 17:15 Gerard Vreeswijk. Slide 5
Multi-agent learning Repeated games Example 1: Nash equilibria in playing the PD twice ( 3 ) In normal form: Other: CC CD DC DD You: ( 6 , 6 ) ( 3 , 8 ) ( 3 , 8 ) ( 0 , 10 ) CC ( 8 , 3 ) ( 4 , 4 ) ( 5 , 5 ) ( 1 , 6 ) CD ( 8 , 3 ) ( 5 , 5 ) ( 4 , 4 ) ( 1 , 6 ) DC ( 10 , 0 ) ( 6 , 1 ) ( 6 , 1 ) ( 2 , 2 ) DD • The action profile ( DD , DD ) is the only Nash equilibrium. • With 3 successive games, we obtain a 2 3 × 2 3 matrix, where the action profile ( DDD , DDD ) still would be the only Nash equilibrium. • Generalise to N repetitions: ( D N , D N ) still is the only Nash equilibrium in a repeated game where the PD is played N times in succession. Last modified on February 9 th , 2012 at 17:15 Gerard Vreeswijk. Slide 6
Multi-agent learning Repeated games Backward induction (version for repeated games) • Suppose G is a game in normal form for p players, where all players possess the same arsenal of possible actions A = { a 1 , . . . , a m } . • The game G n arises by playing the stage game G a number of n times in succession. • A history h of length k is an element of ( A p ) k , e.g., for p = 3 and k = 10, a 7 a 5 a 3 a 6 a 1 a 9 a 2 a 7 a 7 a 3 a 6 a 9 a 2 a 4 a 2 a 9 a 9 a 1 a 1 a 4 a 1 a 2 a 7 a 9 a 6 a 1 a 1 a 8 a 2 a 4 is a history of length ten in a game with three players. The set of all possible histories is denoted by H . (Hence, | H k | = m kp .) • A (possibly mixed) strategy for one player is a function H → Pr ( A ) . Last modified on February 9 th , 2012 at 17:15 Gerard Vreeswijk. Slide 7
Multi-agent learning Repeated games Backward induction (version for repeated games) • For some repeated games of length n , the dominating (read: “clearly best”) strategy for all players in round n (the last round) does not depend on the history of play. E.g., for the Prisoners’ Dilemma in last round: “No matter what happened in rounds 1 . . . n − 1 , I am better off playing D.” • Fixed strategies ( D , D ) in round n determine play after round n − 1. • Independence on history , plus a determined future , leads to the following justification for playing D in round n − 1: “No matter what happened in rounds 1 . . . n − 2 (the past), and given that I will receive a payoff of 1 in round n (the future), I am better off playing D now.” • Per induction in round k , where k ≥ 1: “No matter what happened in rounds 1 . . . k, and given that I will receive a payoff of ( n − k ) · 1 in rounds ( k + 1 ) . . . n, I am better off playing D in round k.” Last modified on February 9 th , 2012 at 17:15 Gerard Vreeswijk. Slide 8
Multi-agent learning Repeated games Indefinite number of repetitions • A Pareto-suboptimal outcome can be avoided in case the following three conditions are met. 1. The Prisoners’ Dilemma is repeated an indefinite number of times (rounds). 2. A so-called discount factor δ ∈ [ 0, 1 ] determines the probability of continuing the game after each round. 3. The probability to continue, δ , must be large enough. • Under these conditions suddenly infinitely many Nash equilibria exist. This is sometimes called an embarrassment of richness (Peters, 2008). • Various Folk theorems state the existence of multiple equilibria in infinitely repeated games. a • We now informally discuss one version of “the” Folk Theorem. a Folk Theorems are named such, because their exact origin cannot be traced. Last modified on February 9 th , 2012 at 17:15 Gerard Vreeswijk. Slide 9
Multi-agent learning Repeated games Example 2: Prisoners’ Dilemma repeated indefinitely • Consider the game G ∗ ( δ ) where the PD is played a number of times in succession. We write G ∗ ( δ ) : G 0 , G 1 , G 2 , . . . . • The number of times the stage game is played is determined by a parameter 0 ≤ δ ≤ 1. The probability that the next stage (and the stages thereafter) will be played is δ . Thus, the probability that stage game G t will be played is δ t . (What if t = 0?) • The PD (of which every G t is an incarnation) is called the stage game , as opposed to the overall game G ∗ ( δ ) . • A history h of length t of a repeated game is a sequence of action profiles of length t . • A realisation h is a countably infinite sequence of action profiles. Last modified on February 9 th , 2012 at 17:15 Gerard Vreeswijk. Slide 10
Multi-agent learning Repeated games Example 2: Prisoners’ Dilemma repeated indefinitely ( 2 ) • Example of a history of length t = 10: Row player: C D D D C C D D D D Column player: C D D D D D D C D D 0 1 2 3 4 5 6 7 8 9 • The set of all possible histories (of any length) is denoted by H . • A (mixed) strategy for Player i is a function s i : H → Pr ( { C , D } ) such that Pr ( Player i plays C in round | h | + 1 | h ) = s i ( h ) ( C ) . • A strategy profile s is a combination of strategies, one for each player. • The expected payoff for player i given s can be computed. It is ∞ δ t Expected payoff i , t ( s ) . ∑ Expected payoff i ( s ) = t = 0 Last modified on February 9 th , 2012 at 17:15 Gerard Vreeswijk. Slide 11
Multi-agent learning Repeated games Example: The expected payoff of a stage game Prisoners’ Dilemma Other: Cooperate Defect ( 3 , 3 ) ( 0 , 5 ) You: Cooperate ( 5 , 0 ) ( 1 , 1 ) Defect • Suppose following strategy profile for one game: – Row player (you) plays with mixed strategy 0.8 on C (hence, 0.2 on D ). – Column player (other) plays with mixed strategy 0.7 on C . • Your expected payoff is 0.8 ( 0.7 · 3 + 0.3 · 0 ) + 0.2 ( 0.7 · 5 + 0.3 · 1 ) = 2.44 • General formula (cf., e.g., Leyton-Brown et al. , 2008): Π n ∑ Expected payoff i , t ( s ) = k = 1 s k , i k · payoff i ( s i 1 , . . . , s i n ) ( i 1 ,..., i n ) ∈ A n Last modified on February 9 th , 2012 at 17:15 Gerard Vreeswijk. Slide 12
Multi-agent learning Repeated games Expected payoffs for P1 and P2 in stage PD with mixed strategies Player 1 may only move “back – front”; Player 2 may only move “left – right”. Last modified on February 9 th , 2012 at 17:15 Gerard Vreeswijk. Slide 13
Recommend
More recommend