Announcements Announcements Ø HW 1 deadline is postponed to next Tuesday before class, e.g., Ø HW 1 deadline is postponed to next Tuesday before class, e.g., 3:30 pm 3:30 pm 1 1
CS6501: T opics in Learning and Game Theory (Fall 2019) Swap Regret and Convergence to CE Instructor: Haifeng Xu
Outline Ø (External) Regret vs Swap Regret Ø Convergence to Correlated Equilibrium Ø Converting Regret Bounds to Swap Regret Bounds 3
Recap: Online Learning At each time step 𝑢 = 1, ⋯ , 𝑈 , the following occurs in order: Learner picks a distribution 𝑞 ( over actions [𝑜] 1. Adversary picks cost vector 𝑑 ( ∈ 0,1 / 2. Action 𝑗 ( ∼ 𝑞 ( is chosen and learner incurs cost 𝑑 ( (𝑗 ( ) 3. Learner observes 𝑑 ( (for use in future time steps) 4. 4
Recap: (External) Regret Ø External regret 𝑆 ; = 𝔽 > ? ∼@ ? ∑ (∈[;] 𝑑 ( 𝑗 ( 7∈[/] ∑ (∈[;] 𝑑 ( (𝑘) − min 7∈[/] ∑ ( 𝑑 ( (𝑘) is the learner utility had he known 𝑑 : , ⋯ , 𝑑 ; Ø Benchmark min and is allowed to take the best single action across all rounds Ø Describes how much the learner regrets, had he known the cost vector 𝑑 : , ⋯ , 𝑑 ; in hindsight 5
Recap: (External) Regret Ø A closer look at external regret 𝑆 ; = 𝔽 > ? ∼@ ? ∑ (∈[;] 𝑑 ( 𝑗 ( 7∈[/] ∑ (∈[;] 𝑑 ( (𝑘) − min = ∑ (∈ ; ∑ >∈[/] 𝑑 ( 𝑗 𝑞 ( (𝑗) − min 7∈[/] ∑ (∈[;] 𝑑 ( (𝑘) 7∈[/] ∑ (∈ ; ∑ >∈[/] 𝑑 ( 𝑗 𝑞 ( (𝑗) − ∑ (∈[;] 𝑑 ( (𝑘) = max 7∈[/] ∑ (∈ ; ∑ >∈[/] [𝑑 ( 𝑗 − 𝑑 ( (𝑘)]𝑞 ( (𝑗) = max Many-to-one action swap 6
Recap: (External) Regret Ø A closer look at external regret 𝑆 ; = 𝔽 > ? ∼@ ? ∑ (∈[;] 𝑑 ( 𝑗 ( 7∈[/] ∑ (∈[;] 𝑑 ( (𝑘) − min = ∑ (∈ ; ∑ >∈[/] 𝑑 ( 𝑗 𝑞 ( (𝑗) − min 7∈[/] ∑ (∈[;] 𝑑 ( (𝑘) 7∈[/] ∑ (∈ ; ∑ >∈[/] 𝑑 ( 𝑗 𝑞 ( (𝑗) − ∑ (∈[;] 𝑑 ( (𝑘) = max 7∈[/] ∑ (∈ ; ∑ >∈[/] [𝑑 ( 𝑗 − 𝑑 ( (𝑘)]𝑞 ( (𝑗) = max Ø In external regret, learner is allowed to swap to a single action 𝑘 and can choose the best 𝑘 in hindsight 7
Swap Regret Ø A closer look at external regret 7∈[/] ∑ (∈ ; ∑ >∈[/] [𝑑 ( 𝑗 − 𝑑 ( (𝑘)]𝑞 ( (𝑗) 𝑆 ; = max 𝑑 ( (𝑡(𝑗)) Ø Swap regret allows many-to-many action swap • E.g., 𝑡 1 = 2, 𝑡 2 = 1, 𝑡 3 = 4, 𝑡 4 = 4 Ø Formally, ∑ (∈ ; ∑ >∈[/] [𝑑 ( 𝑗 − 𝑑 ( (𝑡(𝑗))]𝑞 ( (𝑗) 𝑡𝑥𝑆 ; = max I where max is over all possible swap functions Ø 𝑜 / many swap functions, each action 𝑗 has 𝑜 choices to swap to Ø Quiz: how many many-to-one swaps? 8
Some Facts about Swap Regret Fact 1. For any algorithm: 𝑡𝑥𝑆 ; ≥ 𝑆 ; Fact 2. For any algorithm execution 𝑞 : , ⋯ , 𝑞 ; , the optimal swap function 𝑡 ∗ satisfies, for any 𝑗 , 𝑡 ∗ 𝑗 = arg max 7∈[/] ∑ (∈ ; [𝑑 ( 𝑗 − 𝑑 ( (𝑘)]𝑞 ( (𝑗) Recall swap regret ∑ (∈ ; ∑ >∈[/] [𝑑 ( 𝑗 − 𝑑 ( (𝑡(𝑗))]𝑞 ( (𝑗) 𝑡𝑥𝑆 ; = max I Proof: Ø 𝑡(𝑗) only affects term ∑ (∈ ; [𝑑 ( 𝑗 − 𝑑 ( (𝑡(𝑗))]𝑞 ( (𝑗) , so should be picked to maximize this term 9
Some Facts about Swap Regret Fact 1. For any algorithm: 𝑡𝑥𝑆 ; ≥ 𝑆 ; Fact 2. For any algorithm execution 𝑞 : , ⋯ , 𝑞 ; , the optimal swap function 𝑡 ∗ satisfies, for any 𝑗 , 𝑡 ∗ 𝑗 = arg max 7∈[/] ∑ (∈ ; [𝑑 ( 𝑗 − 𝑑 ( (𝑘)]𝑞 ( (𝑗) Remarks: Ø The optimal swap can be decided “independently” for each 𝑗 10
Some Facts about Swap Regret Fact 1. For any algorithm: 𝑡𝑥𝑆 ; ≥ 𝑆 ; Fact 2. For any algorithm execution 𝑞 : , ⋯ , 𝑞 ; , the optimal swap function 𝑡 ∗ satisfies, for any 𝑗 , 𝑡 ∗ 𝑗 = arg max 7∈[/] ∑ (∈ ; [𝑑 ( 𝑗 − 𝑑 ( (𝑘)]𝑞 ( (𝑗) Remarks: Ø Benchmark of swap regret depends on the algorithm execution 𝑞 : , ⋯ , 𝑞 ; , but benchmark of external regret does not. Ø This raises a subtle issue: an algorithm minimize swap regret does not necessarily minimize the total loss • An algorithm may intentionally take less actions so the benchmark does not have many opportunities to swap 11
Some Facts about Swap Regret Fact 1. For any algorithm: 𝑡𝑥𝑆 ; ≥ 𝑆 ; Fact 2. For any algorithm execution 𝑞 : , ⋯ , 𝑞 ; , the optimal swap function 𝑡 ∗ satisfies, for any 𝑗 , 𝑡 ∗ 𝑗 = arg max 7∈[/] ∑ (∈ ; [𝑑 ( 𝑗 − 𝑑 ( (𝑘)]𝑞 ( (𝑗) pick worst 𝑗 7∈[/] ∑ (∈ ; [𝑑 ( 𝑗 − 𝑑 ( (𝑘)]𝑞 ( (𝑗) max >∈[/] max is also called the internal regret 12
Outline Ø (External) Regret vs Swap Regret Ø Convergence to Correlated Equilibrium Ø Converting Regret Bounds to Swap Regret Bounds 13
Recap: Normal-Form Games and CE Ø 𝑜 players, denoted by set 𝑜 = {1, ⋯ , 𝑜} Ø Player 𝑗 takes action 𝑏 > ∈ 𝐵 > Ø Player utility depends on the outcome of the game, i.e., an action profile 𝑏 = (𝑏 : , ⋯ , 𝑏 / ) / 𝐵 > • Player 𝑗 receives payoff 𝑣 > (𝑏) for any outcome 𝑏 ∈ Π >T: Ø Correlated equilibrium is an action recommendation policy A recommendation policy 𝜌 is a correlated equilibrium if ∑ V WX 𝑣 > 𝑏 > , 𝑏 Y> ⋅ 𝜌(𝑏 > , 𝑏 Y> ) ≥ ∑ V WX 𝑣 > 𝑏 [> , 𝑏 Y> ⋅ 𝜌 𝑏 > , 𝑏 Y> , ∀ 𝑏 [> ∈ 𝐵 > , ∀𝑗 ∈ 𝑜 . Ø That is, for any recommended action 𝑏 > , player 𝑗 does not want [ to “swap” to another 𝑏 > 14
Repeated Games with No-Swap-Regret Players Ø The game is played repeatedly for 𝑈 rounds Ø Each player uses an online learning algorithm to select a mixed strategy at each round 𝑢 Ø For any player 𝑗 ’s perspective, the following occurs in order at 𝑢 ( ∈ Δ |` X | over actions in 𝐵 > • Picks a mixed strategy 𝑦 > ( ∈ Δ |` b | • Any other player 𝑘 ≠ 𝑗 picks a mixed strategy 𝑦 7 ( , 𝑦 Y> ( • Player 𝑗 receives expected utility 𝑣 > 𝑦 > = 𝔽 V∼(c X ? ) 𝑣 > (𝑏) ? ,c WX ( (for future use) • Player 𝑗 learns 𝑦 Y> 15
From No Swap Regret to Correlated Equ Theorem. If all players use no-swap-regret learning algorithms with ( 𝑦 > (∈[;] for 𝑗. The following recommendation strategy sequence policy 𝜌 ; converges to a CE: 𝜌 ; 𝑏 = : ( (𝑏 > ) , ∀ 𝑏 ∈ 𝐵 . ; ∑ ( Π >∈ / 𝑦 > Remarks: ( , prob. of 𝑏 is Π >∈ / 𝑦 > ( (𝑏 > ) ( , 𝑦 d ( , ⋯ , 𝑦 / Ø In mixed strategy profile 𝑦 : ( (𝑏 > ) over 𝑈 rounds Ø 𝜌 ; (𝑏) is simply the average of Π >∈ / 𝑦 > 16
From No Swap Regret to Correlated Equ Theorem. If all players use no-swap-regret learning algorithms with ( 𝑦 > (∈[;] for 𝑗. The following recommendation strategy sequence policy 𝜌 ; converges to a CE: 𝜌 ; 𝑏 = : ( (𝑏 > ) , ∀ 𝑏 ∈ 𝐵 . ; ∑ ( Π >∈ / 𝑦 > Proof: Ø Derive player 𝑗 ’s expected utility from 𝜌 ; : ( (𝑏 > ) ∑ V∈` ; ∑ ( Π >∈ / 𝑦 > ⋅ 𝑣 > (𝑏) : ( (𝑏 > ) ⋅ 𝑣 > (𝑏) ; ∑ ( ∑ V∈` Π >∈ / 𝑦 > = 17
From No Swap Regret to Correlated Equ Theorem. If all players use no-swap-regret learning algorithms with ( 𝑦 > (∈[;] for 𝑗. The following recommendation strategy sequence policy 𝜌 ; converges to a CE: 𝜌 ; 𝑏 = : ( (𝑏 > ) , ∀ 𝑏 ∈ 𝐵 . ; ∑ ( Π >∈ / 𝑦 > Proof: Ø Derive player 𝑗 ’s expected utility from 𝜌 ; : ( (𝑏 > ) ∑ V∈` ; ∑ ( Π >∈ / 𝑦 > ⋅ 𝑣 > (𝑏) : ( (𝑏 > ) ⋅ 𝑣 > (𝑏) ; ∑ ( ∑ V∈` Π >∈ / 𝑦 > = ( ) : ( , 𝑦 Y> ; ∑ ( 𝑣 > (𝑦 > = 18
From No Swap Regret to Correlated Equ Theorem. If all players use no-swap-regret learning algorithms with ( 𝑦 > (∈[;] for 𝑗. The following recommendation strategy sequence policy 𝜌 ; converges to a CE: 𝜌 ; 𝑏 = : ( (𝑏 > ) , ∀ 𝑏 ∈ 𝐵 . ; ∑ ( Π >∈ / 𝑦 > Proof: Ø Derive player 𝑗 ’s expected utility from 𝜌 ; : ( (𝑏 > ) ∑ V∈` ; ∑ ( Π >∈ / 𝑦 > ⋅ 𝑣 > (𝑏) : ( (𝑏 > ) ⋅ 𝑣 > (𝑏) ; ∑ ( ∑ V∈` Π >∈ / 𝑦 > = ( ) : ( , 𝑦 Y> ; ∑ ( 𝑣 > (𝑦 > = : ( ( (𝑏 > ) ; ; ∑ V X ∈` X ∑ (T: = 𝑣 > 𝑏 > , 𝑦 Y> ⋅ 𝑦 > Ø Player 𝑗 ’s expected utility conditioned on being recommended 𝑏 > is : ( ( (𝑏 > ) ; ; ∑ (T: 𝑣 > 𝑏 > , 𝑦 Y> ⋅ 𝑦 > (normalization factor omitted) 19
From No Swap Regret to Correlated Equ Theorem. If all players use no-swap-regret learning algorithms with ( 𝑦 > (∈[;] for 𝑗. The following recommendation strategy sequence policy 𝜌 ; converges to a CE: 𝜌 ; 𝑏 = : ( (𝑏 > ) , ∀ 𝑏 ∈ 𝐵 . ; ∑ ( Π >∈ / 𝑦 > Proof: Ø The CE condition requires for all player 𝑗 and all 𝑏 > ∈ 𝐵 > ( 𝑏 > , ∀𝑡 𝑏 > ∈ 𝐵 > ( 𝑏 > : : ; ( ; ( ; ∑ (T: ; ∑ (T: ≥ 𝑣 > 𝑡(𝑏 > ), 𝑦 Y> ⋅ 𝑦 > 𝑣 > 𝑏 > , 𝑦 Y> ⋅ 𝑦 > 20
Recommend
More recommend