Announcements Ø HW 1 draft is slightly updated; See website for more info Ø Minbiao’s office hour has been changed to Thursday 1-2 pm from this week, at Rice Hall 442 1
CS6501: T opics in Learning and Game Theory (Fall 2019) MW Updates and Implications Instructor: Haifeng Xu
Outline Ø Regret Proof of MW Update Ø Convergence to Minimax Equilibrium Ø Convergence to Coarse Correlated Equilibrium 3
Recap: the Model of Online Learning At each time step 𝑢 = 1, ⋯ , 𝑈 , the following occurs in order: Learner picks a distribution 𝑞 ( over actions [𝑜] 1. Adversary picks cost vector 𝑑 ( ∈ 0,1 / 2. Action 𝑗 ( ∼ 𝑞 ( is chosen and learner incurs cost 𝑑 ( (𝑗 ( ) 3. Learner observes 𝑑 ( (for use in future time steps) 4. Ø Learner’s goal: pick distribution sequence 𝑞 4 , ⋯ , 𝑞 5 to minimize expected cost 𝔽 ∑ (∈5 𝑑 ( (𝑗 ( ) • Expectation over randomness of action 4
Measure Algorithms via Regret Ø Regret – how much the learner regrets, had he known the cost vector 𝑑 4 , ⋯ , 𝑑 5 in hindsight Ø Formally, 𝑆 5 = 𝔽 ; = ∼> = ∑ (∈[5] 𝑑 ( 𝑗 ( ;∈[/] ∑ (∈[5] 𝑑 ( (𝑗) − min ;∈[/] ∑ ( 𝑑 ( (𝑗) is the learner utility had he known 𝑑 4 , ⋯ , 𝑑 5 Ø Benchmark min and is allowed to take the best single action across all rounds ;∈[/] ∑ ( 𝑑 ( (𝑗) is mostly used • Can also use other benchmarks, but min @ A 5 → 0 as 𝑈 → ∞ , i.e., 𝑆 5 = 𝑝(𝑈) . An algorithm has no regret if Regret is an appropriate performance measure of online algorithms • It measures exactly the loss due to not knowing the data in advance 5
The Multiplicative Weight Update Alg Parameter: 𝜗 Initialize weight 𝑥 4 (𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 ( = ∑ ;∈[/] 𝑥 ( (𝑗) , pick action 𝑗 with probability 𝑥 ( (𝑗)/𝑋 Let 𝑋 1. ( Observe cost vector 𝑑 ( ∈ [0,1] / 2. For all 𝑗 ∈ [𝑜] , update 𝑥 (K4 (𝑗) = 𝑥 ( (𝑗) ⋅ (1 − 𝜗 ⋅ 𝑑 ( (𝑗)) 3. Theorem. MW Update algorithm achieves regret at most O( 𝑈 ln 𝑜 ) for the previously described online learning problem. Ø Last lecture: both 𝑈 and ln 𝑜 term are necessary Ø Next, we prove the theorem 6
̅ Intuition of the Proof Parameter: 𝜗 Initialize weight 𝑥 4 (𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 ( = ∑ ;∈[/] 𝑥 ( (𝑗) , pick action 𝑗 with probability 𝑥 ( (𝑗)/𝑋 Let 𝑋 1. ( Observe cost vector 𝑑 ( ∈ [0,1] / 2. For all 𝑗 ∈ [𝑜] , update 𝑥 (K4 (𝑗) = 𝑥 ( (𝑗) ⋅ (1 − 𝜗 ⋅ 𝑑 ( (𝑗)) 3. Ø The decrease of weights relates to expected cost at each round ∑ P∈[Q] R = (;)⋅S = (;) • Expected cost at round 𝑢 is ̅ 𝐷 ( = ∑ ;∈[/] 𝑞 ( (𝑗) ⋅ 𝑑 ( (𝑗) = T = • Propositional to the decrease of total weight at round 𝑢 , which is ∑ ;∈[/] 𝜗 ⋅ 𝑥 ( 𝑗 𝑑 ( (𝑗) = 𝜗𝑋 ( ⋅ 𝐷 ( Ø Proof idea: bound how fast do total weights decrease 7
̅ ̅ ̅ ̅ ̅ Proof Step 1: How Fast do T otal Weights Decrease? ( ⋅ 𝑓 WX ̅ Y = where 𝑋 ( = ∑ ;∈[/] 𝑥 ( (𝑗) is the total Lemma 1. 𝑋 (K4 ≤ 𝑋 weight at 𝑢 and 𝐷 ( is the expected loss at time 𝑢 . is ∑ P∈[Q] R = ; S = (;) 𝐷 ( = ∑ ;∈[/] 𝑞 ( 𝑗 𝑑 ( (𝑗) = T = Proof Ø Almost Immediate from update rule 𝑥 (K4 (𝑗) = 𝑥 ( (𝑗) ⋅ (1 − 𝜗 ⋅ 𝑑 ( (𝑗)) (K4 = ∑ ;∈[/] 𝑥 (K4 (𝑗) 𝑋 = ∑ ;∈[/] 𝑥 ( (𝑗) ⋅ (1 − 𝜗 ⋅ 𝑑 ( (𝑗)) = 𝑋 ( − 𝜗 ⋅ ∑ ;∈[/] 𝑥 ( (𝑗) ⋅ 𝑑 ( (𝑗) = 𝑋 ( − 𝜗 ⋅ 𝑋 𝐷 ( = 𝑋 ( (1 − 𝜗 ⋅ 𝐷 ( ) ( ( ⋅ 𝑓 WX⋅ Y = ≤ 𝑋 since 1 − 𝜀 ≤ 𝑓 W[ , ∀𝜀 ≥ 0 8
̅ ̅ ̅ ̅ ̅ Proof Step 1: How Fast do T otal Weights Decrease? ( ⋅ 𝑓 WX ̅ Y = where 𝑋 ( = ∑ ;∈[/] 𝑥 ( (𝑗) is the total Lemma 1. 𝑋 (K4 ≤ 𝑋 weight at 𝑢 and 𝐷 ( is the expected loss at time 𝑢 . is ∑ P∈[Q] R = ; S = (;) 𝐷 ( = ∑ ;∈[/] 𝑞 ( 𝑗 𝑑 ( (𝑗) = T = A Corollary 1. 𝑋 5K4 ≤ 𝑜𝑓 WX ∑ =]^ Y = . is 𝑋 5K4 ≤ 𝑋 5 ⋅ 𝑓 WX ̅ Y A ≤ [𝑋 5W4 ⋅ 𝑓 WX ̅ Y A`^ ] ⋅ 𝑓 WX ̅ Y A = 𝑋 5W4 ⋅ 𝑓 WX[ ̅ Y A K ̅ Y A`^ ] . . . A 4 ⋅ 𝑓 WX⋅∑ =]^ Y = = 𝑋 A = 𝑜 ⋅ 𝑓 WX⋅∑ =]^ Y = 9
Proof Step 2: Lower Bounding 𝑋 5K4 Lemma 2. 𝑋 5K4 ≥ 𝑓 W5X a ⋅ 𝑓 WX ∑ =]^ S = (;) for any action 𝑗 . A 𝑋 5K4 ≥ 𝑥 5K4 (𝑗) by MW update rule = 𝑥 4 𝑗 1 − 𝜗𝑑 4 𝑗 1 − 𝜗𝑑 b 𝑗 … 1 − 𝜗𝑑 5 𝑗 𝑓 WXS = ; WX a [S = (;)] a by fact 1 − 𝜀 ≥ 𝑓 W[W[ a 5 ≥ Π (e4 ≥ 𝑓 W5X a ⋅ 𝑓 WX ∑ =]^ b to 1 A S = (;) relax 𝑑 ( 𝑗 10
̅ ̅ ̅ ̅ ̅ Proof Step 3: Combing the Two Lemmas A Corollary 1. 𝑋 5K4 ≤ 𝑜𝑓 WX ∑ =]^ Y = . is Lemma 2. 𝑋 5K4 ≥ 𝑓 W5X a ⋅ 𝑓 WX ∑ =]^ A S = (;) for any action 𝑗 . Ø Therefore, for any 𝑗 we have 𝑓 W5X a ⋅ 𝑓 WX ∑ =]^ S = ; ≤ 𝑜𝑓 WX ∑ =]^ A A Y = ⇔ −𝑈𝜗 b − 𝜗 ∑ (e4 5 5 𝑑 ( 𝑗 ≤ ln 𝑜 − 𝜗 ∑ (e4 𝐷 ( take “ ln ” on both sides gh / 5 5 ⇔ ∑ (e4 𝐷 ( − ∑ (e4 𝑑 ( 𝑗 ≤ X + 𝑈𝜗 rearrange terms Taking 𝜗 = ln 𝑜 /𝑈 , we have 5 5 ∑ (e4 ∑ (e4 𝐷 ( − min 𝑑 ( 𝑗 ≤ 2 𝑈 ln 𝑜 ; 11
Remarks Ø Some MW description uses 𝑥 (K4 (𝑗) = 𝑥 ( (𝑗) ⋅ 𝑓 WX ⋅S = (;) . Analysis is similar due to the fact 𝑓 WX ≈ 1 − 𝜗 for small 𝜗 ∈ [0,1] Ø The same algorithm also works for 𝑑 ( ∈ [−𝜍, 𝜍] (still use update rule 𝑥 (K4 (𝑗) = 𝑥 ( (𝑗) ⋅ (1 − 𝜗 ⋅ 𝑑 ( (𝑗)) ). Analysis is the same Ø MW update is a very powerful technique – it can also be used to solve, e.g., LP, semidefinite programs, SetCover, Boosting, etc. • Because it works for arbitrary cost vectors • Next, we show how it can be used to compute equilibria of games where the “cost vector” will be generated by other players 12
Outline Ø Regret Proof of MW Update Ø Convergence to Minimax Equilibrium Ø Convergence to Coarse Correlated Equilibrium 13
Online learning – A natural way to play repeated games Repeated game: the same game played for many rounds Ø Think about how you play rock-paper-scissor repeatedly Ø In reality, we play like online learning • You try to analyze the past patterns, then decide which action to respond, possibly with some randomness • This is basically online learning! 14
Repeated Zero-Sum Games with No-Regret Players Basic Setup: Ø A zero-sum game with payoff matrix 𝑉 ∈ ℝ p×/ Ø Row player maximizes utility and has actions 𝑛 = {1, ⋯ , 𝑛} • Column player thus minimizes utility Ø The game is played repeatedly for 𝑈 rounds Ø Each player uses an online learning algorithm to pick a mixed strategy at each round 15
Repeated Zero-Sum Games with No-Regret Players Ø From row player’s perspective, the following occurs in order at round 𝑢 • Picks a mixed strategy 𝑦 ( ∈ Δ p over actions in [𝑛] • Her opponent, the column player, picks a mixed strategy 𝑧 ( ∈ Δ / • Action 𝑗 ( ∼ 𝑦 ( is chosen and row player receives utility 𝑉 𝑗 ( , 𝑧 ( = ∑ x∈[/] 𝑧 ( 𝑘 ⋅ 𝑉(𝑗 ( , 𝑘) • Row player learns 𝑧 ( (for future use) Ø Column player has a symmetric perspective, but will think of 𝑉 𝑗, 𝑘 as his cost Difference from online learning: utility/cost vector determined by the opponent, instead of being arbitrarily chosen 16
Repeated Zero-Sum Games with No-Regret Players 5 Ø Expected total utility of row player ∑ (e4 𝑉 𝑦 ( , 𝑧 ( • Note: 𝑉 𝑦 ( , 𝑧 ( = ∑ ;,x 𝑉 𝑗, 𝑘 𝑦 ( 𝑗 𝑧 ( (𝑘) = 𝑦 ( 5 𝑉𝑧 ( Ø Regret of row player is 5 5 ;∈[p] ∑ (e4 − ∑ (e4 max 𝑉 𝑗, 𝑧 ( 𝑉 𝑦 ( , 𝑧 ( Ø Regret of column player is 5 5 ∑ (e4 𝑉 𝑦 ( , 𝑧 ( − min x∈[/] ∑ (e4 𝑉 𝑦 ( , 𝑘 17
From No Regret to Minimax Theorem Next, we give another proof of the minimax theorem, using the fact that no regret algorithms exist (e.g., MW update) 18
From No Regret to Minimax Theorem Ø Assume both players use no-regret learning algorithms Ø For row player, we have |}R = max 5 5 ;∈[p] ∑ (e4 − ∑ (e4 𝑆 5 𝑉 𝑗, 𝑧 ( 𝑉 𝑦 ( , 𝑧 ( ~•€ 4 𝑉 𝑦 ( , 𝑧 ( + @ A = 4 5 5 5 ∑ (e4 ;∈[p] ∑ (e4 ⇔ 5 max 𝑉 𝑗, 𝑧 ( 5 ∑ = • = = max ;∈[p] 𝑉 𝑗, 5 ≥ min •∈‚ Q max ;∈[p] 𝑉 𝑗, 𝑧 19
From No Regret to Minimax Theorem Ø Assume both players use no-regret learning algorithms Ø For row player, we have ~•€ 4 𝑉 𝑦 ( , 𝑧 ( + @ A 5 5 ∑ (e4 ≥ min •∈‚ Q max ;∈[p] 𝑉 𝑗, 𝑧 5 Ø Similarly, for column player, S}ƒ„p/ = ∑ (e4 5 5 x∈[/] ∑ (e4 𝑆 5 𝑉 𝑦 ( , 𝑧 ( − min 𝑉 𝑦 ( , 𝑘 implies …•†‡ˆQ 4 @ A 5 5 ∑ (e4 𝑉 𝑦 ( , 𝑧 ( − ≤ max ‰∈‚ ˆ min x∈[/] 𝑉 𝑦, 𝑘 5 ~•€ …•†‡ˆQ @ A @ A Ø Let 𝑈 → ∞ , no regret implies tend to 0 . We have and 5 5 •∈‚ Q max min ;∈[p] 𝑉 𝑗, 𝑧 ≤ max ‰∈‚ ˆ min x∈[/] 𝑉 𝑦, 𝑘 20
Recommend
More recommend