Announcements Ø HW 1 is due now 1
CS6501: T opics in Learning and Game Theory (Fall 2019) Adversarial Multi-Armed Bandits Instructor: Haifeng Xu
Outline Ø The Adversarial Multi-armed Bandit Problem Ø A Basic Algorithm: Exp3 Ø Regret Analysis of Exp3 3
Recap: Online Learning So Far Setup: 𝑈 rounds; the following occurs at round 𝑢 : Learner picks a distribution 𝑞 $ over actions [𝑜] 1. Adversary picks cost vector 𝑑 $ ∈ 0,1 - 2. Action 𝑗 $ ∼ 𝑞 $ is chosen and learner incurs cost 𝑑 $ (𝑗 $ ) 3. Learner observes 𝑑 $ (for use in future time steps) 4. Performance is typically measured by regret: 𝑆 3 = ∑ 6∈[-] ∑ $∈ 3 𝑑 $ (𝑗) 𝑞 $ (𝑗) − min ;∈[-] ∑ $∈[3] 𝑑 $ (𝑘) The multiplicative weight update algorithm has regret 𝑃( 𝑈 ln 𝑜) . 4
Recap: Online Learning So Far Convergence to equilibrium Ø In repeated zero-sum games, if both players use a no-regret learning algorithm, their average strategy converges to an NE Ø In general games, the average strategy converges to a CCE Swap regret – a “stronger” regret concept and better convergence Ø Def: each action 𝑗 has a chance to deviate to another action 𝑡(𝑗) Ø In repeated general games, if both players use a no-swap-regret learning algorithm, their average strategy converges to a CE There is a general reduction, converting any learning algorithm with regret 𝑆 to one with swap regret 𝑜𝑆 . 5
This Lecture: Address Partial Feedback Ø In online learning, the whole cost vector 𝑑 $ can be observed by the learner, despite she only takes a single action 𝑗 $ • Realistic in some applications, e.g., stock investment Ø In many cases, we only see the reward of the action we take • For example: slot machines, a.k.a., multi-armed bandits 6
Other Applications with Partial Feedback Ø Online advertisement placement or web ranking • Action: ad placement or ranking of webs • Cannot see the feedback for untaken actions 7
Other Applications with Partial Feedback Ø Online advertisement placement or web ranking • Action: ad placement or ranking of webs • Cannot see the feedback for untaken actions Ø Recommendation system: • Action = recommended option (e.g., a restaurant) • Do not know other options’ feedback 8
Other Applications with Partial Feedback Ø Online advertisement placement or web ranking • Action: ad placement or ranking of webs • Cannot see the feedback for untaken actions Ø Recommendation system: • Action = recommended option (e.g., a restaurant) • Do not know other options’ feedback Ø Clinical trials • Action = a treatment • Don’t know what would happen for treatments not chosen Ø Playing strategic games • Cannot observe opponents’ strategies but only know the payoff of the taken action • E.g., Poker games, competition in markets 9
Adversarial Multi-Armed Bandits (MAB) Ø Very much like online learning, except partial feedback • The name “bandit” is inspired by slot machines Ø Model: at each time step 𝑢 = 1, ⋯ , 𝑈 ; the following occurs in order Learner picks a distribution 𝑞 $ over arms [𝑜] 1. Adversary picks cost vector 𝑑 $ ∈ 0,1 - 2. Arm 𝑗 $ ∼ 𝑞 $ is chosen and learner incurs cost 𝑑 $ (𝑗 $ ) 3. Learner only observes 𝑑 $ (𝑗 $ ) (for use in future time steps) 4. Ø Though we cannot observe 𝑑 $ , adversary still picks 𝑑 $ before 𝑗 $ is sampled Q: since learner does not observe 𝑑 $ (𝑗) for 𝑗 ≠ 𝑗 $ , can adversary arbitrarily modify these 𝑑 $ (𝑗) ’s after 𝑗 $ has been selected? No, because this makes 𝑑 $ depends on sampled 𝑗 $ which is not allowed 10
Outline Ø The Adversarial Multi-armed Bandit Problem Ø A Basic Algorithm: Exp3 Ø Regret Analysis of Exp3 11
Recall the algorithm for full information setting: Parameter: 𝜗 Initialize weight 𝑥 D (𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 $ = ∑ 6∈[-] 𝑥 $ (𝑗) , pick arm 𝑗 with probability 𝑥 $ (𝑗)/𝑋 Let 𝑋 1. $ Observe cost vector 𝑑 $ ∈ [0,1] - 2. For all 𝑗 ∈ [𝑜] , update 𝑥 $HD (𝑗) = 𝑥 $ (𝑗) ⋅ 𝑓 KL ⋅M N (6) where O 𝑑 $ = 3. (1 − 𝜗𝑑 $ (𝑗)) 0, ⋯ , 0, 𝑑 $ 𝑗 $ /𝑞 $ (𝑗 $ ), 0, ⋯ 0 3 . 12
Recall the algorithm for full information setting: Parameter: 𝜗 Initialize weight 𝑥 D (𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 $ = ∑ 6∈[-] 𝑥 $ (𝑗) , pick arm 𝑗 with probability 𝑥 $ (𝑗)/𝑋 Let 𝑋 1. $ Observe cost vector 𝑑 $ ∈ [0,1] - 2. For all 𝑗 ∈ [𝑜] , update 𝑥 $HD (𝑗) = 𝑥 $ (𝑗) ⋅ 𝑓 KL ⋅M N (6) where O 𝑑 $ = 3. 0, ⋯ , 0, 𝑑 $ 𝑗 $ /𝑞 $ (𝑗 $ ), 0, ⋯ 0 3 . Recall 1 − 𝜀 ≈ 𝑓 KR for small 𝜀 13
Recall the algorithm for full information setting: Parameter: 𝜗 Initialize weight 𝑥 D (𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 $ = ∑ 6∈[-] 𝑥 $ (𝑗) , pick arm 𝑗 with probability 𝑥 $ (𝑗)/𝑋 Let 𝑋 1. $ Observe cost vector 𝑑 $ ∈ [0,1] - 2. For all 𝑗 ∈ [𝑜] , update 𝑥 $HD (𝑗) = 𝑥 $ (𝑗) ⋅ 𝑓 KL ⋅M N (6) where O 𝑑 $ = 3. 0, ⋯ , 0, 𝑑 $ 𝑗 $ /𝑞 $ (𝑗 $ ), 0, ⋯ 0 3 . Ø In this lecture we will use this exponential-weight variant, and prove its regret bound en route Ø Also called Exponential Weight Update (EWU) Recall 1 − 𝜀 ≈ 𝑓 KR for small 𝜀 14
Recall the algorithm for full information setting: Parameter: 𝜗 Initialize weight 𝑥 D (𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 $ = ∑ 6∈[-] 𝑥 $ (𝑗) , pick arm 𝑗 with probability 𝑥 $ (𝑗)/𝑋 Let 𝑋 1. $ Observe cost vector 𝑑 $ ∈ [0,1] - 2. For all 𝑗 ∈ [𝑜] , update 𝑥 $HD (𝑗) = 𝑥 $ (𝑗) ⋅ 𝑓 KL ⋅M N (6) where O 𝑑 $ = 3. 0, ⋯ , 0, 𝑑 $ 𝑗 $ /𝑞 $ (𝑗 $ ), 0, ⋯ 0 3 . Basic idea of Exp3 Ø Want to use EWU, but do not know vector 𝑑 $ à try to estimate 𝑑 $ ! Ø Well, we really only have 𝑑 $ (𝑗 $ ) , what can we do? 𝑑 $ = 0, ⋯ , 0, 𝑑 $ 𝑗 $ , 0, ⋯ 0 3 ? Estimate O Too optimistic 3 M N 6 N Estimate O 𝑑 $ = 0, ⋯ , 0, S N 6 N , 0, ⋯ 0 15
Exp3: a Basic Algorithm for Adversarial MAB Parameter: 𝜗 Initialize weight 𝑥 D (𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 $ = ∑ 6∈[-] 𝑥 $ (𝑗) , pick arm 𝑗 with probability 𝑥 $ (𝑗)/𝑋 Let 𝑋 1. $ Observe cost vector 𝑑 $ ∈ [0,1] - 2. M N (6) where O For all 𝑗 ∈ [𝑜] , update 𝑥 $HD (𝑗) = 𝑥 $ (𝑗) ⋅ 𝑓 KL ⋅ O 𝑑 $ = 3. 0, ⋯ , 0, 𝑑 $ 𝑗 $ /𝑞 $ (𝑗 $ ), 0, ⋯ 0 3 . Ø That is, weight is updated only for the pulled arm • Because we really don’t know how good are other arms at 𝑢 • But 𝑗 $ is more heavily penalized now • Attention: 𝑑 $ 𝑗 $ /𝑞 $ (𝑗 $ ) may be extremely large if 𝑞 $ (𝑗 $ ) is small Ø Called Exp3: Exponential-weight algorithm for Exploration and Exploitation 16
A Closer Look at the Estimator O 𝑑 $ Ø O 𝑑 $ is random – it depends on the randomly sampled 𝑗 $ ∼ 𝑞 $ Ø O 𝑑 $ is an unbiased estimator of 𝑑 $ , i.e., 𝔽 6 N ∼S N O 𝑑 $ = 𝑑 $ • Because given 𝑞 $ , for any 𝑗 we have 𝑑 $ 𝑗 = ℙ 𝑗 $ = 𝑗 ⋅ 𝑑 $ 𝑗 𝔽 6 N ∼S N O 𝑞 $ 𝑗 + ℙ 𝑗 $ ≠ 𝑗 ⋅ 0 = 𝑞 $ (𝑗) ⋅ 𝑑 $ 𝑗 𝑞 $ 𝑗 = 𝑑 $ (𝑗) Ø This is exactly the reason for our choice of O 𝑑 $ 17
Regret 𝑆 3 = ∑ 6∈[-] ∑ $∈ 3 𝑑 $ (𝑗) 𝑞 $ (𝑗) − min ;∈[-] ∑ $∈[3] 𝑑 $ (𝑘) Some key differences from online learning Ø 𝑆 3 is random (even it already takes expectation over 𝑗 $ ∼ 𝑞 $ ) • Because distribution 𝑞 $ itself is random, depends on sampled 𝑗 D , ⋯ 𝑗 $KD • That is, if we run the same algorithm for multiple times, we will get different 𝑆 3 value even when facing the same adversary! 18
Regret 𝑆 3 = ∑ 6∈[-] ∑ $∈ 3 𝑑 $ (𝑗) 𝑞 $ (𝑗) − min ;∈[-] ∑ $∈[3] 𝑑 $ (𝑘) Some key differences from online learning Ø 𝑆 3 is random (even it already takes expectation over 𝑗 $ ∼ 𝑞 $ ) • Because distribution 𝑞 $ itself is random, depends on sampled 𝑗 D , ⋯ 𝑗 $KD • That is, if we run the same algorithm for multiple times, we will get different 𝑆 3 value even when facing the same adversary! 𝑥 D 𝑗 = 1, ∀𝑗 ≠ 1 . . . . 𝑥 D 𝑗 = 1, ∀𝑗 pull 𝑥 D 1 < 1 arm 1 round 1 round 2 19
Regret 𝑆 3 = ∑ 6∈[-] ∑ $∈ 3 𝑑 $ (𝑗) 𝑞 $ (𝑗) − min ;∈[-] ∑ $∈[3] 𝑑 $ (𝑘) Some key differences from online learning Ø 𝑆 3 is random (even it already takes expectation over 𝑗 $ ∼ 𝑞 $ ) • Because distribution 𝑞 $ itself is random, depends on sampled 𝑗 D , ⋯ 𝑗 $KD • That is, if we run the same algorithm for multiple times, we will get different 𝑆 3 value even when facing the same adversary! 𝑥 D 𝑗 = 1, ∀𝑗 ≠ 2 . . . . 𝑥 D 𝑗 = 1, ∀𝑗 pull 𝑥 D 2 < 1 arm 2 round 1 round 2 20
Recommend
More recommend