CS246: Mining Massive Datasets Caroline Lo, Stanford University http://cs246.stanford.edu
Web advertising We’ve learned how to match advertisers to queries in real-time But how to estimate the CTR (Click-Through Rate)? Recommendation engines We’ve learned how to build recommender systems But how to solve the cold-start problem? 3/10/2016 2
What do CTR and cold start have in common? Getting the answer requires experimentation With every ad we show/ product we recommend we gather more data about the ad/product Theme: Learning through experimentation 3/10/2016 3
Google’s goal: Maximize revenue The old way: Pay by impression (CPM) 3/10/2016 4
Google’s goal: Maximize revenue The old way: Pay by impression (CPM) Best strategy: Go with the highest bidder But this ignores “effectiveness” of an ad The new way: Pay per click! (CPC) Best strategy: Go with expected revenue What’s the expected revenue of ad a for query q ? E[revenue a,q ] = P(click a | q) * amount a,q Bid amount for Prob. user will click on ad a given ad a on query q that she issues query q (Known) (Unknown! Need to gather information) 3/10/2016 5
Clinical trials: Investigate effects of different treatments while minimizing patient losses Adaptive routing: Minimize delay in the network by investigating different routes Asset pricing: Figure out product prices while trying to make most money 3/10/2016 6
3/10/2016 7
3/10/2016 8
Each arm a Wins (reward= 1 ) with fixed (unknown) prob. μ a Loses (reward= 0 ) with fixed (unknown) prob. 1- μ a All draws are independent given μ 1 … μ k How to pull arms to maximize total reward? 3/10/2016 9
How does this map to our advertising example? Each query is a bandit Each ad is an arm We want to estimate the arm’s probability of winning μ a (i.e., the ad’s CTR μ a ) Every time we pull an arm we do an ‘experiment’ 3/10/2016 10
The setting: Set of k choices (arms) Each choice a is tied to a probability distribution P a with average reward/payoff μ a (between [0, 1]) We play the game for T rounds For each round t : (1) We pick some arm j (2) We win reward 𝒀 𝒖 drawn from P j Note reward is independent of previous draws 𝑼 Our goal is to maximize σ 𝒖=𝟐 𝒀 𝒖 We don’t know μ a ! But every time we pull some arm a we get to learn a bit about μ a 3/10/2016 11
Online optimization with limited feedback Choices X 1 X 2 X 3 X 4 X 5 X 6 … a 1 1 1 a 2 0 1 0 … a k 0 Time Like in online algorithms: Have to make a choice each time But we only receive information about the chosen action 3/10/2016 12
Policy : a strategy/rule that in each iteration tells me which arm to pull Hopefully policy depends on the history of rewards How to quantify performance of the algorithm? Regret! 3/10/2016 13
𝝂 𝒃 is the mean of 𝑸 𝒃 Payoff/reward of best arm : 𝝂 ∗ = 𝐧𝐛𝐲 𝝂 𝒃 𝒃 Let 𝒃 𝟐 , 𝒃 𝟑 … 𝒃 𝑼 be the sequence of arms pulled Instantaneous regret at time 𝒖 : 𝒔 𝒖 = 𝝂 ∗ − 𝝂 𝒃 𝒖 Total regret: 𝑼 𝑺 𝑼 = 𝒔 𝒖 𝒖=𝟐 Typical goal: Want a policy (arm allocation strategy) that guarantees: 𝑺 𝑼 𝑼 → 𝟏 as 𝑼 → ∞ Note: Ensuring 𝑆 𝑈 /𝑈 → 0 is stronger than maximizing payoffs (minimizing regret), as it means that in the limit we discover the true best arm. 3/10/2016 14
If we knew the payoffs, which arm would we pull? 𝐐𝐣𝐝𝐥 𝐛𝐬𝐡 𝐧𝐛𝐲 𝝂 𝒃 𝒃 We’d always pull the arm with the highest average reward. But we don’t know which arm that is without exploring /experimenting with the arms first. 𝑌 𝑏,𝑘 … payoff received when pulling arm 𝑏 for 𝑘 -th time 3/10/2016 15
Minimizing regret illustrates a classic problem in decision making: We need to trade off exploration (gathering data about arm payoffs) and exploitation (making decisions based on data already gathered) Exploration: Pull an arm we never pulled before Exploitation: Pull an arm 𝒃 for which we currently have the highest estimate of 𝝂 𝒃 3/10/2016 16
Algorithm: Epsilon-Greedy For t=1:T Set 𝜻 𝒖 = 𝑷(𝟐/𝒖) With prob. 𝜻 𝒖 : Explore by picking an arm chosen uniformly at random With prob. 𝟐 − 𝜻 𝒖 : Exploit by picking an arm with highest empirical mean payoff Theorem [Auer et al. ‘02] For suitable choice of 𝜻 𝒖 it holds that 𝑆 𝑈 = 𝑃(𝑙 log 𝑈) ֜ 𝑆 𝑈 𝑙 log 𝑈 𝑈 = 𝑃 → 0 𝑈 k …number of arms 3/10/2016 17
What are some issues with Epsilon Greedy ? “Not elegant” : Algorithm explicitly distinguishes between exploration and exploitation More importantly: Exploration makes suboptimal choices (since it picks any arm with equal likelihood) Idea: When exploring/exploiting we need to compare arms 3/10/2016 18
Suppose we have done experiments: Arm 1 : 1 0 0 1 1 0 0 1 0 1 Arm 2 : 1 Arm 3 : 1 1 0 1 1 1 0 1 1 1 Mean arm values: Arm 1 : 5/10, Arm 2 : 1, Arm 3 : 8/10 Which arm would you pick next? Idea: Don’t just look at the mean (that is, expected payoff) but also the confidence! 3/10/2016 19
A confidence interval is a range of values within which we are sure the mean lies with a certain probability We could believe 𝝂 𝒃 is within [0.2,0.5] with probability 0.95 If we have tried an action less often, our estimated reward is less accurate so the confidence interval is larger Interval shrinks as we get more information (i.e. try the action more often) 3/10/2016 20
Assuming we know the confidence intervals Then, instead of trying the action with the highest mean we can try the action with the highest upper bound on its confidence interval This is called an optimistic policy We believe an action is as good as possible given the available evidence 3/10/2016 21
99.99% confidence interval 𝝂 𝒃 After more 𝝂 𝒃 exploration arm a arm a 3/10/2016 22
Suppose we fix arm a: Let 𝒁 𝒃,𝟐 … 𝒁 𝒃,𝒏 be the payoffs of arm a in the first m trials 𝒁 𝒃,𝟐 … 𝒁 𝒃,𝒏 are i.i.d. rnd. vars. with values in [0,1] Expected mean payoff of arm a : 𝝂 𝒃 = 𝑭[𝒁 𝒃,𝒏 ] 𝟐 𝒏 𝒏 σ ℓ=𝟐 Our estimate: ෟ 𝝂 𝒃,𝒏 = 𝒁 𝒃,ℓ Want to find confidence bound 𝒄 such that with high probability 𝝂 𝒃 − ෟ 𝝂 𝒃,𝒏 ≤ 𝒄 Also want 𝒄 to be as small as possible ( why? ) Goal: Want to bound 𝐐( 𝝂 𝒃 − ෟ 𝝂 𝒃,𝒏 ≤ 𝒄) 3/10/2016 23
Hoeffding’s inequality bounds 𝐐( 𝝂 𝒃 − ෟ 𝝂 𝒃,𝒏 ≤ 𝒄) Let 𝒁 𝟐 … 𝒁 𝒏 be i.i.d. rnd. vars. with values between [0,1] 𝟐 𝒏 Let 𝝂 = 𝑭[𝒁] 𝒏 σ ℓ=𝟐 and ෞ 𝝂 𝒏 = 𝒁 ℓ 𝝂 𝒏 ≥ 𝒄 ≤ 𝒇𝒚𝒒 −𝟑𝒄 𝟑 𝒏 = 𝜺 Then: 𝐐 𝝂 − ෞ To find out the confidence interval 𝒄 (for a given confidence level 𝜺 ) we solve: 𝑓 −2𝑐 2 𝑛 ≤ 𝜀 then −2𝑐 2 𝑛 ≤ ln(𝜀) 𝐦𝐨 𝟐/𝜺 So: 𝒄 ≥ 𝟑 𝒏 3/10/2016 24
[Auer et al. ‘02] UCB1 (Upper confidence sampling) algorithm Set: ෞ 𝝂 𝟐 = ⋯ = ෞ 𝝂 𝒍 = 𝟏 and 𝒏 𝟐 = ⋯ = 𝒏 𝒍 = 𝟏 ෞ 𝝂 𝒃 is our estimate of payoff of arm 𝒋 Upper confidence 𝒏 𝒃 is the number of pulls of arm 𝒋 so far interval ( Hoeffding’s inequality) For t = 1:T 𝟑 ln 𝒖 For each arm a calculate: 𝑽𝑫𝑪 𝒃 = ෞ 𝝂 𝒃 + 𝒏 𝒃 Pick arm 𝒌 = 𝒃𝒔𝒉 𝒏𝒃𝒚 𝒃 𝑽𝑫𝑪 𝒃 Pull arm 𝒌 and observe 𝒛 𝒖 𝟐 Set: 𝒏 𝒌 ← 𝒏 𝒌 + 𝟐 and ෞ 𝒏 𝒌 (𝒛 𝒖 + 𝒏 𝒌 − 𝟐 ෞ 𝝂 𝒌 ← 𝝂 𝒌 ) 3/10/2016 25
𝐦𝐨 𝟐/𝜺 𝟑 ln 𝒖 𝒄 ≥ 𝑽𝑫𝑪 𝒃 = ෞ 𝝂 𝒃 + 𝟑 𝒏 𝒏 𝒃 𝒖 impacts the value of 𝜺 : 𝒖 = 𝒈 𝟐/𝜺 Confidence interval grows with the total number of actions 𝒖 we have taken But shrinks with the number of times 𝒏 𝒃 we have tried arm 𝒃 This ensures each arm is tried infinitely often but still balances exploration and exploitation “Optimism in face of uncertainty”: The algorithm believes that it can obtain extra rewards by reaching the unexplored parts of the state space 3/10/2016 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26
Theorem [Auer et al. 2002] Suppose optimal mean payoff is 𝝂 ∗ = 𝐧𝐛𝐲 𝝂 𝒃 𝒃 And for each arm let 𝚬 𝐛 = 𝝂 ∗ − 𝝂 𝒃 Then it holds that 𝒍 + 𝟐 + 𝝆 𝟑 𝐦𝐨 𝑼 𝑭 𝑺 𝑼 ≤ 𝟗 𝚬 𝒃 𝚬 𝒃 𝟒 𝒃:𝝂 𝒃 <𝝂 ∗ 𝒋=𝒃 𝑷(𝒍 ln 𝑼) 𝑷(𝒍) 𝑺 𝑼 𝒎𝒐 𝑼 So: 𝑷 ≤ 𝒍 𝑼 𝑼 3/10/2016 27
k -armed bandit problem is a formalization of the exploration-exploitation tradeoff Simple algorithms are able to achieve no regret (limit towards infinity) Epsilon-greedy UCB (Upper Confidence Sampling) 3/10/2016 28
Recommend
More recommend