http cs246 stanford edu web advertising
play

http://cs246.stanford.edu Web advertising Weve learned how to - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Caroline Lo, Stanford University http://cs246.stanford.edu Web advertising Weve learned how to match advertisers to queries in real-time But how to estimate the CTR (Click-Through Rate)?


  1. CS246: Mining Massive Datasets Caroline Lo, Stanford University http://cs246.stanford.edu

  2.  Web advertising  We’ve learned how to match advertisers to queries in real-time  But how to estimate the CTR (Click-Through Rate)?  Recommendation engines  We’ve learned how to build recommender systems  But how to solve the cold-start problem? 3/10/2016 2

  3.  What do CTR and cold start have in common?  Getting the answer requires experimentation  With every ad we show/ product we recommend we gather more data about the ad/product  Theme: Learning through experimentation 3/10/2016 3

  4.  Google’s goal: Maximize revenue  The old way: Pay by impression (CPM) 3/10/2016 4

  5.  Google’s goal: Maximize revenue  The old way: Pay by impression (CPM)  Best strategy: Go with the highest bidder  But this ignores “effectiveness” of an ad  The new way: Pay per click! (CPC)  Best strategy: Go with expected revenue  What’s the expected revenue of ad a for query q ?  E[revenue a,q ] = P(click a | q) * amount a,q Bid amount for Prob. user will click on ad a given ad a on query q that she issues query q (Known) (Unknown! Need to gather information) 3/10/2016 5

  6.  Clinical trials:  Investigate effects of different treatments while minimizing patient losses  Adaptive routing:  Minimize delay in the network by investigating different routes  Asset pricing:  Figure out product prices while trying to make most money 3/10/2016 6

  7. 3/10/2016 7

  8. 3/10/2016 8

  9.  Each arm a  Wins (reward= 1 ) with fixed (unknown) prob. μ a  Loses (reward= 0 ) with fixed (unknown) prob. 1- μ a  All draws are independent given μ 1 … μ k  How to pull arms to maximize total reward? 3/10/2016 9

  10.  How does this map to our advertising example?  Each query is a bandit  Each ad is an arm  We want to estimate the arm’s probability of winning μ a (i.e., the ad’s CTR μ a )  Every time we pull an arm we do an ‘experiment’ 3/10/2016 10

  11. The setting:  Set of k choices (arms)  Each choice a is tied to a probability distribution P a with average reward/payoff μ a (between [0, 1])  We play the game for T rounds  For each round t :  (1) We pick some arm j  (2) We win reward 𝒀 𝒖 drawn from P j  Note reward is independent of previous draws 𝑼  Our goal is to maximize σ 𝒖=𝟐 𝒀 𝒖  We don’t know μ a ! But every time we pull some arm a we get to learn a bit about μ a 3/10/2016 11

  12.  Online optimization with limited feedback Choices X 1 X 2 X 3 X 4 X 5 X 6 … a 1 1 1 a 2 0 1 0 … a k 0 Time  Like in online algorithms:  Have to make a choice each time  But we only receive information about the chosen action 3/10/2016 12

  13.  Policy : a strategy/rule that in each iteration tells me which arm to pull  Hopefully policy depends on the history of rewards  How to quantify performance of the algorithm? Regret! 3/10/2016 13

  14.  𝝂 𝒃 is the mean of 𝑸 𝒃  Payoff/reward of best arm : 𝝂 ∗ = 𝐧𝐛𝐲 𝝂 𝒃 𝒃  Let 𝒃 𝟐 , 𝒃 𝟑 … 𝒃 𝑼 be the sequence of arms pulled  Instantaneous regret at time 𝒖 : 𝒔 𝒖 = 𝝂 ∗ − 𝝂 𝒃 𝒖  Total regret: 𝑼 𝑺 𝑼 = ෍ 𝒔 𝒖 𝒖=𝟐  Typical goal: Want a policy (arm allocation strategy) that guarantees: 𝑺 𝑼 𝑼 → 𝟏 as 𝑼 → ∞  Note: Ensuring 𝑆 𝑈 /𝑈 → 0 is stronger than maximizing payoffs (minimizing regret), as it means that in the limit we discover the true best arm. 3/10/2016 14

  15.  If we knew the payoffs, which arm would we pull? 𝐐𝐣𝐝𝐥 𝐛𝐬𝐡 𝐧𝐛𝐲 𝝂 𝒃 𝒃  We’d always pull the arm with the highest average reward.  But we don’t know which arm that is without exploring /experimenting with the arms first. 𝑌 𝑏,𝑘 … payoff received when pulling arm 𝑏 for 𝑘 -th time 3/10/2016 15

  16.  Minimizing regret illustrates a classic problem in decision making:  We need to trade off exploration (gathering data about arm payoffs) and exploitation (making decisions based on data already gathered)  Exploration: Pull an arm we never pulled before  Exploitation: Pull an arm 𝒃 for which we currently have the highest estimate of 𝝂 𝒃 3/10/2016 16

  17. Algorithm: Epsilon-Greedy  For t=1:T  Set 𝜻 𝒖 = 𝑷(𝟐/𝒖)  With prob. 𝜻 𝒖 : Explore by picking an arm chosen uniformly at random  With prob. 𝟐 − 𝜻 𝒖 : Exploit by picking an arm with highest empirical mean payoff  Theorem [Auer et al. ‘02] For suitable choice of 𝜻 𝒖 it holds that 𝑆 𝑈 = 𝑃(𝑙 log 𝑈) ֜ 𝑆 𝑈 𝑙 log 𝑈 𝑈 = 𝑃 → 0 𝑈 k …number of arms 3/10/2016 17

  18.  What are some issues with Epsilon Greedy ?  “Not elegant” : Algorithm explicitly distinguishes between exploration and exploitation  More importantly: Exploration makes suboptimal choices (since it picks any arm with equal likelihood)  Idea: When exploring/exploiting we need to compare arms 3/10/2016 18

  19.  Suppose we have done experiments:  Arm 1 : 1 0 0 1 1 0 0 1 0 1  Arm 2 : 1  Arm 3 : 1 1 0 1 1 1 0 1 1 1  Mean arm values:  Arm 1 : 5/10, Arm 2 : 1, Arm 3 : 8/10  Which arm would you pick next?  Idea: Don’t just look at the mean (that is, expected payoff) but also the confidence! 3/10/2016 19

  20.  A confidence interval is a range of values within which we are sure the mean lies with a certain probability  We could believe 𝝂 𝒃 is within [0.2,0.5] with probability 0.95  If we have tried an action less often, our estimated reward is less accurate so the confidence interval is larger  Interval shrinks as we get more information (i.e. try the action more often) 3/10/2016 20

  21.  Assuming we know the confidence intervals  Then, instead of trying the action with the highest mean we can try the action with the highest upper bound on its confidence interval  This is called an optimistic policy  We believe an action is as good as possible given the available evidence 3/10/2016 21

  22. 99.99% confidence interval 𝝂 𝒃 After more 𝝂 𝒃 exploration arm a arm a 3/10/2016 22

  23. Suppose we fix arm a:  Let 𝒁 𝒃,𝟐 … 𝒁 𝒃,𝒏 be the payoffs of arm a in the first m trials  𝒁 𝒃,𝟐 … 𝒁 𝒃,𝒏 are i.i.d. rnd. vars. with values in [0,1]  Expected mean payoff of arm a : 𝝂 𝒃 = 𝑭[𝒁 𝒃,𝒏 ] 𝟐 𝒏 𝒏 σ ℓ=𝟐  Our estimate: ෟ 𝝂 𝒃,𝒏 = 𝒁 𝒃,ℓ  Want to find confidence bound 𝒄 such that with high probability 𝝂 𝒃 − ෟ 𝝂 𝒃,𝒏 ≤ 𝒄  Also want 𝒄 to be as small as possible ( why? )  Goal: Want to bound 𝐐( 𝝂 𝒃 − ෟ 𝝂 𝒃,𝒏 ≤ 𝒄) 3/10/2016 23

  24.  Hoeffding’s inequality bounds 𝐐( 𝝂 𝒃 − ෟ 𝝂 𝒃,𝒏 ≤ 𝒄)  Let 𝒁 𝟐 … 𝒁 𝒏 be i.i.d. rnd. vars. with values between [0,1] 𝟐 𝒏  Let 𝝂 = 𝑭[𝒁] 𝒏 σ ℓ=𝟐 and ෞ 𝝂 𝒏 = 𝒁 ℓ 𝝂 𝒏 ≥ 𝒄 ≤ 𝒇𝒚𝒒 −𝟑𝒄 𝟑 𝒏 = 𝜺  Then: 𝐐 𝝂 − ෞ  To find out the confidence interval 𝒄 (for a given confidence level 𝜺 ) we solve:  𝑓 −2𝑐 2 𝑛 ≤ 𝜀 then −2𝑐 2 𝑛 ≤ ln(𝜀) 𝐦𝐨 𝟐/𝜺  So: 𝒄 ≥ 𝟑 𝒏 3/10/2016 24

  25. [Auer et al. ‘02]  UCB1 (Upper confidence sampling) algorithm  Set: ෞ 𝝂 𝟐 = ⋯ = ෞ 𝝂 𝒍 = 𝟏 and 𝒏 𝟐 = ⋯ = 𝒏 𝒍 = 𝟏  ෞ 𝝂 𝒃 is our estimate of payoff of arm 𝒋 Upper confidence  𝒏 𝒃 is the number of pulls of arm 𝒋 so far interval ( Hoeffding’s inequality)  For t = 1:T 𝟑 ln 𝒖  For each arm a calculate: 𝑽𝑫𝑪 𝒃 = ෞ 𝝂 𝒃 + 𝒏 𝒃  Pick arm 𝒌 = 𝒃𝒔𝒉 𝒏𝒃𝒚 𝒃 𝑽𝑫𝑪 𝒃  Pull arm 𝒌 and observe 𝒛 𝒖 𝟐  Set: 𝒏 𝒌 ← 𝒏 𝒌 + 𝟐 and ෞ 𝒏 𝒌 (𝒛 𝒖 + 𝒏 𝒌 − 𝟐 ෞ 𝝂 𝒌 ← 𝝂 𝒌 ) 3/10/2016 25

  26. 𝐦𝐨 𝟐/𝜺 𝟑 ln 𝒖 𝒄 ≥  𝑽𝑫𝑪 𝒃 = ෞ 𝝂 𝒃 + 𝟑 𝒏 𝒏 𝒃  𝒖 impacts the value of 𝜺 : 𝒖 = 𝒈 𝟐/𝜺  Confidence interval grows with the total number of actions 𝒖 we have taken  But shrinks with the number of times 𝒏 𝒃 we have tried arm 𝒃  This ensures each arm is tried infinitely often but still balances exploration and exploitation “Optimism in face of uncertainty”: The algorithm believes that it can obtain extra rewards by reaching the unexplored parts of the state space 3/10/2016 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26

  27.  Theorem [Auer et al. 2002]  Suppose optimal mean payoff is 𝝂 ∗ = 𝐧𝐛𝐲 𝝂 𝒃 𝒃  And for each arm let 𝚬 𝐛 = 𝝂 ∗ − 𝝂 𝒃  Then it holds that 𝒍 + 𝟐 + 𝝆 𝟑 𝐦𝐨 𝑼 𝑭 𝑺 𝑼 ≤ 𝟗 ෍ ෍ 𝚬 𝒃 𝚬 𝒃 𝟒 𝒃:𝝂 𝒃 <𝝂 ∗ 𝒋=𝒃 𝑷(𝒍 ln 𝑼) 𝑷(𝒍) 𝑺 𝑼 𝒎𝒐 𝑼  So: 𝑷 ≤ 𝒍 𝑼 𝑼 3/10/2016 27

  28.  k -armed bandit problem is a formalization of the exploration-exploitation tradeoff  Simple algorithms are able to achieve no regret (limit towards infinity)  Epsilon-greedy  UCB (Upper Confidence Sampling) 3/10/2016 28

Recommend


More recommend