Showing Relevant Ads via Context Multi-Armed Bandits D´ avid P´ al December 17, 2008 A&C Seminar joint work with Tyler Lu and Martin P´ al
The Problem • we’re running a popular website • users visit our website • we want to show each user relevant ad for him/her • relevant = likely to click on • for each user there is some side information • (search query, geographic location, cookies, etc.)
Multi-Armed Bandits • pulling an arm = showing an ad • reward = click on the ad
Previous Work Context-Free Multi-Armed Bandits • historical papers by Robbins in early 1950’s • stochastic version: Lai & Robbins 1985, Auer et al. 2002 • non-stochastic version: Auer et al. 1995 • Lipschitz version: R. Kleinberg 2005, Auer et al. 2007, R. Kleinberg et al. 2008
Overview • Our model with context and Lipschitz condition • Regret and No-Regret learning • Statement of our results: • upper and lower bound on the regret • Our algorithm • Idea of the analysis of the algorithm
Lipschitz Context Multi-Armed Bandits • information x about the user ( context ) • suppose we show ad y • with probability µ ( x , y ) the user’s clicks on the ad • assume µ : X × Y → [ 0 , 1 ] is Lipschitz: | µ ( x , y ) − µ ( x ′ , y ′ ) | ≤ L X ( x , x ′ ) + L Y ( y , y ′ ) where L X and L Y are metrics
The Game • adversary chooses µ : X × Y → [ 0 , 1 ] and a sequence x 1 , x 2 , . . . , x T • algorithm chooses y 1 , y 2 , . . . , y T online: • in round t = 1 , 2 , . . . , T the algorithm has access to • x 1 , x 2 , . . . , x t − 1 • y 1 , y 2 , . . . , y t − 1 • ^ µ 1 , ^ µ 2 , . . . , ^ µ t − 1 ∈ { 0 , 1 } • adversary reveals x t • based on this the algorithm outputs y t
Regret • optimal strategy: in round t = 1 , 2 , . . . , T show y ∗ t = argmax µ ( x t , y ) y ∈ Y • the algorithm shows instead y 1 , y 2 , . . . , y T • difference between expected payoffs � T � T � � Regret ( T ) = µ ( x t , y ∗ t ) − E µ ( x t , y t ) t = 1 t = 1
No Regret Learning • per-round regret vanishes: Regret ( T ) lim = 0 T T →∞ • how fast is the convergence? typical result: Regret ( T ) = O ( T γ ) where 0 < γ < 1.
Our Results (Oversimplifying and lying somewhat.) Theorem If X has “dimension” a and Y has “dimension” b, then • there exists an algorithm with � � Regret ( T ) = � a + b + 1 O T a + b + 2 • for any algorithm � � a + b + 1 Regret ( T ) = Ω T a + b + 2
Covering Dimension • let ( Z , L Z ) be a metric space • cover the space with ǫ -balls • How many balls do we need? • roughly ( 1 /ǫ ) d ǫ • define d to be the dimension
Optimal Algorithm • suppose that T is known to the algorithm • X , Y have dimensions a , b respectively • discretize X and Y : 1 ǫ = T − a + b + 2 • X 0 are centers of ǫ -balls covering X • Y 0 are centers of ǫ -balls covering Y • round x t to nearest element of X 0 • display only ads from Y 0
Optimal Algorithm, continued • for each x 0 ∈ X 0 and y 0 ∈ Y 0 maintain: • number of times y 0 was displayed for x 0 : n ( x 0 , y 0 ) • corresponding number of clicks: m ( x 0 , y 0 ) • estimate of the click-through rate: µ ( x 0 , y 0 ) = m ( x 0 , y 0 ) n ( x 0 , y 0 )
Optimal Algorithm, continued x 0 ǫ x t • when x t arrives “round” it to x 0 ∈ X 0 • show ad y 0 ∈ Y 0 that maximizes � log T µ ( x 0 , y 0 ) + 1 + n ( x 0 , y 0 ) (exploration vs. exploitation trade-off)
Idea of Analysis • let � log T R t ( x 0 , y 0 ) = 1 + n ( x 0 , y 0 ) I t ( x 0 , y 0 ) = µ ( x 0 , y 0 ) + R t ( x 0 , y 0 ) • By Chernoff-Hoeffding bound with high probability I t ( x 0 , y 0 ) ∈ [ µ ( x 0 , y 0 ) − ǫ, µ ( x 0 , y 0 ) + 2 R t ( x 0 , y 0 ) + ǫ ] for all x 0 ∈ X 0 , y 0 ∈ Y 0 and all t = 1 , 2 , . . . , T simultaneously.
Idea of Analysis Fix x 0 ∈ X 0 Y 0 µ ( x 0 , y 4 ) y 4 µ ( x 0 , y 3 ) y 3 µ ( x 0 , y 2 ) y 2 µ ( x 0 , y 1 ) y 1 µ ( x 0 , · )
Idea of Analysis The confidence intervals µ ( x 0 , · ) − ǫ µ ( x 0 , · ) + 2 R t ( x 0 , · ) + ǫ
Idea of Analysis • The algorithm displays the ad maximizing I t ( x 0 , · ) . • I t ( x 0 , y 0 ) ’s lies w.h.p. in the confidence interval. I t ( x 0 , · )
Idea of Analysis � T � T � � Regret ( T ) = µ ( x t , y ∗ t ) − E µ ( x t , y t ) t = 1 t = 1 contribution to the regret: µ ( x 0 , y ∗ ) − µ ( x 0 , y ) optimal ad y ∗ suboptimal ad y
Idea of Analysis If µ ( x 0 , y ) + R t ( x 0 , y ) + ǫ < µ ( x 0 , y ∗ ) − ǫ , the algorithm stops displaying the suboptimal ad y . µ ( x 0 , y ∗ ) − ǫ µ ( x 0 , y ) + 2 R t ( x 0 , y ) + ǫ
Idea of Analysis � log T R t ( x 0 , y ) = 1 + n ( x 0 , y ) • Confidence interval for y shrinks as n t ( x 0 , y ) increases. • Thus we can upper bound n t ( x 0 , y ) in terms of the difference µ ( x 0 , y ∗ ) − µ ( x 0 , y ) • Rest is just a long calculation.
Conclusion • formulation of Context Multi-Armed Bandits • roughly matching upper and lower bounds: a + b + 1 T a + b + 2 • www.cs.uwaterloo.ca/˜dpal/papers/ • possible future work: non-stochastic clicks Thanks!
Recommend
More recommend