Showing Relevant Ads via Context Multi-Armed Bandits D avid P al - PowerPoint PPT Presentation

Showing Relevant Ads via Context Multi-Armed Bandits D´ avid P´ al December 17, 2008 A&C Seminar joint work with Tyler Lu and Martin P´ al

The Problem • we’re running a popular website • users visit our website • we want to show each user relevant ad for him/her • relevant = likely to click on • for each user there is some side information • (search query, geographic location, cookies, etc.)

Multi-Armed Bandits • pulling an arm = showing an ad • reward = click on the ad

Previous Work Context-Free Multi-Armed Bandits • historical papers by Robbins in early 1950’s • stochastic version: Lai & Robbins 1985, Auer et al. 2002 • non-stochastic version: Auer et al. 1995 • Lipschitz version: R. Kleinberg 2005, Auer et al. 2007, R. Kleinberg et al. 2008

Overview • Our model with context and Lipschitz condition • Regret and No-Regret learning • Statement of our results: • upper and lower bound on the regret • Our algorithm • Idea of the analysis of the algorithm

Lipschitz Context Multi-Armed Bandits • information x about the user ( context ) • suppose we show ad y • with probability µ ( x , y ) the user’s clicks on the ad • assume µ : X × Y → [ 0 , 1 ] is Lipschitz: | µ ( x , y ) − µ ( x ′ , y ′ ) | ≤ L X ( x , x ′ ) + L Y ( y , y ′ ) where L X and L Y are metrics

The Game • adversary chooses µ : X × Y → [ 0 , 1 ] and a sequence x 1 , x 2 , . . . , x T • algorithm chooses y 1 , y 2 , . . . , y T online: • in round t = 1 , 2 , . . . , T the algorithm has access to • x 1 , x 2 , . . . , x t − 1 • y 1 , y 2 , . . . , y t − 1 • ^ µ 1 , ^ µ 2 , . . . , ^ µ t − 1 ∈ { 0 , 1 } • adversary reveals x t • based on this the algorithm outputs y t

Regret • optimal strategy: in round t = 1 , 2 , . . . , T show y ∗ t = argmax µ ( x t , y ) y ∈ Y • the algorithm shows instead y 1 , y 2 , . . . , y T • difference between expected payoffs � T � T � � Regret ( T ) = µ ( x t , y ∗ t ) − E µ ( x t , y t ) t = 1 t = 1

No Regret Learning • per-round regret vanishes: Regret ( T ) lim = 0 T T →∞ • how fast is the convergence? typical result: Regret ( T ) = O ( T γ ) where 0 < γ < 1.

Our Results (Oversimplifying and lying somewhat.) Theorem If X has “dimension” a and Y has “dimension” b, then • there exists an algorithm with � � Regret ( T ) = � a + b + 1 O T a + b + 2 • for any algorithm � � a + b + 1 Regret ( T ) = Ω T a + b + 2

Covering Dimension • let ( Z , L Z ) be a metric space • cover the space with ǫ -balls • How many balls do we need? • roughly ( 1 /ǫ ) d ǫ • define d to be the dimension

Optimal Algorithm • suppose that T is known to the algorithm • X , Y have dimensions a , b respectively • discretize X and Y : 1 ǫ = T − a + b + 2 • X 0 are centers of ǫ -balls covering X • Y 0 are centers of ǫ -balls covering Y • round x t to nearest element of X 0 • display only ads from Y 0

Optimal Algorithm, continued • for each x 0 ∈ X 0 and y 0 ∈ Y 0 maintain: • number of times y 0 was displayed for x 0 : n ( x 0 , y 0 ) • corresponding number of clicks: m ( x 0 , y 0 ) • estimate of the click-through rate: µ ( x 0 , y 0 ) = m ( x 0 , y 0 ) n ( x 0 , y 0 )

Optimal Algorithm, continued x 0 ǫ x t • when x t arrives “round” it to x 0 ∈ X 0 • show ad y 0 ∈ Y 0 that maximizes � log T µ ( x 0 , y 0 ) + 1 + n ( x 0 , y 0 ) (exploration vs. exploitation trade-off)

Idea of Analysis • let � log T R t ( x 0 , y 0 ) = 1 + n ( x 0 , y 0 ) I t ( x 0 , y 0 ) = µ ( x 0 , y 0 ) + R t ( x 0 , y 0 ) • By Chernoff-Hoeffding bound with high probability I t ( x 0 , y 0 ) ∈ [ µ ( x 0 , y 0 ) − ǫ, µ ( x 0 , y 0 ) + 2 R t ( x 0 , y 0 ) + ǫ ] for all x 0 ∈ X 0 , y 0 ∈ Y 0 and all t = 1 , 2 , . . . , T simultaneously.

Idea of Analysis Fix x 0 ∈ X 0 Y 0 µ ( x 0 , y 4 ) y 4 µ ( x 0 , y 3 ) y 3 µ ( x 0 , y 2 ) y 2 µ ( x 0 , y 1 ) y 1 µ ( x 0 , · )

Idea of Analysis The confidence intervals µ ( x 0 , · ) − ǫ µ ( x 0 , · ) + 2 R t ( x 0 , · ) + ǫ

Idea of Analysis • The algorithm displays the ad maximizing I t ( x 0 , · ) . • I t ( x 0 , y 0 ) ’s lies w.h.p. in the confidence interval. I t ( x 0 , · )

Idea of Analysis � T � T � � Regret ( T ) = µ ( x t , y ∗ t ) − E µ ( x t , y t ) t = 1 t = 1 contribution to the regret: µ ( x 0 , y ∗ ) − µ ( x 0 , y ) optimal ad y ∗ suboptimal ad y

Idea of Analysis If µ ( x 0 , y ) + R t ( x 0 , y ) + ǫ < µ ( x 0 , y ∗ ) − ǫ , the algorithm stops displaying the suboptimal ad y . µ ( x 0 , y ∗ ) − ǫ µ ( x 0 , y ) + 2 R t ( x 0 , y ) + ǫ

Idea of Analysis � log T R t ( x 0 , y ) = 1 + n ( x 0 , y ) • Confidence interval for y shrinks as n t ( x 0 , y ) increases. • Thus we can upper bound n t ( x 0 , y ) in terms of the difference µ ( x 0 , y ∗ ) − µ ( x 0 , y ) • Rest is just a long calculation.

Conclusion • formulation of Context Multi-Armed Bandits • roughly matching upper and lower bounds: a + b + 1 T a + b + 2 • www.cs.uwaterloo.ca/˜dpal/papers/ • possible future work: non-stochastic clicks Thanks!

Showing Relevant Ads via Context Multi-Armed Bandits D avid P al - PowerPoint PPT Presentation

Showing Relevant Ads via Context Multi-Armed Bandits D avid P al December 17, 2008 A&C Seminar joint work with Tyler Lu and Martin P al The Problem were running a popular website users visit our website we want to

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Ciara Pike-Burke

On conditional versus marginal bias in multi-armed bandits Jaehyeok Shin 1 , Aaditya Ramdas 1,2

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of

Social Learning in Multi Agent Multi Armed Bandits Abishek Sankararaman, UC Berkeley April 9,

Muti-armed Bandits,Online Learning and Sequential Prediction Jian Li Institute for

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Research at the Boundary of Robotics and AI Prof: Peter Stone Department of Computer Science

Quantity vs. Quality: Evaluating User Interest Profiles Using Ad Preference Managers Muhammad

ad -heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and

Vectorial AdS/CFT and quantum higher spins Arkady Tseytlin Partition functions and Casimir

CLEF eHealth 2020 @clefehealth CLEF eHealth 2020 Task 1: Multilingual Information Extraction

COMMERCIAL STREET/ COMMERCIAL STREET/ ATLANTIC AVENUE ATLANTIC AVENUE BICYCLE FACILITIES

The Advanced Encryption Standard - see Susan Landaus paper: Communications security for the

Advanced Encryption Standard Simplified-AES Simplified-AES Example Details of AES Cryptography