Bandit opmizaon with large strategy sets Alexandre - PowerPoint PPT Presentation

Bandit ¡op*miza*on ¡with ¡large ¡ strategy ¡sets ¡ ¡ Alexandre ¡Prou*ere ¡ ¡ Joint ¡work ¡with ¡ Richard ¡Combes ¡ ¡ Alexandre ¡Prou-ère ¡ ¡ Docent ¡lecture ¡ October ¡11, ¡2013 ¡ 1 ¡

Outline ¡ 1. Mo-va-on ¡ 2. Bandit ¡op-miza-on: ¡background ¡ 3. Graphically ¡unimodal ¡bandits ¡ 4. Applica-ons ¡ 2 ¡

1. ¡Mo-va-on ¡ 3 ¡

Rate ¡adapta-on ¡in ¡802.11 ¡ Adap-ng ¡the ¡modula-on/coding ¡scheme ¡to ¡the ¡radio ¡environment ¡ ¡ -‑ 802.11 ¡a/b/g ¡ Yes/No ¡ rates ¡ r 1 r 2 . . . r N Success ¡probabili-es ¡ θ 1 θ 2 θ N . . . Throughputs ¡ µ i = r i θ i µ 1 µ 2 . . . µ N 9 ¡ 12 ¡ 18 ¡ 24 ¡ 36 ¡ 48 ¡ 54 ¡ ¡ ¡ ¡(Mbit/s) ¡ 6 ¡ -‑ Op-mal ¡sequen-al ¡rate ¡selec-on? ¡ 4 ¡

Rate ¡adapta-on ¡in ¡802.11 ¡ -‑ 802.11 ¡n/ac ¡MIMO ¡ ¡Rate ¡+ ¡MIMO ¡mode ¡ ¡ ¡ ¡ ¡ ¡ -‑ Example: ¡two ¡modes, ¡single-‑stream ¡(SS) ¡or ¡double-‑stream ¡(DS) ¡ 27 ¡ 54 ¡ 81 ¡ 108 ¡ 162 ¡ 270 ¡ DS ¡ 216 ¡ 243 ¡ SS ¡ ¡ 27 ¡ 40.5 ¡ 54 ¡ 108 ¡ 135 ¡ 13.5 ¡ 81 ¡ 121.5 ¡ 5 ¡

CTR ¡es-ma-on ¡in ¡Ad ¡Auc-ons ¡ -‑ Current ¡prac-ce: ¡ ¡ ü Bayesian ¡es-mator ¡(graphical ¡models ¡+ ¡EP) ¡ ü Minimize ¡the ¡mean ¡square ¡error ¡on ¡CTRs ¡ ü Underlying ¡assump-on: ¡a ¡sta-c ¡system ¡ -‑ A ¡dynamic ¡system ¡(changing ¡ads, ¡changing ¡CTRs, ¡…) ¡ -‑ … ¡but ¡most ¡importantly ¡our ¡goal ¡is ¡to ¡maximize ¡profit ¡ not ¡minimize ¡CTR ¡es-ma-on ¡error! ¡ ¡ ¡

CTR ¡bandits ¡ -‑ CTR ¡matrix: ¡ Queries ¡(n ¡> ¡10 9 ) ¡ j ¡ Ads ¡(m ¡> ¡10 6 ) ¡ i ¡ µ ij T -‑ Profit ¡acer ¡ T ¡queries: ¡ X profit( T ) = b i ( t ) j ( t ) X i ( t ) j ( t ) t =1 X ij ∼ Ber( µ ij )

Online ¡decision ¡problem ¡ Queries ¡(> ¡10 9 ) ¡ Click ¡ j ¡ 1 ¡ 0 ¡ Ads ¡(> ¡10 6 ) ¡ 1 ¡ 1 ¡ No ¡click ¡ 0 ¡ 1 ¡ 1 ¡ 0 ¡ 0 ¡

Online ¡decision ¡problem ¡ Queries ¡(> ¡10 9 ) ¡ j ¡ 1 ¡ 0 ¡ Ads ¡(> ¡10 6 ) ¡ 1 ¡ 1 ¡ 0 ¡ 1 ¡ 1 ¡ 0 ¡ 0 ¡

Online ¡decision ¡problem ¡ Queries ¡(> ¡10 9 ) ¡ 1 ¡ 0 ¡ Ads ¡(> ¡10 6 ) ¡ 1 ¡ 1 ¡ 0 ¡ 1 ¡ 1 ¡ 0 ¡ 0 ¡ 0 ¡

2. ¡Bandit ¡op-miza-on ¡ 11 ¡

Bandit ¡op-miza-on ¡ A ¡sequen-al ¡decision ¡problem ¡( Thompson ¡1933) ¡ 1 ¡ ¡ ¡ ¡ ¡ ¡2 ¡ ¡ ¡ ¡ ¡ ¡3 ¡ ¡ ¡ ¡ ¡ ¡4 ¡ ¡ ¡ ¡ ¡ ¡5 ¡ ¡ ¡….. ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡pa-ents ¡ D ¡ ¡ ¡ ¡ ¡D ¡ ¡ ¡ ¡ ¡ ¡D ¡ ¡ ¡ ¡ ¡ ¡L ¡ ¡ ¡ ¡ ¡ ¡D ¡ ¡ ¡….. ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ D ¡ ¡ ¡ ¡ ¡L ¡ ¡ ¡ ¡ ¡ ¡ ¡L ¡ ¡ ¡ ¡ ¡ ¡ ¡L ¡ ¡ ¡ ¡ ¡ ¡D ¡ ¡ ¡….. ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ – A ¡set ¡of ¡possible ¡ac-ons ¡at ¡each ¡step ¡ – Unknown ¡sequence ¡of ¡rewards ¡for ¡each ¡ac-on ¡ 12 ¡

Bandit ¡op-miza-on ¡ A ¡sequen-al ¡decision ¡problem ¡( Thompson ¡1933) ¡ 1 ¡ ¡ ¡ ¡ ¡ ¡2 ¡ ¡ ¡ ¡ ¡ ¡3 ¡ ¡ ¡ ¡ ¡ ¡4 ¡ ¡ ¡ ¡ ¡ ¡5 ¡ ¡ ¡….. ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡pa-ents ¡ D ¡ ¡ ¡ ¡ ¡D ¡ ¡ ¡ ¡ ¡ ¡D ¡ ¡ ¡ ¡ ¡ ¡L ¡ ¡ ¡ ¡ ¡ ¡D ¡ ¡ ¡….. ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ D ¡ ¡ ¡ ¡ ¡L ¡ ¡ ¡ ¡ ¡ ¡ ¡L ¡ ¡ ¡ ¡ ¡ ¡ ¡L ¡ ¡ ¡ ¡ ¡ ¡D ¡ ¡ ¡….. ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ – A ¡set ¡of ¡possible ¡ac-ons ¡at ¡each ¡step ¡ – Unknown ¡sequence ¡of ¡rewards ¡for ¡each ¡ac-on ¡ – Bandit ¡feedback: ¡only ¡rewards ¡of ¡chosen ¡ac-ons ¡are ¡observed ¡ 13 ¡

Bandit ¡op-miza-on ¡ A ¡sequen-al ¡decision ¡problem ¡( Thompson ¡1933) ¡ 1 ¡ ¡ ¡ ¡ ¡ ¡2 ¡ ¡ ¡ ¡ ¡ ¡3 ¡ ¡ ¡ ¡ ¡ ¡4 ¡ ¡ ¡ ¡ ¡ ¡5 ¡ ¡ ¡….. ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡pa-ents ¡ D ¡ ¡ ¡ ¡ ¡D ¡ ¡ ¡ ¡ ¡ ¡D ¡ ¡ ¡ ¡ ¡ ¡L ¡ ¡ ¡ ¡ ¡ ¡D ¡ ¡ ¡….. ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ D ¡ ¡ ¡ ¡ ¡L ¡ ¡ ¡ ¡ ¡ ¡ ¡L ¡ ¡ ¡ ¡ ¡ ¡ ¡L ¡ ¡ ¡ ¡ ¡ ¡D ¡ ¡ ¡….. ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ – A ¡set ¡of ¡possible ¡ac-ons ¡at ¡each ¡step ¡ – Unknown ¡sequence ¡of ¡rewards ¡for ¡each ¡ac-on ¡ – Bandit ¡feedback: ¡only ¡rewards ¡of ¡chosen ¡ac-ons ¡are ¡observed ¡ – Goal: ¡maximize ¡the ¡cumula-ve ¡reward ¡(up ¡to ¡step ¡T), ¡i.e., ¡strike ¡the ¡ op-mal ¡explora-on-‑exploita-on ¡trade-‑off ¡ 14 ¡

Regret ¡ instantaneous ¡ reward ¡ unknown ¡best ¡ac-on ¡ your ¡algorithm ¡ -me ¡ 15 ¡

Regret ¡ instantaneous ¡ regret ¡ reward ¡ unknown ¡best ¡ac-on ¡ your ¡algorithm ¡ -me ¡ Objec-ve: ¡to ¡iden-fy ¡the ¡best ¡ac-on ¡with ¡minimum ¡explora-on, ¡ i.e., ¡to ¡minimize ¡regret ¡(to ¡maximize ¡the ¡“convergence ¡rate”) ¡ ¡ Par-cularly ¡relevant ¡when ¡the ¡best ¡ac-on ¡evolves ¡– ¡for ¡tracking ¡ problems ¡ 16 ¡

Stochas-c ¡Bandits ¡ Robbins ¡1952 ¡ -‑ K ¡arms ¡/ ¡decisions ¡/ ¡ac-ons ¡ µ ? = max -‑ Unknown ¡i.i.d. ¡rewards: ¡ X i,t ∼ Ber( µ i ) , µ i = µ i ? i -‑ Lack ¡of ¡structure: ¡ µ i ∈ [0 , 1] , ∀ i ∈ { 1 , . . . , K } -‑ Under ¡online ¡algorithm ¡ ¡ ¡ ¡, ¡arm ¡selected ¡at ¡-me ¡ ¡ ¡: ¡ t π ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡func-on ¡of ¡history ¡ ¡ ¡ ¡ I π ( I π t − 1 ,t − 1 ) 1 , 1 , . . . , I π 1 , X I π t − 1 , X I π t -‑ Regret ¡up ¡to ¡-me ¡ ¡ ¡ ¡: ¡ T T T X X R π ( T ) = max i =1 ,...,K E X i,t − E X I π t ,t ¡ t =1 t =1 17 ¡

Stochas-c ¡Bandits ¡ ¡ -‑ Asympto-c ¡regret ¡lower ¡bound ¡(no ¡algorithm ¡can ¡beat ¡this ¡ performance) ¡ -‑ Uniformly ¡good ¡algorithm: ¡ E [ t i ( T )] = o ( T ↵ ) , 8 α > 0 , 8 µ, 8 i 6 = i ? ¡ Theorem ¡(Lai-‑Robbins ¡1985) ¡For ¡any ¡uniformly ¡good ¡policy ¡ ¡ π µ ? − µ i R ⇡ ( T ) X lim inf log( T ) ≥ KL( µ i , µ ? ) T !1 i 6 = i ? q ) + (1 − p ) log(1 − p KL ( p, q ) = p log( p KL ¡divergence ¡number: ¡ 1 − q ) ¡ 1 Regret ¡linear ¡in ¡the ¡number ¡of ¡arms, ¡and ¡propor-onal ¡to ¡ ¡ ( µ ? − µ i ) 18 ¡

The ¡change-‑of-‑measure ¡argument ¡ µ i arms ¡ 2 ¡ 3 ¡ 4 ¡ 1 ¡ 5 ¡ i i ? k To ¡iden-fy ¡the ¡minimum ¡number ¡of ¡-mes ¡sub-‑op-mal ¡arm ¡ ¡ ¡ ¡ ¡ k must ¡be ¡played, ¡find ¡the ¡most ¡confusing ¡parameters ¡ 19 ¡

Bandit opmizaon with large strategy sets Alexandre - PowerPoint PPT Presentation

Bandit opmizaon with large strategy sets Alexandre Prou*ere Joint work with Richard Combes Alexandre Prou-re Docent lecture October 11, 2013

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5

One Armed Bandit source: http://dogbeforewicket.blogspot.ca EECS 1030 moodle.yorku.ca One Armed

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol`

Equilibria in large one-arm bandit games A. Salomon Universit e Paris 13 HEC Paris November

Upper confidence bound strategy on stochastical bandits Multiarmed bandit: K arms, at each step we

MATH 105: Finite Mathematics 6-1: Sets Prof. Jonathan Duncan Walla Walla College Winter

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurlien Garivier

A Contextual-Bandit Approach to Personalized News Article Recommendation Lihong li, Wei Chu,

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part 2 S ebastien

Experiment design Bandit problems and Markov decision processes Christos Dimitrakakis UiO

Reduced Variance Payoff Estimation in Adversarial Bandit Problems Levente Kocsis Csaba Szepesv

The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... Shahaf Nacson TAU Nov 15,

Data-Dependent Algorithms for Bandit Convex Optimization Mehryar Mohri 1 Scott Yang 2 1 Google,

MT@EC Final Multilingual W eb w orkshop Luxem bourg 1 5 -1 6 March 2 0 1 2 Spyridon Pilos

Improving Your TABLEGEN Description Javed Absar WHAT IS TABLEGEN ? DSL invented for LLVM

Dependency Parsing Joakim Nivre Uppsala University Department of Linguistics and Philology

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks Yuan Cao

Area-Universal Area-Universal Rectangular Layouts Rectangular Layouts David Eppstein

Disclosures UCSF Techniques in Complex Spine Surgery Course Las Vegas, 2019 Zimmer Biomet:

Welcoming Your New Family Member Lets Get Started! 1. Infant Feeding - Find out the Facts!

Montagne Ple (Martinique) and Soufrire (Guadeloupe): possible test beds for muon tomography

Bandit op*miza*on with large strategy sets Alexandre - PowerPoint PPT Presentation

Bandit op*miza*on with large strategy sets Alexandre Prou*ere Joint work with Richard Combes Alexandre Prou-re Docent lecture October 11, 2013

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5

One Armed Bandit source: http://dogbeforewicket.blogspot.ca EECS 1030 moodle.yorku.ca One Armed

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol`

Equilibria in large one-arm bandit games A. Salomon Universit e Paris 13 HEC Paris November

Upper confidence bound strategy on stochastical bandits Multiarmed bandit: K arms, at each step we

MATH 105: Finite Mathematics 6-1: Sets Prof. Jonathan Duncan Walla Walla College Winter

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurlien Garivier

A Contextual-Bandit Approach to Personalized News Article Recommendation Lihong li, Wei Chu,

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part 2 S ebastien

Experiment design Bandit problems and Markov decision processes Christos Dimitrakakis UiO

Reduced Variance Payoff Estimation in Adversarial Bandit Problems Levente Kocsis Csaba Szepesv

The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... Shahaf Nacson TAU Nov 15,

Data-Dependent Algorithms for Bandit Convex Optimization Mehryar Mohri 1 Scott Yang 2 1 Google,

MT@EC Final Multilingual W eb w orkshop Luxem bourg 1 5 -1 6 March 2 0 1 2 Spyridon Pilos

Improving Your TABLEGEN Description Javed Absar WHAT IS TABLEGEN ? DSL invented for LLVM

Dependency Parsing Joakim Nivre Uppsala University Department of Linguistics and Philology

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks Yuan Cao

Area-Universal Area-Universal Rectangular Layouts Rectangular Layouts David Eppstein

Disclosures UCSF Techniques in Complex Spine Surgery Course Las Vegas, 2019 Zimmer Biomet:

Welcoming Your New Family Member Lets Get Started! 1. Infant Feeding - Find out the Facts!

Montagne Ple (Martinique) and Soufrire (Guadeloupe): possible test beds for muon tomography

Bandit opmizaon with large strategy sets Alexandre - PowerPoint PPT Presentation

Bandit opmizaon with large strategy sets Alexandre Prou*ere Joint work with Richard Combes Alexandre Prou-re Docent lecture October 11, 2013