New Perspectives for Multi-Armed Bandits and Their Applications Vianney Perchet Workshop Learning & Statistics IHES, January 19 2017 CMLA, ENS Paris-Saclay
Motivations & Objectives
Classical Examples of Bandits Problems – Size of data: n patients with some proba of getting cured or – Patients cured or dead 1) Inference: Find the best treatment between the red and blue 3 – Choose one of two treatments to prescribe 2) Cumul: Save as many patients as possible
Classical Examples of Bandits Problems – Size of data: n banners with some proba of click or – Banner clicked or ignored 1) Inference: Find the best ad between the red and blue 2) Cumul: Get as many clicks as possible 3 – Choose one of two ads to display
Classical Examples of Bandits Problems – Size of data: n auctions with some expected revenue or – Auction won or lost 1) Inference: Find the best strategy between the red and blue 2) Cumul: Win as many profitable auctions as possible 3 – Choose one of two strategies(bid/opt out) to follow
Classical Examples of Bandits Problems – Size of data: n mails with some proba of spam or – Mail correctly or incorrectly classified 1) Inference: Find the best strategy between the red and blue 2) Cumul: Minimize number of errors as possible 3 – Choose one of two actions: spam or ham
Classical Examples of Bandits Problems – Size of data: n patients with some proba of getting cured or – Patients cured or dead 1) Inference: Find the best treatment between the red and blue 3 – Choose one of two treatments to prescribe 2) Cumul: Save as many patients as possible
– Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4
– Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4
– Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4
– Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4
– Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4
– Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4
– Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4
– Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4
– Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4
– Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4
– Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4
– Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4
Two-Armed Bandit – Patients arrive and are treated sequentially. – Save as many as possible. 4
A bit of theory 5
Stochastic Multi-Armed Bandit
K -Armed Stochastic Bandit Problems i.i.d. T T T – Goal: Maximize expected reward 7 bounded X i – K actions i ∈ { 1 , . . . , K } , outcome X i t ∈ R (sub-)Gaussian, ( ) 1 , X i 2 , . . . , ∼ N µ i , 1 ( 2 , . . . , X π t − 1 ) X π 1 1 , X π 2 ∈ { 1 , . . . , K } – Non-Anticipative Policy: π t t − 1 t = 1 E X π t t = 1 µ π t ∑ T t = ∑ T – Performance: Cumulative Regret µ i − µ π t = ∆ i ∑ ∑ ∑ { } R T = max π t = i ̸ = ⋆ 1 i ∈{ 1 , 2 } t = 1 t = 1 t = 1 with ∆ i = µ ⋆ − µ i , the “gap” or cost of error i .
Most Famous algorithm [Auer, Cesa-Bianchi, Fisher, ’02] i Worst-Case: k Regret: s . • UCB - “Upper Confidence Bound” T i t 8 i X i √ 2 log ( t ) { } π t + 1 = arg max t + , T i ( t ) where T i ( t ) = ∑ t t = 1 1 { π t = i } and X ∑ t = 1 s : i s = i X i K log ( T ) log ( T ) E R T ≲ ∑ ∧ T ∆ E R T ≲ sup ∆ k ∆ ∆ √ KT log ( T ) ≂
9 i X i i • “optimal”: no algo can always have a regret smaller than • 2-lines proof: i i i i √ { } 2 log ( t ) Ideas of proof π t + 1 = arg max i t + T i ( t ) √ √ 2 log ( t ) 2 log ( t ) ⋆ π t + 1 = i ̸ = ⋆ ⇐ ⇒ X ≤ X t + t + T ⋆ ( t ) T i ( t ) √ 2 log ( t ) ⇒ T i ( t ) ≲ log ( t ) “ = ⇒ ”∆ i ≤ = T i ( t ) ∆ 2 • Number of mistakes grows as log ( t ) i ; each mistake costs ∆ i . ∆ 2 log ( T ) log ( T ) Regret at stage T ≲ ∑ × ∆ i ≂ ∑ ∆ 2 ∆ i • “ = ⇒ ” actually happens with overwhelming proba log ( T ) ∑ ∆ i
Other Algos • Other algo, MOSS [Audibert, Bubeck], variants of UCB T Discretize + UCB gives TK • Other algo, ETC [Perchet,Rigollet], pulls in round robin then 10 k eliminates log ( T ∆ k ) √ R T ≲ ∑ , worst case R T ≤ T log ( K ) K ∆ k √ R T ≲ K log ( T ∆ min / K ) , worst case R T ≤ ∆ min • Infinite number of actions x ∈ [ 0 , 1 ] d with ∆( x ) 1 Lipschitz. √ ε ≤ T 2 / 3 R T ≲ T ε +
Very interesting.... useful ? no... Here is a list of reasons 11
On the basic assumptions 1. Stochastic: Data are not iid, patients are different ill-posedness , feature selection/model selection pomdp, learn trade bias/variance grouping, evaluations 4. Combinatorial: Several decisions at each stage combinatorial optimization, cascading 12 2. Different Timing: several actions for one reward 3. Delays: Rewards not received instantaneously 5. Non-linearity: concave gain, diminishing returns, etc
Investigating (past/present/futur) them 13
Patients are different • We assumed (implicitly ?) that all patients/users are identical • Treatments efficiency 9proba of clicks) depend on age, gender... • Those covariates or contexts are observed/known before taking the decision of blue/red pill The decision (and regret...) should ultimately depend on it 14
General Model of Contextual Bandits The cookies of a user, the medical history, etc. • Reward: X k 15 • Covariates: ω t ∈ Ω = [ 0 , 1 ] d , i.i.d., law µ (equivalent to) λ • Decisions: π t ∈ { 1 , .., K } The decision can (should) depend on the context ω t t ∈ [ 0 , 1 ] ∼ ν k ( ω t ) , E [ X k | ω ] = µ k ( ω ) The expected reward of action k depend on the context ω • Objectives: Find the best decision given the request t = 1 µ π ⋆ ( ω t ) ( ω t ) − µ π t ( ω t ) Minimize regret R T := ∑ T
k and Regularity assumptions max k is not continuous. -Hölder but is 2: With K is the second max. k s t k max is the maximal k where 16 1. Smoothness of the pb: Every µ k is β -hölder, with β ∈ ( 0 , 1 ] : ∃ L > 0 , ∀ ω, ω ′ ∈ X , ∥ µ ( ω ) − µ ( ω ′ ) ∥ ≤ L ∥ ω − ω ′ ∥ β 2. Complexity of the pb: ( α -margin condition) ∃ C 0 > 0, � � [ ] ≤ C 0 δ α P X 0 < � µ 1 ( ω ) − µ 2 ( ω ) � < δ � �
Regularity assumptions is the second max. 16 1. Smoothness of the pb: Every µ k is β -hölder, with β ∈ ( 0 , 1 ] : ∃ L > 0 , ∀ ω, ω ′ ∈ X , ∥ µ ( ω ) − µ ( ω ′ ) ∥ ≤ L ∥ ω − ω ′ ∥ β 2. Complexity of the pb: ( α -margin condition) ∃ C 0 > 0, � � [ ] � µ ⋆ ( ω ) − µ ♯ ( ω ) ≤ C 0 δ α P X 0 < � < δ � � where µ ⋆ ( ω ) = max k µ k ( ω ) is the maximal µ k and µ ♯ ( ω ) = max { µ k ( ω ) s . t . µ k ( ω ) < µ ⋆ ( ω ) } With K > 2: µ ⋆ is β -Hölder but µ ♯ is not continuous.
17 Regularity: an easy example ( α big) µ 1 ( ω )
17 Regularity: an easy example ( α big) µ 1 ( ω ) µ 2 ( ω )
17 Regularity: an easy example ( α big) µ 1 ( ω ) µ 2 ( ω ) µ 3 ( ω )
17 Regularity: an easy example ( α big) µ 1 ( ω ) µ ⋆ ( ω ) µ 2 ( ω ) µ 3 ( ω )
17 Regularity: an easy example ( α big) µ 1 ( ω ) µ ⋆ ( ω ) µ 2 ( ω ) µ ♯ ( ω ) µ 3 ( ω )
17 Regularity: an easy example ( α big) µ 1 ( ω ) µ ⋆ ( ω ) µ 2 ( ω ) µ ♯ ( ω ) µ 3 ( ω )
18 Regularity: a hard example ( α small) µ 1 ( ω )
18 Regularity: a hard example ( α small) µ 1 ( ω ) µ 2 ( ω )
18 Regularity: a hard example ( α small) µ 1 ( ω ) µ 3 ( ω ) µ 2 ( ω )
18 Regularity: a hard example ( α small) µ 1 ( ω ) µ 3 ( ω ) µ ⋆ ( ω ) µ 2 ( ω )
18 Regularity: a hard example ( α small) µ 1 ( ω ) µ 3 ( ω ) µ ⋆ ( ω ) µ ♯ ( ω ) µ 2 ( ω )
18 Regularity: a hard example ( α small) µ 1 ( ω ) µ 3 ( ω ) µ ⋆ ( ω ) µ ♯ ( ω ) µ 2 ( ω )
19 Binned policy µ 1 ( ω ) µ ⋆ ( ω ) µ 2 ( ω ) µ ♯ ( ω ) µ 3 ( ω )
Binned policy 19 µ 1 ( ω ) µ 2 ( ω ) µ 3 ( ω )
Binned policy 19 µ 1 ( ω ) µ 2 ( ω ) µ 3 ( ω )
Binned Successive Elimination (BSE) Theorem [P. and Rigollet (’13)] the effects of exploration/exploitation. • Same bound with full monit [Audibert and Tsybakov, ’07] 1 T 20 T ) β ( 1 + α ) ( ( ) 2 β + d , bin side 2 β + d . K log ( K ) K log ( K ) If α < 1, E [ R T ( BSE )] ≲ T For K = 2, matches lower bound: minimax optimal w.r.t. T . • No log ( T ) : difficulty of nonparametric estimation washes away • α < 1: cannot attain fast rates for easy problems. • Adaptive partitioning !
Recommend
More recommend