Nonparametric Bandits with Covariates Philippe Rigollet Princeton University with A. Zeevi (Columbia University) Support from NSF (DMS-0906424) 1 / 32
Example: Real time web page optimization 2 / 32
Example: Real time web page optimization 2 / 32
Example: Real time web page optimization Which ad will generate the most $/clicks ? 2 / 32
Characteristics of the problem • A choice must be made for each customer. • Cannot observe the outcome of the alternative choice. • Try to maximize the rewards. Exploration vs. Exploitation dilemma Exploration: which one is the best? Exploitation: display the best as much as possible. 3 / 32
Two armed bandit problem: setup • Two arms (e.g.: actions, ads): i ∈ { 1 , 2 } . • At time t , random reward Y ( i ) is observed when arm i is t pulled. • A policy π is a sequence π 1 , π 2 , . . . ∈ { 1 , 2 } , which indicates which arm to pull at each time t . • Performance: Expected cumulative reward at time n n � Y ( π t ) I E t t =1 • Goal: maximize reward. 4 / 32
Two armed bandit problem: regret • Oracle policy π ⋆ = ( π ⋆ 1 , π ⋆ 2 , . . . ) pulls at each time t the best arm (in expectation) E[ Y ( i ) π ⋆ t = argmax I ] . t i =1 , 2 • We measure our performance by the regret n ( π ⋆ t ) − Y ( π t ) � � � R n ( π ) = I E Y t t t =1 5 / 32
Static Environment • The problem is not new: Robbins (’52), Lai & Robbins (’85) 6 / 32
Static Environment • The problem is not new: Robbins (’52), Lai & Robbins (’85) • Key assumption: Static environment E[ Y ( i ) • i.e., the (unknown) expected rewards µ i = I t ] are constant . • One way to solve the problem is to use U pper C onfidence B ounds policy. 6 / 32
Side information 7 / 32
Side information 7 / 32
Side information 7 / 32
Side information and covariates • At time t , the reward of each arm i ∈ { 1 , 2 } depends on R d )) a covariate X t ∈ X ( ⊂ (I Y ( i ) = f ( i ) ( X t ) + ε t , t = 1 , 2 , . . ., i = 1 , 2 . t with standard regression assumptions on { ε t } . • A policy is now a sequence of functions π t : X → { 1 , 2 } . • Oracle policy E[ Y ( i ) f ( i ) ( x ) π ⋆ ( x ) = argmax I | X t = x ] = argmax t i =1 , 2 i =1 , 2 8 / 32
Assumptions on the expected rewards Assume now that X = [0 , 1] . 1. Constant: Static model studies by Lai & Robbins: f ( i ) ( x ) = µ i , i = 1 , 2 µ i unknown 2. Linear: One-armed bandit problem, studied by Goldenshluger & Zeevi (2008) f (1) ( x ) = x − θ, i = 1 , 2 θ unknown and f (2) ( x ) = 0 is constant and known . 3. Smooth: We assume that the functions are H¨ older smooth with parameter β ≤ 1 : | f ( i ) ( x ) − f ( i ) ( x ′ ) | ≤ L | x − x ′ | β . (Consistency studied by Yang & Zhu, 2002) 9 / 32
Constant rewards f (1) f (2) 0 1 10 / 32
One-armed linear reward f (1) f (2) 0 0 1 11 / 32
Smooth rewards f (1) f (2) 0 1 12 / 32
Nonparametric bandit with covariates 13 / 32
Two armed bandit problem with uniform covariates • Covariates: { X t } i.i.d in [0 , 1] with uniform distribution • Rewards: Y ( i ) ∈ [0 , 1] t Y ( i ) � t | X t ] = f ( i ) ( X t ) I E t = 1 , 2 , . . ., i = 1 , 2 , where | f ( i ) ( x ) − f ( i ) ( x ′ ) | ≤ L | x − x ′ | β , β ≤ 1 , i = 1 , 2 • Oracle policy pulls at time t π ⋆ ( X t ) = argmax f ( i ) ( X t ) i =1 , 2 • Regret n � f ( π ⋆ ( X t )) ( X t ) − f ( π t ( X t )) ( X t ) � � R n ( π ) = I E t =1 14 / 32
Margin condition Margin condition ≤ Cδ α . 0 < | f (1) ( X ) − f (2) ( X ) | ≤ δ � � I P • first used by Goldenshluger and Zeevi (2008) in the one-armed bandit setting • In the one-armed setup, it is an assumption on the distribution of X only • Here: fixed marginal (e.g. uniform) so it measures how close the functions are 15 / 32
Margin condition Margin condition ≤ Cδ α . 0 < | f (1) ( X ) − f (2) ( X ) | ≤ δ � � I P • first used by Goldenshluger and Zeevi (2008) in the one-armed bandit setting • In the one-armed setup, it is an assumption on the distribution of X only • Here: fixed marginal (e.g. uniform) so it measures how close the functions are Proposition: Conflict α vs. β ⇒ π ⋆ is a.s constant on [0 , 1] αβ > 1 = 15 / 32
Illustration of the margin condition f (1) f (2) 0 1 16 / 32
Illustration of the margin condition α = 1 f (1) f (2) 0 1 16 / 32
Illustration of the margin condition f (1) f (2) α = 2 1 β = 2 0 1 16 / 32
Binning (Exploiting smoothness) • Fix M > 1 . Consider the bins B j = [ j/M, ( j + 1) /M ) • Consider the average reward on each bin = 1 � f ( i ) ¯ f ( i ) ( x )d x , j p j B j Z t = j iff X t ∈ B j 17 / 32
Binned UCB • For uniformly distributed X t , we have p j = I P( Z t = j ) = I P( X t ∈ B j ) = 1 /M • The rewards are Y ( i ) t | Z t = j ] = ¯ f ( i ) � I E t = 1 , 2 , . . ., i = 1 , 2 , j Play UCB on the ( Z t , Y t ) , t = 1 , . . . , n 18 / 32
Binned problem f (1) f (2) 0 1 19 / 32
Binned problem f (1) f (2) 0 1 19 / 32
Binned problem f (1) f (2) 0 1 19 / 32
Binned problem f (1) ¯ f (2) ¯ 0 1 19 / 32
Two armed bandit problem with discrete covariates • Covariates: { Z t } i.i.d in { 1 , . . ., M } P ( Z t = j ) = p j , t = 1 , 2 , . . . • Rewards: Y ( i ) ∈ [0 , 1] t Y ( i ) t | Z t = j ] = ¯ f ( i ) � I E t = 1 , 2 , . . ., i = 1 , 2 , j • Oracle policy pulls at time t f ( i ) ¯ π ⋆ ( Z t ) = argmax Z t i =1 , 2 20 / 32
Regret • Regret given by M n � ¯ f ( π ⋆ ( j )) � � − ¯ f ( π t ( j )) � R n ( π ) = I E 1 I( Z t = j ) j j j =1 t =1 21 / 32
Regret • Regret given by M n � ¯ f ( π ⋆ ( j )) � � − ¯ f ( π t ( j )) � R n ( π ) = I E 1 I( Z t = j ) j j j =1 t =1 Idea: play independently for each j = 1 , . . . M 21 / 32
UCB policy for discrete covariate • Based Upper Confidence Bounds given by concentration inequalities (Hoeffding or Bernstein): � 2 log t B t ( s ) := . s • Define the number of times ˆ π prescribed to pull arm i and Z t = j , before time t t N ( i ) � j ( t ) = 1 I( Z s = j, ˆ π s ( Z s ) = i ) , s =1 • Average reward collected at those times t 1 ( i ) � Y ( i ) Y j ( t ) = s 1 I( Z s = j, ˆ π s ( Z s ) = i ) , N ( i ) j ( t ) s =1 22 / 32
A first bound on the regret Binned UCB policy: conditionally on Z t = j , � � ( i ) j ( t ) + B t ( N ( i ) π t ( j ) = argmax ˆ Y j ( t )) i =1 , 2 Theorem 1. A first bound on the regret Denote by ∆ j = | ¯ f (1) − ¯ f (2) j | . j M ∆ j + log n � � � R n (ˆ π ) ≤ C ∆ j j =1 Direct consequence of Auer, Cesa-Bianchi & Fischer (2002) 23 / 32
Margin condition M ∆ j + log n � � � ∆ j j =1 • The previous bound can become arbitrary large if one the ∆ j , j = 1 , . . . , M becomes too small. • Using the margin condition we can make local conclusions on gaps ∆ j : Few j ’s such that ∆ j is small 24 / 32
Upper bound Theorem 2. A bound on the regret for the binned UCB policy 1 Fix α > 0 and 0 < β ≤ 1 and choose M ∼ ( n/ log n ) 2 β +1 . Then � − β (1+ α ) � 2 β +1 n Cn if α < 1 log n R n (ˆ π ) ≤ 2 β � − � 2 β +1 n Cn if α > 1 log n 25 / 32
Suboptimality for α > 1 • If α > 1 , the bound becomes � nM − β (1+ α ) + M log n � R n (ˆ π ) ≤ C • Minimum for 1 n � � β (1+ α )+1 M ∼ log n • which yields β (1+ α ) n � − � β (1+ α )+1 R n (ˆ π ) ≤ Cn log n • Problem is: too many bins. Solution: Online/adaptive construction of the bins 26 / 32
Conditional distributions • The distribution of Y ( i ) | X belongs to P = { P θ , θ ∈ Θ } , where θ is the mean parameter: � θ = x d P θ ( x ) • Assume that the family P is such that K ( P θ , P θ ′ ) ≤ ( θ − θ ′ ) 2 , κ > 0 . κ For any θ, θ ′ ∈ Θ ⊂ I R • Satisfied in particular for Gaussian (location) and Bernoulli families. 27 / 32
Minimax lower bound Theorem 3. Let αβ ≤ 1 and the covariates { X t } be uniformly distributed on [0 , 1] d . Assume also that { P ( i ) , θ ∈ Im f ( i ) ( X ) } satisfies θ the condition on Kullback-leibler for any i = 1 , 2 . Then, for any policy π , R n ( π ) ≥ Cn · n − β (1+ α ) 2 β +1 , sup f (1) ,f (2) ∈ Σ( β,L ) for some positive constant C . 28 / 32
Comments • Same bound as in the full information case (see Audibert & Tsybakov, 07) • Gap (of logarithmic size) between upper and lower bound. 29 / 32
Extensions • Higher dimension d ≥ 2 , choose � · � ∞ � − β (1+ α ) n � 2 β + d R n (ˆ π ) ≤ C ( d ) n log n • The lower bound also holds. • Unknown n : doubling trick 30 / 32
K -armed bandit • K -armed bandit problem ≤ Cδ α . i � = i ⋆ ( X ) | f ( i ) ( X ) − f ( i ⋆ ( X )) ( X ) | ≤ δ � � I P 0 < min where i ⋆ ( x ) = argmax 1 ≤ i ≤ K f ( i ) ( x ) � − β (1+ α ) n � 2 β +1 R n (ˆ π ) ≤ CKn log n 31 / 32
Recommend
More recommend