Linear (and contextual) Bandits: Rich decision sets (and side information) Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 14
Announcements... Poster session: June 1, 9-11:30a Request: CSE grad students, could you please help others with poster printing? Aravind: Ask by 2p on Weds for help printing. Prepare, at most, a 2 minute verbal summary. Come earlier to setup. Submit your poster on Canvas. Due Dates: Please be on time. Today: review: Linear bandits today: contextual bandits, game trees? S. M. Kakade (UW) Optimization for Big data 2 / 14
Review S. M. Kakade (UW) Optimization for Big data 2 / 14
Bandits in practice: two major issues The decision space is very large. Drug cocktails Ad design We often have “side information” when making a decision history of a user S. M. Kakade (UW) Optimization for Big data 2 / 9
More real motivations... S. M. Kakade (UW) Optimization for Big data 2 / 9
Linear bandits An additive effects model. Suppose each round we take a decision x ∈ D ⊂ R d . x is paths on a graph. x is a feature vector of properties of an ad x is a which drugs are being taken Upon taking action x , we get reward r , with expectation: E [ r | x ] = µ > x only d unknown parameters (and “effectively” 2 d actions) W desire an algorithm A (mapping histories to decisions), which has low regret. T T µ > x ⇤ − X E [ µ > x t |A ] ≤ ?? t = 1 (where x ⇤ is the best decision) S. M. Kakade (UW) Optimization for Big data 4 / 14
Example: Shortest paths... S. M. Kakade (UW) Optimization for Big data 3 / 9
Algorithm Idea again, let’s think of optimism in the face of uncertainty we observed some r 1 , . . . r t � 1 , and have taken x 1 , . . . x t � 1 . Questions: what is an estimate of the reward of E [ r | x ] and what is our uncertainty? what is an estimate of µ and what is our uncertainty? S. M. Kakade (UW) Optimization for Big data 4 / 9
Regression! Define: X x τ x > X A t := τ + λ I , b t := x τ r τ τ < t τ < t Our estimate of µ µ t = A � 1 ˆ b t t Confidence of our estimate: µ t k 2 k µ � ˆ A t O ( d log t ) S. M. Kakade (UW) Optimization for Big data 5 / 9
LinUCB Again, optimism in the face of uncertainty. Define: µ t k 2 B t := { ν | k ν � ˆ A t O d log t } (Lin UCB) take action: ν > x x t = argmax x 2 D max ν 2 B t then update A t , B t , b t , and ˆ µ t . Equivalently, take action: q xA � 1 µ > x t = argmax x 2 D ˆ t x + ( d log t ) x t S. M. Kakade (UW) Optimization for Big data 7 / 14
LinUCB: Geometry S. M. Kakade (UW) Optimization for Big data 8 / 14
LinUCB: Confidence intervals S. M. Kakade (UW) Optimization for Big data 8 / 9
Today S. M. Kakade (UW) Optimization for Big data 9 / 14
LinUCB Regret bound of LinUCB T √ T µ > x ⇤ − X E [ µ > x t ] ≤ ⇤ ( d T ) t = 1 (this is the best possible, up to log factors). √ Compare to O ( KT ) Independent of number of actions. k -arm case is a special case. Thompson sampling: This is a good algorithm in practice. S. M. Kakade (UW) Optimization for Big data 10 / 14
Proof Idea... Stats: need to show that B t is a valid confidence region. Geometric lemma: The regret is upper bounded by the: log volume of posterior cov volume of prior cov Then just bound the worst case log volume change. S. M. Kakade (UW) Optimization for Big data 10 / 14
What about context? S. M. Kakade (UW) Optimization for Big data 10 / 14
The Contextual Bandit Game Game: for t = 1 , 2 , . . . At each time t , we obtain context (e.g. side information, user information) c t Our feasible action set is A t . We choose arm a t ∈ A t and receive reward r t , a t . (what assumptions on the reward process?) Goal: Algorithm A to have low regret: X E [ ( r t , a ∗ t − r t ) |A ] ≤ ?? t where E [ r t , a ∗ t ] is the optimal expected reward at time t . S. M. Kakade (UW) Optimization for Big data 11 / 14
How should we model outcomes? Example: ad (or movie, song, etc) prediction. What is prob. that a user u clicks on an ad a . How should we model the click probability of a for user u ? Featurizations: suppose we have φ ad ( a ) ∈ R d ad and φ user ( u ) ∈ R d user . We could make an “outer product” feature vector x as: x ( a , u ) = Vector ( φ ad ( a ) φ user ( u ) > ) ∈ R d ad d user We could model the probabilities as: E [ click = 1 | a , u ] = µ > x ( a , u ) (or log linear) How do we estimate µ ? S. M. Kakade (UW) Optimization for Big data 12 / 14
Contextual Linear bandits Suppose each round t , we take a decision x ∈ D t ⊂ R d ( D t may be time varying). map each ad/user a to x ( a , u ) . D t = { x ( a , u t ) | a is a feasible ad at time t } Our decision is a feature vector in x ∈ D t . Upon taking action x t ∈ D t , we get reward r t , with expectation: E [ r t | x t ∈ D t ] = µ > x t (here µ is assumed constant over time). Our regret: X ( µ > x t , a ∗ t − µ > x t ) |A ] ≤ ?? E [ t (where x t , a ∗ t is the best decision at time t ) S. M. Kakade (UW) Optimization for Big data 13 / 14
Algorithm let’s just run linUCB (or Thompson sampling) Nothing really changes: A t and b t are the same updating rules now our decision is: ν 2 B t ν > x x t = argmax x 2 D t max i.e. q µ > xA � 1 x t = argmax x 2 D t ˆ t x + ( d log t ) x t √ Regret bound is still O ( d T ) . S. M. Kakade (UW) Optimization for Big data 14 / 14
Acknowledgements http://gdrro.lip6.fr/sites/default/files/ JourneeCOSdec2015-Kaufman.pdf https://sites.google.com/site/banditstutorial/ http://www.yisongyue.com/courses/cs159/lectures/ LinUCB.pdf S. M. Kakade (UW) Optimization for Big data 14 / 14
Recommend
More recommend