Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 18
Announcements... HW 4 posted soon (short) Poster session: June 1, 9-11:30a; ask TA/CSE students for help printing Projects: the term is approaching the end.... Today: Quick overview: Parallelization and Deep learning Bandits: Vanilla k-arm setting 1 Linear bandits and ad-placement 2 Game trees? 3 S. M. Kakade (UW) Optimization for Big data 2 / 18
The problem In unsupervised learning, we just have data... In supervised learning, we have inputs X and labels Y (often we spend resources to get these labels). In reinforcement learning (very general), we act in the world, there is “state” and we observe rewards. Bandit Settings: We have K decisions each round and we do only received feedback for the chosen decision... S. M. Kakade (UW) Optimization for Big data 3 / 18
Gambling in casino... S. M. Kakade (UW) Optimization for Big data 4 / 18
Multi-Armed Bandit Game K Independent Arms: a ∈ { 1 , . . . K } Each arm a returns a random reward R a if pulled. (simpler case) assume R a is not time varying. Game: You chose arm a t at time t . You then observe: X t = R a t where R a t is sampled from the underlying distribution of that arm. The distribution of R a is not known. S. M. Kakade (UW) Optimization for Big data 5 / 18
More real motivations... S. M. Kakade (UW) Optimization for Big data 5 / 18
Ad placement... S. M. Kakade (UW) Optimization for Big data 5 / 18
The Goal We would like to maximize our long term future reward. Our (possibly randomized) sequential strategy/algorithm A is: a t = A ( a 1 , X 1 , a 2 , X 2 , . . . a t − 1 , X t − 1 ) In T rounds, our reward is: T � E [ X t |A ] t = 1 where the expectation is with respect to the reward process and our algorithm. Objective: What is a strategy which maximizes our long term reward? S. M. Kakade (UW) Optimization for Big data 6 / 18
Our Regret Suppose: µ a = E [ R a ] Assume 0 ≤ µ a ≤ 1. Let µ ∗ = max a µ a In expectation, the best we can do is obtain µ ∗ T reward in T steps. In T rounds, our regret is: � T � � µ ∗ T − E X t |A ≤ ?? t = 1 Objective: What is a strategy which makes our regret small? S. M. Kakade (UW) Optimization for Big data 7 / 18
A Naive Strategy For the first τ rounds, sample each arm τ/ K times. For the remainder of the rounds, choose the arm with best observed empirical reward. How goes is this strategy? How do we set τ ? Let’s look at confidence intervals. S. M. Kakade (UW) Optimization for Big data 8 / 18
Hoeffding’s bound If we pull arm N a times, our empirical estimate for arm a is: µ a = 1 � ˆ X t N a t : a t = a By Hoeffding’s bound, with probability greater than 1 − δ , � log ( 1 /δ ) | ˆ µ a − µ a | ≤ O N a By the union bound, with probability greater than 1 − δ , � log ( K /δ ) ∀ a , | ˆ µ a − µ a | ≤ O N a S. M. Kakade (UW) Optimization for Big data 9 / 18
Our regret (Exploration rounds) What is our regret for the first τ rounds? (Exploitation rounds) What is our regret for the remainder τ rounds? Our total regret is: T � log ( K /δ ) � µ ∗ T − X t ≤ τ + O ( T − τ ) τ/ K t = 1 How do we choose τ ? S. M. Kakade (UW) Optimization for Big data 10 / 18
The Naive Strategy’s Regret Choose τ = K 1 / 3 T 2 / 3 and δ = 1 / T . Theorem: Our total (expected) regret is: T � X t |A ] ≤ O ( K 1 / 3 T 2 / 3 ( log ( KT )) 1 / 3 ) µ ∗ T − E [ t = 1 S. M. Kakade (UW) Optimization for Big data 11 / 18
Can we be more adaptive? Are we still pulling arms that we know are sub-optimal? How do we know this?? Let N a , t be the number of times we pulled arm a up to time t . Confidence interval at time t : with probability greater than 1 − δ , � log ( 1 /δ ) | ˆ µ a , t − µ a | ≤ O N a , t with δ → δ/ ( TK ) , the above bound will hold for all time arms a ∈ [ K ] and timesteps t ≤ T . S. M. Kakade (UW) Optimization for Big data 12 / 18
Example S. M. Kakade (UW) Optimization for Big data 12 / 18
Example S. M. Kakade (UW) Optimization for Big data 12 / 18
Example S. M. Kakade (UW) Optimization for Big data 12 / 18
Confidence Bounds... S. M. Kakade (UW) Optimization for Big data 13 / 18
UCB: a reasonable state of our uncertainty... S. M. Kakade (UW) Optimization for Big data 14 / 18
Upper Confidence Bound (UCB) Algorithm At each time t , Pull arm: � log ( KT /δ ) = argmax ˆ µ a , t + c a t N a , t := argmax ˆ µ a , t + ConfBound a , t (where c ≤ 10 is a constant). Observe reward X t . Update µ a , t , N a , t , and ConfBound a , t . How well does this do? S. M. Kakade (UW) Optimization for Big data 15 / 18
Instantaneous Regret With probability greater than 1 − δ all the confidence bounds will hold. Question: If argmax ˆ µ a , t + ConfBound a , t ≤ µ ∗ could UCB pull arm a at time t ? Question: If pull arm a at time t , how much regret do we pay? i.e. µ ∗ − µ a t ≤ ?? S. M. Kakade (UW) Optimization for Big data 16 / 18
Total Regret Theorem: The total (expected) regret of UCB is: T � � µ ∗ T − E [ X t |A ] ≤ KT log ( KT ) t = 1 This better than the Naive strategy. Up to log factors, it is optimal. Practical algorithm? S. M. Kakade (UW) Optimization for Big data 17 / 18
Proof Idea: for K = 2 Suppose arm a = 2 is not optimal. Claim 1: All confidence intervals will be valid (with Pr ≥ 1 − δ ). Claim 2: If we pull arm a = 1, then no regret. Claim 3: If we pull a = 2, then we pay 2 C a , t regret. To see this: Why? µ a , t + C a , t ≥ ˆ ˆ µ 1 , t + C 1 , t ≥ µ ∗ Why? µ a ≥ ˆ µ a , t − C a , t The total regret is: 1 � � C a , t ≤ � N a , t t t Note that N a , t ≤ t (and increasing). S. M. Kakade (UW) Optimization for Big data 18 / 18
Acknowledgements http://gdrro.lip6.fr/sites/default/files/ JourneeCOSdec2015-Kaufman.pdf https://sites.google.com/site/banditstutorial/ S. M. Kakade (UW) Optimization for Big data 18 / 18
Recommend
More recommend