Bandits Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard University 1 / 25
Bandits Agenda ◮ Thus far: “Supervised machine learning” – data are given. Next: “Active learning” – experimentation. ◮ Setup: The multi-armed bandit problem. Adaptive experiment with exploration / exploitation trade-off. ◮ Two popular approximate algorithms: 1. Thompson sampling 2. Upper Confidence Bound algorithm ◮ Characterizing regret. ◮ Characterizing an exact solution: Gittins Index. ◮ Extension to settings with covariates (contextual bandits). 2 / 25
Bandits Takeaways for this part of class ◮ When experimental units arrive over time, and we can adapt our treatment choices, we can learn optimal treatment quickly. ◮ Treatment choice: Trade-off between 1. choosing good treatments now (exploitation), 2. and learning for future treatment choices (exploration). ◮ Optimal solutions are hard, but good heuristics are available. ◮ We will derive a bound on the regret of one heuristic. ◮ Bounding the number of times a sub-optimal treatment is chosen, ◮ using large deviations bounds (cf. testing!). ◮ We will also derive a characterization of the optimal solution in the infinite-horizon case. This relies on a separate index for each arm. 3 / 25
Bandits The multi-armed bandit The multi-armed bandit Setup ◮ Treatments D t ∈ 1 ,..., k ◮ Experimental units come in sequentially over time. One unit per time period t = 1 , 2 ,... ◮ Potential outcomes: i.i.d. over time, Y t = Y D t t , Y d t ∼ F d E [ Y d t ] = θ d ◮ Treatment assignment can depend on past treatments and outcomes, D t + 1 = d t ( D 1 ,..., D t , Y 1 ,..., Y t ) . 4 / 25
Bandits The multi-armed bandit The multi-armed bandit Setup continued ◮ Optimal treatment: d ∗ = argmax θ ∗ = max θ d = θ d ∗ θ d d d ◮ Expected regret for treatment d : Y d ∗ − Y d � = θ d ∗ − θ d . ∆ d = E � ◮ Finite horizon objective: Average outcome, U T = 1 T ∑ Y t . 1 ≤ t ≤ T ◮ Infinite horizon objective: Discounted average outcome, U ∞ = ∑ β t Y t t ≥ 1 5 / 25
Bandits The multi-armed bandit The multi-armed bandit Expectations of objectives ◮ Expected finite horizon objective: � � T ∑ 1 θ D t E [ U T ] = E 1 ≤ t ≤ T ◮ Expected infinite horizon objective: � � ∑ β t θ D t E [ U ∞ ] = E t ≥ 1 ◮ Expected finite horizon regret: Compare to always assigning optimal treatment d ∗ . � �� � � � Y d ∗ T ∑ 1 T ∑ 1 ∆ D t R T = E − Y t = E t 1 ≤ t ≤ T 1 ≤ t ≤ T 6 / 25
Bandits The multi-armed bandit Practice problem ◮ Show that these equalities hold. ◮ Interpret these objectives. ◮ Relate them to our decision theory terminology. 7 / 25
Bandits Two popular algorithms Two popular algorithms Upper Confidence Bound (UCB) algorithm ◮ Define ¯ Y d 1 t ∑ t = 1 ( D s = d ) · Y s , T d 1 ≤ s ≤ t t = ∑ T d 1 ( D s = d ) 1 ≤ s ≤ t B d t = B ( T d t ) . ◮ B ( · ) is a decreasing function, giving the width of the “confidence interval.” We will specify this function later. ◮ At time t + 1, choose ¯ Y d t + B d D t + 1 = argmax t . d ◮ “Optimism in the face of uncertainty.” 8 / 25
Bandits Two popular algorithms Two popular algorithms Thompson sampling ◮ Start with a Bayesian prior for θ . ◮ Assign each treatment with probability equal to the posterior probability that it is optimal. ◮ Put differently, obtain one draw ˆ θ t + 1 from the posterior given ( D 1 ,..., D t , Y 1 ,..., Y t ) , and choose ˆ θ d D t + 1 = argmax t + 1 . d ◮ Easily extendable to more complicated dynamic decision problems, complicated priors, etc.! 9 / 25
Bandits Two popular algorithms Two popular algorithms Thompson sampling - the binomial case ◮ Assume that Y ∈ { 0 , 1 } , Y d t ∼ Ber ( θ d ) . ◮ Start with a uniform prior for θ on [ 0 , 1 ] k . ◮ Then the posterior for θ d at time t + 1 is a Beta distribution with parameters t · ¯ α d t = 1 + T d Y d t , t · ( 1 − ¯ β d t = 1 + T d Y d t ) . ◮ Thus ˆ D t = argmax θ t . d where ˆ θ d t ∼ Beta ( α d t , β d t ) is a random draw from the posterior. 10 / 25
Bandits Regret bounds Regret bounds ◮ Back to the general case. ◮ Recall expected finite horizon regret, � �� � � � Y d ∗ 1 T ∑ T ∑ 1 ∆ D t R T = E − Y t = E . t 1 ≤ t ≤ T 1 ≤ t ≤ T ◮ Thus, T · R T = ∑ E [ T d T ] · ∆ d . d T ] small when ∆ d > 0. ◮ Good algorithms will have E [ T d ◮ We will next derive upper bounds on E [ T d T ] for the UCB algorithm. ◮ We will then state that for large T similar upper bounds hold for Thompson sampling. ◮ There is also a lower bound on regret across all possible algorithms which is the same, up to a constant. 11 / 25
Bandits Regret bounds Probability theory preliminary Large deviations ◮ Suppose that E [exp( λ · ( Y − E [ Y ]))] ≤ exp( ψ ( λ )) . ◮ Let ¯ Y T = 1 T ∑ 1 ≤ t ≤ T Y t for i.i.d. Y t . Then, by Markov’s inequality and independence across t , Y T − E [ Y ] > ε ) ≤ E [exp( λ · (¯ Y T − E [ Y ]))] P (¯ exp( λ · ε ) = ∏ 1 ≤ t ≤ T E [exp(( λ / T ) · ( Y t − E [ Y ]))] exp( λ · ε ) ≤ exp( T ψ ( λ / T ) − λ · ε ) . 12 / 25
Bandits Regret bounds Large deviations continued ◮ Define the Legendre-transformation of ψ as ψ ∗ ( ε ) = sup [ λ · ε − ψ ( λ )] . λ ≥ 0 ◮ Taking the inf over λ in the previous slide implies P (¯ Y T − E [ Y ] > ε ) ≤ exp( − T · ψ ∗ ( ε )) . ◮ For distributions bounded by [ 0 , 1 ] : ψ ( λ ) = λ 2 / 8 and ψ ∗ ( ε ) = 2 ε 2 . ◮ For normal distributions: ψ ( λ ) = λ 2 σ 2 / 2 and ψ ∗ ( ε ) = ε 2 / ( 2 σ 2 ) . 13 / 25
Bandits Regret bounds Applied to the Bandit setting ◮ Suppose that for all d E [exp( λ · ( Y d − θ d ))] ≤ exp( ψ ( λ )) E [exp( − λ · ( Y d − θ d ))] ≤ exp( ψ ( λ )) . ◮ Recall / define � α log( t ) � ¯ t = ( ψ ∗ ) − 1 Y d 1 t ∑ B d t = 1 ( D s = d ) · Y s , . T d T d 1 ≤ s ≤ t t ◮ Then we get t − θ d > B d P (¯ t · ψ ∗ ( B d Y d t ) ≤ exp( − T d t )) = exp( − α log( t )) = t − α t − θ d < − B d P (¯ t ) ≤ t − α . Y d 14 / 25
Bandits Regret bounds Why this choice of B ( · ) ? ◮ A smaller B ( · ) is better for exploitation. ◮ A larger B ( · ) is better for exploration. ◮ Special cases: ◮ Distributions bounded by [ 0 , 1 ] : � α log( t ) B d t = . 2 T d t ◮ Normal distributions: � 2 σ 2 α log( t ) B d t = . T d t ◮ The α log( t ) term ensures that coverage goes to 1, but slow enough to not waste too much in terms of exploitation. 15 / 25
Bandits Regret bounds When d is chosen by the UCB algorithm ◮ By definition of UCB, at least one of these three events has to hold when d is chosen at time t + 1: Y d ∗ + B d ∗ ¯ ≤ θ ∗ (1) t t Y d ¯ t − B d t > θ d (2) 2 B d t > ∆ d . (3) ◮ 1 and 2 have low probability. By previous slide, � ≤ θ ∗ � Y d ∗ + B d ∗ ¯ ≤ t − α , � ¯ ≤ t − α . Y d t − B d t > θ d � P P t t ◮ 3 only happens when T d t is small. By definition of B d t , 3 happens iff α log( t ) T d t < ψ ∗ (∆ d / 2 ) . 16 / 25
Bandits Regret bounds Practice problem Show that at least one of the statements 1, 2, or 3 has to be true whenever D t + 1 = d , for the UCB algorithm. 17 / 25
Bandits Regret bounds Bounding E [ T d t ] ◮ Let � α log( T ) � ˜ T d T = . ψ ∗ (∆ d / 2 ) ◮ Forcing the algorithm to pick d the first ˜ T d T periods can only increase T d T . ◮ We can collect our results to get T ] = ∑ T + ∑ E [ T d 1 ( D t = d ) ≤ ˜ T d E [ 1 ( D t = d )] 1 ≤ t ≤ T ˜ T d T < t ≤ T T + ∑ ≤ ˜ T d E [ 1 ( (1) or (2) is true at t )] ˜ T d T < t ≤ T T + ∑ ≤ ˜ T d E [ 1 ( (1)is true at t )]+ E [ 1 ( (2) is true at t )] ˜ T d T < t ≤ T α T + ∑ 2 t − α + 1 ≤ ˜ ≤ ˜ T d T d T + α − 2 . T d ˜ T < t ≤ T 18 / 25
Bandits Regret bounds Upper bound on expected regret for UCB ◮ We thus get: T ] ≤ α log( T ) α E [ T d ψ ∗ (∆ d / 2 ) + α − 2 , � α log( T ) � α R T ≤ 1 · ∆ d . T ∑ ψ ∗ (∆ d / 2 ) + α − 2 d ◮ Expected regret (difference to optimal policy) goes to 0 at a rate of O (log( T ) / T ) – pretty fast! ◮ While the cost of “getting treatment wrong” is ∆ d , the difficulty of figuring out the right treatment is of order 1 / ψ ∗ (∆ d / 2 ) . Typically, this is of order ( 1 / ∆ d ) 2 . 19 / 25
Bandits Regret bounds Related bounds - rate optimality ◮ Lower bound : Consider the Bandit problem with binary outcomes and any algorithm such that E [ T d t ] = o ( t a ) for all a > 0. Then ∆ d R T ≥ ∑ log( T ) ¯ T liminf kl ( θ d , θ ∗ ) , t → ∞ d where kl ( p , q ) = p · log( p / q )+( 1 − p ) · log(( 1 − p ) / ( 1 − q )) . ◮ Upper bound for Thompson sampling : In the binary outcome setting, Thompson sampling achieves this bound, i.e., ∆ d R T = ∑ log( T ) ¯ T liminf kl ( θ d , θ ∗ ) . t → ∞ d 20 / 25
Recommend
More recommend