Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy - PowerPoint PPT Presentation

Bandits Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard University 1 / 25

Bandits Agenda ◮ Thus far: “Supervised machine learning” – data are given. Next: “Active learning” – experimentation. ◮ Setup: The multi-armed bandit problem. Adaptive experiment with exploration / exploitation trade-off. ◮ Two popular approximate algorithms: 1. Thompson sampling 2. Upper Confidence Bound algorithm ◮ Characterizing regret. ◮ Characterizing an exact solution: Gittins Index. ◮ Extension to settings with covariates (contextual bandits). 2 / 25

Bandits Takeaways for this part of class ◮ When experimental units arrive over time, and we can adapt our treatment choices, we can learn optimal treatment quickly. ◮ Treatment choice: Trade-off between 1. choosing good treatments now (exploitation), 2. and learning for future treatment choices (exploration). ◮ Optimal solutions are hard, but good heuristics are available. ◮ We will derive a bound on the regret of one heuristic. ◮ Bounding the number of times a sub-optimal treatment is chosen, ◮ using large deviations bounds (cf. testing!). ◮ We will also derive a characterization of the optimal solution in the infinite-horizon case. This relies on a separate index for each arm. 3 / 25

Bandits The multi-armed bandit The multi-armed bandit Setup ◮ Treatments D t ∈ 1 ,..., k ◮ Experimental units come in sequentially over time. One unit per time period t = 1 , 2 ,... ◮ Potential outcomes: i.i.d. over time, Y t = Y D t t , Y d t ∼ F d E [ Y d t ] = θ d ◮ Treatment assignment can depend on past treatments and outcomes, D t + 1 = d t ( D 1 ,..., D t , Y 1 ,..., Y t ) . 4 / 25

Bandits The multi-armed bandit The multi-armed bandit Setup continued ◮ Optimal treatment: d ∗ = argmax θ ∗ = max θ d = θ d ∗ θ d d d ◮ Expected regret for treatment d : Y d ∗ − Y d � = θ d ∗ − θ d . ∆ d = E � ◮ Finite horizon objective: Average outcome, U T = 1 T ∑ Y t . 1 ≤ t ≤ T ◮ Infinite horizon objective: Discounted average outcome, U ∞ = ∑ β t Y t t ≥ 1 5 / 25

Bandits The multi-armed bandit The multi-armed bandit Expectations of objectives ◮ Expected finite horizon objective: � � T ∑ 1 θ D t E [ U T ] = E 1 ≤ t ≤ T ◮ Expected infinite horizon objective: � � ∑ β t θ D t E [ U ∞ ] = E t ≥ 1 ◮ Expected finite horizon regret: Compare to always assigning optimal treatment d ∗ . � �� Y d ∗ T ∑ 1 T ∑ 1 ∆ D t R T = E − Y t = E t 1 ≤ t ≤ T 1 ≤ t ≤ T 6 / 25

Bandits The multi-armed bandit Practice problem ◮ Show that these equalities hold. ◮ Interpret these objectives. ◮ Relate them to our decision theory terminology. 7 / 25

Bandits Two popular algorithms Two popular algorithms Upper Confidence Bound (UCB) algorithm ◮ Define ¯ Y d 1 t ∑ t = 1 ( D s = d ) · Y s , T d 1 ≤ s ≤ t t = ∑ T d 1 ( D s = d ) 1 ≤ s ≤ t B d t = B ( T d t ) . ◮ B ( · ) is a decreasing function, giving the width of the “confidence interval.” We will specify this function later. ◮ At time t + 1, choose ¯ Y d t + B d D t + 1 = argmax t . d ◮ “Optimism in the face of uncertainty.” 8 / 25

Bandits Two popular algorithms Two popular algorithms Thompson sampling ◮ Start with a Bayesian prior for θ . ◮ Assign each treatment with probability equal to the posterior probability that it is optimal. ◮ Put differently, obtain one draw ˆ θ t + 1 from the posterior given ( D 1 ,..., D t , Y 1 ,..., Y t ) , and choose ˆ θ d D t + 1 = argmax t + 1 . d ◮ Easily extendable to more complicated dynamic decision problems, complicated priors, etc.! 9 / 25

Bandits Two popular algorithms Two popular algorithms Thompson sampling - the binomial case ◮ Assume that Y ∈ { 0 , 1 } , Y d t ∼ Ber ( θ d ) . ◮ Start with a uniform prior for θ on [ 0 , 1 ] k . ◮ Then the posterior for θ d at time t + 1 is a Beta distribution with parameters t · ¯ α d t = 1 + T d Y d t , t · ( 1 − ¯ β d t = 1 + T d Y d t ) . ◮ Thus ˆ D t = argmax θ t . d where ˆ θ d t ∼ Beta ( α d t , β d t ) is a random draw from the posterior. 10 / 25

Bandits Regret bounds Regret bounds ◮ Back to the general case. ◮ Recall expected finite horizon regret, � �� Y d ∗ 1 T ∑ T ∑ 1 ∆ D t R T = E − Y t = E . t 1 ≤ t ≤ T 1 ≤ t ≤ T ◮ Thus, T · R T = ∑ E [ T d T ] · ∆ d . d T ] small when ∆ d > 0. ◮ Good algorithms will have E [ T d ◮ We will next derive upper bounds on E [ T d T ] for the UCB algorithm. ◮ We will then state that for large T similar upper bounds hold for Thompson sampling. ◮ There is also a lower bound on regret across all possible algorithms which is the same, up to a constant. 11 / 25

Bandits Regret bounds Probability theory preliminary Large deviations ◮ Suppose that E [exp( λ · ( Y − E [ Y ]))] ≤ exp( ψ ( λ )) . ◮ Let ¯ Y T = 1 T ∑ 1 ≤ t ≤ T Y t for i.i.d. Y t . Then, by Markov’s inequality and independence across t , Y T − E [ Y ] > ε ) ≤ E [exp( λ · (¯ Y T − E [ Y ]))] P (¯ exp( λ · ε ) = ∏ 1 ≤ t ≤ T E [exp(( λ / T ) · ( Y t − E [ Y ]))] exp( λ · ε ) ≤ exp( T ψ ( λ / T ) − λ · ε ) . 12 / 25

Bandits Regret bounds Large deviations continued ◮ Define the Legendre-transformation of ψ as ψ ∗ ( ε ) = sup [ λ · ε − ψ ( λ )] . λ ≥ 0 ◮ Taking the inf over λ in the previous slide implies P (¯ Y T − E [ Y ] > ε ) ≤ exp( − T · ψ ∗ ( ε )) . ◮ For distributions bounded by [ 0 , 1 ] : ψ ( λ ) = λ 2 / 8 and ψ ∗ ( ε ) = 2 ε 2 . ◮ For normal distributions: ψ ( λ ) = λ 2 σ 2 / 2 and ψ ∗ ( ε ) = ε 2 / ( 2 σ 2 ) . 13 / 25

Bandits Regret bounds Applied to the Bandit setting ◮ Suppose that for all d E [exp( λ · ( Y d − θ d ))] ≤ exp( ψ ( λ )) E [exp( − λ · ( Y d − θ d ))] ≤ exp( ψ ( λ )) . ◮ Recall / define � α log( t ) � ¯ t = ( ψ ∗ ) − 1 Y d 1 t ∑ B d t = 1 ( D s = d ) · Y s , . T d T d 1 ≤ s ≤ t t ◮ Then we get t − θ d > B d P (¯ t · ψ ∗ ( B d Y d t ) ≤ exp( − T d t )) = exp( − α log( t )) = t − α t − θ d < − B d P (¯ t ) ≤ t − α . Y d 14 / 25

Bandits Regret bounds Why this choice of B ( · ) ? ◮ A smaller B ( · ) is better for exploitation. ◮ A larger B ( · ) is better for exploration. ◮ Special cases: ◮ Distributions bounded by [ 0 , 1 ] : � α log( t ) B d t = . 2 T d t ◮ Normal distributions: � 2 σ 2 α log( t ) B d t = . T d t ◮ The α log( t ) term ensures that coverage goes to 1, but slow enough to not waste too much in terms of exploitation. 15 / 25

Bandits Regret bounds When d is chosen by the UCB algorithm ◮ By definition of UCB, at least one of these three events has to hold when d is chosen at time t + 1: Y d ∗ + B d ∗ ¯ ≤ θ ∗ (1) t t Y d ¯ t − B d t > θ d (2) 2 B d t > ∆ d . (3) ◮ 1 and 2 have low probability. By previous slide, � ≤ θ ∗ � Y d ∗ + B d ∗ ¯ ≤ t − α , � ¯ ≤ t − α . Y d t − B d t > θ d � P P t t ◮ 3 only happens when T d t is small. By definition of B d t , 3 happens iff α log( t ) T d t < ψ ∗ (∆ d / 2 ) . 16 / 25

Bandits Regret bounds Practice problem Show that at least one of the statements 1, 2, or 3 has to be true whenever D t + 1 = d , for the UCB algorithm. 17 / 25

Bandits Regret bounds Bounding E [ T d t ] ◮ Let � α log( T ) � ˜ T d T = . ψ ∗ (∆ d / 2 ) ◮ Forcing the algorithm to pick d the first ˜ T d T periods can only increase T d T . ◮ We can collect our results to get T ] = ∑ T + ∑ E [ T d 1 ( D t = d ) ≤ ˜ T d E [ 1 ( D t = d )] 1 ≤ t ≤ T ˜ T d T < t ≤ T T + ∑ ≤ ˜ T d E [ 1 ( (1) or (2) is true at t )] ˜ T d T < t ≤ T T + ∑ ≤ ˜ T d E [ 1 ( (1)is true at t )]+ E [ 1 ( (2) is true at t )] ˜ T d T < t ≤ T α T + ∑ 2 t − α + 1 ≤ ˜ ≤ ˜ T d T d T + α − 2 . T d ˜ T < t ≤ T 18 / 25

Bandits Regret bounds Upper bound on expected regret for UCB ◮ We thus get: T ] ≤ α log( T ) α E [ T d ψ ∗ (∆ d / 2 ) + α − 2 , � α log( T ) � α R T ≤ 1 · ∆ d . T ∑ ψ ∗ (∆ d / 2 ) + α − 2 d ◮ Expected regret (difference to optimal policy) goes to 0 at a rate of O (log( T ) / T ) – pretty fast! ◮ While the cost of “getting treatment wrong” is ∆ d , the difficulty of figuring out the right treatment is of order 1 / ψ ∗ (∆ d / 2 ) . Typically, this is of order ( 1 / ∆ d ) 2 . 19 / 25

Bandits Regret bounds Related bounds - rate optimality ◮ Lower bound : Consider the Bandit problem with binary outcomes and any algorithm such that E [ T d t ] = o ( t a ) for all a > 0. Then ∆ d R T ≥ ∑ log( T ) ¯ T liminf kl ( θ d , θ ∗ ) , t → ∞ d where kl ( p , q ) = p · log( p / q )+( 1 − p ) · log(( 1 − p ) / ( 1 − q )) . ◮ Upper bound for Thompson sampling : In the binary outcome setting, Thompson sampling achieves this bound, i.e., ∆ d R T = ∑ log( T ) ¯ T liminf kl ( θ d , θ ∗ ) . t → ∞ d 20 / 25

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy - PowerPoint PPT Presentation

Bandits Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard University 1 / 25 Bandits Agenda Thus far: Supervised machine learning data are given. Next: Active learning

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

Social Learning in Multi Agent Multi Armed Bandits Abishek Sankararaman, UC Berkeley April 9,

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Exploration/Exploitation in Multi-armed Bandits Spring 2019, CMU 10-403 Katerina Fragkiadaki

New Perspectives for Multi-Armed Bandits and Their Applications Vianney Perchet Workshop

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling Instructor: Sham Kakade 1 The

On conditional versus marginal bias in multi-armed bandits Jaehyeok Shin 1 , Aaditya Ramdas 1,2

Announcements HW 1 is due now 1 CS6501: T opics in Learning and Game Theory (Fall 2019)

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of

Econ 2148, fall 2019 Applications of Gaussian process priors Maximilian Kasy Department of

Econ 2148, fall 2019 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 *

Muti-armed Bandits,Online Learning and Sequential Prediction Jian Li Institute for

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Ciara Pike-Burke

Introduction to Multi-Armed Bandits and Reinforcement Learning Training School on Machine

Econ 2148, fall 2019 Data visualization Maximilian Kasy Department of Economics, Harvard

Econ 2148, fall 2019 Shrinkage in the Normal means model Maximilian Kasy Department of

Econ 2148, fall 2019 Text as data Maximilian Kasy Department of Economics, Harvard University 1

Best Arm Identification in Multi-Armed Bandits Jean-Yves Audibert 1 , 2 & S ebastien Bubeck 3

Econ 2148, fall 2019 Trees, forests, and causal trees Maximilian Kasy Department of Economics,

Pure Exploration Stochastic Multi-armed Bandits Jian Li Institute for Interdisciplinary

Econ 2148, fall 2019 Statistical decision theory Maximilian Kasy Department of Economics,

Econ 2148, fall 2019 Instrumental variables II, continuous treatment Maximilian Kasy Department

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy - PowerPoint PPT Presentation

Bandits Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard University 1 / 25 Bandits Agenda Thus far: Supervised machine learning data are given. Next: Active learning

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

Social Learning in Multi Agent Multi Armed Bandits Abishek Sankararaman, UC Berkeley April 9,

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Exploration/Exploitation in Multi-armed Bandits Spring 2019, CMU 10-403 Katerina Fragkiadaki

New Perspectives for Multi-Armed Bandits and Their Applications Vianney Perchet Workshop

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling Instructor: Sham Kakade 1 The

On conditional versus marginal bias in multi-armed bandits Jaehyeok Shin 1 , Aaditya Ramdas 1,2

Announcements HW 1 is due now 1 CS6501: T opics in Learning and Game Theory (Fall 2019)

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of

Econ 2148, fall 2019 Applications of Gaussian process priors Maximilian Kasy Department of

Econ 2148, fall 2019 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 *

Muti-armed Bandits,Online Learning and Sequential Prediction Jian Li Institute for

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Ciara Pike-Burke

Introduction to Multi-Armed Bandits and Reinforcement Learning Training School on Machine

Econ 2148, fall 2019 Data visualization Maximilian Kasy Department of Economics, Harvard

Econ 2148, fall 2019 Shrinkage in the Normal means model Maximilian Kasy Department of

Econ 2148, fall 2019 Text as data Maximilian Kasy Department of Economics, Harvard University 1

Best Arm Identification in Multi-Armed Bandits Jean-Yves Audibert 1 , 2 &amp; S ebastien Bubeck 3

Econ 2148, fall 2019 Trees, forests, and causal trees Maximilian Kasy Department of Economics,

Pure Exploration Stochastic Multi-armed Bandits Jian Li Institute for Interdisciplinary

Econ 2148, fall 2019 Statistical decision theory Maximilian Kasy Department of Economics,

Econ 2148, fall 2019 Instrumental variables II, continuous treatment Maximilian Kasy Department

Best Arm Identification in Multi-Armed Bandits Jean-Yves Audibert 1 , 2 & S ebastien Bubeck 3