Best Arm Identification in Multi-Armed Bandits Jean-Yves Audibert 1 , - PowerPoint PPT Presentation

Framework Lower Bound Algorithms Experiments Conclusion Best Arm Identification in Multi-Armed Bandits Jean-Yves Audibert 1 , 2 & S´ ebastien Bubeck 3 & R´ emi Munos 3 1 Univ. Paris Est, Imagine 2 CNRS/ENS/INRIA, Willow project 3 INRIA Lille, SequeL team mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

Framework Lower Bound Algorithms Experiments Conclusion Best arm identification task Parameters available to the forecaster: the number of rounds n and the number of arms K . Parameters unknown to the forecaster: the reward distributions (over [0 , 1]) ν 1 , . . . , ν K of the arms. We assume that there is a unique arm i ∗ with maximal mean. For each round t = 1 , 2 , . . . , n ; 1 The forecaster chooses an arm I t ∈ { 1 , . . . , K } . 2 The environment draws the reward Y t from ν I t (and independently from the past given I t ). At the end of the n rounds the forecaster outputs a recommendation J n ∈ { 1 , . . . , K } . Goal: Find the best arm, i.e, the arm with maximal mean. Regret: e n = P ( J n � = i ∗ ) . mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

Framework Lower Bound Algorithms Experiments Conclusion Motivating examples Clinical trials for cosmetic products. During the test phase , several several formulæ for a cream are sequentially tested , and after a finite time one is chosen for commercialization. Channel allocation for mobile phone communications. Cellphones can explore the set of channels to find the best one to operate. Each evaluation of a channel is noisy and there is a limited number of evaluations before the communication starts on the chosen channel . mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

Framework Lower Bound Algorithms Experiments Conclusion Summary of the talk Let µ i be the mean of ν i , and ∆ i = µ i ∗ − µ i the suboptimality of arm i . Main theoretical result: it requires of order of H = � i � = i ∗ 1 / ∆ 2 i rounds to find the best arm. Note that this result is well known for K = 2. We present two new forecasters, Successive Rejects (SR) and Adaptive UCB-E (Upper Confidence Bound Exploration) . SR is parameter free, and has optimal guarantees (up to a logarithmic factor). Adaptive UCB-E has no theoretical guarantees but it experimentally outperforms SR. mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

Framework Lower Bound Algorithms Experiments Conclusion Lower Bound Theorem Let ν 1 , . . . , ν K be Bernoulli distributions with parameters in [1 / 3 , 2 / 3] . There exists a numerical constant c > 0 such that for any forecaster, up to a permutation of the arms, � � − c (1 + o (1)) n log( K ) e n ≥ exp . H Informally, any algorithm requires at least (of order of) H / log( K ) rounds to find the best arm. mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

Framework Lower Bound Algorithms Experiments Conclusion Lower Bound Theorem Let ν 1 , . . . , ν K be Bernoulli distributions with parameters in [1 / 3 , 2 / 3] . There exists a numerical constant c > 0 such that for any forecaster, up to a permutation of the arms, � � � n log( K ) � 1 + K log( K ) √ n e n ≥ exp − c . H Informally, any algorithm requires at least (of order of) H / log( K ) rounds to find the best arm. mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

Framework Lower Bound Algorithms Experiments Conclusion Uniform strategy For each i ∈ { 1 , . . . , K } , select arm i during ⌊ n / K ⌋ rounds. Let J n ∈ argmax i ∈{ 1 ,..., K } � X i , ⌊ n / K ⌋ . Theorem � � − n min i ∆ 2 The uniform strategy satisfies: e n ≤ 2 K exp . i 2 K For any ( δ 1 , . . . , δ K ) with min i δ i ≤ 1 / 2 , there exist distributions such that ∆ 1 = δ 1 , . . . , ∆ K = δ K and � � − 8 n min i ∆ 2 e n ≥ 1 i 2 exp . K Informally, the uniform strategy finds the best arm with (of order of) K / min i ∆ 2 i rounds. For large K , this can be significantly larger than H = � i � = i ∗ 1 / ∆ 2 i . mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

Framework Lower Bound Algorithms Experiments Conclusion UCB-E Draw each arm once For each round t = K + 1 , 2 , . . . , n : � � � n / H � Draw I t ∈ argmax X i , T i ( t − 1) + , 2 T i ( t − 1) i ∈{ 1 ,..., K } where T i ( t − 1) = nb of times we pulled arm i up to time t − 1. Let J n ∈ argmax i ∈{ 1 ,..., K } � X i , T i ( n ) . Theorem � � n UCB-E satisfies e n ≤ n exp − . 50 H UCB-E finds the best arm with (of order of) H rounds, but it requires the knowledge of H = � i � = i ∗ 1 / ∆ 2 mon-logo i . Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

Framework Lower Bound Algorithms Experiments Conclusion Successive Rejects (SR) 2 + � K Let log( K ) = 1 1 i , A 1 = { 1 , . . . , K } , n 0 = 0 and i =2 1 n − K n k = ⌈ K +1 − k ⌉ for k ∈ { 1 , . . . , K − 1 } . log( K ) For each phase k = 1 , 2 , . . . , K − 1: (1) For each i ∈ A k , select arm i during n k − n k − 1 rounds. (2) Let A k +1 = A k \ arg min i ∈ A k � X i , n k , where � X i , s represents the empirical mean of arm i after s pulls. Let J n be the unique element of A K . Motivation for choosing n k Consider µ 1 > µ 2 = · · · = µ M ≫ µ M +1 = · · · = µ K target: draw n / M times the M best arms 1 n SR: the M best arms are drawn more than n K − M +1 ≈ log( K ) M mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

Framework Lower Bound Algorithms Experiments Conclusion Successive Rejects (SR) 2 + � K Let log( K ) = 1 1 i , A 1 = { 1 , . . . , K } , n 0 = 0 and i =2 1 n − K n k = ⌈ K +1 − k ⌉ for k ∈ { 1 , . . . , K − 1 } . log( K ) For each phase k = 1 , 2 , . . . , K − 1: (1) For each i ∈ A k , select arm i during n k − n k − 1 rounds. (2) Let A k +1 = A k \ arg min i ∈ A k � X i , n k , where � X i , s represents the empirical mean of arm i after s pulls. Let J n be the unique element of A K . Theorem SR satisfies: � � n e n ≤ K exp − . 4 H log K mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

Framework Lower Bound Algorithms Experiments Conclusion UCB-E Parameter: exploration constant c > 0. Draw each arm once For each round t = 1 , 2 , . . . , n : � � � c n / H � Draw I t ∈ argmax X i , T i ( t − 1) + , T i ( t − 1) i ∈{ 1 ,..., K } where T i ( t − 1) = nb of times we pulled arm i up to time t − 1. Let J n ∈ argmax i ∈{ 1 ,..., K } � X i , T i ( n ) . mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

Framework Lower Bound Algorithms Experiments Conclusion Adaptive UCB-E Parameter: exploration constant c > 0. For each round t = 1 , 2 , . . . , n : (1) Compute an (under)estimate ˆ H t of H � � � c n / ˆ H t � (2) Draw I t ∈ argmax i ∈{ 1 ,..., K } X i , T i ( t − 1) + , T i ( t − 1) Let J n ∈ argmax i ∈{ 1 ,..., K } � X i , T i ( n ) . Overestimating H ⇒ low exploration of the arms ⇒ potential missing of the optimal arm ⇒ all ∆ i badly estimated Underestimating H ⇒ higher exploration ⇒ not focusing enough on the arms ⇒ bad estimation of H = � i � = i ∗ 1 / ∆ 2 i mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

Framework Lower Bound Algorithms Experiments Conclusion Experiments with Bernoulli distributions Experiment 5: Arithmetic progression, K = 15, µ i = 0 . 5 − 0 . 025 i , i ∈ { 1 , . . . , 15 } . Experiment 7: Three groups of bad arms, K = 30, µ 1 = 0 . 5, µ 2:6 = 0 . 45, µ 7:20 = 0 . 43, µ 21:30 = 0 . 38. Experiment 5, n=4000 Experiment 7, n=6000 0.4 0.7 1 : Unif 1 : Unif 2−4 : HR 2−4 : HR 0.35 5 : SR 0.6 5 : SR 6−9 : UCB−E 6−9 : UCB−E 10−14 : Ad UCB−E 10−14 : Ad UCB−E Probability of error 0.3 Probability of error 0.5 0.25 0.4 0.2 0.3 0.15 0.2 0.1 0.1 0.05 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

Framework Lower Bound Algorithms Experiments Conclusion Conclusion It requires at least H / log( K ) rounds to find the best arm, with H = � i � = i ∗ 1 / ∆ 2 i . UCB-E requires only H log n rounds but also the knowledge of H to tune its parameter. SR is a parameter free algorithm that requires less than H log 2 K rounds to find the best arm. Adaptive UCB-E does not have theoretical guarantees but it experimentally outperforms SR. mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

Best Arm Identification in Multi-Armed Bandits Jean-Yves Audibert 1 , - PowerPoint PPT Presentation

Framework Lower Bound Algorithms Experiments Conclusion Best Arm Identification in Multi-Armed Bandits Jean-Yves Audibert 1 , 2 & S ebastien Bubeck 3 & R emi Munos 3 1 Univ. Paris Est, Imagine 2 CNRS/ENS/INRIA, Willow project 3

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Ciara Pike-Burke

On conditional versus marginal bias in multi-armed bandits Jaehyeok Shin 1 , Aaditya Ramdas 1,2

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurlien Garivier

Pure Exploration Stochastic Multi-armed Bandits Jian Li Institute for Interdisciplinary

Common Core Community Presentation Mr. Larry Knapp Assistant Superintendent Dr. Angie Thompson

Outcomes of Home Health Care for High-Risk Rural Medicare Beneficiaries TRACY MROZ, PHD

BofAML 21st Annual Financials CEO Conference Investor presentation Casper von Koskull, President

Share Talk Investor Evening Presentation London September 2019 Disclaimer THIS PRESENTATION AND

Welcome to our Maths Parent Workshop What maths discussions or thinking could you create with

Is triangle A the same as Triangle B? A B Mathematics at Bathwick St. Mary Primary School

Construction Arna - Bergen Stian Ekornaas Head of Strategies and Contracts Railway Tender

ANTIMICROBIAL RESISTANCE Walter Marrocco EFPC September 19 th 2017 - EMA USE OF ANTIBIOTICS IN