Upper confidence bound algorithms Christos Dimitrakakis EPFL - PowerPoint PPT Presentation

Upper confidence bound algorithms Christos Dimitrakakis EPFL November 6, 2013 Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 1 / 22

1 Introduction 2 Bandit problems UCB 3 Structured bandit problems 4 Reinforcement learning problems Optimality Criteria UCRL Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 2 / 22

Bandit problems The stochastic bandit problem A set of K bandits, actions A = { 1 , . . . , K } Expected reward of the i -th bandit: µ i � E ( r t | a t = i ). Maximise: T � r t , (2.1) t =1 where T is arbitrary. What is a good heuristic strategy? Definition (Regret) The (total) regret of a policy π relative to the optimal policy is: � T r ∗ L T ( π ) � t − r π (2.2) t t =1 Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 3 / 22

Bandit problems Empirical average t t � � 1 µ t , i � n t , i � ˆ r k , i I { a k = i } , I { a k = i } . n t , i k =1 k =1 Algorithm 1 Optimistic initial values Input A , R r max � max R for t = 1 , . . . do u t , i = n t − 1 , i ˆ µ t − 1 , i + r max n t − 1 , i +1 a t = arg max i ∈A u t , i end for Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 4 / 22

Bandit problems A simple analysis in the deterministic case Consider the case where r t , i = µ t , i for all bandits. Then u t , i ≥ µ i for all t , i . Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 5 / 22

Bandit problems A simple analysis in the deterministic case Consider the case where r t , i = µ t , i for all bandits. Then u t , i ≥ µ i for all t , i . At time t , we play i if u t , i ≥ u t , j for all j . Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 5 / 22

Bandit problems A simple analysis in the deterministic case Consider the case where r t , i = µ t , i for all bandits. Then u t , i ≥ µ i for all t , i . At time t , we play i if u t , i ≥ u t , j for all j . But u t , j ≥ µ j Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 5 / 22

Bandit problems A simple analysis in the deterministic case Consider the case where r t , i = µ t , i for all bandits. Then u t , i ≥ µ i for all t , i . At time t , we play i if u t , i ≥ u t , j for all j . But u t , j ≥ µ j If µ ∗ � max j µ j , we play i at most n t , i ≤ r max ∆ i times, where ∆ i = µ ∗ − µ i . Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 5 / 22

Bandit problems A simple analysis in the deterministic case Consider the case where r t , i = µ t , i for all bandits. Then u t , i ≥ µ i for all t , i . At time t , we play i if u t , i ≥ u t , j for all j . But u t , j ≥ µ j If µ ∗ � max j µ j , we play i at most n t , i ≤ r max ∆ i times, where ∆ i = µ ∗ − µ i . Since every time we play i we lose ∆ i , the regret is � r max − µ ∗ = ( K − 1)( r max − µ ∗ ) L T ≤ ∆ i ∆ i i � = j Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 5 / 22

Bandit problems UCB Algorithm 2 UCB1 Input A , R µ 0 , i = r max , ∀ i . ˆ for t = 1 , . . . do � 2 ln t u t , i = ˆ µ t − 1 , i + n t − 1 , i . a t = arg max i ∈A u t , i end for Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 6 / 22

Bandit problems UCB Theorem (Auer et al [ ? ]) The expected regret of UCB1 after T rounds is at most � ln T � K � � c 1 + c 2 ∆ j ∆ i i : µ i <µ ∗ j =1 Proof. First we prove that � ln T � E n t , i ≤ O ∆ 2 i Then we note that the expected regret can be written as � ∆ i E n t , i i : µ i <µ ∗ due to Wald’s identity. Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 7 / 22

Bandit problems UCB � Let B t , s = (2 ln t ) / s . Then we can prove ∀ c ∈ Z : T � n T , i = 1 + I { a t = i } t = K +1 T � ≤ c + I { a t = i ∧ n t − 1 , i ≥ c } t = K +1 � � T � µ ∗ ≤ c + ˆ t − 1 + B t − 1 , n ∗ t − 1 ≤ max ˆ µ n i ( t − 1) , i + B t − 1 , n i ( t − 1) I n ∗ t = K +1 � � T � µ ∗ ≤ c + 0 < s < t ˆ min s + B t − 1 , s ≤ max c ≤ s i < t ˆ µ s i , i + B t − 1 , s i I t = K +1 ∞ t − 1 t − 1 � � � µ ∗ ≤ c + I { ˆ s + B t − 1 , s ≤ ˆ µ s i , i + B t − 1 , s i } s i = c t =1 s =1 Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 8 / 22

Bandit problems UCB � Let B t , s = (2 ln t ) / s . Then we can prove ∀ c ∈ Z : ∞ t − 1 t − 1 � � � µ ∗ n T , i ≤ c + I { ˆ s + B t − 1 , s ≤ ˆ µ s i , i + B t − 1 , s i } s i = c t =1 s =1 When the indicator function is true one of the following holds: s ≤ µ ∗ − B t , s µ ∗ ˆ (2.3) µ s i , i ≥ µ i + B t , s i ˆ (2.4) µ ∗ < µ i + 2 B t , s i (2.5) Proof idea Bound the probability of the first two events. Choose c to bound the last term. Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 8 / 22

Bandit problems UCB From Hoeffding bound: s ≤ µ ∗ − B t , s ) ≤ e − 4 ln t = t − 4 µ ∗ P (ˆ (2.6) µ s i , i ≥ µ i + B t , s i ) ≤ e − 4 ln t = t − 4 P (ˆ (2.7) � � (8 ln n ) /∆ 2 Setting c = makes the last event false as s i ≥ c . i � µ ∗ − µ i − 2 B t , s i = µ ∗ − µ i − 2 (2 ln t ) / s i ≥ µ ∗ − µ i − ∆ i = 0 . Summing up all the terms completes the proof. Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 9 / 22

Structured bandit problems Bandits and optimisation Continuous stochastic functions[ ? ? ? ] Constrained deterministic distributed functions[ ? ] Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 10 / 22

Structured bandit problems First idea[ ? ] Solve a sequence of discrete bandit problems. At epoch i , we have some interval A i Split the interval A i in k regions A i , j Run UCB on the k -armed bandit problem. When a region is sub-optimal with high probability, remove it! Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 11 / 22

Structured bandit problems Tree bandits [ ? ] Create a tree of coverings, with ( h , i ) being the i -th node at depth h . D are the descendants and C the children of a node. At time t we pick node H t , I t . Each node is picked at most once. T � n h , i ( T ) � I { ( H t , I t ) ∈ D ( h , i ) } (visits of ( h , i )) t =1 T � 1 µ h , i ( T ) � � r t I { ( H t , I t ) ∈ C ( h , i ) } (reward from ( h , i )) n h , i ( T ) t =1 (child bound) Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 12 / 22

Structured bandit problems Tree bandits [ ? ] Create a tree of coverings, with ( h , i ) being the i -th node at depth h . D are the descendants and C the children of a node. At time t we pick node H t , I t . Each node is picked at most once. T � n h , i ( T ) � I { ( H t , I t ) ∈ D ( h , i ) } (visits of ( h , i )) t =1 T � 1 µ h , i ( T ) � � r t I { ( H t , I t ) ∈ C ( h , i ) } (reward from ( h , i )) n h , i ( T ) t =1 � 2 ln T C h , i ( T ) � � n h , i ( T ) + nu 1 ρ h µ h , i ( T ) + (confidence bound) � � B h , i ( T ) � min C h , i ( T ) , max (child bound) ( h +1 , j ) ∈C ( h , i ) B h +1 , j Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 12 / 22

Reinforcement learning problems Optimality Criteria Infinite horizon, discounted Discount factor γ such that ∞ � γ k r t + k U t = (4.1) k =0 Geometric horizon, undiscounted At each step t , the process terminates with probability 1 − γ : T − t � U T t = r t + k , T ∼ Geom (1 − γ ) (4.2) k =0 γ ( s ) � E ( U t | s t = s ) V π Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 13 / 22

Reinforcement learning problems Optimality Criteria Infinite horizon, discounted Discount factor γ such that ∞ ∞ � � γ k E r t + k γ k r t + k U t = ⇒ E U t = (4.1) k =0 k =0 Geometric horizon, undiscounted At each step t , the process terminates with probability 1 − γ : T − t ∞ � � γ k E r t + k U T t = r t + k , T ∼ Geom (1 − γ ) ⇒ E U t = (4.2) k =0 k =0 γ ( s ) � E ( U t | s t = s ) V π Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 13 / 22

Reinforcement learning problems Optimality Criteria The expected total reward criterion V π � lim V π, T � E π U T T →∞ V π, T t , (4.3) t Dealing with the limit Consider µ s.t. the limit exists ∀ π . Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 14 / 22

Reinforcement learning problems Optimality Criteria The expected total reward criterion V π � lim V π, T � E π U T T →∞ V π, T t , (4.3) t Dealing with the limit Consider µ s.t. the limit exists ∀ π . � � � ∞ � � ∞ � � � � � � � r + r − V π + ( s ) � E π V π − ( s ) � E π � s t = s , � s t = s � � t t t =1 t =1 (4.4) r + r − t � max {− r , 0 } , t � max { r , 0 } . (4.5) Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 14 / 22

Upper confidence bound algorithms Christos Dimitrakakis EPFL - PowerPoint PPT Presentation

Upper confidence bound algorithms Christos Dimitrakakis EPFL November 6, 2013 Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 1 / 22 1 Introduction 2 Bandit problems UCB 3 Structured bandit problems 4

THE LISTING PRESENTATION A Natural Close! CONFIDENCE CONFIDENCE CONFIDENCE CONFIDENCE Hi

Motif Discovery Upper Bound An Upper Bound on the Hardness of Exact Matrix Based Motif Discovery

Branch-and-Bound Math 482, Lecture 33 Misha Lavrov April 27, 2020 Branch-and-bound methods

CS70: Jean Walrand: Lecture 29. Confidence? Confidence? Confidence is essential is many

Upper and Lower Loop Bound Estimation by Symbolic Execution and Loop Acceleration Pavel Cadek

Consistency Analysis for Massively Inconsistent Datasets in Bound-to-Bound Data Collaboration

Rightward Bound: The Rise of Conservatism in Postwar America Rightward Bound : The Rise of

Randomized Algorithms Randomized Algorithms The Chernoff bound The Chernoff bound Speaker:

Creating Confidence Intervals using Excel 2013 XL8A-V0R XL8A-V0R XL8A-V0R Create Confidence

Creating Confidence Intervals using Excel 2010 5/08/2015 V0M V0M V0M Create Confidence

STAT 113 Confidence Intervals Colin Reimer Dawson Oberlin College October 3, 2017 1 / 51

An inequality between the edge-Wiener index and the Wiener index of a graph A. Tepeh joint work

Upper confidence bound strategy on stochastical bandits Multiarmed bandit: K arms, at each step we

An upper bound on the fractional chromatic number of triangle-free subcubic graphs Chun-Hung Liu

TCTL model checking lower/upper-bound Introduction parametric timed automata without Parametric

Permuting Upper and Lower bounds [Aggarwal, Vitter, 88] Page 1 Upper Bound Assume instance is

Combinatorics of optimal designs R. A. Bailey and Peter J. Cameron p.j.cameron@qmul.ac.uk

Econ 2148, fall 2017 Statistical decision theory Maximilian Kasy Department of Economics,

Nonlinear Optimization: Optimality conditions INSEAD, Spring 2006 Jean-Philippe Vert Ecole des

Econ 2148, fall 2019 Statistical decision theory Maximilian Kasy Department of Economics,

Topics in Combinatorial Optimization Orlando Lee Unicamp 3 de junho de 2014 Orlando Lee

Exercises 6 Marco Chiarandini Department of Mathematics & Computer Science University of

Semi-Latin squares, permutation groups, and statistical optimality Leonard Soicher Queen Mary,

KKT conditions I Lecture 14 ME EN 575 Andrew Ning aning@byu.edu Outline Equality Constraints

Upper confidence bound algorithms Christos Dimitrakakis EPFL - PowerPoint PPT Presentation

Upper confidence bound algorithms Christos Dimitrakakis EPFL November 6, 2013 Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 1 / 22 1 Introduction 2 Bandit problems UCB 3 Structured bandit problems 4

THE LISTING PRESENTATION A Natural Close! CONFIDENCE CONFIDENCE CONFIDENCE CONFIDENCE Hi

Motif Discovery Upper Bound An Upper Bound on the Hardness of Exact Matrix Based Motif Discovery

Branch-and-Bound Math 482, Lecture 33 Misha Lavrov April 27, 2020 Branch-and-bound methods

CS70: Jean Walrand: Lecture 29. Confidence? Confidence? Confidence is essential is many

Upper and Lower Loop Bound Estimation by Symbolic Execution and Loop Acceleration Pavel Cadek

Consistency Analysis for Massively Inconsistent Datasets in Bound-to-Bound Data Collaboration

Rightward Bound: The Rise of Conservatism in Postwar America Rightward Bound : The Rise of

Randomized Algorithms Randomized Algorithms The Chernoff bound The Chernoff bound Speaker:

Creating Confidence Intervals using Excel 2013 XL8A-V0R XL8A-V0R XL8A-V0R Create Confidence

Creating Confidence Intervals using Excel 2010 5/08/2015 V0M V0M V0M Create Confidence

STAT 113 Confidence Intervals Colin Reimer Dawson Oberlin College October 3, 2017 1 / 51

An inequality between the edge-Wiener index and the Wiener index of a graph A. Tepeh joint work

Upper confidence bound strategy on stochastical bandits Multiarmed bandit: K arms, at each step we

An upper bound on the fractional chromatic number of triangle-free subcubic graphs Chun-Hung Liu

TCTL model checking lower/upper-bound Introduction parametric timed automata without Parametric

Permuting Upper and Lower bounds [Aggarwal, Vitter, 88] Page 1 Upper Bound Assume instance is

Combinatorics of optimal designs R. A. Bailey and Peter J. Cameron p.j.cameron@qmul.ac.uk

Econ 2148, fall 2017 Statistical decision theory Maximilian Kasy Department of Economics,

Nonlinear Optimization: Optimality conditions INSEAD, Spring 2006 Jean-Philippe Vert Ecole des

Econ 2148, fall 2019 Statistical decision theory Maximilian Kasy Department of Economics,

Topics in Combinatorial Optimization Orlando Lee Unicamp 3 de junho de 2014 Orlando Lee

Exercises 6 Marco Chiarandini Department of Mathematics &amp; Computer Science University of

Semi-Latin squares, permutation groups, and statistical optimality Leonard Soicher Queen Mary,

KKT conditions I Lecture 14 ME EN 575 Andrew Ning aning@byu.edu Outline Equality Constraints

Exercises 6 Marco Chiarandini Department of Mathematics & Computer Science University of