Upper confidence bound algorithms Christos Dimitrakakis EPFL November 6, 2013 Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 1 / 22
1 Introduction 2 Bandit problems UCB 3 Structured bandit problems 4 Reinforcement learning problems Optimality Criteria UCRL Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 2 / 22
Bandit problems The stochastic bandit problem A set of K bandits, actions A = { 1 , . . . , K } Expected reward of the i -th bandit: µ i � E ( r t | a t = i ). Maximise: T � r t , (2.1) t =1 where T is arbitrary. What is a good heuristic strategy? Definition (Regret) The (total) regret of a policy π relative to the optimal policy is: � T r ∗ L T ( π ) � t − r π (2.2) t t =1 Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 3 / 22
Bandit problems Empirical average t t � � 1 µ t , i � n t , i � ˆ r k , i I { a k = i } , I { a k = i } . n t , i k =1 k =1 Algorithm 1 Optimistic initial values Input A , R r max � max R for t = 1 , . . . do u t , i = n t − 1 , i ˆ µ t − 1 , i + r max n t − 1 , i +1 a t = arg max i ∈A u t , i end for Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 4 / 22
Bandit problems A simple analysis in the deterministic case Consider the case where r t , i = µ t , i for all bandits. Then u t , i ≥ µ i for all t , i . Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 5 / 22
Bandit problems A simple analysis in the deterministic case Consider the case where r t , i = µ t , i for all bandits. Then u t , i ≥ µ i for all t , i . At time t , we play i if u t , i ≥ u t , j for all j . Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 5 / 22
Bandit problems A simple analysis in the deterministic case Consider the case where r t , i = µ t , i for all bandits. Then u t , i ≥ µ i for all t , i . At time t , we play i if u t , i ≥ u t , j for all j . But u t , j ≥ µ j Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 5 / 22
Bandit problems A simple analysis in the deterministic case Consider the case where r t , i = µ t , i for all bandits. Then u t , i ≥ µ i for all t , i . At time t , we play i if u t , i ≥ u t , j for all j . But u t , j ≥ µ j If µ ∗ � max j µ j , we play i at most n t , i ≤ r max ∆ i times, where ∆ i = µ ∗ − µ i . Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 5 / 22
Bandit problems A simple analysis in the deterministic case Consider the case where r t , i = µ t , i for all bandits. Then u t , i ≥ µ i for all t , i . At time t , we play i if u t , i ≥ u t , j for all j . But u t , j ≥ µ j If µ ∗ � max j µ j , we play i at most n t , i ≤ r max ∆ i times, where ∆ i = µ ∗ − µ i . Since every time we play i we lose ∆ i , the regret is � r max − µ ∗ = ( K − 1)( r max − µ ∗ ) L T ≤ ∆ i ∆ i i � = j Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 5 / 22
Bandit problems UCB Algorithm 2 UCB1 Input A , R µ 0 , i = r max , ∀ i . ˆ for t = 1 , . . . do � 2 ln t u t , i = ˆ µ t − 1 , i + n t − 1 , i . a t = arg max i ∈A u t , i end for Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 6 / 22
Bandit problems UCB Theorem (Auer et al [ ? ]) The expected regret of UCB1 after T rounds is at most � ln T � K � � c 1 + c 2 ∆ j ∆ i i : µ i <µ ∗ j =1 Proof. First we prove that � ln T � E n t , i ≤ O ∆ 2 i Then we note that the expected regret can be written as � ∆ i E n t , i i : µ i <µ ∗ due to Wald’s identity. Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 7 / 22
Bandit problems UCB � Let B t , s = (2 ln t ) / s . Then we can prove ∀ c ∈ Z : T � n T , i = 1 + I { a t = i } t = K +1 T � ≤ c + I { a t = i ∧ n t − 1 , i ≥ c } t = K +1 � � T � µ ∗ ≤ c + ˆ t − 1 + B t − 1 , n ∗ t − 1 ≤ max ˆ µ n i ( t − 1) , i + B t − 1 , n i ( t − 1) I n ∗ t = K +1 � � T � µ ∗ ≤ c + 0 < s < t ˆ min s + B t − 1 , s ≤ max c ≤ s i < t ˆ µ s i , i + B t − 1 , s i I t = K +1 ∞ t − 1 t − 1 � � � µ ∗ ≤ c + I { ˆ s + B t − 1 , s ≤ ˆ µ s i , i + B t − 1 , s i } s i = c t =1 s =1 Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 8 / 22
Bandit problems UCB � Let B t , s = (2 ln t ) / s . Then we can prove ∀ c ∈ Z : ∞ t − 1 t − 1 � � � µ ∗ n T , i ≤ c + I { ˆ s + B t − 1 , s ≤ ˆ µ s i , i + B t − 1 , s i } s i = c t =1 s =1 When the indicator function is true one of the following holds: s ≤ µ ∗ − B t , s µ ∗ ˆ (2.3) µ s i , i ≥ µ i + B t , s i ˆ (2.4) µ ∗ < µ i + 2 B t , s i (2.5) Proof idea Bound the probability of the first two events. Choose c to bound the last term. Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 8 / 22
Bandit problems UCB From Hoeffding bound: s ≤ µ ∗ − B t , s ) ≤ e − 4 ln t = t − 4 µ ∗ P (ˆ (2.6) µ s i , i ≥ µ i + B t , s i ) ≤ e − 4 ln t = t − 4 P (ˆ (2.7) � � (8 ln n ) /∆ 2 Setting c = makes the last event false as s i ≥ c . i � µ ∗ − µ i − 2 B t , s i = µ ∗ − µ i − 2 (2 ln t ) / s i ≥ µ ∗ − µ i − ∆ i = 0 . Summing up all the terms completes the proof. Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 9 / 22
Structured bandit problems Bandits and optimisation Continuous stochastic functions[ ? ? ? ] Constrained deterministic distributed functions[ ? ] Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 10 / 22
Structured bandit problems First idea[ ? ] Solve a sequence of discrete bandit problems. At epoch i , we have some interval A i Split the interval A i in k regions A i , j Run UCB on the k -armed bandit problem. When a region is sub-optimal with high probability, remove it! Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 11 / 22
Structured bandit problems Tree bandits [ ? ] Create a tree of coverings, with ( h , i ) being the i -th node at depth h . D are the descendants and C the children of a node. At time t we pick node H t , I t . Each node is picked at most once. T � n h , i ( T ) � I { ( H t , I t ) ∈ D ( h , i ) } (visits of ( h , i )) t =1 T � 1 µ h , i ( T ) � � r t I { ( H t , I t ) ∈ C ( h , i ) } (reward from ( h , i )) n h , i ( T ) t =1 (child bound) Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 12 / 22
Structured bandit problems Tree bandits [ ? ] Create a tree of coverings, with ( h , i ) being the i -th node at depth h . D are the descendants and C the children of a node. At time t we pick node H t , I t . Each node is picked at most once. T � n h , i ( T ) � I { ( H t , I t ) ∈ D ( h , i ) } (visits of ( h , i )) t =1 T � 1 µ h , i ( T ) � � r t I { ( H t , I t ) ∈ C ( h , i ) } (reward from ( h , i )) n h , i ( T ) t =1 � 2 ln T C h , i ( T ) � � n h , i ( T ) + nu 1 ρ h µ h , i ( T ) + (confidence bound) � � B h , i ( T ) � min C h , i ( T ) , max (child bound) ( h +1 , j ) ∈C ( h , i ) B h +1 , j Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 12 / 22
Reinforcement learning problems Optimality Criteria Infinite horizon, discounted Discount factor γ such that ∞ � γ k r t + k U t = (4.1) k =0 Geometric horizon, undiscounted At each step t , the process terminates with probability 1 − γ : T − t � U T t = r t + k , T ∼ Geom (1 − γ ) (4.2) k =0 γ ( s ) � E ( U t | s t = s ) V π Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 13 / 22
Reinforcement learning problems Optimality Criteria Infinite horizon, discounted Discount factor γ such that ∞ ∞ � � γ k E r t + k γ k r t + k U t = ⇒ E U t = (4.1) k =0 k =0 Geometric horizon, undiscounted At each step t , the process terminates with probability 1 − γ : T − t ∞ � � γ k E r t + k U T t = r t + k , T ∼ Geom (1 − γ ) ⇒ E U t = (4.2) k =0 k =0 γ ( s ) � E ( U t | s t = s ) V π Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 13 / 22
Reinforcement learning problems Optimality Criteria The expected total reward criterion V π � lim V π, T � E π U T T →∞ V π, T t , (4.3) t Dealing with the limit Consider µ s.t. the limit exists ∀ π . Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 14 / 22
Reinforcement learning problems Optimality Criteria The expected total reward criterion V π � lim V π, T � E π U T T →∞ V π, T t , (4.3) t Dealing with the limit Consider µ s.t. the limit exists ∀ π . � � � ∞ � � ∞ � � � � � � � r + r − V π + ( s ) � E π V π − ( s ) � E π � s t = s , � s t = s � � t t t =1 t =1 (4.4) r + r − t � max {− r , 0 } , t � max { r , 0 } . (4.5) Christos Dimitrakakis (EPFL) Upper confidence bound algorithms November 6, 2013 14 / 22
Recommend
More recommend