CS 101.2: Notes for Lecture 2 (Bandit Problems) Andreas Krause - PDF document

CS 101.2: Notes for Lecture 2 (Bandit Problems) Andreas Krause January 9, 2009 In these notes we prove logarithmic regret for the UCB 1 algorithm (based on Auer et al, 2002). 1 Notation • j : Index of slot machine arm (1 to k ). • n : Total number of plays we will make (known and specified in advance) • t : Total number of plays we did so far • X j,t : Random variable for reward of arm j at time t . All X j,t are possibly continuous, but supported in the interval [0 , 1] (i.e., they do not take any values outside [0 , 1]). All X j,t are independent. • T j ( t ): Number of times arm j pulled during the first t plays. Note that T j ( t ) is a random quantity. • µ j = E [ X j,t ], and µ ∗ = max j µ j • ∆ j = µ ∗ − µ j , and ∆ = min j ∆ j • Expected regret after t plays: � � � � tµ ∗ − R t = E T j ( t ) µ j = E [ T j ( t )]∆ j . j j • ¯ X j ( t ) is the sample average of all rewards obtained from arm j during the first t plays (i.e., if we’ve observed rewards x 1 , . . . , x m where m = T j ( t ), then ¯ X j ( t ) = 1 m ( x 1 + · · · + x m )). 1

2 The Upper Confidence Band algorithm (UCB1) • Initially, play each arm once (hence T j ( t ) ≥ 1 for all t ≥ k ). • Loop (for t = k + 1 to n ) – For each arm j compute “index” v j = ¯ X j ( t ) + c j ( t ) , � log n where c j ( t ) = T j ( t ) . – Play the arm with j ∗ = argmax j v j . 3 Analysis Theorem 1. If UCB 1 is run with input n , then its expected regret R n is O ( K log n ) . ∆ Proof. To prove Theorem 1, we will bound E [ T j ( n )] for all arms j . Suppose, at some time t , UCB 1 pulls a suboptimal arm j . That means, that X j ( t ) + c j ( t ) ≥ ¯ ¯ X ∗ ( t ) + c ∗ ( t ) . Hence, in this case, X ∗ ( t ) + c ∗ ( t ) + ( µ ∗ − µ ∗ ) X j ( t ) + 2 c j ( t ) − c j ( t ) + ( µ j − µ j ) ≥ ¯ ¯ + ( µ j − µ ∗ + 2 c j ( t )) X ∗ ( t ) − ( µ ∗ − c ∗ ( t )) ⇔ ¯ ≥ ¯ X j ( t ) − ( µ j + c j ( t )) � �� − C A B We can see that at least one of A , B or C must be nonnegative, i.e., at least one of the following inequalities must hold: ¯ X j ( t ) ≥ µ j + c j ( t ) (1) X ∗ ( t ) ≤ µ ∗ − c ∗ ( t ) ¯ (2) µ ∗ ≥ µ j + 2 c j ( t ) (3) In order to bound the probability of (1) and (2), we use the Chernoff-Hoeffding inequality: 2

Fact 1 (Chernoff-Hoeffding inequality) . Let X 1 , . . . , X n be independent random vari- ables supported on [0 , 1] , with E [ X i ] = µ . Then, for every a > 0 , n P (1 � X i ≥ µ + a ) ≤ e − 2 a 2 n n i =1 and n P (1 � X i < µ − a ) ≤ e − 2 a 2 n n i =1 Hence, we can bound the probability of (1) as Tj ( t ) T j ( t ) = e − 2 log n = n − 2 . − 2 log n X j ( t ) ≥ µ j + c j ( t )) ≤ e − 2 c j ( t ) 2 T j ( t ) = e P ( ¯ Similarly, X ∗ ( t ) ≤ µ ∗ − c ∗ ( t )) ≤ n − 2 . P ( ¯ Hence, (1) and (2) are very unlikely events. Now, note that whenever T j ( t ) ≥ ℓ = ⌈ (4 log n ) / ∆ 2 j ⌉ , (3) must be false, since � � � log n � log n � ≤ µ j + ∆ j = µ ∗ µ j + 2 c j ( t ) = µ j + 2 T j ( t ) ≤ µ j + 2 4 log n ∆ 2 j Hence, if arm j has been played at least ℓ = O (log n / ∆ 2 j ) times, then inequality (3) must be false, and hence arm j is pulled with probability at most O ( n − 2 ). Now we bound E [ T j ( n )]. By using conditional expectations, we have (writing T j instead of T j ( n ) for short) ≤ ℓ + 2 n − 1 E [ T j ] = P ( T j ≤ ℓ ) E [ T j | T j ≤ ℓ ] + P ( T j ≥ ℓ ) E [ T j | T j ≥ ℓ ] � �� ≤ 1 ≤ 2 n − 2 ≤ n ≤ ℓ since we have P ( T j ≥ ℓ ) ≤ P (inequality (1) or (2) violated ) ≤ 2 n − 2 . 3

4 Some additional remarks Note that as stated in Section 2, the total number of plays n needs to specified in advance. By setting � 2 log t c t = T j ( t ) , we can avoid this issue. A slightly more complex analysis (of Auer et al ’02) shows that in this case after any number of t plays it holds that R t = O ( k log t ) . ∆ 4

CS 101.2: Notes for Lecture 2 (Bandit Problems) Andreas Krause - PDF document

CS 101.2: Notes for Lecture 2 (Bandit Problems) Andreas Krause January 9, 2009 In these notes we prove logarithmic regret for the UCB 1 algorithm (based on Auer et al, 2002). 1 Notation j : Index of slot machine arm (1 to k ). n :

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5

One Armed Bandit source: http://dogbeforewicket.blogspot.ca EECS 1030 moodle.yorku.ca One Armed

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol`

Experiment design Bandit problems and Markov decision processes Christos Dimitrakakis UiO

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part 2 S ebastien

Reduced Variance Payoff Estimation in Adversarial Bandit Problems Levente Kocsis Csaba Szepesv

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Common Alerting Protocol (CAP) Presentation Outline 101.1 Opportunity and Challenge 101.2

Networking 101.101.101.101 The Internet The Internet is governed by a series of protocols

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurlien Garivier

Equilibria in large one-arm bandit games A. Salomon Universit e Paris 13 HEC Paris November

A Contextual-Bandit Approach to Personalized News Article Recommendation Lihong li, Wei Chu,

The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... Shahaf Nacson TAU Nov 15,

Data-Dependent Algorithms for Bandit Convex Optimization Mehryar Mohri 1 Scott Yang 2 1 Google,

CMP722 ADVANCED COMPUTER VISION Lecture #6 Deep Reinforcement Learning Aykut Erdem //

Meta-Learning Contextual Bandit Exploration Amr Sharaf Hal Daum e III University of Maryland

Efficient Algorithms for Infinite-Armed Bandit Arghya Roy Chaudhuri under the guidance of Prof.

tr ts ts t

Wireless Optimisation via Convex Bandits Unlicensed LTE/WiFi Coexistence Cristina Cano and

Sample-Based Methods for Continuous Action Markov Decision Processes Chris Mansley Ari

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for

What we learned last time 1. Intelligence is the computational part of the ability to achieve