An Asymptotically Optimal Bandit Algorithm for Bounded Support - PowerPoint PPT Presentation

An Asymptotically Optimal Bandit Algorithm for Bounded Support Models Junya Honda and Akimichi Takemura The University of Tokyo COLT 2010

Outline • Introduction • DMED policy – Proof of the optimality – Efficient computation • Simulation results • Conclusion

Multiarmed bandit problem • Model of a gambler playing a slot machine with multiple arms • Example of a dilemma between exploration and exploitation • -armed stochastic bandit problem – Burnates-Katehakis derived an asymptotic bound of the regret • Model of reward distributions with support in [0,1] – UCB policies by Auer et al. are widely used practically – Bound-achieving policies have not been known – We propose DMED policy, which achieves the bound

Notation : family of distributions with support in [0,1] ∈ A : probability distribution of arm i = 1 , · · · , K F i ∈ A : expectation of arm i ( : expectation of distribution ) : maximum expectation of arms : # of times that arm has been pulled i through the first rounds Goal: minimize the regret � ( µ ∗ − µ i ) T i ( n ) i : µ i <µ ∗ by reducing each for suboptimal arm i

＋ Asymptotic bound Burnetas and Katehakis (1996) • Under any policy satisfying a mild condition (consistency), ∈ A for all and suboptimal F = ( F 1 , · · · , F K ) ∈ A K i ˆ � 1 � E F [ T i ( n )] ≥ D min ( F i , µ ∗ ) − o(1) log n where D min ( F, µ ) = H ∈ A :E( H ) ≥ µ D ( F || H ) min � � log d F ： Kullback-Leibler divergence D ( F || H ) = E F d H

Visualization of D min ( F, µ ) = H ∈ A :E( H ) ≥ µ D ( F || H ) min { H ∈ A : E( H ) ≥ µ } { ∈ A { ∈ E( ) = E( H ) = µ large E( H ) = F A D min ( F, µ ) =

DMED policy • Deterministic Minimum Empirical Divergence policy For each loop, DMED chooses arms to pull in this way: 1. For each arm , check the condition i empirical distribution of arm at the -th round i T i ( n ) D min ( ˆ µ ∗ ( n )) ≤ log n F i ( n ) , ˆ maximum sample mean at the -th round (The condition is always true for the currently best arm) 2. Pull all of arms such that the condition is true

Main theorem Under DMED policy, for all suboptimal arm , i � � 1 E F [ T i ( n )] ≤ D min ( F i , µ ∗ ) + o(1) log n Asymptotic bound ： � 1 � E F [ T i ( n )] ≥ D min ( F i , µ ∗ ) − o(1) log n DMED is asymptotically optimal

Intuitive interpretation (1) • Assume and consider the event K = 2 µ ∗ ( n ) • µ 1 ( n ) < ˆ ˆ µ 2 ( n ) = ˆ • T 1 ( n ) � T 2 ( n ) • How likely is arm 1 actually the best? � ≈ - is far more likely than µ 2 ≈ ˆ µ 2 µ 1 ≈ ˆ µ 1 ≈ • How likely is the hypothesis ? µ 1 ≥ ˆ µ 2

Intuitive interpretation (2) • By Sanov’s theorem in the large deviation theory, P [empirical distribution from F 1 come close to ˆ F 1 ] 1 ] ≈ exp( − T 1 ( n ) D ( ˆ F 1 || F 1 )) number of samples F 1 ≥ ˆ F 1 A D ( ˆ F 1 || F )

Intuitive interpretation (2) • By Sanov’s theorem in the large deviation theory, P [empirical distribution from F 1 come close to ˆ F 1 ] 1 ] ≈ exp( − T 1 ( n ) D ( ˆ F 1 || F 1 )) µ ∗ • Maximum likelihood of is µ 1 ≥ ˆ || µ ∗ E( H ) = ˆ µ ∗ exp( − T 1 ( n ) D ( ˆ max F 1 || H )) H ∈ A :E( H ) ≥ ˆ � � ≥ µ ∗ D ( ˆ = exp − T 1 ( n ) min F 1 || H ) ˆ H ∈ A :E( H ) ≥ ˆ F 1 = exp( − T 1 ( n ) D min ( ˆ A µ ∗ )) F 1 , ˆ D min ( ˆ µ ∗ ) F 1 , ˆ

Intuitive interpretation (3) • Maximum likelihood that arm is actually the best: i exp( − T i ( n ) D min ( ˆ µ ∗ )) F i , ˆ • In DMED policy, arm is pulled when i − T i ( n ) D min ( ˆ µ ∗ ) ≤ log n F i , ˆ – Arm is pulled if i ‣ the maximum likelihood is large ‣ round number is large n

＋＋ Proof of the optimality µ 2 < µ 1 = µ ∗ • Assume and (arm 1 is the best) K = 2 • Two events are essential for the proof: ˆ : Estimators are already close to F i ( n ) , ˆ µ i ( n ) F i , µ i A n 2 ˆ : , but (arm 1 seems inferior) µ 2 ( n ) ≈ µ 2 ) ˆ µ 1 ( n ) < µ 2 ( < µ 1 ) B n “Arm 2 is pulled at the -th round” n N � � T 2 ( N ) = I[ { J n = 2 } ∩ A n ] + I[ { J n = 2 } ∩ B n ] + n =1 � n ] + I[ { J n = 2 } ∩ A c n ∩ B c ] arm pulled at the -th round

＋＋ Proof of the optimality µ 2 < µ 1 = µ ∗ • Assume and (arm 1 is the best) K = 2 • Two events are essential for the proof: ˆ : Estimators are already close to F i ( n ) , ˆ µ i ( n ) F i , µ i A n 2 ˆ : , but (arm 1 seems inferior) µ 2 ( n ) ≈ µ 2 ) ˆ µ 1 ( n ) < µ 2 ( < µ 1 ) B n log n D min ( F 2 , µ 1 ) O(1) = ≈ = N � � T 2 ( N ) = I[ { J n = 2 } ∩ A n ] + I[ { J n = 2 } ∩ B n ] + n =1 � n ] + I[ { J n = 2 } ∩ A c n ∩ B c ] = O(1)

＋ After the convergence ≈ � T 2 ( n ) D min ( ˆ • Arm 2 is pulled when µ ∗ ( n )) ≤ log n F 2 ( n ) , ˆ D min ( ˆ • On the event , holds µ ∗ ( n )) ≈ D min ( F 2 , µ ∗ ) F 2 ( n ) , ˆ A n because is continuous D min ( F, µ ) If is true, arm 2 is pulled only while A n log n T 2 ( n ) � D min ( F 2 , µ ∗ ) is true. N log N � I[ { J n = 2 } ∩ A n ] � D min ( F 2 , µ ∗ ) n =1

Before the convergence (1) • : and ˆ µ 2 ≈ µ 2 µ 1 < µ 2 ( < µ 1 ) ˆ B n A F 1 • We will show � N � 1 E( H ) = µ 2 � E I[ { J n = 2 } ∩ B n ] = O(1) B n )] ≤ n =1 � N � � E I[ B n ] n =1

Before the convergence (1) • : and ˆ µ 2 ≈ µ 2 µ 1 < µ 2 ( < µ 1 ) ˆ B n A F 1 • We will show � N � 1 E( H ) = µ 2 � E I[ B n ] = O(1) B n n =1 ˆ • Focus on of the event F 1 ( n ) B n • is compact (w.r.t. Lévy distance) A

Before the convergence (1) • : and ˆ µ 2 ≈ µ 2 µ 1 < µ 2 ( < µ 1 ) ˆ B n A F 1 • We will show � N � 1 E( H ) = µ 2 � E I[ B n ] = O(1) B n G � n =1 ˆ • Focus on of the event F 1 ( n ) B n -ball with center G G � • is compact (w.r.t. Lévy distance) ≥ A It is sufficient to show for arbitrary s.t. E( G ) ≤ µ 2 G ∈ A � N � � I[ B n ∩ { ˆ E F 1 ( n ) ∈ G � } ] = O(1) n =1

Before the convergence (1) • : and ˆ µ 2 ≈ µ 2 µ 1 < µ 2 ( < µ 1 ) ˆ B n A F 1 • We will show � N � 1 E( H ) = µ 2 � E I[ B n ] = O(1) B n G � n =1 ˆ • Focus on of the event F 1 ( n ) B n -ball with center G G � • is compact (w.r.t. Lévy distance) ≥ A Take the summation over finite balls It is sufficient to show for arbitrary s.t. E( G ) ≤ µ 2 G ∈ A � N � � I[ B n ∩ { ˆ E F 1 ( n ) ∈ G � } ] = O(1) n =1

Before the convergence (2) • : and ˆ µ 2 ≈ µ 2 µ 1 < µ 2 ( < µ 1 ) ˆ B n • We will show � N � � I[ B n ∩ { ˆ E F 1 ( n ) ∈ G � } ] = O(1) n =1 )] ≤ � N � ∞ � � I[ B n ∩ { ˆ E F 1 ( n ) ∈ G � } ∩ { T 1 ( n ) = t } ] = t =1 n =1

Before the convergence (3) • We will show � N ∞ � � � I[ B n ∩ { ˆ E F 1 ( n ) ∈ G � } ∩ { T 1 ( n ) = t } ] = O(1) t =1 n =1 � N � I[ B n ∩ { ˆ � E F 1 ( n ) ∈ G � } ∩ { T 1 ( n ) = t } ] n =1 ≤ P F 1 [ { ˆ F 1 ( n ) ∈ G � } ∩ { T 1 ( n ) = t } ] � N � � I[ B n ∩ { ˆ × max F 1 ( n ) ∈ G � } ∩ { T 1 ( n ) = t } ] n =1 � � �� ] ≤ exp D min ( G, µ 1 ) − D min ( G, µ 2 ) − t

Before the convergence (4) � N � I[ B n ∩ { ˆ � E F 1 ( n ) ∈ G � } ∩ { T 1 ( n ) = t } ] n =1 � � �� ] ≤ exp D min ( G, µ 1 ) − D min ( G, µ 2 ) − t ≈ E( H ) = µ 1 F 1 1 E( H ) = µ 2 G � || A D min ( G, µ 1 ) D min ( G, µ 2 )

Before the convergence (4) � N � I[ B n ∩ { ˆ � E F 1 ( n ) ∈ G � } ∩ { T 1 ( n ) = t } ] n =1 � � �� ] ≤ exp D min ( G, µ 1 ) − D min ( G, µ 2 ) − t ≤ exp( − t C ) ≈ E( H ) = µ 1 F 1 C 1 E( H ) = µ 2 G � || A D min ( G, µ 1 ) D min ( G, µ 2 )

An Asymptotically Optimal Bandit Algorithm for Bounded Support - PowerPoint PPT Presentation

An Asymptotically Optimal Bandit Algorithm for Bounded Support Models Junya Honda and Akimichi Takemura The University of Tokyo COLT 2010 Outline Introduction DMED policy Proof of the optimality Efficient computation

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5

Multivalued complementarity problems with asymptotically bounded multifunctions Fabin

One Armed Bandit source: http://dogbeforewicket.blogspot.ca EECS 1030 moodle.yorku.ca One Armed

Bounded Degree Spanning Tree using Iterative Relaxation Barna Saha March 11, 2015 Bounded

Big- Big -O O Analyzing Algorithms Asymptotically Analyzing Algorithms Asymptotically P1 P2

Equilibria in large one-arm bandit games A. Salomon Universit e Paris 13 HEC Paris November

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol`

Contextual Multi-armed Bandit Algorithm for Semiparametric Reward Model Gi-Soo Kim, Myunghee Cho

6.02 Fall 2012 Lecture #12 Bounded-input, bounded-output stability Frequency response 6.02

Bounded Radius Routing Perform bounded PRIM algorithm Under = 0, = 0.5, and =

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

An Optimal Private Stochastic-MAB Algorithm Based on an Optimal Private Stopping Rule Touqir Sajed

Bounded Type Parameters 49 What is a bounded Type Parameter? Restrict the types that may

The Metropolis Hastings algorithm : introduction and optimal scaling of the transient phase

19-Feb-19 Kendrion N.V. Q4 & FY 2018 results Amsterdam, 19 February 2019 1 Agenda

Model Selection & Information Criteria: Akaike Information Criterion A uthors : M. M attheakis

Making Good Presentations Making Good Presentations Essential for Crystallizing your

the Interamerican Human Rights System University College London Dr Tom Pegram 10 October 2014

UNDERSTANDING & SEPARATING THE ROLES OF DYNAMICS & STATISTICS IN DATA ASSIMILATION

IMPLEMENTATION RESEARCH IN LOW-RESOURCE SETTINGS Rinad Beidas, PhD, Laura Murray, PhD, Shannon

Utilities Adjustment for Owners Equivalent Rent Randy Verbrugge October, 2007 US Bureau of

The Bankruptcy Weekly July 8, 2009 Brought to you by the National Association of Dealer Counsel

Sambuz

Useful Links

Newsletter

Mail Us

An Asymptotically Optimal Bandit Algorithm for Bounded Support - PowerPoint PPT Presentation

An Asymptotically Optimal Bandit Algorithm for Bounded Support Models Junya Honda and Akimichi Takemura The University of Tokyo COLT 2010 Outline Introduction DMED policy Proof of the optimality Efficient computation

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5

Multivalued complementarity problems with asymptotically bounded multifunctions Fabin

One Armed Bandit source: http://dogbeforewicket.blogspot.ca EECS 1030 moodle.yorku.ca One Armed

Bounded Degree Spanning Tree using Iterative Relaxation Barna Saha March 11, 2015 Bounded

Big- Big -O O Analyzing Algorithms Asymptotically Analyzing Algorithms Asymptotically P1 P2

Equilibria in large one-arm bandit games A. Salomon Universit e Paris 13 HEC Paris November

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol`

Contextual Multi-armed Bandit Algorithm for Semiparametric Reward Model Gi-Soo Kim, Myunghee Cho

6.02 Fall 2012 Lecture #12 Bounded-input, bounded-output stability Frequency response 6.02

Bounded Radius Routing Perform bounded PRIM algorithm Under = 0, = 0.5, and =

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

An Optimal Private Stochastic-MAB Algorithm Based on an Optimal Private Stopping Rule Touqir Sajed

Bounded Type Parameters 49 What is a bounded Type Parameter? Restrict the types that may

The Metropolis Hastings algorithm : introduction and optimal scaling of the transient phase

19-Feb-19 Kendrion N.V. Q4 &amp; FY 2018 results Amsterdam, 19 February 2019 1 Agenda

Model Selection &amp; Information Criteria: Akaike Information Criterion A uthors : M. M attheakis

Making Good Presentations Making Good Presentations Essential for Crystallizing your

the Interamerican Human Rights System University College London Dr Tom Pegram 10 October 2014

UNDERSTANDING &amp; SEPARATING THE ROLES OF DYNAMICS &amp; STATISTICS IN DATA ASSIMILATION

IMPLEMENTATION RESEARCH IN LOW-RESOURCE SETTINGS Rinad Beidas, PhD, Laura Murray, PhD, Shannon

Utilities Adjustment for Owners Equivalent Rent Randy Verbrugge October, 2007 US Bureau of

The Bankruptcy Weekly July 8, 2009 Brought to you by the National Association of Dealer Counsel

Sambuz

Useful Links

Newsletter

Mail Us

19-Feb-19 Kendrion N.V. Q4 & FY 2018 results Amsterdam, 19 February 2019 1 Agenda

Model Selection & Information Criteria: Akaike Information Criterion A uthors : M. M attheakis

UNDERSTANDING & SEPARATING THE ROLES OF DYNAMICS & STATISTICS IN DATA ASSIMILATION