An Asymptotically Optimal Bandit Algorithm for Bounded Support Models Junya Honda and Akimichi Takemura The University of Tokyo COLT 2010
Outline • Introduction • DMED policy – Proof of the optimality – Efficient computation • Simulation results • Conclusion
Outline • Introduction • DMED policy – Proof of the optimality – Efficient computation • Simulation results • Conclusion
Multiarmed bandit problem • Model of a gambler playing a slot machine with multiple arms • Example of a dilemma between exploration and exploitation • -armed stochastic bandit problem – Burnates-Katehakis derived an asymptotic bound of the regret • Model of reward distributions with support in [0,1] – UCB policies by Auer et al. are widely used practically – Bound-achieving policies have not been known – We propose DMED policy, which achieves the bound
Notation : family of distributions with support in [0,1] ∈ A : probability distribution of arm i = 1 , · · · , K F i ∈ A : expectation of arm i ( : expectation of distribution ) : maximum expectation of arms : # of times that arm has been pulled i through the first rounds Goal: minimize the regret � ( µ ∗ − µ i ) T i ( n ) i : µ i <µ ∗ by reducing each for suboptimal arm i
+ Asymptotic bound Burnetas and Katehakis (1996) • Under any policy satisfying a mild condition (consistency), ∈ A for all and suboptimal F = ( F 1 , · · · , F K ) ∈ A K i ˆ � 1 � E F [ T i ( n )] ≥ D min ( F i , µ ∗ ) − o(1) log n where D min ( F, µ ) = H ∈ A :E( H ) ≥ µ D ( F || H ) min � � log d F : Kullback-Leibler divergence D ( F || H ) = E F d H
Visualization of D min ( F, µ ) = H ∈ A :E( H ) ≥ µ D ( F || H ) min { H ∈ A : E( H ) ≥ µ } { ∈ A { ∈ E( ) = E( H ) = µ large E( H ) = F A D min ( F, µ ) =
Outline • Introduction • DMED policy – Proof of the optimality – Efficient computation • Simulation results • Conclusion
DMED policy • Deterministic Minimum Empirical Divergence policy For each loop, DMED chooses arms to pull in this way: 1. For each arm , check the condition i empirical distribution of arm at the -th round i T i ( n ) D min ( ˆ µ ∗ ( n )) ≤ log n F i ( n ) , ˆ maximum sample mean at the -th round (The condition is always true for the currently best arm) 2. Pull all of arms such that the condition is true
Main theorem Under DMED policy, for all suboptimal arm , i � � 1 E F [ T i ( n )] ≤ D min ( F i , µ ∗ ) + o(1) log n Asymptotic bound : � 1 � E F [ T i ( n )] ≥ D min ( F i , µ ∗ ) − o(1) log n DMED is asymptotically optimal
Intuitive interpretation (1) • Assume and consider the event K = 2 µ ∗ ( n ) • µ 1 ( n ) < ˆ ˆ µ 2 ( n ) = ˆ • T 1 ( n ) � T 2 ( n ) • How likely is arm 1 actually the best? � ≈ - is far more likely than µ 2 ≈ ˆ µ 2 µ 1 ≈ ˆ µ 1 ≈ • How likely is the hypothesis ? µ 1 ≥ ˆ µ 2
Intuitive interpretation (2) • By Sanov’s theorem in the large deviation theory, P [empirical distribution from F 1 come close to ˆ F 1 ] 1 ] ≈ exp( − T 1 ( n ) D ( ˆ F 1 || F 1 )) number of samples F 1 ≥ ˆ F 1 A D ( ˆ F 1 || F )
Intuitive interpretation (2) • By Sanov’s theorem in the large deviation theory, P [empirical distribution from F 1 come close to ˆ F 1 ] 1 ] ≈ exp( − T 1 ( n ) D ( ˆ F 1 || F 1 )) µ ∗ • Maximum likelihood of is µ 1 ≥ ˆ || µ ∗ E( H ) = ˆ µ ∗ exp( − T 1 ( n ) D ( ˆ max F 1 || H )) H ∈ A :E( H ) ≥ ˆ � � ≥ µ ∗ D ( ˆ = exp − T 1 ( n ) min F 1 || H ) ˆ H ∈ A :E( H ) ≥ ˆ F 1 = exp( − T 1 ( n ) D min ( ˆ A µ ∗ )) F 1 , ˆ D min ( ˆ µ ∗ ) F 1 , ˆ
Intuitive interpretation (3) • Maximum likelihood that arm is actually the best: i exp( − T i ( n ) D min ( ˆ µ ∗ )) F i , ˆ • In DMED policy, arm is pulled when i − T i ( n ) D min ( ˆ µ ∗ ) ≤ log n F i , ˆ – Arm is pulled if i ‣ the maximum likelihood is large ‣ round number is large n
Outline • Introduction • DMED policy – Proof of the optimality – Efficient computation • Simulation results • Conclusion
+ + Proof of the optimality µ 2 < µ 1 = µ ∗ • Assume and (arm 1 is the best) K = 2 • Two events are essential for the proof: ˆ : Estimators are already close to F i ( n ) , ˆ µ i ( n ) F i , µ i A n 2 ˆ : , but (arm 1 seems inferior) µ 2 ( n ) ≈ µ 2 ) ˆ µ 1 ( n ) < µ 2 ( < µ 1 ) B n “Arm 2 is pulled at the -th round” n N � � T 2 ( N ) = I[ { J n = 2 } ∩ A n ] + I[ { J n = 2 } ∩ B n ] + n =1 � n ] + I[ { J n = 2 } ∩ A c n ∩ B c ] arm pulled at the -th round
+ + Proof of the optimality µ 2 < µ 1 = µ ∗ • Assume and (arm 1 is the best) K = 2 • Two events are essential for the proof: ˆ : Estimators are already close to F i ( n ) , ˆ µ i ( n ) F i , µ i A n 2 ˆ : , but (arm 1 seems inferior) µ 2 ( n ) ≈ µ 2 ) ˆ µ 1 ( n ) < µ 2 ( < µ 1 ) B n log n D min ( F 2 , µ 1 ) O(1) = ≈ = N � � T 2 ( N ) = I[ { J n = 2 } ∩ A n ] + I[ { J n = 2 } ∩ B n ] + n =1 � n ] + I[ { J n = 2 } ∩ A c n ∩ B c ] = O(1)
+ After the convergence ≈ � T 2 ( n ) D min ( ˆ • Arm 2 is pulled when µ ∗ ( n )) ≤ log n F 2 ( n ) , ˆ D min ( ˆ • On the event , holds µ ∗ ( n )) ≈ D min ( F 2 , µ ∗ ) F 2 ( n ) , ˆ A n because is continuous D min ( F, µ ) If is true, arm 2 is pulled only while A n log n T 2 ( n ) � D min ( F 2 , µ ∗ ) is true. N log N � I[ { J n = 2 } ∩ A n ] � D min ( F 2 , µ ∗ ) n =1
Before the convergence (1) • : and ˆ µ 2 ≈ µ 2 µ 1 < µ 2 ( < µ 1 ) ˆ B n A F 1 • We will show � N � 1 E( H ) = µ 2 � E I[ { J n = 2 } ∩ B n ] = O(1) B n )] ≤ n =1 � N � � E I[ B n ] n =1
Before the convergence (1) • : and ˆ µ 2 ≈ µ 2 µ 1 < µ 2 ( < µ 1 ) ˆ B n A F 1 • We will show � N � 1 E( H ) = µ 2 � E I[ B n ] = O(1) B n n =1 ˆ • Focus on of the event F 1 ( n ) B n • is compact (w.r.t. Lévy distance) A
Before the convergence (1) • : and ˆ µ 2 ≈ µ 2 µ 1 < µ 2 ( < µ 1 ) ˆ B n A F 1 • We will show � N � 1 E( H ) = µ 2 � E I[ B n ] = O(1) B n n =1 ˆ • Focus on of the event F 1 ( n ) B n • is compact (w.r.t. Lévy distance) A
Before the convergence (1) • : and ˆ µ 2 ≈ µ 2 µ 1 < µ 2 ( < µ 1 ) ˆ B n A F 1 • We will show � N � 1 E( H ) = µ 2 � E I[ B n ] = O(1) B n G � n =1 ˆ • Focus on of the event F 1 ( n ) B n -ball with center G G � • is compact (w.r.t. Lévy distance) ≥ A It is sufficient to show for arbitrary s.t. E( G ) ≤ µ 2 G ∈ A � N � � I[ B n ∩ { ˆ E F 1 ( n ) ∈ G � } ] = O(1) n =1
Before the convergence (1) • : and ˆ µ 2 ≈ µ 2 µ 1 < µ 2 ( < µ 1 ) ˆ B n A F 1 • We will show � N � 1 E( H ) = µ 2 � E I[ B n ] = O(1) B n G � n =1 ˆ • Focus on of the event F 1 ( n ) B n -ball with center G G � • is compact (w.r.t. Lévy distance) ≥ A Take the summation over finite balls It is sufficient to show for arbitrary s.t. E( G ) ≤ µ 2 G ∈ A � N � � I[ B n ∩ { ˆ E F 1 ( n ) ∈ G � } ] = O(1) n =1
Before the convergence (2) • : and ˆ µ 2 ≈ µ 2 µ 1 < µ 2 ( < µ 1 ) ˆ B n • We will show � N � � I[ B n ∩ { ˆ E F 1 ( n ) ∈ G � } ] = O(1) n =1 )] ≤ � N � ∞ � � I[ B n ∩ { ˆ E F 1 ( n ) ∈ G � } ∩ { T 1 ( n ) = t } ] = t =1 n =1
Before the convergence (3) • We will show � N ∞ � � � I[ B n ∩ { ˆ E F 1 ( n ) ∈ G � } ∩ { T 1 ( n ) = t } ] = O(1) t =1 n =1 � N � I[ B n ∩ { ˆ � E F 1 ( n ) ∈ G � } ∩ { T 1 ( n ) = t } ] n =1 ≤ P F 1 [ { ˆ F 1 ( n ) ∈ G � } ∩ { T 1 ( n ) = t } ] � N � � I[ B n ∩ { ˆ × max F 1 ( n ) ∈ G � } ∩ { T 1 ( n ) = t } ] n =1 � � �� � ] ≤ exp D min ( G, µ 1 ) − D min ( G, µ 2 ) − t
Before the convergence (4) � N � I[ B n ∩ { ˆ � E F 1 ( n ) ∈ G � } ∩ { T 1 ( n ) = t } ] n =1 � � �� � ] ≤ exp D min ( G, µ 1 ) − D min ( G, µ 2 ) − t ≈ E( H ) = µ 1 F 1 1 E( H ) = µ 2 G � || A D min ( G, µ 1 ) D min ( G, µ 2 )
Before the convergence (4) � N � I[ B n ∩ { ˆ � E F 1 ( n ) ∈ G � } ∩ { T 1 ( n ) = t } ] n =1 � � �� � ] ≤ exp D min ( G, µ 1 ) − D min ( G, µ 2 ) − t ≤ exp( − t C ) ≈ E( H ) = µ 1 F 1 C 1 E( H ) = µ 2 G � || A D min ( G, µ 1 ) D min ( G, µ 2 )
Recommend
More recommend