An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes Prokopis C. Prokopiou, Peter E. Caines, and Aditya Mahajan McGill University May 6, 2015 PP, PEC, AM (McGill University) May 6, 2015 1 / 29
The Multi-Armed Bandit (MAB) Problem At each step a Decision Maker (DM) faces the following sequential allocation problem: must allocate a unit resource between several competing actions/projects. obtains a random reward with unkown probability distribution. The DM must design a policy to maximize the cumulative expected reward asymptotically in time. PP, PEC, AM (McGill University) May 6, 2015 2 / 29
Stylized model to understand exploration-exploitation trade-off Imagined slot machine with multiple arms. The gambler must choose one arm to pull at each time instant. He/she wins a random reward following some unknown probability distribution. His/her objective is to choose a policy to maximize the cumulative expected reward over the long term. PP, PEC, AM (McGill University) May 6, 2015 3 / 29
Real examples In Internet routing: Sequential transmission of packets between a source and a destination. The DM must choose one route among several alternatives. Reward = transmission time or transmission cost of the packet. In cognitive radio communications: The DM must choose which channel to use in different time slots among several alternatives. Reward = Number of bits sent at each slot In advertisement placement: The DM must choose which advertisement to show to the next visitor of a web-page among a finite set of alternatives. Reward = Number of click-outs. PP, PEC, AM (McGill University) May 6, 2015 4 / 29
Literature Overview i.i.d. rewards Lai and Robbins (1985) constructed a policy that achieves the asymptotically optimal regret of O(logT). Agrawal (1995) constructed index type policies that depend on the sample mean of the reward process, and they achieve asymptotically optimal regret of O(logT). Auer et. al. (2002), constructed an index type policy, called UCB1, which whose regret is O(logT) uniformly in time. Markov rewards Tekin et. al. (2010) proposed an index-based policy that achieves an asymptotically optimal regret of O(logT). PP, PEC, AM (McGill University) May 6, 2015 5 / 29
The Reward Process and the Regret n } ∞ { Y k Reward processes n = 1 ; k = 1 , . . . , K , defined on a common mea- surable space (Ω , A ) . for each machine { P k Set of probability θ ; θ ∈ Θ k } , where Θ k is a known finite set, for measures which: f k θ denotes probability density, µ k θ denotes mean. k ∗ � argmax { µ k Best machine k } . θ ∗ k ∈{ 1 ,..., K } true parameter for machine k is denoted θ ∗ k . PP, PEC, AM (McGill University) May 6, 2015 6 / 29
Allocation policy and Expected Regret Allocation policy A mapping φ t : R t − 1 → { 1 , . . . , K } that indicates the arm to be selected at the instant t u t = φ t ( Z 1 , . . . , Z t − 1 ) , where Z 1 , . . . , Z t − 1 denote the rewards gained up until t − 1. Expected Regret K � � � µ k ∗ k ∗ − µ k E ( n k R T ( φ ) = T ) , θ ∗ θ ∗ k k = 1 where � n k t − 1 + 1 if u t = k , n k t = n k if u t � = k . t − 1 PP, PEC, AM (McGill University) May 6, 2015 7 / 29
The Multi-Armed Bandit Problem Definition The MAB problem is to define a policy φ = { φ t ; t ∈ Z > 0 } in order to minimize the rate of growth of R T ( φ ) as T → ∞ . PP, PEC, AM (McGill University) May 6, 2015 8 / 29
Index policies and Upper Confidence Bounds Index policy φ g A policy that depends on a set g of indices for each arm and chooses the arm with the highest index at each time. Upper Confidence Bounds (UCB) [Agrawal (1985)] A set g of indices is a UCB, if it satisfies the following conditions: 1 g t , n is non-decreasing in t ≥ n , for each fixed n ∈ Z > 0 . 2 Let y k 1 , y k 2 , . . . , y k n be a sequence of observations from machine k . Then, for any z < µ k t , � � � � y k 1 , . . . , y k = o ( t − 1 ) g t , n < z , for some n ≤ t P θ ∗ n k PP, PEC, AM (McGill University) May 6, 2015 9 / 29
The Proposed Allocation (UCB) policy Consider a set of index functions g with n + t / C � � g k y k 1 , . . . , y k µ k � ˆ n , t , n n where t ∈ Z > 0 , n � n k µ k t ∈ { 1 , . . . , t } , C ∈ R and k ∈ { 1 , . . . , K } , and ˆ n is the maximum likelihood estimate of the mean of Y k . Then, if t ≤ K : φ g samples from each process Y k once if t > K : φ g samples from Y u t , where u t = argmax { g k t ; k ∈ { 1 , . . . , K }} t , n k PP, PEC, AM (McGill University) May 6, 2015 10 / 29
The main results Theorem Under suitable technical assumptions, the regret of the proposed policy satisfies R T ( φ g ) = o ( T 1 + δ ) for some δ > 0. The proposed index policy works when the rewards processes are ARMA processes with unknown means and variance. PP, PEC, AM (McGill University) May 6, 2015 11 / 29
Preliminaries on MLE Definition A sequence of estimates { ˆ θ n } ∞ n = 1 is called a maximum likelihood estimate if f ˆ θ n ( y 1 , . . . , y n ) ≥ max θ ∈ Θ { f θ ( y 1 , . . . , y n ) } , P θ ∗ a . s . Definition θ n � = θ ∗ finitely often, { ˆ n = 1 is called a (strongly) consistent estimator if ˆ θ n } ∞ P θ ∗ a . s . Assumption 1 Let P θ, n denote the restriction of P θ to the σ -field A n , n ≥ 0. Then, for all θ ∈ Θ and n ≥ 0, P θ, n is absolutely continuous with respect to P θ ∗ , n . PP, PEC, AM (McGill University) May 6, 2015 12 / 29
Preliminaries on MLE Assumption 2 For every θ ∈ Θ , let f θ, n be the density function associated with P θ, n . Define h θ, n ( y n | y n − 1 ) = f θ, n ( y n | y n − 1 ) f θ ∗ , n ( y n | y n − 1 ) , where y n � ( y 1 , . . . , y n ) . Then, for every ε > 0, there exists α ( ε ) > 1, such that � � θ n − 1 ( y n | y n − 1 ) ≤ α, for all n > | Θ | P θ ∗ 0 ≤ h ˆ < ε, where ˆ θ n ∈ Θ . Theorem 1 (PEC, 1975) Under Assumptions 1 and 2, the sequence of the maximum likelihood estimates is (strongly) consistent. PP, PEC, AM (McGill University) May 6, 2015 13 / 29
Assumptions on the model Assumption 3 ϑ k = { ˆ For every arm k , there is a consistent estimator ˆ 1 , ˆ ϑ k ϑ k 2 , . . . } . Assumption 4 (The summable Wrong and Corrected Condition (SWAC)) For all machines k ∈ { 1 , . . . , K } , the sequence of estimates ˆ 1 , . . . , ˆ θ k θ k n , . . . satisfies the following condition: C k (ˆ n − 1 � = θ ∗ k , ˆ m = θ ∗ P k θ k θ k k , ∀ m ≥ n ) < n 3 + β , θ ∗ for some C ∈ R > 0 , β ∈ R > 0 , and for all n ∈ Z > 0 . PP, PEC, AM (McGill University) May 6, 2015 14 / 29
The Lock-on time Definition For a consistent sequence of estimates ˆ 1 , . . . , ˆ θ k θ k n , . . . , the lock-on time refers to the least N such that for all n ≥ N , ˆ θ n = θ ∗ , P θ ∗ a.s. Lemma 1 Let N k be the lock-on time for estimator ˆ θ k . Then, under Assumption 4, E { N 2 + α } < ∞ , ∀ k ∈ { 1 , . . . , K } , 0 < α < β, k where β appears in Assumption 4. PP, PEC, AM (McGill University) May 6, 2015 15 / 29
Performance of φ g Theorem 2 If Assumptions 3 and 4 hold, then for each k ∈ { 1 , . . . , K } , the proposed index function t + t / C � � g k y k 1 , . . . , y k µ k � ˆ n , t , n n is an Upper Confidence Bound (UCB) Theorem 3 If Assumptions 3 and 4 hold, then the regret of the proposed policy φ g satisfies R T ( φ g ) = o ( T 1 + δ ) , for some δ > 0. PP, PEC, AM (McGill University) May 6, 2015 16 / 29
A MAB Problem for ARMA Processes Consider a bandit system with reward process generated by the following ARMA process x k n + 1 = λ k x k n + w k n S : ∀ n ∈ Z ≥ 0 , k ∈ { 1 , 2 } y k n = x k n n ∈ R , n ∈ Z ≥ 0 , and w k is i.i.d. ∼ N ( 0 , σ k 2 ) where x k n , y k n , w k x k 0 . = | Assumptions: The parameter space of the system contains two alternatives: Θ k = { θ ∗ k , θ k } ; θ k � ( λ k , σ k ) , k ∈ { 1 , 2 } . For each system | λ | < 1 and each process y k n is stationary. PP, PEC, AM (McGill University) May 6, 2015 17 / 29
A MAB Problem for ARMA Processes Problem Description At each step t , the player chooses to observe a sample from machine k ∈ { 1 , 2 } pays a cost υ k t equal to the squared minimum one step prediction error of the next observation y k t given the past observations y k 1 , . . . , y k t − 1 . n k n k The Expected Regret T 2 − E υ u i � 2 ) , R T ( φ g ) = − k ∈{ 1 , 2 } E υ k ( min ui n k n i i i = 1 where u i denotes the arm that is needed to be chosen at time i , specified by the proposed index policy φ g . PP, PEC, AM (McGill University) May 6, 2015 18 / 29
Preliminary results for ARMA Processes The negative logarithmic likelihood function of the reward process can be described as follows: 2 log ( σ 2 n σ 2 − log f ( y n ; λ ) = n 2 log 2 π + 1 1 − λ 2 ) + 1 1 − λ 2 ) − 1 2 y 2 1 ( n + 1 � ( y i − y i | i − 1 ) 2 σ − 2 2 i = 2 where y i | i − 1 � E ( y i | y i − i ) = λ y i − 1 , and y i − y i | i − 1 is the prediction error process. PP, PEC, AM (McGill University) May 6, 2015 19 / 29
Recommend
More recommend