An Estimation Based Allocation Rule with Super-linear Regret and - PowerPoint PPT Presentation

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes Prokopis C. Prokopiou, Peter E. Caines, and Aditya Mahajan McGill University May 6, 2015 PP, PEC, AM (McGill University) May 6, 2015 1 / 29

The Multi-Armed Bandit (MAB) Problem At each step a Decision Maker (DM) faces the following sequential allocation problem: must allocate a unit resource between several competing actions/projects. obtains a random reward with unkown probability distribution. The DM must design a policy to maximize the cumulative expected reward asymptotically in time. PP, PEC, AM (McGill University) May 6, 2015 2 / 29

Stylized model to understand exploration-exploitation trade-off Imagined slot machine with multiple arms. The gambler must choose one arm to pull at each time instant. He/she wins a random reward following some unknown probability distribution. His/her objective is to choose a policy to maximize the cumulative expected reward over the long term. PP, PEC, AM (McGill University) May 6, 2015 3 / 29

Real examples In Internet routing: Sequential transmission of packets between a source and a destination. The DM must choose one route among several alternatives. Reward = transmission time or transmission cost of the packet. In cognitive radio communications: The DM must choose which channel to use in different time slots among several alternatives. Reward = Number of bits sent at each slot In advertisement placement: The DM must choose which advertisement to show to the next visitor of a web-page among a finite set of alternatives. Reward = Number of click-outs. PP, PEC, AM (McGill University) May 6, 2015 4 / 29

Literature Overview i.i.d. rewards Lai and Robbins (1985) constructed a policy that achieves the asymptotically optimal regret of O(logT). Agrawal (1995) constructed index type policies that depend on the sample mean of the reward process, and they achieve asymptotically optimal regret of O(logT). Auer et. al. (2002), constructed an index type policy, called UCB1, which whose regret is O(logT) uniformly in time. Markov rewards Tekin et. al. (2010) proposed an index-based policy that achieves an asymptotically optimal regret of O(logT). PP, PEC, AM (McGill University) May 6, 2015 5 / 29

The Reward Process and the Regret n } ∞ { Y k Reward processes n = 1 ; k = 1 , . . . , K , defined on a common mea- surable space (Ω , A ) . for each machine { P k Set of probability θ ; θ ∈ Θ k } , where Θ k is a known finite set, for measures which: f k θ denotes probability density, µ k θ denotes mean. k ∗ � argmax { µ k Best machine k } . θ ∗ k ∈{ 1 ,..., K } true parameter for machine k is denoted θ ∗ k . PP, PEC, AM (McGill University) May 6, 2015 6 / 29

Allocation policy and Expected Regret Allocation policy A mapping φ t : R t − 1 → { 1 , . . . , K } that indicates the arm to be selected at the instant t u t = φ t ( Z 1 , . . . , Z t − 1 ) , where Z 1 , . . . , Z t − 1 denote the rewards gained up until t − 1. Expected Regret K � � � µ k ∗ k ∗ − µ k E ( n k R T ( φ ) = T ) , θ ∗ θ ∗ k k = 1 where � n k t − 1 + 1 if u t = k , n k t = n k if u t � = k . t − 1 PP, PEC, AM (McGill University) May 6, 2015 7 / 29

The Multi-Armed Bandit Problem Definition The MAB problem is to define a policy φ = { φ t ; t ∈ Z > 0 } in order to minimize the rate of growth of R T ( φ ) as T → ∞ . PP, PEC, AM (McGill University) May 6, 2015 8 / 29

Index policies and Upper Confidence Bounds Index policy φ g A policy that depends on a set g of indices for each arm and chooses the arm with the highest index at each time. Upper Confidence Bounds (UCB) [Agrawal (1985)] A set g of indices is a UCB, if it satisfies the following conditions: 1 g t , n is non-decreasing in t ≥ n , for each fixed n ∈ Z > 0 . 2 Let y k 1 , y k 2 , . . . , y k n be a sequence of observations from machine k . Then, for any z < µ k t , � � � � y k 1 , . . . , y k = o ( t − 1 ) g t , n < z , for some n ≤ t P θ ∗ n k PP, PEC, AM (McGill University) May 6, 2015 9 / 29

The Proposed Allocation (UCB) policy Consider a set of index functions g with n + t / C � � g k y k 1 , . . . , y k µ k � ˆ n , t , n n where t ∈ Z > 0 , n � n k µ k t ∈ { 1 , . . . , t } , C ∈ R and k ∈ { 1 , . . . , K } , and ˆ n is the maximum likelihood estimate of the mean of Y k . Then, if t ≤ K : φ g samples from each process Y k once if t > K : φ g samples from Y u t , where u t = argmax { g k t ; k ∈ { 1 , . . . , K }} t , n k PP, PEC, AM (McGill University) May 6, 2015 10 / 29

The main results Theorem Under suitable technical assumptions, the regret of the proposed policy satisfies R T ( φ g ) = o ( T 1 + δ ) for some δ > 0. The proposed index policy works when the rewards processes are ARMA processes with unknown means and variance. PP, PEC, AM (McGill University) May 6, 2015 11 / 29

Preliminaries on MLE Definition A sequence of estimates { ˆ θ n } ∞ n = 1 is called a maximum likelihood estimate if f ˆ θ n ( y 1 , . . . , y n ) ≥ max θ ∈ Θ { f θ ( y 1 , . . . , y n ) } , P θ ∗ a . s . Definition θ n � = θ ∗ finitely often, { ˆ n = 1 is called a (strongly) consistent estimator if ˆ θ n } ∞ P θ ∗ a . s . Assumption 1 Let P θ, n denote the restriction of P θ to the σ -field A n , n ≥ 0. Then, for all θ ∈ Θ and n ≥ 0, P θ, n is absolutely continuous with respect to P θ ∗ , n . PP, PEC, AM (McGill University) May 6, 2015 12 / 29

Preliminaries on MLE Assumption 2 For every θ ∈ Θ , let f θ, n be the density function associated with P θ, n . Define h θ, n ( y n | y n − 1 ) = f θ, n ( y n | y n − 1 ) f θ ∗ , n ( y n | y n − 1 ) , where y n � ( y 1 , . . . , y n ) . Then, for every ε > 0, there exists α ( ε ) > 1, such that � � θ n − 1 ( y n | y n − 1 ) ≤ α, for all n > | Θ | P θ ∗ 0 ≤ h ˆ < ε, where ˆ θ n ∈ Θ . Theorem 1 (PEC, 1975) Under Assumptions 1 and 2, the sequence of the maximum likelihood estimates is (strongly) consistent. PP, PEC, AM (McGill University) May 6, 2015 13 / 29

Assumptions on the model Assumption 3 ϑ k = { ˆ For every arm k , there is a consistent estimator ˆ 1 , ˆ ϑ k ϑ k 2 , . . . } . Assumption 4 (The summable Wrong and Corrected Condition (SWAC)) For all machines k ∈ { 1 , . . . , K } , the sequence of estimates ˆ 1 , . . . , ˆ θ k θ k n , . . . satisfies the following condition: C k (ˆ n − 1 � = θ ∗ k , ˆ m = θ ∗ P k θ k θ k k , ∀ m ≥ n ) < n 3 + β , θ ∗ for some C ∈ R > 0 , β ∈ R > 0 , and for all n ∈ Z > 0 . PP, PEC, AM (McGill University) May 6, 2015 14 / 29

The Lock-on time Definition For a consistent sequence of estimates ˆ 1 , . . . , ˆ θ k θ k n , . . . , the lock-on time refers to the least N such that for all n ≥ N , ˆ θ n = θ ∗ , P θ ∗ a.s. Lemma 1 Let N k be the lock-on time for estimator ˆ θ k . Then, under Assumption 4, E { N 2 + α } < ∞ , ∀ k ∈ { 1 , . . . , K } , 0 < α < β, k where β appears in Assumption 4. PP, PEC, AM (McGill University) May 6, 2015 15 / 29

Performance of φ g Theorem 2 If Assumptions 3 and 4 hold, then for each k ∈ { 1 , . . . , K } , the proposed index function t + t / C � � g k y k 1 , . . . , y k µ k � ˆ n , t , n n is an Upper Confidence Bound (UCB) Theorem 3 If Assumptions 3 and 4 hold, then the regret of the proposed policy φ g satisfies R T ( φ g ) = o ( T 1 + δ ) , for some δ > 0. PP, PEC, AM (McGill University) May 6, 2015 16 / 29

A MAB Problem for ARMA Processes Consider a bandit system with reward process generated by the following ARMA process x k n + 1 = λ k x k n + w k n S : ∀ n ∈ Z ≥ 0 , k ∈ { 1 , 2 } y k n = x k n n ∈ R , n ∈ Z ≥ 0 , and w k is i.i.d. ∼ N ( 0 , σ k 2 ) where x k n , y k n , w k x k 0 . = | Assumptions: The parameter space of the system contains two alternatives: Θ k = { θ ∗ k , θ k } ; θ k � ( λ k , σ k ) , k ∈ { 1 , 2 } . For each system | λ | < 1 and each process y k n is stationary. PP, PEC, AM (McGill University) May 6, 2015 17 / 29

A MAB Problem for ARMA Processes Problem Description At each step t , the player chooses to observe a sample from machine k ∈ { 1 , 2 } pays a cost υ k t equal to the squared minimum one step prediction error of the next observation y k t given the past observations y k 1 , . . . , y k t − 1 . n k n k The Expected Regret T 2 − E υ u i � 2 ) , R T ( φ g ) = − k ∈{ 1 , 2 } E υ k ( min ui n k n i i i = 1 where u i denotes the arm that is needed to be chosen at time i , specified by the proposed index policy φ g . PP, PEC, AM (McGill University) May 6, 2015 18 / 29

Preliminary results for ARMA Processes The negative logarithmic likelihood function of the reward process can be described as follows: 2 log ( σ 2 n σ 2 − log f ( y n ; λ ) = n 2 log 2 π + 1 1 − λ 2 ) + 1 1 − λ 2 ) − 1 2 y 2 1 ( n + 1 � ( y i − y i | i − 1 ) 2 σ − 2 2 i = 2 where y i | i − 1 � E ( y i | y i − i ) = λ y i − 1 , and y i − y i | i − 1 is the prediction error process. PP, PEC, AM (McGill University) May 6, 2015 19 / 29

An Estimation Based Allocation Rule with Super-linear Regret and - PowerPoint PPT Presentation

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes Prokopis C. Prokopiou, Peter E. Caines, and Aditya Mahajan McGill University May 6, 2015 PP, PEC, AM (McGill

Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based

UNVEILING THE SUPER ORBITAL UNVEILING THE SUPER ORBITAL UNVEILING THE SUPER-ORBITAL UNVEILING

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

THE FALL 2018 NFL PRIMETIME SEASON & THE SUPER BOWL KTG CONTENT STRATEGY SUPER BOWL

Bigger is Better Trends in super computers, super software, and super data Michael L. Norman,

Super- -Kamiokande Kamiokande s s Solar Neutrino results Solar Neutrino results Super

More Register Allocation Last time Register allocation Global allocation via graph

Rule Changes - Non rule change year Review of 2017 rule changes - just the easy to forgot

Common Rule Advanced Notice of Proposed Rulemaking (ANPRM) IRB Investigator Advanced Notice

2nd RULE: You MUST TALK about BOOK CLUB. 2nd RULE: You DO NOT talk about 3rd RULE: PERSEVERE -- If

Rule #1: Have a takeaway. Rule #2: Keep It Simple. Rule #3: Repetition is Good. Rule #4: Be

Counting Rules, etc Product Rule Generalized Product Rule Division Rule Bijection

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Project Nexus Principle Workshop Project Nexus Principle Workshop ALLOCATION ALLOCATION 15

Sample-Based Methods for Continuous Action Markov Decision Processes Chris Mansley Ari

Wireless Optimisation via Convex Bandits Unlicensed LTE/WiFi Coexistence Cristina Cano and

CS 101.2: Notes for Lecture 2 (Bandit Problems) Andreas Krause January 9, 2009 In these notes we

CMP722 ADVANCED COMPUTER VISION Lecture #6 Deep Reinforcement Learning Aykut Erdem //

What we learned last time 1. Intelligence is the computational part of the ability to achieve

Deep Hep Reading Group 1611.05763 Learning To Reinforcement Learn 1611.02779 SchemaAc

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Data-Dependent Algorithms for Bandit Convex Optimization Mehryar Mohri 1 Scott Yang 2 1 Google,

An Estimation Based Allocation Rule with Super-linear Regret and - PowerPoint PPT Presentation

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes Prokopis C. Prokopiou, Peter E. Caines, and Aditya Mahajan McGill University May 6, 2015 PP, PEC, AM (McGill

Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based

UNVEILING THE SUPER ORBITAL UNVEILING THE SUPER ORBITAL UNVEILING THE SUPER-ORBITAL UNVEILING

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

THE FALL 2018 NFL PRIMETIME SEASON &amp; THE SUPER BOWL KTG CONTENT STRATEGY SUPER BOWL

Bigger is Better Trends in super computers, super software, and super data Michael L. Norman,

Super- -Kamiokande Kamiokande s s Solar Neutrino results Solar Neutrino results Super

More Register Allocation Last time Register allocation Global allocation via graph

Rule Changes - Non rule change year Review of 2017 rule changes - just the easy to forgot

Common Rule Advanced Notice of Proposed Rulemaking (ANPRM) IRB Investigator Advanced Notice

2nd RULE: You MUST TALK about BOOK CLUB. 2nd RULE: You DO NOT talk about 3rd RULE: PERSEVERE -- If

Rule #1: Have a takeaway. Rule #2: Keep It Simple. Rule #3: Repetition is Good. Rule #4: Be

Counting Rules, etc Product Rule Generalized Product Rule Division Rule Bijection

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Project Nexus Principle Workshop Project Nexus Principle Workshop ALLOCATION ALLOCATION 15

Sample-Based Methods for Continuous Action Markov Decision Processes Chris Mansley Ari

Wireless Optimisation via Convex Bandits Unlicensed LTE/WiFi Coexistence Cristina Cano and

CS 101.2: Notes for Lecture 2 (Bandit Problems) Andreas Krause January 9, 2009 In these notes we

CMP722 ADVANCED COMPUTER VISION Lecture #6 Deep Reinforcement Learning Aykut Erdem //

What we learned last time 1. Intelligence is the computational part of the ability to achieve

Deep Hep Reading Group 1611.05763 Learning To Reinforcement Learn 1611.02779 SchemaAc

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Data-Dependent Algorithms for Bandit Convex Optimization Mehryar Mohri 1 Scott Yang 2 1 Google,

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

THE FALL 2018 NFL PRIMETIME SEASON & THE SUPER BOWL KTG CONTENT STRATEGY SUPER BOWL