Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit - PowerPoint PPT Presentation

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part 2 S´ ebastien Bubeck Theory Group

The linear bandit problem, Auer [2002] Known parameters: compact action set A ⊂ R n , adversary’s action set L ⊂ R n , number of rounds T .

The linear bandit problem, Auer [2002] Known parameters: compact action set A ⊂ R n , adversary’s action set L ⊂ R n , number of rounds T . Protocol: For each round t = 1 , 2 , . . . , T , the adversary chooses a loss vector ℓ t ∈ L and simultaneously the player chooses a t ∈ A based on past observations and receives a loss/observation Y t = ℓ ⊤ t a t . T T � � ℓ ⊤ ℓ ⊤ R T = E t a t − min t a . a ∈A E t =1 t =1

The linear bandit problem, Auer [2002] Known parameters: compact action set A ⊂ R n , adversary’s action set L ⊂ R n , number of rounds T . Protocol: For each round t = 1 , 2 , . . . , T , the adversary chooses a loss vector ℓ t ∈ L and simultaneously the player chooses a t ∈ A based on past observations and receives a loss/observation Y t = ℓ ⊤ t a t . T T � � ℓ ⊤ ℓ ⊤ R T = E t a t − min t a . a ∈A E t =1 t =1 Other models: In the i.i.d. model we assume that there is some underlying θ ∈ L such that E ( Y t | a t ) = θ ⊤ a t . In the Bayesian model we assume that we have a prior distribution ν over the sequence ( ℓ 1 , . . . , ℓ T ) (in this case the expectation in R T is also over ( ℓ 1 , . . . , ℓ T ) ∼ ν ). Alternatively we could assume a prior over θ .

The linear bandit problem, Auer [2002] Known parameters: compact action set A ⊂ R n , adversary’s action set L ⊂ R n , number of rounds T . Protocol: For each round t = 1 , 2 , . . . , T , the adversary chooses a loss vector ℓ t ∈ L and simultaneously the player chooses a t ∈ A based on past observations and receives a loss/observation Y t = ℓ ⊤ t a t . T T � � ℓ ⊤ ℓ ⊤ R T = E t a t − min t a . a ∈A E t =1 t =1 Other models: In the i.i.d. model we assume that there is some underlying θ ∈ L such that E ( Y t | a t ) = θ ⊤ a t . In the Bayesian model we assume that we have a prior distribution ν over the sequence ( ℓ 1 , . . . , ℓ T ) (in this case the expectation in R T is also over ( ℓ 1 , . . . , ℓ T ) ∼ ν ). Alternatively we could assume a prior over θ . Example: Part 1 was about A = { e 1 , . . . , e n } and L = [0 , 1] n .

The linear bandit problem, Auer [2002] Known parameters: compact action set A ⊂ R n , adversary’s action set L ⊂ R n , number of rounds T . Protocol: For each round t = 1 , 2 , . . . , T , the adversary chooses a loss vector ℓ t ∈ L and simultaneously the player chooses a t ∈ A based on past observations and receives a loss/observation Y t = ℓ ⊤ t a t . T T � � ℓ ⊤ ℓ ⊤ R T = E t a t − min t a . a ∈A E t =1 t =1 Other models: In the i.i.d. model we assume that there is some underlying θ ∈ L such that E ( Y t | a t ) = θ ⊤ a t . In the Bayesian model we assume that we have a prior distribution ν over the sequence ( ℓ 1 , . . . , ℓ T ) (in this case the expectation in R T is also over ( ℓ 1 , . . . , ℓ T ) ∼ ν ). Alternatively we could assume a prior over θ . Example: Part 1 was about A = { e 1 , . . . , e n } and L = [0 , 1] n . Assumption: unless specified otherwise we assume L = A ◦ := { ℓ : sup a ∈A | ℓ ⊤ a | ≤ 1 } .

Example: path planning

Example: path planning Adversary Player

Example: path planning Adversary ℓ 4 ℓ n − 2 ℓ 1 ℓ 5 ℓ n − 1 ℓ 2 ℓ 6 ℓ 3 ℓ 7 ℓ n ℓ 8 ℓ 9 Player

Example: path planning Adversary ℓ 4 ℓ n − 2 ℓ 1 ℓ 5 ℓ n − 1 ℓ 2 ℓ 6 ℓ 3 ℓ 7 ℓ n ℓ 8 ℓ 9 Player loss suffered: ℓ 2 + ℓ 7 + . . . + ℓ n

Example: path planning  Full Info: ℓ 1 , ℓ 2 , . . . , ℓ n  Adversary Feedback:  ℓ 4 ℓ n − 2 ℓ 1 ℓ 5 ℓ n − 1 ℓ 2 ℓ 6 ℓ 3 ℓ 7 ℓ n ℓ 8 ℓ 9 Player loss suffered: ℓ 2 + ℓ 7 + . . . + ℓ n

Example: path planning  Full Info: ℓ 1 , ℓ 2 , . . . , ℓ n  Adversary Feedback: Semi-Bandit: ℓ 2 , ℓ 7 , . . . , ℓ n  ℓ 4 ℓ n − 2 ℓ 1 ℓ 5 ℓ n − 1 ℓ 2 ℓ 6 ℓ 3 ℓ 7 ℓ n ℓ 8 ℓ 9 Player loss suffered: ℓ 2 + ℓ 7 + . . . + ℓ n

Example: path planning  Full Info: ℓ 1 , ℓ 2 , . . . , ℓ n  Adversary Feedback: Semi-Bandit: ℓ 2 , ℓ 7 , . . . , ℓ n  Bandit: ℓ 2 + ℓ 7 + . . . + ℓ n ℓ 4 ℓ n − 2 ℓ 1 ℓ 5 ℓ n − 1 ℓ 2 ℓ 6 ℓ 3 ℓ 7 ℓ n ℓ 8 ℓ 9 Player loss suffered: ℓ 2 + ℓ 7 + . . . + ℓ n

Thompson Sampling for linear bandit after RVR14 Assume A = { a 1 , . . . , a |A| } . Recall from Part 1 that TS satisfies � � � π t ( i )(¯ ℓ t ( i ) − ¯ π t ( i ) π t ( j )(¯ ℓ t ( i , j ) − ¯ ℓ t ( i )) 2 ℓ t ( i , i )) ≤ C i i , j � ⇒ R T ≤ C T log( |A| ) / 2 , ℓ t ( i , j ) = E t ( ℓ t ( i ) | i ∗ = j ). where ¯ ℓ t ( i ) = E t ℓ t ( i ) and ¯

Thompson Sampling for linear bandit after RVR14 Assume A = { a 1 , . . . , a |A| } . Recall from Part 1 that TS satisfies � � � π t ( i )(¯ ℓ t ( i ) − ¯ π t ( i ) π t ( j )(¯ ℓ t ( i , j ) − ¯ ℓ t ( i )) 2 ℓ t ( i , i )) ≤ C i i , j � ⇒ R T ≤ C T log( |A| ) / 2 , ℓ t ( i , j ) = E t ( ℓ t ( i ) | i ∗ = j ). where ¯ ℓ t ( i ) = E t ℓ t ( i ) and ¯ Writing ¯ i ¯ ℓ t , ¯ i ¯ ℓ t ( i ) = a ⊤ ℓ t ( i , j ) = a ⊤ ℓ j t , and �� i (¯ ℓ t − ¯ ℓ j π t ( i ) π t ( j ) a ⊤ ( M i , j ) = t ) we want to show that √ Tr ( M ) ≤ C � M � F .

Thompson Sampling for linear bandit after RVR14 Assume A = { a 1 , . . . , a |A| } . Recall from Part 1 that TS satisfies � � � π t ( i )(¯ ℓ t ( i ) − ¯ π t ( i ) π t ( j )(¯ ℓ t ( i , j ) − ¯ ℓ t ( i )) 2 ℓ t ( i , i )) ≤ C i i , j � ⇒ R T ≤ C T log( |A| ) / 2 , ℓ t ( i , j ) = E t ( ℓ t ( i ) | i ∗ = j ). where ¯ ℓ t ( i ) = E t ℓ t ( i ) and ¯ Writing ¯ i ¯ ℓ t , ¯ i ¯ ℓ t ( i ) = a ⊤ ℓ t ( i , j ) = a ⊤ ℓ j t , and �� i (¯ ℓ t − ¯ ℓ j π t ( i ) π t ( j ) a ⊤ ( M i , j ) = t ) we want to show that √ Tr ( M ) ≤ C � M � F . Using the eigenvalue formula for the trace and the Frobenius norm one can see that Tr ( M ) 2 ≤ rank ( M ) � M � 2 F .

Thompson Sampling for linear bandit after RVR14 Assume A = { a 1 , . . . , a |A| } . Recall from Part 1 that TS satisfies � � � π t ( i )(¯ ℓ t ( i ) − ¯ π t ( i ) π t ( j )(¯ ℓ t ( i , j ) − ¯ ℓ t ( i )) 2 ℓ t ( i , i )) ≤ C i i , j � ⇒ R T ≤ C T log( |A| ) / 2 , ℓ t ( i , j ) = E t ( ℓ t ( i ) | i ∗ = j ). where ¯ ℓ t ( i ) = E t ℓ t ( i ) and ¯ Writing ¯ i ¯ ℓ t , ¯ i ¯ ℓ t ( i ) = a ⊤ ℓ t ( i , j ) = a ⊤ ℓ j t , and �� i (¯ ℓ t − ¯ ℓ j π t ( i ) π t ( j ) a ⊤ ( M i , j ) = t ) we want to show that √ Tr ( M ) ≤ C � M � F . Using the eigenvalue formula for the trace and the Frobenius norm one can see that Tr ( M ) 2 ≤ rank ( M ) � M � 2 F . Moreover the rank of M is at most n since M = UV ⊤ where U , V ∈ R |A|× n (the i th row � � π t ( i )(¯ ℓ t − ¯ ℓ i of U is π t ( i ) a i and for V it is t )).

Thompson Sampling for linear bandit after RVR14 � 1. TS satisfies R T ≤ nT log( |A| ). To appreciate the improvement recall that without the linear structure one would � get a regret of order |A| T and that A can be exponential in the dimension n (think of the path planning example).

Thompson Sampling for linear bandit after RVR14 � 1. TS satisfies R T ≤ nT log( |A| ). To appreciate the improvement recall that without the linear structure one would � get a regret of order |A| T and that A can be exponential in the dimension n (think of the path planning example). 2. Provided that one can efficiently sample from the posterior on ℓ t (or on θ ), TS just requires at each step one linear optimization over A .

Thompson Sampling for linear bandit after RVR14 � 1. TS satisfies R T ≤ nT log( |A| ). To appreciate the improvement recall that without the linear structure one would � get a regret of order |A| T and that A can be exponential in the dimension n (think of the path planning example). 2. Provided that one can efficiently sample from the posterior on ℓ t (or on θ ), TS just requires at each step one linear optimization over A . 3. TS regret bound is optimal in the following sense. W.l.og. one can assume |A| ≤ (10 T ) n and thus TS satisfies � R T = O ( n T log( T )) for any action set. Furthermore one can show that there exists an action set and a prior such that √ for any strategy one has R T = Ω( n T ), see Dani, Hayes and Kakade [2008], Rusmevichientong and Tsitsiklis [2010], and Audibert, Bubeck and Lugosi [2011, 2014].

Adversarial linear bandit after Dani, Hayes, Kakade [2008] Recall from Part 1 that exponential weights satisfies for any � ℓ t such that E � ℓ t ( i ) = ℓ t ( i ) and � ℓ t ( i ) ≥ 0, � R T ≤ max i Ent ( δ i � p 1 ) + η E I ∼ p t � ℓ t ( I ) 2 . 2 E η t

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit - PowerPoint PPT Presentation

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part 2 S ebastien Bubeck Theory Group The linear bandit problem, Auer [2002] Known parameters: compact action set A R n , adversarys action set L R n ,

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 *

The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... Shahaf Nacson TAU Nov 15,

Modelling nonstationary signals using stochastic and nonstochastic approach Jacek Lekow

Multi-level stochastic approximation algorithms Noufel Frikha Universit e Paris Diderot, LPMA

Stochastic and deterministic analysis of models of defects in discrete systems Andrea Braides

Stochastic analysis Stochastic analysis of egress of egress simulations simulations Quentin

Multi-Index Monte Carlo and Multi-Index Stochastic Collocation Ra ul Tempone Alexander von

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

Large Deviations for Multi-valued Stochastic Differential Equations Large Deviations for

Multi-resolution Inference of Stochastic Models from Partially Observed Data Samuel Kou

Multi-level Stochastic Local Search for SAT Camilo Rostoker and Chris Dabrowski

Multi-stage Stochastic Fluid Models for Congestion Control Magorzata OReilly * * University

Regret Bounds for Lifelong Learning Pierre Alquier Groupe de Travail de Machine learning du CMLA

Multi- -gas Model Analysis on gas Model Analysis on Multi stabilization scenarios

IGA-based Multi-Index Stochastic Collocation for Uncertainty Quantification J. Beck 1 , L.

Global Sensitivity Analysis in Stochastic Systems Olivier Le Matre, Omar Knio LIMSI CNRS,

Surrogate models for Single and Multi-Objective Stochastic Optimization: Integrating Support

Elements of a Nonstochastic Information Theory Girish Nair Dept. Electrical & Electronic

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for

Stochastic multi-scale selection of the stopping Nicolai Bissantz criterion for MLEM

STOCHASTIC ANALYSIS AND THE KdV EQUATION Setsuo TANIGUCHI Faculty of Mathematics, Kyushu Univ.

Learning Linear Quadratic Regulators Efficiently with Only Regret T Alon Cohen Joint

A Multi-Agent Prediction Market based on Raj Dasgupta Partially Observable Stochastic Game

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit - PowerPoint PPT Presentation

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part 2 S ebastien Bubeck Theory Group The linear bandit problem, Auer [2002] Known parameters: compact action set A R n , adversarys action set L R n ,

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 *

The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... Shahaf Nacson TAU Nov 15,

Modelling nonstationary signals using stochastic and nonstochastic approach Jacek Lekow

Multi-level stochastic approximation algorithms Noufel Frikha Universit e Paris Diderot, LPMA

Stochastic and deterministic analysis of models of defects in discrete systems Andrea Braides

Stochastic analysis Stochastic analysis of egress of egress simulations simulations Quentin

Multi-Index Monte Carlo and Multi-Index Stochastic Collocation Ra ul Tempone Alexander von

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

Large Deviations for Multi-valued Stochastic Differential Equations Large Deviations for

Multi-resolution Inference of Stochastic Models from Partially Observed Data Samuel Kou

Multi-level Stochastic Local Search for SAT Camilo Rostoker and Chris Dabrowski

Multi-stage Stochastic Fluid Models for Congestion Control Magorzata OReilly * * University

Regret Bounds for Lifelong Learning Pierre Alquier Groupe de Travail de Machine learning du CMLA

Multi- -gas Model Analysis on gas Model Analysis on Multi stabilization scenarios

IGA-based Multi-Index Stochastic Collocation for Uncertainty Quantification J. Beck 1 , L.

Global Sensitivity Analysis in Stochastic Systems Olivier Le Matre, Omar Knio LIMSI CNRS,

Surrogate models for Single and Multi-Objective Stochastic Optimization: Integrating Support

Elements of a Nonstochastic Information Theory Girish Nair Dept. Electrical &amp; Electronic

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for

Stochastic multi-scale selection of the stopping Nicolai Bissantz criterion for MLEM

STOCHASTIC ANALYSIS AND THE KdV EQUATION Setsuo TANIGUCHI Faculty of Mathematics, Kyushu Univ.

Learning Linear Quadratic Regulators Efficiently with Only Regret T Alon Cohen Joint

A Multi-Agent Prediction Market based on Raj Dasgupta Partially Observable Stochastic Game

Elements of a Nonstochastic Information Theory Girish Nair Dept. Electrical & Electronic