Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part 2 S´ ebastien Bubeck Theory Group
The linear bandit problem, Auer [2002] Known parameters: compact action set A ⊂ R n , adversary’s action set L ⊂ R n , number of rounds T .
The linear bandit problem, Auer [2002] Known parameters: compact action set A ⊂ R n , adversary’s action set L ⊂ R n , number of rounds T . Protocol: For each round t = 1 , 2 , . . . , T , the adversary chooses a loss vector ℓ t ∈ L and simultaneously the player chooses a t ∈ A based on past observations and receives a loss/observation Y t = ℓ ⊤ t a t . T T � � ℓ ⊤ ℓ ⊤ R T = E t a t − min t a . a ∈A E t =1 t =1
The linear bandit problem, Auer [2002] Known parameters: compact action set A ⊂ R n , adversary’s action set L ⊂ R n , number of rounds T . Protocol: For each round t = 1 , 2 , . . . , T , the adversary chooses a loss vector ℓ t ∈ L and simultaneously the player chooses a t ∈ A based on past observations and receives a loss/observation Y t = ℓ ⊤ t a t . T T � � ℓ ⊤ ℓ ⊤ R T = E t a t − min t a . a ∈A E t =1 t =1 Other models: In the i.i.d. model we assume that there is some underlying θ ∈ L such that E ( Y t | a t ) = θ ⊤ a t . In the Bayesian model we assume that we have a prior distribution ν over the sequence ( ℓ 1 , . . . , ℓ T ) (in this case the expectation in R T is also over ( ℓ 1 , . . . , ℓ T ) ∼ ν ). Alternatively we could assume a prior over θ .
The linear bandit problem, Auer [2002] Known parameters: compact action set A ⊂ R n , adversary’s action set L ⊂ R n , number of rounds T . Protocol: For each round t = 1 , 2 , . . . , T , the adversary chooses a loss vector ℓ t ∈ L and simultaneously the player chooses a t ∈ A based on past observations and receives a loss/observation Y t = ℓ ⊤ t a t . T T � � ℓ ⊤ ℓ ⊤ R T = E t a t − min t a . a ∈A E t =1 t =1 Other models: In the i.i.d. model we assume that there is some underlying θ ∈ L such that E ( Y t | a t ) = θ ⊤ a t . In the Bayesian model we assume that we have a prior distribution ν over the sequence ( ℓ 1 , . . . , ℓ T ) (in this case the expectation in R T is also over ( ℓ 1 , . . . , ℓ T ) ∼ ν ). Alternatively we could assume a prior over θ . Example: Part 1 was about A = { e 1 , . . . , e n } and L = [0 , 1] n .
The linear bandit problem, Auer [2002] Known parameters: compact action set A ⊂ R n , adversary’s action set L ⊂ R n , number of rounds T . Protocol: For each round t = 1 , 2 , . . . , T , the adversary chooses a loss vector ℓ t ∈ L and simultaneously the player chooses a t ∈ A based on past observations and receives a loss/observation Y t = ℓ ⊤ t a t . T T � � ℓ ⊤ ℓ ⊤ R T = E t a t − min t a . a ∈A E t =1 t =1 Other models: In the i.i.d. model we assume that there is some underlying θ ∈ L such that E ( Y t | a t ) = θ ⊤ a t . In the Bayesian model we assume that we have a prior distribution ν over the sequence ( ℓ 1 , . . . , ℓ T ) (in this case the expectation in R T is also over ( ℓ 1 , . . . , ℓ T ) ∼ ν ). Alternatively we could assume a prior over θ . Example: Part 1 was about A = { e 1 , . . . , e n } and L = [0 , 1] n . Assumption: unless specified otherwise we assume L = A ◦ := { ℓ : sup a ∈A | ℓ ⊤ a | ≤ 1 } .
Example: path planning
Example: path planning Adversary Player
Example: path planning Adversary Player
Example: path planning Adversary Player
Example: path planning Adversary ℓ 4 ℓ n − 2 ℓ 1 ℓ 5 ℓ n − 1 ℓ 2 ℓ 6 ℓ 3 ℓ 7 ℓ n ℓ 8 ℓ 9 Player
Example: path planning Adversary ℓ 4 ℓ n − 2 ℓ 1 ℓ 5 ℓ n − 1 ℓ 2 ℓ 6 ℓ 3 ℓ 7 ℓ n ℓ 8 ℓ 9 Player loss suffered: ℓ 2 + ℓ 7 + . . . + ℓ n
Example: path planning Full Info: ℓ 1 , ℓ 2 , . . . , ℓ n Adversary Feedback: ℓ 4 ℓ n − 2 ℓ 1 ℓ 5 ℓ n − 1 ℓ 2 ℓ 6 ℓ 3 ℓ 7 ℓ n ℓ 8 ℓ 9 Player loss suffered: ℓ 2 + ℓ 7 + . . . + ℓ n
Example: path planning Full Info: ℓ 1 , ℓ 2 , . . . , ℓ n Adversary Feedback: Semi-Bandit: ℓ 2 , ℓ 7 , . . . , ℓ n ℓ 4 ℓ n − 2 ℓ 1 ℓ 5 ℓ n − 1 ℓ 2 ℓ 6 ℓ 3 ℓ 7 ℓ n ℓ 8 ℓ 9 Player loss suffered: ℓ 2 + ℓ 7 + . . . + ℓ n
Example: path planning Full Info: ℓ 1 , ℓ 2 , . . . , ℓ n Adversary Feedback: Semi-Bandit: ℓ 2 , ℓ 7 , . . . , ℓ n Bandit: ℓ 2 + ℓ 7 + . . . + ℓ n ℓ 4 ℓ n − 2 ℓ 1 ℓ 5 ℓ n − 1 ℓ 2 ℓ 6 ℓ 3 ℓ 7 ℓ n ℓ 8 ℓ 9 Player loss suffered: ℓ 2 + ℓ 7 + . . . + ℓ n
Thompson Sampling for linear bandit after RVR14 Assume A = { a 1 , . . . , a |A| } . Recall from Part 1 that TS satisfies � � � π t ( i )(¯ ℓ t ( i ) − ¯ π t ( i ) π t ( j )(¯ ℓ t ( i , j ) − ¯ ℓ t ( i )) 2 ℓ t ( i , i )) ≤ C i i , j � ⇒ R T ≤ C T log( |A| ) / 2 , ℓ t ( i , j ) = E t ( ℓ t ( i ) | i ∗ = j ). where ¯ ℓ t ( i ) = E t ℓ t ( i ) and ¯
Thompson Sampling for linear bandit after RVR14 Assume A = { a 1 , . . . , a |A| } . Recall from Part 1 that TS satisfies � � � π t ( i )(¯ ℓ t ( i ) − ¯ π t ( i ) π t ( j )(¯ ℓ t ( i , j ) − ¯ ℓ t ( i )) 2 ℓ t ( i , i )) ≤ C i i , j � ⇒ R T ≤ C T log( |A| ) / 2 , ℓ t ( i , j ) = E t ( ℓ t ( i ) | i ∗ = j ). where ¯ ℓ t ( i ) = E t ℓ t ( i ) and ¯ Writing ¯ i ¯ ℓ t , ¯ i ¯ ℓ t ( i ) = a ⊤ ℓ t ( i , j ) = a ⊤ ℓ j t , and �� � i (¯ ℓ t − ¯ ℓ j π t ( i ) π t ( j ) a ⊤ ( M i , j ) = t ) we want to show that √ Tr ( M ) ≤ C � M � F .
Thompson Sampling for linear bandit after RVR14 Assume A = { a 1 , . . . , a |A| } . Recall from Part 1 that TS satisfies � � � π t ( i )(¯ ℓ t ( i ) − ¯ π t ( i ) π t ( j )(¯ ℓ t ( i , j ) − ¯ ℓ t ( i )) 2 ℓ t ( i , i )) ≤ C i i , j � ⇒ R T ≤ C T log( |A| ) / 2 , ℓ t ( i , j ) = E t ( ℓ t ( i ) | i ∗ = j ). where ¯ ℓ t ( i ) = E t ℓ t ( i ) and ¯ Writing ¯ i ¯ ℓ t , ¯ i ¯ ℓ t ( i ) = a ⊤ ℓ t ( i , j ) = a ⊤ ℓ j t , and �� � i (¯ ℓ t − ¯ ℓ j π t ( i ) π t ( j ) a ⊤ ( M i , j ) = t ) we want to show that √ Tr ( M ) ≤ C � M � F . Using the eigenvalue formula for the trace and the Frobenius norm one can see that Tr ( M ) 2 ≤ rank ( M ) � M � 2 F .
Thompson Sampling for linear bandit after RVR14 Assume A = { a 1 , . . . , a |A| } . Recall from Part 1 that TS satisfies � � � π t ( i )(¯ ℓ t ( i ) − ¯ π t ( i ) π t ( j )(¯ ℓ t ( i , j ) − ¯ ℓ t ( i )) 2 ℓ t ( i , i )) ≤ C i i , j � ⇒ R T ≤ C T log( |A| ) / 2 , ℓ t ( i , j ) = E t ( ℓ t ( i ) | i ∗ = j ). where ¯ ℓ t ( i ) = E t ℓ t ( i ) and ¯ Writing ¯ i ¯ ℓ t , ¯ i ¯ ℓ t ( i ) = a ⊤ ℓ t ( i , j ) = a ⊤ ℓ j t , and �� � i (¯ ℓ t − ¯ ℓ j π t ( i ) π t ( j ) a ⊤ ( M i , j ) = t ) we want to show that √ Tr ( M ) ≤ C � M � F . Using the eigenvalue formula for the trace and the Frobenius norm one can see that Tr ( M ) 2 ≤ rank ( M ) � M � 2 F . Moreover the rank of M is at most n since M = UV ⊤ where U , V ∈ R |A|× n (the i th row � � π t ( i )(¯ ℓ t − ¯ ℓ i of U is π t ( i ) a i and for V it is t )).
Thompson Sampling for linear bandit after RVR14 � 1. TS satisfies R T ≤ nT log( |A| ). To appreciate the improvement recall that without the linear structure one would � get a regret of order |A| T and that A can be exponential in the dimension n (think of the path planning example).
Thompson Sampling for linear bandit after RVR14 � 1. TS satisfies R T ≤ nT log( |A| ). To appreciate the improvement recall that without the linear structure one would � get a regret of order |A| T and that A can be exponential in the dimension n (think of the path planning example). 2. Provided that one can efficiently sample from the posterior on ℓ t (or on θ ), TS just requires at each step one linear optimization over A .
Thompson Sampling for linear bandit after RVR14 � 1. TS satisfies R T ≤ nT log( |A| ). To appreciate the improvement recall that without the linear structure one would � get a regret of order |A| T and that A can be exponential in the dimension n (think of the path planning example). 2. Provided that one can efficiently sample from the posterior on ℓ t (or on θ ), TS just requires at each step one linear optimization over A . 3. TS regret bound is optimal in the following sense. W.l.og. one can assume |A| ≤ (10 T ) n and thus TS satisfies � R T = O ( n T log( T )) for any action set. Furthermore one can show that there exists an action set and a prior such that √ for any strategy one has R T = Ω( n T ), see Dani, Hayes and Kakade [2008], Rusmevichientong and Tsitsiklis [2010], and Audibert, Bubeck and Lugosi [2011, 2014].
Adversarial linear bandit after Dani, Hayes, Kakade [2008] Recall from Part 1 that exponential weights satisfies for any � ℓ t such that E � ℓ t ( i ) = ℓ t ( i ) and � ℓ t ( i ) ≥ 0, � R T ≤ max i Ent ( δ i � p 1 ) + η E I ∼ p t � ℓ t ( I ) 2 . 2 E η t
Recommend
More recommend