an improved regret bound for thompson sampling in the
play

An Improved Regret Bound for Thompson Sampling in the Gaussian - PowerPoint PPT Presentation

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting Cem Kalkanl, Ayfer Ozg ur Stanford University ISIT, June 2020 An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 1


  1. An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting Cem Kalkanlı, Ayfer ¨ Ozg¨ ur Stanford University ISIT, June 2020 An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 1 / 13

  2. The Gaussian Linear Bandit Problem Compact action set U : || u || 2 ≤ c for any u ∈ U Reward at time t : Y u t = θ T u t + η t where θ ∈ R d θ ∼ N ( µ, K ) , η t ∼ N (0 , σ 2 ) , η t ∈ R Optimal action and reward: u ∗ = arg max θ T u u ∈ U Y u ∗ , t = θ T u ∗ + η t An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 2 / 13

  3. A Policy and the Performance Criterion Past t − 1 observations: H t − 1 = { u 1 , Y u 1 , ..., u t − 1 , Y u t − 1 } , H 0 = ∅ A policy π = ( π 1 , π 2 , π 3 , ... ): P ( u t ∈ ·| H t − 1 ) = π t ( H t − 1 )( · ) . The performance criterion for the policy π , the Bayesian regret: T � E [ Y u ∗ , t − Y u t ] R ( T , π ) = t =1 An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 3 / 13

  4. Posterior of θ Claim: θ | H t ∼ N ( µ t , K t ) for any non negative integer t where µ t = E [ θ |H t ] K t = E [( θ − E [ θ |H t ])( θ − E [ θ |H t ]) T |H t ] Assume θ | H t − 1 ∼ N ( µ t − 1 , K t − 1 ) θ is independent of u t given H t − 1 . ( θ, Y u t ) is a Gaussian random vector given {H t − 1 , u t } . Result: θ | H t ∼ N ( µ t , K t ) An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 4 / 13

  5. Thompson Sampling Proposed by Thompson (1933) Posterior Matching: P ( u t ∈ B | H t − 1 ) = P ( u ∗ ∈ B | H t − 1 ) Significant empirical performance in online service, display advertising, and online revenue management An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 5 / 13

  6. Thompson Sampling For The Gaussian Linear Bandit Implementation: Select u t Sample ˆ θ t ∼ N ( µ t − 1 , K t − 1 ) 1 u t = arg max u ∈ U ˆ θ T t u 2 Compute the posterior of θ given H t : µ t ← E [ θ |H t ] K t ← E [( θ − E [ θ |H t ])( θ − E [ θ |H t ]) T |H t ] Keywords: Thompson sampling: π TS The Bayesian regret of Thompson sampling: R ( T , π TS ) An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 6 / 13

  7. Prior Work √ Lower bound: R ( T , π ) � T for any policy π in a certain Gaussian linear bandit setting (Rusmevichientong & Tsitsiklis, 2010) Thompson sampling: √ R ( T , π TS ) � log( T ) T (Russo & Van Roy, 2014) 1 √ R ( T , π TS ) � T when | U | < ∞ (Russo & Van Roy, 2016) 2 � R ( T , π TS ) � T log( T ) when θ and U are bounded, not 3 including the Gaussian linear bandit (Dong & Van Roy, 2018) An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 7 / 13

  8. Main Result Theorem The Bayesian regret of Thompson sampling in the Gaussian linear bandit setup: � T ( σ 2 + c 2 Tr( K )) log(1 + T R ( T , π TS ) ≤ d d ) . � Within log( T ) of optimality compared with the lower bound √ of Ω( T ) (Rusmevichientong and Tsitsiklis, 2010) Improves the state-of-the-art upper bound by an order of � log( T ) for the case of an action set with infinitely many √ elements (Previous bound: O (log( T ) T ) by Russo and Van Roy (2014)) Same T dependency as the bound given by Dong & Van Roy (2018) even though θ here has unbounded support unlike the one in 2018 An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 8 / 13

  9. Cauchy–Schwarz Type Inequality Proposition Let X 1 and X 2 be arbitrary i.i.d., R m valued random variables and f 1 , f 2 measurable maps such that f 1 , f 2 : R m → R d with E [ || f 1 ( X 1 ) || 2 2 ] , E [ || f 2 ( X 1 ) || 2 2 ] < ∞ , then � | E [ f 1 ( X 1 ) T f 2 ( X 1 )] | ≤ d E [( f 1 ( X 1 ) T f 2 ( X 2 )) 2 ] . Reduces to Cauchy-Schwarz inequality when d = 1 Similar statement when d > 1 An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 9 / 13

  10. Single Step Regret Lemma Let G > 0 such that G ≥ Tr( K ), then � u T 1 Ku 1 d ( σ 2 + c 2 G ) E [log(1 + E [ Y u ∗ , 1 − Y u 1 ] ≤ σ 2 + c 2 G )] . I ( θ ; u 1 , Y u 1 ) = I ( θ ; u 1 ) + I ( θ ; Y u 1 | u 1 ) = E u ∼ u 1 [ I ( θ ; Y u )] θ and Y u are jointly Gaussian random variables: E u ∼ u 1 [ I ( θ ; Y u )] = E [log(1 + u T 1 Ku 1 )] σ 2 Similar to the information ratio concept used by Russo and Van Roy (2016), Dong and Van Roy (2018) Instead of a discrete entropy term, maybe use the mutual information as is An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 10 / 13

  11. Proof of the Lemma 1 Use of the earlier proposition: E [ Y u ∗ , 1 − Y u 1 ] = E [( θ − µ ) T u ∗ ] � ≤ d E [(( θ − µ ) T u 1 ) 2 ] � d E [ u T = 1 Ku 1 ] 1 Ku 1 ≤ σ 2 + c 2 Tr( K ) ≤ σ 2 + c 2 G and x ≤ log(1 + x ) for 2 u T any x ∈ [0 , 1]: 1 Ku 1 = ( σ 2 + c 2 G ) u T σ 2 + c 2 G ≤ ( σ 2 + c 2 G ) log(1+ u T 1 Ku 1 1 Ku 1 u T σ 2 + c 2 G ) An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 11 / 13

  12. An Overview of the Main Theorem’s Proof 1 Use the lemma: E [ Y u ∗ , t − Y u t |H t − 1 ] ≤ � u T t K t − 1 u t d ( σ 2 + c 2 Tr( K )) E [log(1 + σ 2 + c 2 Tr( K )) |H t − 1 ]  Jensen’s Inequality  � E [ Y u ∗ , t − Y u t ] ≤ � u T t K t − 1 u t d ( σ 2 + c 2 Tr( K )) E [log(1 + σ 2 + c 2 Tr( K ))] An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 12 / 13

  13. An Overview of the Main Theorem’s Proof cont. 2 Overall bound on the Bayesian regret: T � E [ Y u ∗ , t − Y u t ] t =1 � T � u T t K t − 1 u t � Td ( σ 2 + c 2 Tr( K )) E [ � � ≤ log(1 + σ 2 + c 2 Tr( K ))] t =1 u T t K t − 1 u t 3 Show that � T σ 2 + c 2 Tr( K ) ) ≤ d log(1 + T t =1 log(1 + d ): t ( K − 1 + � t − 1 u T 1 i =1 u i u T i ) − 1 u t u T t K t − 1 u t σ 2 + c 2 Tr( K ) 1 + σ 2 + c 2 Tr( K ) ≤ 1 + σ 2 + c 2 Tr( K ) det( K − 1 + � t 1 i =1 u i u T i ) σ 2 + c 2 Tr( K ) = det( K − 1 + � t − 1 1 i =1 u i u T i ) σ 2 + c 2 Tr( K ) An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 13 / 13

Recommend


More recommend