An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting Cem Kalkanlı, Ayfer ¨ Ozg¨ ur Stanford University ISIT, June 2020 An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 1 / 13
The Gaussian Linear Bandit Problem Compact action set U : || u || 2 ≤ c for any u ∈ U Reward at time t : Y u t = θ T u t + η t where θ ∈ R d θ ∼ N ( µ, K ) , η t ∼ N (0 , σ 2 ) , η t ∈ R Optimal action and reward: u ∗ = arg max θ T u u ∈ U Y u ∗ , t = θ T u ∗ + η t An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 2 / 13
A Policy and the Performance Criterion Past t − 1 observations: H t − 1 = { u 1 , Y u 1 , ..., u t − 1 , Y u t − 1 } , H 0 = ∅ A policy π = ( π 1 , π 2 , π 3 , ... ): P ( u t ∈ ·| H t − 1 ) = π t ( H t − 1 )( · ) . The performance criterion for the policy π , the Bayesian regret: T � E [ Y u ∗ , t − Y u t ] R ( T , π ) = t =1 An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 3 / 13
Posterior of θ Claim: θ | H t ∼ N ( µ t , K t ) for any non negative integer t where µ t = E [ θ |H t ] K t = E [( θ − E [ θ |H t ])( θ − E [ θ |H t ]) T |H t ] Assume θ | H t − 1 ∼ N ( µ t − 1 , K t − 1 ) θ is independent of u t given H t − 1 . ( θ, Y u t ) is a Gaussian random vector given {H t − 1 , u t } . Result: θ | H t ∼ N ( µ t , K t ) An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 4 / 13
Thompson Sampling Proposed by Thompson (1933) Posterior Matching: P ( u t ∈ B | H t − 1 ) = P ( u ∗ ∈ B | H t − 1 ) Significant empirical performance in online service, display advertising, and online revenue management An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 5 / 13
Thompson Sampling For The Gaussian Linear Bandit Implementation: Select u t Sample ˆ θ t ∼ N ( µ t − 1 , K t − 1 ) 1 u t = arg max u ∈ U ˆ θ T t u 2 Compute the posterior of θ given H t : µ t ← E [ θ |H t ] K t ← E [( θ − E [ θ |H t ])( θ − E [ θ |H t ]) T |H t ] Keywords: Thompson sampling: π TS The Bayesian regret of Thompson sampling: R ( T , π TS ) An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 6 / 13
Prior Work √ Lower bound: R ( T , π ) � T for any policy π in a certain Gaussian linear bandit setting (Rusmevichientong & Tsitsiklis, 2010) Thompson sampling: √ R ( T , π TS ) � log( T ) T (Russo & Van Roy, 2014) 1 √ R ( T , π TS ) � T when | U | < ∞ (Russo & Van Roy, 2016) 2 � R ( T , π TS ) � T log( T ) when θ and U are bounded, not 3 including the Gaussian linear bandit (Dong & Van Roy, 2018) An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 7 / 13
Main Result Theorem The Bayesian regret of Thompson sampling in the Gaussian linear bandit setup: � T ( σ 2 + c 2 Tr( K )) log(1 + T R ( T , π TS ) ≤ d d ) . � Within log( T ) of optimality compared with the lower bound √ of Ω( T ) (Rusmevichientong and Tsitsiklis, 2010) Improves the state-of-the-art upper bound by an order of � log( T ) for the case of an action set with infinitely many √ elements (Previous bound: O (log( T ) T ) by Russo and Van Roy (2014)) Same T dependency as the bound given by Dong & Van Roy (2018) even though θ here has unbounded support unlike the one in 2018 An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 8 / 13
Cauchy–Schwarz Type Inequality Proposition Let X 1 and X 2 be arbitrary i.i.d., R m valued random variables and f 1 , f 2 measurable maps such that f 1 , f 2 : R m → R d with E [ || f 1 ( X 1 ) || 2 2 ] , E [ || f 2 ( X 1 ) || 2 2 ] < ∞ , then � | E [ f 1 ( X 1 ) T f 2 ( X 1 )] | ≤ d E [( f 1 ( X 1 ) T f 2 ( X 2 )) 2 ] . Reduces to Cauchy-Schwarz inequality when d = 1 Similar statement when d > 1 An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 9 / 13
Single Step Regret Lemma Let G > 0 such that G ≥ Tr( K ), then � u T 1 Ku 1 d ( σ 2 + c 2 G ) E [log(1 + E [ Y u ∗ , 1 − Y u 1 ] ≤ σ 2 + c 2 G )] . I ( θ ; u 1 , Y u 1 ) = I ( θ ; u 1 ) + I ( θ ; Y u 1 | u 1 ) = E u ∼ u 1 [ I ( θ ; Y u )] θ and Y u are jointly Gaussian random variables: E u ∼ u 1 [ I ( θ ; Y u )] = E [log(1 + u T 1 Ku 1 )] σ 2 Similar to the information ratio concept used by Russo and Van Roy (2016), Dong and Van Roy (2018) Instead of a discrete entropy term, maybe use the mutual information as is An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 10 / 13
Proof of the Lemma 1 Use of the earlier proposition: E [ Y u ∗ , 1 − Y u 1 ] = E [( θ − µ ) T u ∗ ] � ≤ d E [(( θ − µ ) T u 1 ) 2 ] � d E [ u T = 1 Ku 1 ] 1 Ku 1 ≤ σ 2 + c 2 Tr( K ) ≤ σ 2 + c 2 G and x ≤ log(1 + x ) for 2 u T any x ∈ [0 , 1]: 1 Ku 1 = ( σ 2 + c 2 G ) u T σ 2 + c 2 G ≤ ( σ 2 + c 2 G ) log(1+ u T 1 Ku 1 1 Ku 1 u T σ 2 + c 2 G ) An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 11 / 13
An Overview of the Main Theorem’s Proof 1 Use the lemma: E [ Y u ∗ , t − Y u t |H t − 1 ] ≤ � u T t K t − 1 u t d ( σ 2 + c 2 Tr( K )) E [log(1 + σ 2 + c 2 Tr( K )) |H t − 1 ] Jensen’s Inequality � E [ Y u ∗ , t − Y u t ] ≤ � u T t K t − 1 u t d ( σ 2 + c 2 Tr( K )) E [log(1 + σ 2 + c 2 Tr( K ))] An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 12 / 13
An Overview of the Main Theorem’s Proof cont. 2 Overall bound on the Bayesian regret: T � E [ Y u ∗ , t − Y u t ] t =1 � T � u T t K t − 1 u t � Td ( σ 2 + c 2 Tr( K )) E [ � � ≤ log(1 + σ 2 + c 2 Tr( K ))] t =1 u T t K t − 1 u t 3 Show that � T σ 2 + c 2 Tr( K ) ) ≤ d log(1 + T t =1 log(1 + d ): t ( K − 1 + � t − 1 u T 1 i =1 u i u T i ) − 1 u t u T t K t − 1 u t σ 2 + c 2 Tr( K ) 1 + σ 2 + c 2 Tr( K ) ≤ 1 + σ 2 + c 2 Tr( K ) det( K − 1 + � t 1 i =1 u i u T i ) σ 2 + c 2 Tr( K ) = det( K − 1 + � t − 1 1 i =1 u i u T i ) σ 2 + c 2 Tr( K ) An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 13 / 13
Recommend
More recommend