thompson sampling algorithms for mean variance bandits
play

Thompson Sampling Algorithms for Mean-Variance Bandits Qiuyu Zhu - PowerPoint PPT Presentation

Thompson Sampling Algorithms for Mean-Variance Bandits Qiuyu Zhu Vincent Y. F. Tan Institute of Operations Research and Analytics, National University of Singapore ICML 2020 Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance


  1. Thompson Sampling Algorithms for Mean-Variance Bandits Qiuyu Zhu Vincent Y. F. Tan Institute of Operations Research and Analytics, National University of Singapore ICML 2020 Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 1 / 23

  2. Stochastic multi-armed bandit Problem formulation A stochastic multi-armed bandit is a collection of distributions ν = ( P 1 , P 2 , . . . , P K ), where K is the number of the arms. In each period t ∈ [ T ] : 1 Player picks arm i ( t ) ∈ A . 2 Player observes reward X i ( t ) , t ∼ P i ( t ) for the chosen arm. Learning policy A policy π : ( t , A 1 , X 1 , . . . , A t − 1 , X t − 1 ) → [ K ] is characterised by, i ( t ) = π ( t , i (1) , X i (1) , 1 , · · · , i ( t − 1) , X i ( t − 1) , t − 1 ) , t = 1 , · · · , T The player can only use the past observations in current decisions. Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 2 / 23

  3. The learning objective Objective Minimize the expected cumulative regret � n � n K � � � ( µ ∗ − µ i ( t ) ) = R n = E ( X i ∗ , t − X i ( t ) , t ) = ∆ i E [ T i , n ] t =1 t =1 i =1 where µ i is the mean of each arm, i ∗ = arg max[ µ i ], ∆ i = µ ∗ − µ i and T i , n = � n t =1 1 { i ( t )= i } Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 3 / 23

  4. Motivation Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 4 / 23

  5. Motivation Mean = ( − 1 . 44 , 3 . 00 , 3 . 12) Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 4 / 23

  6. Motivation True reward distribution: Arm 1 ∼ N (1 , 3) Arm 2 ∼ N (3 , 0 . 1) Arm 3 ∼ N (3 . 3 , 4) Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 4 / 23

  7. Motivation True reward distribution: Arm 1 ∼ N (1 , 3) Arm 2 ∼ N (3 , 0 . 1) Arm 3 ∼ N (3 . 3 , 4) Some applications require a trade-off between risk and return. Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 4 / 23

  8. Mean-variance multi-armed bandit Definition 1 (Mean-Variance) The mean-variance of an arm i with mean µ i , variance σ 2 i and coefficient absolute risk tolerance ρ > 0 is defined as MV i = ρµ i − σ 2 i Definition 2 (Empirical Mean-Variance) Suppose we have i.i.d. samples { X i , t } s t =1 from the distribution ν i , the empirical mean-variance is defined as � σ 2 MV i , s = ρ ˆ µ i , s − ˆ i , s σ 2 where ˆ i , s and ˆ µ i , s are empirical variance and mean respectively. Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 5 / 23

  9. The learning objective For a given policy π , and its corresponding performance over n rounds { Z t , t = 1 , 2 , . . . , n } . We define its empirical mean-variance as � σ 2 MV n ( π ) = ρ ˆ µ n ( π ) − ˆ n ( π ) where � T � n µ n ( π ) = 1 n ( π ) = 1 σ 2 µ n ( π )) 2 . ( Z t − ˆ ˆ Z t , and ˆ n n t =1 t =1 Definition 3 (Regret) The expected regret of a policy π ( · ) over n rounds is defined as � � �� � E [ R n ( π )] = n MV 1 − E MV n ( π ) where we assume the first arm is the best arm. Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 6 / 23

  10. The variances Law of total variance Var ( reward ) = E [ Var ( reward | arm )] + Var ( E [ reward | arm ]) Figure 1: Reward Process Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 7 / 23

  11. Pseudo-regret Definition 4 The expected pseudo-regret for a policy π ( · ) over n rounds is defined as � K � K � � � � E [ T i , n ] ∆ i + 1 E [ T i , n T j , n ] Γ 2 R n ( π ) = i , j . E n i =2 i =1 j � = i where ∆ i = σ 2 i − σ 2 1 − ρ ( µ i − µ 1 ) is the gap between MV i and MV 1 , and Γ i , j is the gap between µ i and µ j . Lemma 1 The difference between the expected regret and the expected pseudo-regret can be bounded as follows: � K � � � σ 2 E [ R n ( π )] ≤ E R n ( π ) + 3 i i =1 Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 8 / 23

  12. Pseudo-regret Simplification of pseudo-regret � K � � K 1 E [ T i , n T j , n ] Γ 2 E [ T i , n ] Γ 2 i , j ≤ 2 (1) i , max n i =1 j � = i i =2 i , max = max { ( µ i − µ j ) 2 : j = 1 , . . . , K } . where Γ 2 By applying Definition 4, Lemma 1 and Eqn. (1), it suffices to bound the expected number of pulls of suboptimal arms E [ T i , n ]. Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 9 / 23

  13. Thompson Sampling True reward distributions are: N (1 , 3) , N (3 , 0 . 1) , N (3 . 3 , 4) t = 0 → Samples: (1 . 30 , 1 . 22 , − 0 . 07) → Play arm 1 → Get reward − 1 . 44 Update posteriors Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 10 / 23

  14. Thompson Sampling True reward distributions are: N (1 , 3) , N (3 , 0 . 1) , N (3 . 3 , 4) t = 1 → Samples: (0 . 17 , − 0 . 24 , 0 . 65) → Play arm 3 → Get reward 0 . 62 Update posteriors Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 10 / 23

  15. Thompson Sampling True reward distributions are: N (1 , 3) , N (3 , 0 . 1) , N (3 . 3 , 4) t = 10 → Samples: ( − 0 . 24 , 2 . 15 , 3 . 23) → Play arm 2 → Get reward 2 . 12 Update posteriors Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 10 / 23

  16. TS algorithm for mean learning Algorithm 1 Thompson Sampling for Mean Learning µ i , 0 = 0 , T i , 0 = 0 , α i , 0 = 1 2 , β i , 0 = 1 1: Input : ˆ 2 . 2: for each t = 1 , 2 . . . , do Sample θ i ( t ) from N (ˆ µ i , t − 1 , 1 / ( T i , t − 1 + 1)). 3: Play arm i ( t ) = arg max i ρθ i ( t ) − 2 β i , t − 1 and observe X i ( t ) , t 4: (ˆ µ i ( t ) , t , T i ( t ) , t , α i ( t ) , t , β i ( t ) , t ) = 5: Update(ˆ µ i ( t ) , t − 1 , T i ( t ) , t − 1 , α i ( t ) , t − 1 , β i ( t ) , t − 1 , X i ( t ) , t ) 6: 7: end for Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 11 / 23

  17. Regret bound Theorem 1 � � σ 2 If ρ > max 1 / Γ i : i = 1 , 2 , . . . , K , the asymptotic expected regret incurredd by MTS for mean-variance Gaussian bandits satisfies � � � � K 2 ρ 2 R n ( MTS ) � � E ∆ i + 2Γ 2 ≤ lim � � 2 i , max log n n →∞ ρ Γ 1 , i − σ 2 i =2 1 Remark 1 (The bound) Since ∆ i = σ 2 i − σ 2 1 + ρ Γ 1 , i , as ρ tends to + ∞ , we observe that � � � � K R n ( MTS ) E 2 ≤ lim . ρ log n Γ 1 , i n →∞ i =2 This bound is near-optimal according to [Agrawal and Goyal, 2012]. Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 12 / 23

  18. TS algorithm for variance learning Algorithm 2 TS for Variance Learning µ i , 0 = 0 , T i , 0 = 0 , α i , 0 = 1 2 , β i , 0 = 1 1: Input : ˆ 2 . 2: for each t = 1 , 2 . . . , do Sample τ i ( t ) from Gamma ( α i , t − 1 , β i , t − 1 ). 3: Play arm i ( t ) = arg max i ∈ [ K ] ρ ˆ µ i , t − 1 − 1 /τ i ( t ) and observe X i ( t ) , t 4: (ˆ µ i ( t ) , t , T i ( t ) , t , α i ( t ) , t , β i ( t ) , t ) = 5: Update(ˆ µ i ( t ) , t − 1 , T i ( t ) , t − 1 , α i ( t ) , t − 1 , β i ( t ) , t − 1 , X i ( t ) , t ) 6: end for Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 13 / 23

  19. Regret bound Theorem 2 � � Let h ( x ) = 1 2 ( x − 1 − log x ) . If ρ ≤ min ∆ i / Γ i : ∆ i / Γ i > 0 , the asymptotic regret incurred by VTS for mean-variance Gaussian bandits satisfies � � � K � � � � R n ( VTS ) E 1 ∆ i + 2Γ 2 � lim ≤ . i , max σ 2 i /σ 2 log n h n →∞ 1 i =2 Remark 2 (Order optimality) Vakili and Zhao (2015) proved that the expected regret of any consistent � (log n ) / ∆ 2 � algorithm is Ω where ∆ = min i � =1 ∆ i . Since h ( x ) = ( x − 1) 2 / 4 + o (( x − 1) 2 ) as x → 1 , MTS and VTS are order optimal in both n and ∆ . Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 14 / 23

  20. TS algorithm for mean-variance learning Algorithm 3 Thompson Sampling for Mean-Variance bandits (MVTS) µ i , 0 = 0 , T i , 0 = 0 , α i , 0 = 1 2 , β i , 0 = 1 1: Input : ˆ 2 . 2: for each t = 1 , 2 , . . . , do Sample τ i ( t ) from Gamma ( α i , t − 1 , β i , t − 1 ). 3: Sample θ i ( t ) from N (ˆ µ i , t − 1 , 1 / ( T i , t − 1 + 1)) 4: Play arm i ( t ) = arg max i ∈ [ K ] ρθ i ( t ) − 1 /τ i ( t ) and observe X i ( t ) , t 5: (ˆ µ i ( t ) , t , T i ( t ) , t , α i ( t ) , t , β i ( t ) , t ) = 6: Update(ˆ µ i ( t ) , t − 1 , T i ( t ) , t − 1 , α i ( t ) , t − 1 , β i ( t ) , t − 1 , X i ( t ) , t ) 7: 8: end for Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 15 / 23

  21. Hierarchical structure of Thompson samples µ i , T i , t ∼ N ( µ i , σ 2 2 β i , t /σ 2 i ∼ χ 2 � i / T i , t ) s − 1 ❄ ❄ � � θ i , t ∼ N µ i , T i , t , 1 / T i , t ˆ τ i , t ∼ Gamma ( α i , t , β i , t ) ❍❍❍❍ ✟ ✟ ✟ ❥ ✟ ✙ � MV i , t = ρθ i , t − 1 /τ i , t Figure 2: Hierarchical structure of the mean-variance Thompson samples in MVTS. Qiuyu Zhu, Vincent Y. F. Tan TS Algorithms for Mean-Variance Bandits ICML 2020 16 / 23

Recommend


More recommend