weighted linear bandits for non stationary environments
play

Weighted Linear Bandits for Non-Stationary Environments Yoan Russac - PowerPoint PPT Presentation

Weighted Linear Bandits for Non-Stationary Environments Yoan Russac 1 , Claire Vernade 2 and Olivier Capp e 1 1 CNRS, Inria, ENS, Universit 2 Deepmind e PSL The Model Roadmap 1 The Model 2 Related work 3 Concentration Result 4 Application to


  1. Weighted Linear Bandits for Non-Stationary Environments Yoan Russac 1 , Claire Vernade 2 and Olivier Capp´ e 1 1 CNRS, Inria, ENS, Universit´ 2 Deepmind e PSL

  2. The Model Roadmap 1 The Model 2 Related work 3 Concentration Result 4 Application to Non-Stationary Linear Bandits 5 Empirical Performances

  3. The Model The Non-Stationary Linear Model At time t , the learner has access to a time-dependent finite set of arbitrary actions A t = { A t, 1 , . . . , A t,K t } , where A t,k ∈ R d (with � A t,k � 2 ≤ L ) They can only be probed one at a time, i.e., the learner Chooses an action A t ∈ A t and observes only the noisy linear reward X t = A ⊤ t θ ⋆ t + η t where η t is a σ -subgaussian random noise Specificity of the model Non-Stationarity θ ⋆ t depends on t Unstructured action set

  4. The Model Optimality Criteria Dynamic Regret Minimization � T � � T � T � � � a ∈A t � a, θ ⋆ max E X t ⇐ ⇒ min E max t � − X t t =1 s =1 t =1 � T � � a ∈A t � a − A t , θ ⋆ ⇐ ⇒ min E max t � t =1 � �� � dynamic regret

  5. The Model Difference to Specific Cases   1 . . . 0  . .  ... . . 1 When A t → I d =   . . 0 . . . 1 The model reduces to the (non-stationary) multiarmed bandit model If θ ⋆ t = θ ⋆ , there is a single best action a ⋆ It is only necessary to control the deviations of ˆ θ t in the principal directions   A t . . . 0  . .  ... . . 2 If A t → I d ⊗ A t =  , with ( A t ) t ≥ 1 i.i.d.  . . 0 . . . A t ǫ -greedy exploration (may be) efficient

  6. The Model Non-Stationarity and Bandits Two different approaches are commonly used to deal with non-stationary bandits Detecting changes in the distribution of the arms Building methods that are (somewhat) robust to variations of the environment Their performance depends on the assumptions made on the sequence of environment parameters ( θ ⋆ t ) t ≥ 1 In abruptly changing environments, changepoint detection methods are more efficient But they may fail in slowly-changing environments We expect robust policies to perform well in both environments

  7. The Model Our Approach We only focus on robust policies With that in mind, the non-stationarity in the θ ⋆ t parameter is measured with the variation budget T − 1 � � θ ⋆ s − θ ⋆ s +1 � 2 ≤ B T s =1 ֒ → A large variation budget can be either due to large scarce changes of θ ⋆ t or frequent but small deviations

  8. Related work Roadmap 1 The Model 2 Related work 3 Concentration Result 4 Application to Non-Stationary Linear Bandits 5 Empirical Performances

  9. Related work Some references Garivier et al. (2011), On upper-confidence bound policies for switching bandit problems , COLT Introduce sliding window and exponential discounting algorithms, analyzing them in the abrupt changes setting and providing a O ( T 1 / 2 ) lower bound Besbes et al. (2014), Stochastic multi-armed-bandit problem with non-stationary rewards , NeurIPS Consider the variation budget, prouve a O ( T 2 / 3 ) lower bound and analyze an epoch-based variant of Exp3 Wu et al. (2018), Learning contextual bandits in a non-stationary environment , ACM SIGIR Introduce an algorithm (called dLinUCB) based on change detection for the linear bandit Cheung et al. (2019), Learning to optimize under non-stationarity , AISTATS Adapt the sliding-window algorithm to the linear bandit

  10. Related work Garivier et al. paper Sliding-Window UCB algorithm At time t the SW - UCB policy selects the action that maximizes � t � s = t − τ +1 X s ✶ ( I s = i ) ξ log(min( t, τ )) A t = arg max + � t � t s = t − τ +1 ✶ ( I s = i ) s = t − τ +1 ✶ ( I s = i )) i ∈{ 1 ,...K } Discounted UCB algorithm At time t the D - UCB policy selects the action that maximizes � t � s =1 γ t − s X s ✶ ( I s = i ) ξ log((1 − γ − t ) / (1 − γ )) A t = arg max + 2 � t � t s =1 γ t − s ✶ ( I s = i ) s =1 γ t − s ✶ ( I s = i ) i ∈{ 1 ,...K } with γ < 1

  11. Concentration Result Roadmap 1 The Model 2 Related work 3 Concentration Result 4 Application to Non-Stationary Linear Bandits 5 Empirical Performances

  12. Concentration Result Assumptions At each round t ≥ 1 the learner Receives a finite set of arbitrary feasible actions A t ⊂ R d Selects an F t = σ ( X 1 , A 1 , . . . , X t − 1 , A t − 1 ) –measurable action A t ∈ A t Other assumptions Sub-Gaussian Random Noise η t is, conditionally on the past, σ -subgaussian Bounded Actions ∀ t ≥ 1 , ∀ a ∈ A t , � a � 2 ≤ L Bounded Parameters ∀ t ≥ 1 , � θ ⋆ t � 2 ≤ S ∀ t ≥ 1 , ∀ a ∈ A t , |� a, θ ⋆ t �| ≤ 1

  13. Concentration Result Weighted Least Squares Estimator Least Squares Estimator t � s θ ) 2 + λ ˆ ( X s − A ⊤ 2 � θ � 2 θ t = arg min 2 θ ∈ R d s =1 Weighted Least Squares Estimator t � s θ ) 2 + λ t ˆ w s ( X s − A ⊤ 2 � θ � 2 θ t = arg min 2 θ ∈ R d s =1

  14. Concentration Result Scale-Invariance Property The weighted least squares estimator is given by � t � − 1 t � � ˆ w s A s A ⊤ θ t = s + λ t I d w s A s X s s =1 s =1 → ˆ ֒ θ t is unchanged if all the weights w s and the regularization parameter λ t are multiplied by a same constant α

  15. Concentration Result The Case of Exponential weights Exponential Discount (Time-Dependent Weights) t � s θ ) 2 + λ ˆ γ t − s ( X s − A ⊤ 2 � θ � 2 θ t = arg min 2 ���� θ ∈ R d s =1 w t,s Time-Independent Weights � 1 � s t � s θ ) 2 + λ ˆ ( X s − A ⊤ 2 γ t � θ � 2 θ t = arg min 2 γ θ ∈ R d s =1 ֒ → are equivalent, due to scale-invariance

  16. Concentration Result Concentration Result Theorem 1 Assuming that θ ⋆ t = θ ⋆ , for any F t -predictable sequences of actions ( A t ) t ≥ 1 and positive weights ( w t ) t ≥ 1 and for all δ > 0 , with probability higher than 1 − δ , �  � � � 1 + L 2 � t � s =1 w 2 λ t � 2 log(1 /δ ) + d log  ∀ t, � ˆ θ t − θ ⋆ � V t � s  P V t ≤ S + σ √ µ t V − 1 dµ t t where t � w s A s A ⊤ V t = s + λ t I d , s =1 t � � w 2 s A s A ⊤ V t = s + µ t I d s =1

  17. Concentration Result On the Control of Deviations in the V t � V − 1 V t Norm t For the unweighted least squares estimator, the [Abbasi-Yadkori et al. , 2001] deviation bound features the � ˆ θ t − θ ⋆ � V t norm Here, the V t � V − 1 V t norm comes form the observation that t s which are featured in � The variance terms are related to w 2 V t The weighted least squares estimator (and the matrix V t ) is defined with w s Remark: When w t = 1 , taking λ t = µ t yields V t � V − 1 V t = V t and t the usual concentration inequality

  18. Concentration Result On the Role of µ t The sequence of parameters ( µ t ) t ≥ 1 is instrumental (results from the use of the Method of Mixtures) and could theoretically be chosen completely independently from λ t and w t But taking µ t proportional to λ 2 t , ensures that V t � V − 1 V t becomes scale-invariant t λ t / √ µ t becomes scale-invariant � t s =1 w 2 s /µ t becomes scale-invariant ֒ → Scale-invariant concentration inequality !

  19. Concentration Result On the Use of Time-Dependent Regularization Parameters Using time-dependent regularization parameter λ t , is required to avoid vanishing regularization � � 1 + L 2 � t s =1 w 2 s In the sense that d log should not dominate dµ t the radius of the confidence region as t increases In the setting with exponentially increasing weights ( w s = γ − s ) µ t ∝ λ 2 λ t ∝ w t t

  20. Application to Non-Stationary Linear Bandits Roadmap 1 The Model 2 Related work 3 Concentration Result 4 Application to Non-Stationary Linear Bandits 5 Empirical Performances

  21. Application to Non-Stationary Linear Bandits D - LinUCB Algorithm (1) Algorithm 1: D - LinUCB Input: Probability δ , subgaussianity constant σ , dimension d , regularization λ , upper bound for actions L , upper bound for parameters S , discount factor γ . Initialization: b = 0 R d , V = λI d , � V = λI d , ˆ θ = 0 R d for t ≥ 1 do Receive A t , compute � � � √ � 1 � 1 + L 2 (1 − γ 2( t − 1) ) β t − 1 = λS + σ 2 log + d log λd (1 − γ 2 ) δ for a ∈ A t do � Compute UCB ( a ) = a ⊤ ˆ a ⊤ V − 1 � V V − 1 a θ + β t − 1 A t = arg max a ( UCB ( a )) Play action A t and receive reward X t Updating phase : V = γV + A t A ⊤ t + (1 − γ ) λI d , V = γ 2 � � V + A t A ⊤ t + (1 − γ 2 ) λI d b = γb + X t A t , ˆ θ = V − 1 b

  22. Application to Non-Stationary Linear Bandits D - LinUCB Algorithm (2) Thanks to the scale-invariance property, for numerical stability of the implementation, we consider time-dependent weights w t,s = γ t − s for 1 ≤ s ≤ t The weighted least squares estimator is solution of t � γ t − s ( X s − � A s , θ � ) 2 + λ/ 2 � θ � 2 ˆ θ t = arg min 2 θ ∈ R d s =1 ֒ → this form is numerically stable and can be implemented recursively (but we revert to the standard form for the analysis)

Recommend


More recommend