linear bandits from theory to applications
play

Linear Bandits: From Theory to Applications Claire Vernade DeepMind - PowerPoint PPT Presentation

Linear Bandits: From Theory to Applications Claire Vernade DeepMind Foundations Team credits : Csaba Szepesv ari, Tor Lattimore for their blog Sequential Decision Making 1 Real World Sequential Decision Making 2 Table of contents 1.


  1. Linear Bandits: From Theory to Applications Claire Vernade DeepMind – Foundations Team credits : Csaba Szepesv´ ari, Tor Lattimore for their blog

  2. Sequential Decision Making 1

  3. Real World Sequential Decision Making 2

  4. Table of contents 1. Linear Bandits 2. Real-World Setting: Delayed Feedback 3

  5. Linear Bandits

  6. Linear Bandits 1. In round t , observe action set A t ⊂ R d . 2. The learner chooses A t ∈ A t and receives X t , satisfying E [ X t |A 1 , A 1 , . . . , A t , A t ] = � A t , θ ∗ � := f θ ∗ ( A t ) for some unknown θ ∗ . 3. Light-tailed noise: X t − � A t , θ ∗ � = η t ∼ N (0 , 1) Goal: Keep regret � n � � R n = E max a ∈A t � a , θ ∗ � − X t t =1 small. 4

  7. Real-World setting Typical setting: a user, represented by its feature vector u t , shows up and we have a finite set of (correlated) actions ( a 1 , . . . , a K ). Some function Φ joins these vectors pairwise to create a contextualized action set : Φ( u t , a i ) = a t , i ∈ R d ∀ i ∈ [ K ] , A t = { a t , 1 , . . . , a t , K } . No assumption is to be made on the joining function Φ as the bandit may take over the decision step from that contextualized action set. So, it is equivalent to A t ∼ Π( R d ) some arbitrary distribution, or A 1 , . . . , A n fixed arbitrarily by the environment. 5

  8. Toolbox of the optimist Say, reward in round t is X t , action in round t is A t ∈ R d : X t = � A t , θ ∗ � + η t , We want to estimate θ ∗ :regularized least-squares estimator: t � ˆ θ t = V − 1 A s X s , t s =1 t � A s A ⊤ V 0 = λ I , V t = V 0 + s . s =1 Choice of confidence regions (ellipsoids) C t : � � C t . θ ∈ R d : � θ − ˆ θ t − 1 � 2 = V t − 1 ≤ β t . where, for A positive definite, � x � 2 A = x ⊤ Ax . 6

  9. LinUCB “Choose the best action in the best environment amongst the plausible ones.” Choose C t with suitable ( β t ) t and let θ ∈C t � a , θ � . A t = argmax max a ∈A Or, more concretely, for each action a ∈ A , compute the ”optimistic index” θ ∈C t � a , θ � . U t ( a ) = max Maximising a linear function over a convex closed set, the solution is explicit: � � a , ˆ A t = argmax U t ( a ) = argmax θ t � + β t � a � V − 1 t − 1 . a a 7

  10. Optimism in the Face of Uncertainty Principle 8

  11. Regret Bound Assumptions: 1. Bounded scalar mean reward: |� a , θ ∗ �| ≤ 1 for any a ∈ ∪ t A t . 2. Bounded actions: for any a ∈ ∪ t A t , � a � 2 ≤ L . 3. Honest confidence intervals: There exists a δ ∈ (0 , 1) such that with probability 1 − δ , for all t ∈ [ n ], θ ∗ ∈ C t for some choice of ( β t ) t ≤ n . Theorem (LinUCB Regret) Let the conditions listed above hold. Then with probability 1 − δ the regret of LinUCB satisfies � � d λ + nL 2 � ˆ R n ≤ 8 dn β n log . d λ 9

  12. Proof Jensen’s inequality shows that � � n n n � � � � ˆ � A ∗ r 2 R n = t − A t , θ � := r t ≤ � n t t =1 t =1 t =1 . where A ∗ = argmax a ∈A t � a , θ ∗ � . t Let ˜ θ t be the vector that realizes the maximum over the ellipsoid: θ t ∈ C t s.t. � A t , ˜ ˜ θ t � = U t ( A t ). From the definition of LinUCB, � A ∗ t , θ ∗ � ≤ U t ( A ∗ t ) ≤ U t ( A t ) = � A t , ˜ θ t � . Then, � r t ≤ � A t , ˜ t − 1 � ˜ θ t − θ ∗ � ≤ � A t � V − 1 θ t − θ ∗ � V t − 1 ≤ 2 � A t � V − 1 β t . t − 1 10

  13. Elliptical Potential Lemma So we now have a new upper bound, � � n � n � n � � � � � ˆ (1 ∧ � A t � 2 r 2 r t ≤ t ≤ 2 R n = � n � n β n t − 1 ) . V − 1 t =1 t =1 t =1 Lemma (Abbasi-Yadkori et al. (2011)) Let x 1 , . . . , x n ∈ R d , V t = V 0 + � t s =1 x s x ⊤ s , t ∈ [ n ] , and L ≥ max t � x t � 2 . Then, � � n � det V n � trace ( V 0 ) + nL 2 � � � 1 ∧ � x t � 2 ≤ 2 log ≤ d log . V − 1 d det 1 / d ( V 0 ) det V 0 t − 1 t =1 11

  14. Confidence Ellipsoids Assumptions: � θ ∗ � ≤ S , and let ( A s ) s , ( η s ) s be so that for any 1 ≤ s ≤ t , η s |F s − 1 ∼ subG (1), where F s = σ ( A 1 , η 1 , . . . , A s − 1 , η s − 1 , A s ) Fix δ ∈ (0 , 1). Let � √ � � � 1 � det V t ( λ ) β t +1 = λ S + 2 log + log λ d δ � √ � � � 1 � λ d + nL 2 ≤ λ S + 2 log + log , δ d λ and � � θ ∈ R d : � ˆ C t +1 = θ t − θ ∗ � V t ( λ ) ≤ β t +1 . Theorem C t +1 is a confidence set for θ ∗ at level 1 − δ : P ( θ ∗ ∈ C t +1 ) ≥ 1 − δ . Proof : See Chapter 20 of Bandit Algorithms (www.banditalgs.com) 12

  15. History • Abe and Long [4] introduced stochastic linear bandits into machine learning literature. • Auer [6] was the first to consider optimism for linear bandits (LinRel, SupLinRel). Main restriction: |A t | < + ∞ . • Confidence ellipsoids: Dani et al. [8] (ConfidenceBall 2 ), Rusmevichientong and Tsitsiklis [11] (Uncertainty Ellipsoid Policy), Abbasi-Yadkori et al. [3] (OFUL). • The name LinUCB comes from Chu et al. [7]. • Alternative routes: • Explore then commit for action sets with smooth boundary. Abbasi-Yadkori [1], Abbasi-Yadkori et al. [2], Rusmevichientong and Tsitsiklis [11]. • Phased elimination • Thompson sampling 13

  16. Summary Theorem (LinUCB Regret) Let the conditions listed above hold. Then with probability 1 − δ the regret of LinUCB satisfies � � � � = O ( d √ n ) . trace ( V 0 ) + nL 2 � ˆ R n ≤ � 8 dn β n log 1 d ( V 0 ) d det Linear bandits are an elegant model of the exploration-exploitation dilemma when actions are correlated. The main ingredients of the regret analysis are: • bounding the instantaneous regret using the definition of optimism; • a maximal concentration inequality holding for a randomized, sequential design; • the Elliptical Potential Lemma. 14

  17. Real-World Setting: Delayed Feedback

  18. In a real-world application, rewards are delayed ... 15

  19. In a real-world application, rewards are delayed ... and censored. 16

  20. Delayed Linear Bandits Modified setting: at round t ≥ 1, • receive contextualized action set A t = { a 1 , . . . , a K } and choose action A t ∈ A t , • two random variables are generated but not observed: X t ∼ B ( θ ⊤ A t ) and D t ∼ D ( τ ), • at t + D t the reward X t of action A t is disclosed ... • ...unless D t > m : If the delay is too long, the reward is discarded. New parameter: 0 < m < T is the cut-off time of the system. If the delay is longer, the reward is never received. The delay distribution D ( τ ) characterizes the proportion of converting actions: τ m = p ( D t ≤ m ). 17

  21. A new estimator We now have : t − 1 t − 1 � � ˜ A s A ⊤ V t = b t = A s X s 1 { D s ≤ m } s s =1 s =1 where ˜ b t contains additional non-identically distributed samples: t − m t − 1 � � ˜ b t = A s X s 1 { D s ≤ m } + A s X s 1 { D s ≤ t − s } s =1 s = t − m +1 ”Conditionally biased” least squares estimator includes every received feedback ˆ t = V − 1 ˜ θ b b t t Baseline: use previous estimator but discard last m steps ˆ = V − 1 E [ˆ θ disc θ disc with |F t ] ≈ τ m θ t − m b t − m t t 18

  22. Confidence interval and the D-LinUCB policy We remark that ˆ t − τ m θ = ˆ t − ˆ θ disc t + m + ˆ θ disc θ b θ b t + m − τ m θ = ˆ t − ˆ + ˆ θ disc θ disc θ b t + m − τ m θ t + m � �� � � �� � finite bias same as before For the new C t , we have new optimistic indices A t = argmax θ ∈C t � a , θ � . max a ∈A But now, the solution has an extra (vanishing) bias term � � a , ˆ θ t � + β t � a � V − 1 t − 1 + m � a � V − 2 A t = argmax t − 1 . a D-LinUCB: Easy, straightforward, harmless modification of LinUCB, with regret guarantees in the delayed feedback setting. 19

  23. Regret bound Theorem (D-LinUCB Regret) Under the same conditions as before, with V 0 = λ I, with probability 1 − δ the regret of D-LinUCB satisfies � � � � � � trace ( V 0 ) + nL 2 dm n � R n ≤ τ − 1 ˆ � 8 dn β n log + log 1 + . m 1 ( λ − 1) τ − 1 d ( V 0 ) d ( λ − 1) d det m 20

  24. Simulations We fix n = 3000 and generate geometric delays with E [ D t ] = 100. In a real setting, this would correspond to an experiment that lasts 3h, with average delays of 6 minutes. Then, we let the cut off vary m ∈ 250 , 500 , 1000, i.e. waiting time of 15min, 30min and 1h, respectively. 200 WaiLinUCB WaiLinUCB WaiLinUCB DeLinUCB DeLinUCB DeLinUCB 150 150 125 125 150 Regret R ( T ) Regret R ( T ) Regret R ( T ) 100 100 100 75 75 50 50 50 25 25 0 0 0 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000 Round t Round t Round t Figure 1: Comparison of the simulated behaviors of D-LinUCB and (waiting)LinUCB 21

  25. Conclusions • Linear Bandits are a powerful and well-understood way of solving the exploration-exploitation trade-off in a metric space; • The techniques have been extended to Generalized Linear models by Filippi et al. [9] • and to kernel regression Valko et al. [12, 13]. • Yet, including constraints and external sources of noise in real-world application is challenging. • Some use cases challenge the bandit model assumptions... • ... and then it’s time to open the box of MDP’s (e.g. UCRL abd KL-UCRL Auer et al. [5], Filippi et al. [10]). 22

Recommend


More recommend