linear bandits
play

Linear Bandits D avid P al Google, New York & Department of - PowerPoint PPT Presentation

Linear Bandits D avid P al Google, New York & Department of Computing Science University of Alberta dpal@google.com November 2, 2011 joint work with Yasin Abbasi-Yadkori and Csaba Szepesv ari Linear Bandits In round t = 1 , 2 ,


  1. Linear Bandits D´ avid P´ al Google, New York & Department of Computing Science University of Alberta dpal@google.com November 2, 2011 joint work with Yasin Abbasi-Yadkori and Csaba Szepesv´ ari

  2. Linear Bandits In round t = 1 , 2 , . . . ◮ Choose an action X t from a set D t ⊂ ❘ d . ◮ Receive a reward � X t , θ ∗ � + random noise ◮ Weights θ ∗ are unknown but fixed. ◮ Goal: Maximize total reward.

  3. Motivation ◮ exploration & exploitation with side information ◮ action = arm = ad = feature vector ◮ reward = click

  4. Outline ◮ Formal model & Regret ◮ Algorithm: Optimism in the Face of Uncertainty principle ◮ Confidence sets for Least Squares ◮ Sparse models: Online-to-Confidence-Set Conversion

  5. Formal model Unknown but fixed weight vector θ ∗ ∈ ❘ d . In round t = 1 , 2 , . . . ◮ Receive D t ⊂ ❘ d ◮ Choose an action X t ∈ D t ◮ Receive a reward Y t = � X t , θ ∗ � + η t Noise is conditionally R -sub-Gaussian i.e. � γ 2 R 2 � E [ e γη t | X 1 : t , η 1 : t − 1 ] ≤ exp ∀ γ ∈ ❘ . 2

  6. Sub-Gaussianity Definition Random variable Z is R-sub-Gaussian for some R ≥ 0 if � γ 2 R 2 � E [ e γ Z ] ≤ exp ∀ γ ∈ ❘ . 2 The condition implies that ◮ E [ Z ] = 0 ◮ Var [ Z ] ≤ R 2 Examples: ◮ Zero-mean bounded in an interval of length 2 R (Hoeffding-Azuma) ◮ Zero-mean Gaussian with variance ≤ R 2

  7. Regret ◮ If we knew θ ∗ , then in round t we’d choose action X ∗ � x , θ ∗ � t = argmax x ∈ D t ◮ Regret is our reward in n rounds relative to X ∗ t : n n � � � X ∗ Regret n = t , θ ∗ � − � X t , θ ∗ � t = 1 t = 1 ◮ We want Regret n / n → 0 as n → ∞

  8. Optimism in the Face of Uncertainty Principle ◮ Maintain a confidence set C t ⊆ ❘ d such that θ ∗ ∈ C t with high probability. ◮ In round t , choose ( X t , � θ t ) = argmax � X t , θ t � ( x ,θ ) ∈ D t × C t − 1 ◮ � θ t is an “optimistic” estimate of θ ∗ ◮ UCB algorithm is a special case.

  9. Least Squares ◮ Data ( X 1 , Y 1 ) , . . . , ( X n , Y n ) such that Y t ≈ � X t , θ ∗ � ◮ Stack them into matrices: X 1 : n is n × d and Y 1 : n is n × 1 ◮ Least squares estimate: � θ n = ( X 1 : n X T 1 : n + λ I ) − 1 X T 1 : n Y 1 : n ◮ Let V n = X 1 : n X T 1 : n + λ I Theorem If � θ ∗ � 2 ≤ S, then with probability at least 1 − δ , for all t, θ ∗ lies in � � � � det ( V t ) 1 / 2 � √ θ : � � θ t − θ � V t ≤ R C t = 2 ln + S λ δ det ( λ I ) 1 / 2 √ v T Av is the matrix A-norm. where � v � A =

  10. Confidence Set C t � θ t +1 � θ t θ ∗ ◮ Least squares solution � θ t is the center of C t ◮ θ ∗ lies somewhere in C t w.h.p. ◮ Next action � θ t + 1 is on the boundary of C t

  11. Comparison with Previous Confidence Sets ◮ Our bound: � � det ( V t ) 1 / 2 � √ � � θ t − θ ∗ � V t ≤ R 2 ln + S λ δ det ( λ I ) 1 / 2 ◮ [Dani et al.(2008)] If � θ ∗ � 2 , � X t � 2 ≤ 1 then for a specific λ � � � 128 d ln ( t ) ln ( t 2 /δ ) , 8 � � 3 ln ( t 2 /δ ) θ t − θ ∗ � V t ≤ R max ◮ [Rusmevichientong and Tsitsiklis(2010)] If � X t � 2 ≤ 1 √ � √ � � θ t − θ ∗ � V t ≤ 2 R κ d ln t + ln ( t 2 /δ ) + S ln t λ where κ = 3 + 2 ln (( 1 + λ d ) /λ ) . Our bound doesn’t depend on t .

  12. Regret of the Bandit Algorithm Theorem ([Dani et al.(2008)]) If � θ ∗ � 2 ≤ 1 and D t ’s are subsets of the unit 2 -ball with probability at least 1 − δ Regret n ≤ O ( Rd √ n · polylog ( n , d , 1 /δ )) We get the same result with smaller polylog ( n , d , 1 /δ ) factor.

  13. Sparse Bandits What if θ ∗ is sparse? ◮ Not good idea to use least squares. ◮ Better use e.g. L 1 -regularization. ◮ How do we construct confidence sets? Our new technique: Online-to-Confidence-Set Conversion ◮ Similar to Online-to-Batch Conversion, but very different ◮ We start with an online prediction algorithm.

  14. Online Prediction Algorithms In round t ◮ Receive X t ∈ ❘ d ◮ Predict � Y t ∈ ❘ ◮ Receive correct label Y t ∈ ❘ ◮ Suffer loss ( Y t − � Y t ) 2 No assumptions whatsoever on ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , . . . There are heaps of algorithms of this structure: ◮ online gradient descent [Zinkevich(2003)] ◮ online least-squares [Azoury and Warmuth(2001), Vovk(2001)] ◮ exponetiated gradient [Kivinen and Warmuth(1997)] ◮ online LASSO (??) ◮ SeqSEW [Gerchinovitz(2011), Dalalyan and Tsybakov(2007)]

  15. Online Prediction Algorithms, cnt’d ◮ Regret with respect to a linear predictor θ ∈ ❘ d n n � � Y t ) 2 − ( Y t − � ( Y t − � X t , θ � ) 2 ρ n ( θ ) = t = 1 t = 1 ◮ Prediction algorithms come with “regret bounds” B n : ∀ n ρ n ( θ ) ≤ B n ◮ B n depends on n , d , θ and possibly X 1 , X 2 , . . . , X n and Y 1 , Y 2 , . . . , Y n ◮ Typically, B n = O ( √ n ) or B n = O ( log n )

  16. Online-to-Confidence-Set Conversion ◮ Data ( X 1 , Y 1 ) , . . . , ( X n , Y n ) where Y t = � X t , θ ∗ � + η t and η t is conditionally R -sub-Gaussian. ◮ Predictions � Y 1 , � Y 2 , . . . , � Y n ◮ Regret bound ρ ( θ ∗ ) ≤ B n Theorem (Conversion) With probability at least 1 − δ , for all n, θ ∗ lies in � n � θ ∈ ❘ d : (^ Y t − � X t , θ � ) 2 C n = t = 1 � � � √ 8 + √ 1 + B n R ≤ 1 + 2 B n + 32 R 2 ln δ

  17. Optimistic Algorithm with Conversion Theorem If | � x , θ ∗ � | ≤ 1 for all x ∈ D t and all t then with probability at least 1 − δ , for all n, the regret of Optimistic Algorithm is �� � Regret n ≤ O dnB n · polylog ( n , d , 1 /δ, B n ) .

  18. Bandits combined with SeqSEW Theorem ([Gerchinovitz(2011)]) If � θ � ∞ ≤ 1 and � θ � 0 ≤ p then S EQ SEW algorithm has regret bound ρ n ( θ ) ≤ B n = O ( p log ( nd )) . Suppose � θ ∗ � 2 ≤ 1 and � θ ∗ � 0 ≤ p . Via the conversion, the Optimistic Algorithm has regret � O ( R pdn · polylog ( n , d , 1 /δ )) which is better than O ( Rd √ n · polylog ( n , d , 1 /δ )) .

  19. Open problems ◮ Confidence sets for batch algorithms e.g. offline LASSO. ◮ Adaptive bandit algorithm that doesn’t need p upfront.

  20. Questions? Read papers at http://david.palenica.com/

  21. References Katy S. Azoury and Mafred K. Warmuth. In Proceedings of the 24st Annual Conference on Learning Theory (COLT 2011) , 2011. Relative loss bounds for on-line density estimation with the exponential family of Jyrki Kivinen and Manfred K. Warmuth. distributions. Exponentiated gradient versus gradient descent Machine Learning , 43:211–246, 2001. for linear predictors. Arnak S. Dalalyan and Alexandre B. Tsybakov. Information and Computation , 132(1):1–63, January 1997. Aggregation by exponential weighting and sharp oracle inequalities. Paat Rusmevichientong and John N. Tsitsiklis. In Proceedings of the 20th Annual Conference on Learning Theory , pages 97–111, 2007. Linearly parameterized bandits. Mathematics of Operations Research , 35(2):395–411, Varsha Dani, Thomas P. Hayes, and Sham M. 2010. Kakade. Vladimir Vovk. Stochastic linear optimization under bandit Competitive on-line statistics. feedback. International Statistical Review , 69:213–248, 2001. In Rocco Servedio and Tong Zhang, editors, Proceedings of the 21st Annual Conference on Martin Zinkevich. Learning Theory (COLT 2008) , pages 355–366, Online convex programming and generalized 2008. infinitesimal gradient ascent. S` ebastien Gerchinovitz. In Proceedings of Twentieth International Conference on Machine Learning , 2003. Sparsity regret bounds for individual sequences in online linear regression.

Recommend


More recommend