matrix free preconditioning in online learning
play

Matrix-Free Preconditioning in Online Learning Ashok Cutkosky, - PowerPoint PPT Presentation

Matrix-Free Preconditioning in Online Learning Ashok Cutkosky, Tamas Sarlos Google Research Online Optimization For t = 1 . . . T , repeat: 1: Learner chooses a point w t . 2: Environment presents learner with a gradient g t (think E [ g t ] =


  1. Matrix-Free Preconditioning in Online Learning Ashok Cutkosky, Tamas Sarlos Google Research

  2. Online Optimization For t = 1 . . . T , repeat: 1: Learner chooses a point w t . 2: Environment presents learner with a gradient g t (think E [ g t ] = ∇ F ( w t ) ). 3: Learner suffers loss � g t , w t � . The objective is minimize regret : T � R T ( w ⋆ ) = � g t , w t � − � g t , w ⋆ � � �� � � �� � t = 1 loss suffered benchmark loss Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 1 of 20

  3. Online Optimization For t = 1 . . . T , repeat: 1: Learner chooses a point w t . 2: Environment presents learner with a gradient g t (think E [ g t ] = ∇ F ( w t ) ). 3: Learner suffers loss � g t , w t � . The objective is minimize regret : T � R T ( w ⋆ ) = � g t , w t � − � g t , w ⋆ � � �� � � �� � t = 1 loss suffered benchmark loss Running an online algorithm on a stochastic optimization problem guarantees F ( w T ) − F ( w ⋆ ) ≤ R T ( w ⋆ ) . T Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 1 of 20

  4. The Classic Algorithm: Gradient Descent w t + 1 = w t − η t g t Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 2 of 20

  5. The Classic Algorithm: Gradient Descent w t + 1 = w t − η t g t Gradient descent obtains regret: � � T � � � w ⋆ � 2 � g t � 2 R T ( w ⋆ ) ≤ � t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 2 of 20

  6. Gradient Descent Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 3 of 20

  7. Preconditioning (Deterministic) • The gradient ∇ F ( w ) may not point towards the minimum w ⋆ Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 4 of 20

  8. Preconditioning (Deterministic) • The gradient ∇ F ( w ) may not point towards the minimum w ⋆ Key idea: “Preconditioning” means ignoring irrelevant directions. Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 4 of 20

  9. Preconditioning (Stochastic) • Noise can also make g t not point towards the minimum. Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 5 of 20

  10. Regret Bounds • Regret of un-preconditioned stochastic gradient descent (with the appropriate learning rate) is � � T � √ � � � � w ⋆ � 2 � g t � 2 = O R T ( w ⋆ ) ≤ � T t = 1 • An ideal preconditioned algorithm should obtain regret � � T � √ � � � � w ⋆ , g t � 2 = O R T ( w ⋆ ) ≤ � T t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 6 of 20

  11. Regret Bound Picture Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 7 of 20

  12. Goals • Want regret bound as good as if we had ignored irrelevant directions (up to constants/logs) Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 8 of 20

  13. Using the Covariance Matrix The typical approach to preconditioning maintains the matrix T � g t g ⊤ G = t t = 1 and compute various inverses and square roots of G . This can obtain the guarantee [CO18; KL17] � � T � � � w ⋆ , g t � 2 R T ( w ⋆ ) ≤ � d t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 9 of 20

  14. Issues with Using Covariance Matrix • d 2 time is too slow - there’s a lot of work on compressing the matrix to try to make some tradeoff [Luo+16; GKS18; Aga+18]. Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 10 of 20

  15. Issues with Using Covariance Matrix • d 2 time is too slow - there’s a lot of work on compressing the matrix to try to make some tradeoff [Luo+16; GKS18; Aga+18]. • The regret bound might not even be beter! � � � � T T ? � � � � � w ⋆ , g t � 2 � � w ⋆ � 2 � g t � 2 � d ≤ t = 1 t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 10 of 20

  16. Goals 1: Want regret bound as good as if we had ignored irrelevant directions (up to constants/logs). 2: Want an efficient algorithm ( O ( d ) time per update in d -dimensions). 3: Want to never do worse than non-preconditioned algorithms. Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 11 of 20

  17. Goals 1: Want regret bound as good as if we had ignored irrelevant directions (up to constants/logs). 2: Want an efficient algorithm ( O ( d ) time per update in d -dimensions). 3: Want to never do worse than non-preconditioned algorithms. • We will achieve 2 and 3, and sometimes 1. Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 11 of 20

  18. Our Contribution We provide an online learning algorithm that: • Runs in O ( d ) time per-update. • Always achieves regret: � � T � � R T ( w ⋆ ) ≤ � w ⋆ � � g t � 2 � t = 1 �� T • When −� � T t = 1 � g t � 2 , achieves: t = 1 g t , w ⋆ / � w ⋆ �� ≥ � � T � � � w ⋆ , g t � 2 R T ( w ⋆ ) ≤ � t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 12 of 20

  19. Unpacking the Condition �� T • We need −� � T t = 1 � g t � 2 for preconditioned t = 1 g t , w ⋆ / � w ⋆ �� ≥ regret. • If g t are mean-zero independent random variables, then standard concentration results say: �   � T � � � � T T � � � � � � � � � g t � 2 − g t , w ⋆ / � w ⋆ � ≤ � = Θ � g t   � � � t = 1 t = 1 t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 13 of 20

  20. Unpacking the Condition �� T • We need −� � T t = 1 � g t � 2 for preconditioned t = 1 g t , w ⋆ / � w ⋆ �� ≥ regret. • If g t are mean-zero independent random variables, then standard concentration results say: �   � T � � � � T T � � � � � � � � � g t � 2 − g t , w ⋆ / � w ⋆ � ≤ � = Θ � g t   � � � t = 1 t = 1 t = 1 We achieve preconditioning whenever there is any “signal” in the gradients. Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 13 of 20

  21. Coin Beting [OP16] • Define wealth : T � Wealth T = 1 − � g t , w t � t = 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 14 of 20

  22. Coin Beting [OP16] • Define wealth : T � Wealth T = 1 − � g t , w t � t = 1 • High wealth implies low regret: T � R T ( w ⋆ ) = 1 − � g t , w ⋆ � − Wealth T t = 1 � �� � out of our control Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 14 of 20

  23. Coin Beting [OP16] • Define wealth : T � Wealth T = 1 − � g t , w t � t = 1 • High wealth implies low regret: T � R T ( w ⋆ ) = 1 − � g t , w ⋆ � − Wealth T t = 1 � �� � out of our control • At every iteration, choose a beting fraction v t ∈ R d and use w t = v t Wealth t − 1 Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 14 of 20

  24. Oracle value for v yields good algorithm � w ⋆ � √ � T w ⋆ Set v t = v ⋆ ≈ t = 1 � g t , w ⋆ � 2 . Then � � T � � � w ⋆ , g t � 2 R T ( w ⋆ ) ≤ � t = 1 • There are no matrices here! • But we don’t know this magic value for v . Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 15 of 20

  25. Online Learning Inside Online Learning [CO18] • Define ℓ t ( v ) = − log( 1 − � g t , v � ) . Then: T � R v T ( v ⋆ ) := ℓ t ( v t ) − ℓ t ( v ⋆ ) t = 1 • If R v T ( v ⋆ ) = O (log( T )) , then the final regret R T ( w ⋆ ) is the same as if we’d used the constant v t = v ⋆ . Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 16 of 20

  26. Online Learning Inside Online Learning [CO18] • Define ℓ t ( v ) = − log( 1 − � g t , v � ) . Then: T � R v T ( v ⋆ ) := ℓ t ( v t ) − ℓ t ( v ⋆ ) t = 1 • If R v T ( v ⋆ ) = O (log( T )) , then the final regret R T ( w ⋆ ) is the same as if we’d used the constant v t = v ⋆ . • We can use online learning to choose the v t ! Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 16 of 20

  27. Overview of Algorithm Strategy • There exists an unknown v ⋆ that would give preconditioned regret. • We can choose v t using online convex optimization on losses ℓ t ( v ) = − log( 1 − � g t , v t � ) . T ( v ⋆ ) = � T • If we get R v t = 1 ℓ t ( v t ) − ℓ t ( v ⋆ ) = O (log( T )) , then we are as good as picking v ⋆ from the beginning. • So how can we obtain logarithmic regret? Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 17 of 20

  28. How to obtain logarithmic regret? • Strategy: Remember that the constant v ⋆ we need to compete with is √ � w ⋆ � √ � T w ⋆ v ⋆ = t = 1 � g t , w ⋆ � 2 , so � v ⋆ � = O ( 1 / T ) usually. • This means that we can use a non-preconditioned online learning algorithm to obtain logarithmic regret: √ R v T ( v ⋆ ) ≤ � v ⋆ � T = O ( 1 ) Ashok Cutkosky, Tamas Sarlos Matrix-Free Preconditioning in Online Learning 18 of 20

Recommend


More recommend