stochastic composite optimization
play

Stochastic Composite Optimization: Variance Reduction, Acceleration, - PowerPoint PPT Presentation

Stochastic Composite Optimization: Variance Reduction, Acceleration, and Robustness to Noise Andrei Kulunchakov, Julien Mairal Inria Grenoble ML in the real world, Criteo Julien Mairal Stochastic Composite Optimization 1/24 Publications


  1. Stochastic Composite Optimization: Variance Reduction, Acceleration, and Robustness to Noise Andrei Kulunchakov, Julien Mairal Inria Grenoble ML in the real world, Criteo Julien Mairal Stochastic Composite Optimization 1/24

  2. Publications Andrei Kulunchakov A. Kulunchakov and J. Mairal. Estimate Sequences for Variance-Reduced Stochastic Composite Optimization. International Conference on Machine Learning (ICML). 2019. A. Kulunchakov and J. Mairal. Estimate Sequences for Stochastic Composite Optimization: Variance Reduction, Acceleration, and Robustness to Noise. preprint arXiv:1901.08788. 2019. Julien Mairal Stochastic Composite Optimization 2/24

  3. Context Many subspace identification approaches require solving a composite optimization problem x ∈ R p { F ( x ) := f ( x ) + ψ ( x ) } , min where f is L -smooth and convex, and ψ is convex. Julien Mairal Stochastic Composite Optimization 3/24

  4. Context Many subspace identification approaches require solving a composite optimization problem x ∈ R p { F ( x ) := f ( x ) + ψ ( x ) } , min where f is L -smooth and convex, and ψ is convex. Two settings of interest Particularly interesting structures in machine learning are n f ( x ) = 1 f ( x ) = E [ ˜ � f i ( x ) or f ( x, ξ )] . n i =1 Those can typically be addressed with variants of SGD for the general stochastic case. variance-reduced algorithms such as SVRG, SAGA, MISO, SARAH, SDCA, Katyusha. . . Julien Mairal Stochastic Composite Optimization 3/24

  5. Basics of gradient-based optimization Smooth vs non-smooth (a) smooth (b) non-smooth An important quantity to quantify smoothness is the Lipschitz constant of the gradient: �∇ f ( x ) − ∇ f ( y ) � ≤ L � x − y � . Julien Mairal Stochastic Composite Optimization 4/24

  6. Basics of gradient-based optimization Smooth vs non-smooth (a) smooth (b) non-smooth An important quantity to quantify smoothness is the Lipschitz constant of the gradient: �∇ f ( x ) − ∇ f ( y ) � ≤ L � x − y � . If f is twice differentiable, L may be chosen as the largest eigenvalue of the Hessian ∇ 2 f . This is an upper-bound on the function curvature. Julien Mairal Stochastic Composite Optimization 4/24

  7. Basics of gradient-based optimization Convex vs non-convex (a) non-convex (b) convex (c) strongly-convex An important quantity to quantify convexity is the strong-convexity constant f ( x ) ≥ f ( y ) + ∇ f ( y ) ⊤ ( x − y ) + µ 2 � x − y � 2 , Julien Mairal Stochastic Composite Optimization 5/24

  8. Basics of gradient-based optimization Convex vs non-convex (a) non-convex (b) convex (c) strongly-convex An important quantity to quantify convexity is the strong-convexity constant f ( x ) ≥ f ( y ) + ∇ f ( y ) ⊤ ( x − y ) + µ 2 � x − y � 2 , If f is twice differentiable, µ may be chosen as the smallest eigenvalue of the Hessian ∇ 2 f . This is a lower-bound on the function curvature. Julien Mairal Stochastic Composite Optimization 5/24

  9. Basics of gradient-based optimization Picture from F. Bach Why is the condition number L/µ important? Julien Mairal Stochastic Composite Optimization 6/24

  10. Basics of gradient-based optimization Picture from F. Bach Trajectory of gradient descent with optimal step size. Julien Mairal Stochastic Composite Optimization 7/24

  11. Variance reduction (1/2) Variance reduction Consider two random variables X, Y and define Z = X − Y + E [ Y ] . Then, E [ Z ] = E [ X ] Var ( Z ) = Var ( X ) + Var ( Y ) − 2 cov ( X, Y ) . The variance of Z may be smaller if X and Y are positively correlated. Julien Mairal Stochastic Composite Optimization 8/24

  12. Variance reduction (1/2) Variance reduction Consider two random variables X, Y and define Z = X − Y + E [ Y ] . Then, E [ Z ] = E [ X ] Var ( Z ) = Var ( X ) + Var ( Y ) − 2 cov ( X, Y ) . The variance of Z may be smaller if X and Y are positively correlated. Why is it useful for stochastic optimization? step-sizes for SGD have to decrease to ensure convergence. with variance reduction, one may use larger constant step-sizes. Julien Mairal Stochastic Composite Optimization 8/24

  13. Variance reduction for smooth functions (2/2) SVRG x t = x t − 1 − γ ( ∇ f i t ( x t − 1 ) − ∇ f i t ( y ) + ∇ f ( y )) , where y is updated every epoch and E [ ∇ f i t ( y ) |F t − 1 ] = ∇ f ( y ) . SAGA � n ∇ f i t ( x t − 1 ) − y t − 1 i =1 y t − 1 � + 1 � x t = x t − 1 − γ , i t n i � ∇ f i ( x t − 1 ) if i = i t where E [ y t − 1 � n i =1 y t − 1 |F t − 1 ] = 1 and y t i = y t − 1 i t n i otherwise . i MISO/Finito: for n ≥ L/µ , same form as SAGA but � ∇ f i ( x t − 1 ) − µx t − 1 if i = i t � n i =1 y t − 1 1 y t = − µx t − 1 and i = y t − 1 i n otherwise . i Julien Mairal Stochastic Composite Optimization 9/24

  14. Complexity of SGD variants x ) − F ⋆ ] ≤ ε for We consider the worst-case complexity for finding a point ¯ x such that E [ F (¯ x ∈ R p { F ( x ) := E [ ˜ min f ( x, ξ )] + ψ ( x ) } , In this talk, we consider the µ -strongly convex case only. Complexity of SGD with iterate averaging � L � C 0 �� � σ 2 � O µ log + O , ε µε under the (strong) assumption that the gradient estimates have bounded variance σ 2 . Julien Mairal Stochastic Composite Optimization 10/24

  15. Complexity of SGD variants x ) − F ⋆ ] ≤ ε for We consider the worst-case complexity for finding a point ¯ x such that E [ F (¯ x ∈ R p { F ( x ) := E [ ˜ min f ( x, ξ )] + ψ ( x ) } , In this talk, we consider the µ -strongly convex case only. Complexity of SGD with iterate averaging � L � C 0 �� � σ 2 � O µ log + O , ε µε under the (strong) assumption that the gradient estimates have bounded variance σ 2 . Complexity of accelerated SGD [Ghadimi and Lan, 2013] �� �� � C 0 � σ 2 � L O µ log + O , ε µε Julien Mairal Stochastic Composite Optimization 10/24

  16. Complexity for finite sums x ) − F ⋆ ] ≤ ε for We consider the worst-case complexity for finding a point ¯ x such that E [ F (¯ n � � F ( x ) := 1 � min f i ( x ) + ψ ( x ) , n x ∈ R p i =1 Complexity of SAGA/SVRG/SDCA/MISO/S2GD n ¯ �� L � � C 0 �� L = 1 ¯ � O n + log with L i . µ ε n i =1 Complexity of GD and acc-GD � �� � �� �� � � C 0 �� � C 0 nL L O log vs. O n log . µ ε µ ε see also SDCA [Shalev-Shwartz and Zhang, 2014] and Catalyst [Lin et al., 2018]. Julien Mairal Stochastic Composite Optimization 11/24

  17. Complexity for finite sums x ) − F ⋆ ] ≤ ε for We consider the worst-case complexity for finding a point ¯ x such that E [ F (¯ n � � F ( x ) := 1 � min f i ( x ) + ψ ( x ) , n x ∈ R p i =1 Complexity of SAGA/SVRG/SDCA/MISO/S2GD n ¯ �� L � � C 0 �� L = 1 ¯ � O n + log with L i . µ ε n i =1 Complexity of Katyusha [Allen-Zhu, 2017]    � � n ¯ L � C 0  log  . O  n +  µ ε see also SDCA [Shalev-Shwartz and Zhang, 2014] and Catalyst [Lin et al., 2018]. Julien Mairal Stochastic Composite Optimization 11/24

  18. Contributions without acceleration We extend and generalize the concept of estimate sequences introduced by Nesterov to provide a unified proof of convergence for SAGA/random-SVRG/MISO. provide them adaptivity for unknown µ (known before for SAGA only). make them robust to stochastic noise , e.g. , for solving n f ( x ) = 1 � f i ( x ) = E [ ˜ f i ( x ) with f i ( x, ξ )] . n i =1 with complexity � ˜ ¯ σ 2 �� L � � C 0 �� � σ 2 ≪ σ 2 , O n + log + O with ˜ µ ε µε σ 2 is the variance due to small perturbations. where ˜ obtain new variants of the above algorithms with the same guarantees. Julien Mairal Stochastic Composite Optimization 12/24

  19. The stochastic finite sum problem � n � F ( x ) := 1 � f i ( x ) = E [ ˜ min f i ( x ) + ψ ( x ) with f i ( x, ξ )] , n x ∈ R p i =1 Data augmentation on digits (left); Dropout on text (right). Julien Mairal Stochastic Composite Optimization 13/24

  20. Contributions with acceleration we propose a new accelerated SGD algorithm for composite optimization with optimal complexity �� �� � σ 2 L � C 0 � O µ log + O , ε µε we propose an accelerated variant of SVRG for the stochastic finite-sum problem with complexity � ˜    � � n ¯ � C 0 σ 2 � L σ 2 ≪ σ 2 .  log  + O O  n + with ˜  µ ε µε When ˜ σ = 0 , the complexity matches that of Katyusha. Julien Mairal Stochastic Composite Optimization 14/24

  21. A classical iteration x k ← Prox η k ψ [ x k − 1 − η k g k ] with E [ g k |F k ] = ∇ f ( x k − 1 ) , Julien Mairal Stochastic Composite Optimization 15/24

  22. A classical iteration x k ← Prox η k ψ [ x k − 1 − η k g k ] with E [ g k |F k ] = ∇ f ( x k − 1 ) , covers SGD, SAGA, SVRG, and composite variants. Julien Mairal Stochastic Composite Optimization 15/24

Recommend


More recommend