sgd without replacement sharper rates for general smooth
play

SGD without Replacement: Sharper Rates for General Smooth Convex - PowerPoint PPT Presentation

SGD without Replacement: Sharper Rates for General Smooth Convex Functions Dheeraj Nagaraj Massachusetts Institute of Technology June 12, 2019 Joint work with Praneeth Netrapalli and Prateek Jain (MSR India) 1 / 12 Overview Introduction 1


  1. SGD without Replacement: Sharper Rates for General Smooth Convex Functions Dheeraj Nagaraj Massachusetts Institute of Technology June 12, 2019 Joint work with Praneeth Netrapalli and Prateek Jain (MSR India) 1 / 12

  2. Overview Introduction 1 Current Results 2 Our Results 3 Main Techniques 4 2 / 12

  3. SGD with Replacement (SGD) Consider observations ξ 1 , . . . , ξ n . Convex loss function f ( , ξ i ) : R d → R . Empirical Risk Minimization : n 1 x ∗ = arg min � x ∈ D ∇ ˆ f ( x , ξ i ) := arg min F ( x , ξ i ) , . n x ∈ D i =1 SGD with replacement (SGD) : fix step size sequence α t ≥ 0. Start at x 0 ∈ D . For every time step generate independent random variable I t ∼ unif ([ n ]). x t +1 = x t − α t ∇ f ( x t , ξ I t ) Easy to analyze since independence of I t ensures that E I t ∇ f ( x t , ξ I t ) = ˆ F ( x t ). Sharp non-asymptotic guarantees available but seldom used in practice. 3 / 12

  4. SGD without Replacement (SGDo) In practice, the order of data is fixed (say ξ 1 , . . . , ξ n ) and the data is selected in this order, one after the other. One such pass is called an epoch. The algorithm is run for K epochs. A randomized version of this ‘gets rid’ of the bad orderings. SGD without Replacement (SGDo) At the beginning of the k th epoch, draw an independent uniformly random permutation σ k . x k , i = x k , i − 1 − α k , i ∇ f ( x k , i ; ξ σ k ( i ) ) This is closer to the algorithm implemented in practice. Harder to analyze since E ∇ f ( x k , i ; ξ σ k ( i ) ) � = E ∇ ˆ F ( x k , i ) 4 / 12

  5. Experimental Observations Experiments 1 found that on many problems SGDo converges as O (1 / K 2 ), which is faster than SGD which converges at O (1 / K ). ( K = number of epochs) Theoretically, it wasn’t even shown that SGDo ‘matches’ the rate of SGD for all K . 1 L´ eon Bottou. “Curiously fast convergence of some stochastic gradient descent algorithms”. In: Proceedings of the symposium on learning and data science, Paris . 2009. 5 / 12

  6. Currently Known Bounds 6 / 12

  7. Small number of Epochs Assumptions: f ( · ; ξ i ) is L smooth, �∇ f ( · ; ξ i ) � ≤ G , diam( W ) ≤ D . � � GD Suboptimality O (leading order, General case) √ nK G 2 log nK � � Suboptimality O (leading order, µ strongly convex) µ nK Shamir’s result 2 only works for generalized linear functions and when K = 1. All other “acceleration” results hold only when K is very large. 2 Ohad Shamir. “Without-replacement sampling for stochastic gradient methods”. In: Advances in Neural Information Processing Systems . 2016, pp. 46–54. 7 / 12

  8. Large number of Epochs Assumptions: f ( · ; ξ i ) is L smooth, �∇ f ( · ; ξ i ) � ≤ G and ˆ F is µ strongly convex. (log nK ) 2 � κ 2 G 2 � When K � κ 2 , Suboptimality: O nK 2 µ Previous results 3 require Hessian smoothness and K ≥ κ 1 . 5 √ n to give � � n 2 K 2 + κ 4 κ 4 suboptimality of O . K 3 Without smoothness assumption, there can be no acceleration. 3 Jeffery Z HaoChen and Suvrit Sra. “Random Shuffling Beats SGD after Finite Epochs”. In: arXiv preprint arXiv:1806.10077 (2018). 8 / 12

  9. Main Techniques Main bottleneck in analysis: E ∇ f ( x k , i ; ξ σ k ( i ) ) � = E ∇ ˆ F ( x k , i ). If σ ′ k is independent of σ k , k ( i ) ) = E ∇ ˆ E ∇ f ( x k , i ; ξ σ ′ F ( x k , i ) . Therefore, E ∇ f ( x k , i ; ξ σ k ( i ) ) = E ∇ ˆ � � � F ( x k , i ) + O ( d W x k , i � σ k ( i ) = r , x k , i � � � Through coupling arguments: d W � σ k ( i ) = r , x k , i � α k , 0 G x k , i 9 / 12

  10. Automatic Variance Reduction and Acceleration For the smooth and strongly convex case, ∇ ˆ F ( x ∗ ) = 0 = 1 � n i =1 f ( x ∗ , ξ σ k ( i ) ). (Note that this doesn’t hold with n independent sampling). Therefore, when x k , 0 ≈ x ∗ we show by coupling arguments that: n F ( x k , 0 ) ≈ 1 0 ≈ ∇ ˆ � f ( x i , k , ξ σ k ( i ) ) n i =1 This is similar to the variance reduction as seen in modifications of SGD like SAGA, SVRG etc. 10 / 12

  11. References Bottou, L´ eon. “Curiously fast convergence of some stochastic gradient descent algorithms”. In: Proceedings of the symposium on learning and data science, Paris . 2009. G¨ urb¨ uzbalaban, Mert, Asu Ozdaglar, and Pablo Parrilo. “Why random reshuffling beats stochastic gradient descent”. In: arXiv preprint arXiv:1510.08560 (2015). HaoChen, Jeffery Z and Suvrit Sra. “Random Shuffling Beats SGD after Finite Epochs”. In: arXiv preprint arXiv:1806.10077 (2018). Shamir, Ohad. “Without-replacement sampling for stochastic gradient methods”. In: Advances in Neural Information Processing Systems . 2016, pp. 46–54. 11 / 12

  12. Questions? 12 / 12

Recommend


More recommend