Optimal Mini-Batch and Step Sizes for SAGA Nidham Gazagnadou 1 , a joint work with Robert M. Gower 1 & Joseph Salmon 2 1 LTCI, T´ el´ ecom Paris, Institut Polytechnique de Paris, France 2 IMAG, Univ Montpellier, CNRS, Montpellier, France a This work was supported by grants from R´ egion Ile-de-France 1
The Optimization Problem • Goal n w ∈ R d f ( w ) = 1 find w ∗ ∈ arg min � f i ( w ) n i =1 2
The Optimization Problem • Goal n w ∈ R d f ( w ) = 1 find w ∗ ∈ arg min � f i ( w ) n i =1 where – n i.i.d. observations: ( a i , y i ) ∈ R d × R or R d × {− 1 , 1 } – f i : R d → R is L i –smooth ∀ i ∈ [ n ] – f is L –smooth and µ –strongly convex 2
The Optimization Problem • Goal n w ∈ R d f ( w ) = 1 find w ∗ ∈ arg min � f i ( w ) n i =1 where – n i.i.d. observations: ( a i , y i ) ∈ R d × R or R d × {− 1 , 1 } – f i : R d → R is L i –smooth ∀ i ∈ [ n ] – f is L –smooth and µ –strongly convex • Covered problems – Ridge regression – Regularized logistic regression 2
Reformulation of the ERM • Sampling vector Let v ∈ R n , with distribution D s.t. for all i in [ n ]:= { 1 , . . . , n } E D [ v i ] = 1 3
Reformulation of the ERM • Sampling vector Let v ∈ R n , with distribution D s.t. for all i in [ n ]:= { 1 , . . . , n } E D [ v i ] = 1 • ERM stochastic reformulation n � � f v ( w ) := 1 find w ∗ ∈ arg min � = E D v i f i ( w ) n w ∈ R d i =1 leading to an unbiased gradient estimate n E D [ ∇ f v ( w )] = 1 � E D [ v i ] f i ( w ) = ∇ f ( w ) n i =1 3
Reformulation of the ERM • Sampling vector Let v ∈ R n , with distribution D s.t. for all i in [ n ]:= { 1 , . . . , n } E D [ v i ] = 1 • ERM stochastic reformulation n � � f v ( w ) := 1 find w ∗ ∈ arg min � = E D v i f i ( w ) n w ∈ R d i =1 leading to an unbiased gradient estimate n E D [ ∇ f v ( w )] = 1 � E D [ v i ] f i ( w ) = ∇ f ( w ) n i =1 • Arbitrary sampling includes all mini-batching strategies such as sampling b ∈ [ n ] elements without replacement � � v = n 1 � e i = for all B ⊆ [ n ] , | B | = b . P � , � n b 3 b i ∈ B
Focus on b Mini-Batch SAGA The algorithm – Sample a mini-batch B ⊂ [ n ] := { 1 , . . . , n } s.t. | B | = b – Build the gradient estimate g ( w k ) = 1 ∇ f i ( w k ) − 1 : i + 1 � � J k nJ k e b b i ∈ B i ∈ B : i the i –th column of J k ∈ R d × n where e is the all-ones vector and J k – Take a step: w k +1 = w k − γ g ( w k ) – Update the Jacobian estimate J k J k i = ∇ f i ( w k ) , ∀ i ∈ B 10 0 b Defazio = 1 + = 6.10 e 05 , Defazio Our contribution: relative distance to optimum b practical = 70 + = 1.20 e 02 , practical b practical = 70 + gridsearch = 3.13 e 02 , 10 1 optimal mini-batch and step size b Hofmann = 20 + Hofmann = 1.59 e 03 , residual 10 2 10 3 Example of SAGA run on real data ( slice data set) 10 4 4 0 1 2 3 4 epochs
Key Constant: Expected Smoothness Definition (Expected Smoothness constant) If f is L –smooth in expectation, then for every w ∈ R d � � �∇ f v ( w ) − ∇ f v ( w ∗ ) � 2 ≤ 2 L ( f ( w ) − f ( w ∗ )) E D 2 5
Key Constant: Expected Smoothness Definition (Expected Smoothness constant) If f is L –smooth in expectation, then for every w ∈ R d � � �∇ f v ( w ) − ∇ f v ( w ∗ ) � 2 ≤ 2 L ( f ( w ) − f ( w ∗ )) E D 2 • Total Complexity of b mini-batch SAGA, for a given ǫ > 0, is � 4 b ( L + λ ) , n + n − b 4( L max + λ ) � � 1 � K total ( b ) = max log , n − 1 µ µ ǫ where λ is the regularizer and L max := max i =1 ... n L i • For a step size 1 γ = � . � L + λ, 1 n − b n − 1 L max + µ n 4 max b 4 b 5
Key Constant: Expected Smoothness Definition (Expected Smoothness constant) If f is L –smooth in expectation, then for every w ∈ R d � � �∇ f v ( w ) − ∇ f v ( w ∗ ) � 2 ≤ 2 L ( f ( w ) − f ( w ∗ )) E D 2 • Total Complexity of b mini-batch SAGA, for a given ǫ > 0, is � 4 b ( L + λ ) , n + n − b 4( L max + λ ) � � 1 � K total ( b ) = max log , n − 1 µ µ ǫ where λ is the regularizer and L max := max i =1 ... n L i • For a step size 1 γ = � . � L + λ, 1 n − b n − 1 L max + µ n 4 max b 4 b Problem: Calculating L is most of the time intractable 5
Our Estimates of the Expected Smoothness Theorem (Upper-bounds of L ) When sampling b points without replacement we have • Simple bound L ≤ L simple ( b ) := n b − 1 L + 1 n − b ¯ n − 1 L max n − 1 b b • Bernstein bound � � L ≤ L Bernstein ( b ) := 2 b − 1 n − 1 L + 1 n n − b n − 1 + 4 3 log d L max b b where ¯ � n L := 1 500 i =1 L i and n smoothness constant L max := max i ∈ [ n ] L i 400 300 Practical estimate 200 100 L practical := n b − 1 n − 1 L + 1 n − b n − 1 L max b b 0 5 10 15 20 mini-batch size Estimates of L artificial data 6 ( n = d = 24)
Optimal Mini-Batch from the Practical Estimate For a precision ǫ > 0, the total complexity is � 4 b ( L practical + λ ) , n + n − b 4( L max + λ ) � � 1 � K total ( b ) = max log n − 1 µ µ ǫ b empirical = 2 10 7.0 empirical total complexity b practical = 70 Leading to the optimal 10 6.5 mini-batch size 10 6.0 b ∗ practical ∈ arg min K total ( b ) 10 5.5 b ∈ [ n ] 10 5.0 1 2 4 8 16 32 64 128 256 1024 4096 16384 � � 1 + µ ( n − 1) ⇒ b ∗ = practical = mini-batch size 4 L Total complexity vs mini-batch size ( slice dataset, λ = 10 − 1 ) 7
Summary Take Home Message • Use optimal mini-batch and step sizes available for SAGA! What was done • Build estimates of L • Give optimal settings ( b , γ ) for mini-batch SAGA ⇒ Faster convergence of w k − k →∞ w ∗ = − − → • Provide convincing numerical improvements on real datasets • All the Julia code available at https://github.com/gowerrobert/StochOpt.jl 8
References (1/2) • F. Bach. ”Sharp analysis of low-rank kernel matrix approximations”. In:ArXiv e-prints (Aug. 2012). arXiv: 1208.2015 [cs.LG]. • C. C. Chang and C. J. Lin. ”LIBSVM : A library for support vector machines”. In: ACM Transactions on Intelligent Systems and Technology 2.3 (Apr. 2011), pp. 127. • A. Defazio, F. Bach, and S. Lacoste-julien. ”SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives”. In: Advances in Neural Information Processing Systems 27. 2014, pp. 16461654. • R. M. Gower, P. Richtrik, and F. Bach. ”Stochastic Quasi-Gradient Methods: Variance Reduction via Jacobian Sketching”. In: arXiv preprint arXiv:1805.02632 (2018). • D. Gross and V. Nesme. ”Note on sampling without replacing from a finite collection of matrices”. In: arXiv preprint arXiv:1001.2738 (2010) • W. Hoeffding. ”Probability inequalities for sums of bounded random variables”. In: Journal of the American statistical association 58.301 (1963), pp. 1330. 9
References (2/2) • R. Johnson and T. Zhang. ”Accelerating Stochastic Gradient Descent using Predictive Variance Reduction”. In: Advances in Neural Information Processing Systems 26. Curran Associates, Inc., 2013, pp. 315323. • H. Robbins and S. Monro. ”A stochastic approximation method”. In: Annals of Mathematical Statistics 22 (1951), pp. 400407. • M. Schmidt, N. Le Roux, and F. Bach. ”Minimizing finite sums with the stochastic average gradient”. In: Mathematical Programming 162.1 (2017), pp. 83112. • J. A. Tropp. ”An Introduction to Matrix Concentration Inequalities”. In: ArXiv e-prints (Jan. 2015). arXiv:1501.01571 [math.PR] • J. A. Tropp. ”Improved analysis of the subsampled randomized Hadamard transform”. In: Advances in Adaptive Data Analysis 3.01n02 (2011), pp. 115126. • J. A. Tropp. ”User-Friendly Tail Bounds for Sums of Random Matrices”. In: Foundations of Computational Mathematics 12.4 (2012), pp. 389434. 10
Recommend
More recommend