Optimal Mini-Batch and Step Sizes for SAGA Nidham Gazagnadou 1 , a - PowerPoint PPT Presentation

Optimal Mini-Batch and Step Sizes for SAGA Nidham Gazagnadou 1 , a joint work with Robert M. Gower 1 & Joseph Salmon 2 1 LTCI, T´ el´ ecom Paris, Institut Polytechnique de Paris, France 2 IMAG, Univ Montpellier, CNRS, Montpellier, France a This work was supported by grants from R´ egion Ile-de-France 1

The Optimization Problem • Goal n w ∈ R d f ( w ) = 1 find w ∗ ∈ arg min � f i ( w ) n i =1 2

The Optimization Problem • Goal n w ∈ R d f ( w ) = 1 find w ∗ ∈ arg min � f i ( w ) n i =1 where – n i.i.d. observations: ( a i , y i ) ∈ R d × R or R d × {− 1 , 1 } – f i : R d → R is L i –smooth ∀ i ∈ [ n ] – f is L –smooth and µ –strongly convex 2

The Optimization Problem • Goal n w ∈ R d f ( w ) = 1 find w ∗ ∈ arg min � f i ( w ) n i =1 where – n i.i.d. observations: ( a i , y i ) ∈ R d × R or R d × {− 1 , 1 } – f i : R d → R is L i –smooth ∀ i ∈ [ n ] – f is L –smooth and µ –strongly convex • Covered problems – Ridge regression – Regularized logistic regression 2

Reformulation of the ERM • Sampling vector Let v ∈ R n , with distribution D s.t. for all i in [ n ]:= { 1 , . . . , n } E D [ v i ] = 1 3

Reformulation of the ERM • Sampling vector Let v ∈ R n , with distribution D s.t. for all i in [ n ]:= { 1 , . . . , n } E D [ v i ] = 1 • ERM stochastic reformulation n � � f v ( w ) := 1 find w ∗ ∈ arg min � = E D v i f i ( w ) n w ∈ R d i =1 leading to an unbiased gradient estimate n E D [ ∇ f v ( w )] = 1 � E D [ v i ] f i ( w ) = ∇ f ( w ) n i =1 3

Reformulation of the ERM • Sampling vector Let v ∈ R n , with distribution D s.t. for all i in [ n ]:= { 1 , . . . , n } E D [ v i ] = 1 • ERM stochastic reformulation n � � f v ( w ) := 1 find w ∗ ∈ arg min � = E D v i f i ( w ) n w ∈ R d i =1 leading to an unbiased gradient estimate n E D [ ∇ f v ( w )] = 1 � E D [ v i ] f i ( w ) = ∇ f ( w ) n i =1 • Arbitrary sampling includes all mini-batching strategies such as sampling b ∈ [ n ] elements without replacement � � v = n 1 � e i = for all B ⊆ [ n ] , | B | = b . P � , � n b 3 b i ∈ B

Focus on b Mini-Batch SAGA The algorithm – Sample a mini-batch B ⊂ [ n ] := { 1 , . . . , n } s.t. | B | = b – Build the gradient estimate g ( w k ) = 1 ∇ f i ( w k ) − 1 : i + 1 � � J k nJ k e b b i ∈ B i ∈ B : i the i –th column of J k ∈ R d × n where e is the all-ones vector and J k – Take a step: w k +1 = w k − γ g ( w k ) – Update the Jacobian estimate J k J k i = ∇ f i ( w k ) , ∀ i ∈ B 10 0 b Defazio = 1 + = 6.10 e 05 , Defazio Our contribution: relative distance to optimum b practical = 70 + = 1.20 e 02 , practical b practical = 70 + gridsearch = 3.13 e 02 , 10 1 optimal mini-batch and step size b Hofmann = 20 + Hofmann = 1.59 e 03 , residual 10 2 10 3 Example of SAGA run on real data ( slice data set) 10 4 4 0 1 2 3 4 epochs

Key Constant: Expected Smoothness Definition (Expected Smoothness constant) If f is L –smooth in expectation, then for every w ∈ R d � � �∇ f v ( w ) − ∇ f v ( w ∗ ) � 2 ≤ 2 L ( f ( w ) − f ( w ∗ )) E D 2 5

Key Constant: Expected Smoothness Definition (Expected Smoothness constant) If f is L –smooth in expectation, then for every w ∈ R d � � �∇ f v ( w ) − ∇ f v ( w ∗ ) � 2 ≤ 2 L ( f ( w ) − f ( w ∗ )) E D 2 • Total Complexity of b mini-batch SAGA, for a given ǫ > 0, is � 4 b ( L + λ ) , n + n − b 4( L max + λ ) � � 1 � K total ( b ) = max log , n − 1 µ µ ǫ where λ is the regularizer and L max := max i =1 ... n L i • For a step size 1 γ = � . � L + λ, 1 n − b n − 1 L max + µ n 4 max b 4 b 5

Key Constant: Expected Smoothness Definition (Expected Smoothness constant) If f is L –smooth in expectation, then for every w ∈ R d � � �∇ f v ( w ) − ∇ f v ( w ∗ ) � 2 ≤ 2 L ( f ( w ) − f ( w ∗ )) E D 2 • Total Complexity of b mini-batch SAGA, for a given ǫ > 0, is � 4 b ( L + λ ) , n + n − b 4( L max + λ ) � � 1 � K total ( b ) = max log , n − 1 µ µ ǫ where λ is the regularizer and L max := max i =1 ... n L i • For a step size 1 γ = � . � L + λ, 1 n − b n − 1 L max + µ n 4 max b 4 b Problem: Calculating L is most of the time intractable 5

Our Estimates of the Expected Smoothness Theorem (Upper-bounds of L ) When sampling b points without replacement we have • Simple bound L ≤ L simple ( b ) := n b − 1 L + 1 n − b ¯ n − 1 L max n − 1 b b • Bernstein bound � � L ≤ L Bernstein ( b ) := 2 b − 1 n − 1 L + 1 n n − b n − 1 + 4 3 log d L max b b where ¯ � n L := 1 500 i =1 L i and n smoothness constant L max := max i ∈ [ n ] L i 400 300 Practical estimate 200 100 L practical := n b − 1 n − 1 L + 1 n − b n − 1 L max b b 0 5 10 15 20 mini-batch size Estimates of L artificial data 6 ( n = d = 24)

Optimal Mini-Batch from the Practical Estimate For a precision ǫ > 0, the total complexity is � 4 b ( L practical + λ ) , n + n − b 4( L max + λ ) � � 1 � K total ( b ) = max log n − 1 µ µ ǫ b empirical = 2 10 7.0 empirical total complexity b practical = 70 Leading to the optimal 10 6.5 mini-batch size 10 6.0 b ∗ practical ∈ arg min K total ( b ) 10 5.5 b ∈ [ n ] 10 5.0 1 2 4 8 16 32 64 128 256 1024 4096 16384 � � 1 + µ ( n − 1) ⇒ b ∗ = practical = mini-batch size 4 L Total complexity vs mini-batch size ( slice dataset, λ = 10 − 1 ) 7

Summary Take Home Message • Use optimal mini-batch and step sizes available for SAGA! What was done • Build estimates of L • Give optimal settings ( b , γ ) for mini-batch SAGA ⇒ Faster convergence of w k − k →∞ w ∗ = − − → • Provide convincing numerical improvements on real datasets • All the Julia code available at https://github.com/gowerrobert/StochOpt.jl 8

References (1/2) • F. Bach. ”Sharp analysis of low-rank kernel matrix approximations”. In:ArXiv e-prints (Aug. 2012). arXiv: 1208.2015 [cs.LG]. • C. C. Chang and C. J. Lin. ”LIBSVM : A library for support vector machines”. In: ACM Transactions on Intelligent Systems and Technology 2.3 (Apr. 2011), pp. 127. • A. Defazio, F. Bach, and S. Lacoste-julien. ”SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives”. In: Advances in Neural Information Processing Systems 27. 2014, pp. 16461654. • R. M. Gower, P. Richtrik, and F. Bach. ”Stochastic Quasi-Gradient Methods: Variance Reduction via Jacobian Sketching”. In: arXiv preprint arXiv:1805.02632 (2018). • D. Gross and V. Nesme. ”Note on sampling without replacing from a finite collection of matrices”. In: arXiv preprint arXiv:1001.2738 (2010) • W. Hoeffding. ”Probability inequalities for sums of bounded random variables”. In: Journal of the American statistical association 58.301 (1963), pp. 1330. 9

References (2/2) • R. Johnson and T. Zhang. ”Accelerating Stochastic Gradient Descent using Predictive Variance Reduction”. In: Advances in Neural Information Processing Systems 26. Curran Associates, Inc., 2013, pp. 315323. • H. Robbins and S. Monro. ”A stochastic approximation method”. In: Annals of Mathematical Statistics 22 (1951), pp. 400407. • M. Schmidt, N. Le Roux, and F. Bach. ”Minimizing finite sums with the stochastic average gradient”. In: Mathematical Programming 162.1 (2017), pp. 83112. • J. A. Tropp. ”An Introduction to Matrix Concentration Inequalities”. In: ArXiv e-prints (Jan. 2015). arXiv:1501.01571 [math.PR] • J. A. Tropp. ”Improved analysis of the subsampled randomized Hadamard transform”. In: Advances in Adaptive Data Analysis 3.01n02 (2011), pp. 115126. • J. A. Tropp. ”User-Friendly Tail Bounds for Sums of Random Matrices”. In: Foundations of Computational Mathematics 12.4 (2012), pp. 389434. 10

Optimal Mini-Batch and Step Sizes for SAGA Nidham Gazagnadou 1 , a - PowerPoint PPT Presentation

Optimal Mini-Batch and Step Sizes for SAGA Nidham Gazagnadou 1 , a joint work with Robert M. Gower 1 & Joseph Salmon 2 1 LTCI, T el ecom Paris, Institut Polytechnique de Paris, France 2 IMAG, Univ Montpellier, CNRS, Montpellier, France a

Today you are going to: Watch a video Learn what a mini saga is and how to write one

Introduction to the SAGA API Outline SAGA Standardization API Structure and Scope (C++)

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

ECE 417 Fall 2018 Lecture 19: Mini-Batch Training and Data Augmentation Mark Hasegawa-Johnson

Interoperabilty: The SAGA Approach and Experience Shantenu Jha, Andre Merzky, Ole Weidner &

Batch Systems Running calculations on HPC resources Outline What is a batch system? How

Today you are going to: Watch a video Learn what a mini saga is and Learn what a

Quick guide Step 1: Purchasing an RSEvents! membership Step 2: Downloading RSEvents! Step 3:

Step by step guide Step 1: Purchasing an RSBlog! membership Step 2: Downloading RSBlog! Step 3:

Step by step guide Step 1: Purchasing an RSEvents! membership Step 2: Downloading RSEvents! Step

Step 1 Step 2 Step 3 Step 4 Step 5 Preparation of a sketch Submission of birth map of all

Step by step guide Step 1: Accessing the account Step 2: Download RSFiles! 2.1 Download the

Quick guide Step 1: Purchasing RSMail! Step 2: Download RSMail! Step 3: Installing RSMail! Step

Credential Assessment Mapping Privilege Escalation at Scale Matt Weeks @scriptjunkie1 Adversary

MINI OPENDRIVE 1 MINI MINI OPENDRIVE EXP OPENDRIVE EXP Experience, eXpertise, Performance The

Dr. Michelle Black Map of Japan I lived on Kyushu island in Saga Prefecture ( ,

SLIDES AGAINST HUMANITY 1 Slides Against Humanity Course/Program: The Basic Course, Public

Markov Systems, Markov Decision Processes, and Dynamic Programming Andrew W. Moore Note to

Topics in Computational Sustainability CS 325 Spring 2016 Note to other teachers and users of

Odyssey 2016, Bilbao, Spain Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based

A Perfect Sampling Method for Exponential Random Graph Models Carter T. Butts Department of

This Class Map Reduce Programming Framework Map Reduce

CSE 140: Components and Design Techniques for Digital Systems Lecture 9: Sequential Networks:

GraphCoQL A mechanized formalization of GraphQL in Coq Toms Daz Federico Olmedo ric

Optimal Mini-Batch and Step Sizes for SAGA Nidham Gazagnadou 1 , a - PowerPoint PPT Presentation

Optimal Mini-Batch and Step Sizes for SAGA Nidham Gazagnadou 1 , a joint work with Robert M. Gower 1 & Joseph Salmon 2 1 LTCI, T el ecom Paris, Institut Polytechnique de Paris, France 2 IMAG, Univ Montpellier, CNRS, Montpellier, France a

Today you are going to: Watch a video Learn what a mini saga is and how to write one

Introduction to the SAGA API Outline SAGA Standardization API Structure and Scope (C++)

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

ECE 417 Fall 2018 Lecture 19: Mini-Batch Training and Data Augmentation Mark Hasegawa-Johnson

Interoperabilty: The SAGA Approach and Experience Shantenu Jha, Andre Merzky, Ole Weidner &amp;

Batch Systems Running calculations on HPC resources Outline What is a batch system? How

Today you are going to: Watch a video Learn what a mini saga is and Learn what a

Quick guide Step 1: Purchasing an RSEvents! membership Step 2: Downloading RSEvents! Step 3:

Step by step guide Step 1: Purchasing an RSBlog! membership Step 2: Downloading RSBlog! Step 3:

Step by step guide Step 1: Purchasing an RSEvents! membership Step 2: Downloading RSEvents! Step

Step 1 Step 2 Step 3 Step 4 Step 5 Preparation of a sketch Submission of birth map of all

Step by step guide Step 1: Accessing the account Step 2: Download RSFiles! 2.1 Download the

Quick guide Step 1: Purchasing RSMail! Step 2: Download RSMail! Step 3: Installing RSMail! Step

Credential Assessment Mapping Privilege Escalation at Scale Matt Weeks @scriptjunkie1 Adversary

MINI OPENDRIVE 1 MINI MINI OPENDRIVE EXP OPENDRIVE EXP Experience, eXpertise, Performance The

Dr. Michelle Black Map of Japan I lived on Kyushu island in Saga Prefecture ( ,

SLIDES AGAINST HUMANITY 1 Slides Against Humanity Course/Program: The Basic Course, Public

Markov Systems, Markov Decision Processes, and Dynamic Programming Andrew W. Moore Note to

Topics in Computational Sustainability CS 325 Spring 2016 Note to other teachers and users of

Odyssey 2016, Bilbao, Spain Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based

A Perfect Sampling Method for Exponential Random Graph Models Carter T. Butts Department of

This Class Map Reduce Programming Framework Map Reduce

CSE 140: Components and Design Techniques for Digital Systems Lecture 9: Sequential Networks:

GraphCoQL A mechanized formalization of GraphQL in Coq Toms Daz Federico Olmedo ric

Interoperabilty: The SAGA Approach and Experience Shantenu Jha, Andre Merzky, Ole Weidner &