Improving SGD via Variance Reduction GD converges fast but computation is slow SGD computation is fast but converges slowly slow convergence due to inherent variance SGD as a statistical estimator of gradient: let g i = ∇ f i . � n unbaisedness: E g i = 1 i = 1 g i = ∇ f . n error of using g i to approx ∇ f : variance E � g i − Eg i � 2 2 . Statistical thinking: relating variance to optimization design other unbiased gradient estimators with smaller variance T. Zhang Big Data Optimization 11 / 73
Improving SGD using Variance Reduction The idea leads to modern stochastic algorithms for big data machine learning with fast convergence rate T. Zhang Big Data Optimization 12 / 73
Improving SGD using Variance Reduction The idea leads to modern stochastic algorithms for big data machine learning with fast convergence rate Collins et al (2008): For special problems, with a relatively complicated algorithm (Exponentiated Gradient on dual) Le Roux, Schmidt, Bach (NIPS 2012): A variant of SGD called SAG (stochastic average gradient) and later SAGA (Defazio, Bach, Lacoste-Julien, NIPS 2014) Johnson and Z (NIPS 2013): SVRG (Stochastic variance reduced gradient) Shalev-Schwartz and Z (JMLR 2013): SDCA (Stochastic Dual Coordinate Ascent) , and later a variant with Zheng Qu and Peter Richtarik T. Zhang Big Data Optimization 12 / 73
Outline Background: big data optimization in machine learning: special structure T. Zhang Big Data Optimization 13 / 73
Outline Background: big data optimization in machine learning: special structure Single machine optimization stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration) T. Zhang Big Data Optimization 13 / 73
Outline Background: big data optimization in machine learning: special structure Single machine optimization stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration) Distributed optimization algorithm 5: accelerated minibatch SDCA algorithm 6: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling T. Zhang Big Data Optimization 13 / 73
Relating Statistical Variance to Optimization Want to optimize min w f ( w ) Full gradient ∇ f ( w ) . T. Zhang Big Data Optimization 14 / 73
Relating Statistical Variance to Optimization Want to optimize min w f ( w ) Full gradient ∇ f ( w ) . Given unbiased random estimator g i of ∇ f ( w ) , and SGD rule w → w − η g i , reduction of objective is + η 2 L E f ( w − η g i ) ≤ f ( w ) − ( η − η 2 L / 2 ) �∇ f ( w ) � 2 2 E � g − Eg � 2 . 2 2 � �� � � �� � non-random variance T. Zhang Big Data Optimization 14 / 73
Relating Statistical Variance to Optimization Want to optimize min w f ( w ) Full gradient ∇ f ( w ) . Given unbiased random estimator g i of ∇ f ( w ) , and SGD rule w → w − η g i , reduction of objective is + η 2 L E f ( w − η g i ) ≤ f ( w ) − ( η − η 2 L / 2 ) �∇ f ( w ) � 2 2 E � g − Eg � 2 . 2 2 � �� � � �� � non-random variance Smaller variance implies bigger reduction T. Zhang Big Data Optimization 14 / 73
Statistical Thinking: variance reduction techniques Given unbiased estimator g i of ∇ f ; how to design other unbiased estimators with reduce variance? T. Zhang Big Data Optimization 15 / 73
Statistical Thinking: variance reduction techniques Given unbiased estimator g i of ∇ f ; how to design other unbiased estimators with reduce variance? Control variates (leads to SVRG). find ˜ g i ≈ g i use estimator i := g i − ˜ g i + E ˜ g ′ g i . T. Zhang Big Data Optimization 15 / 73
Statistical Thinking: variance reduction techniques Given unbiased estimator g i of ∇ f ; how to design other unbiased estimators with reduce variance? Control variates (leads to SVRG). find ˜ g i ≈ g i use estimator i := g i − ˜ g i + E ˜ g ′ g i . Importance sampling (Zhao and Zhang ICML 2014) sample g i proportional to ρ i ( E ρ i = 1) use estimator g i /ρ i T. Zhang Big Data Optimization 15 / 73
Statistical Thinking: variance reduction techniques Given unbiased estimator g i of ∇ f ; how to design other unbiased estimators with reduce variance? Control variates (leads to SVRG). find ˜ g i ≈ g i use estimator i := g i − ˜ g i + E ˜ g ′ g i . Importance sampling (Zhao and Zhang ICML 2014) sample g i proportional to ρ i ( E ρ i = 1) use estimator g i /ρ i Stratified sampling (Zhao and Zhang) T. Zhang Big Data Optimization 15 / 73
Stochastic Variance Reduced Gradient (SVRG) I Objective function n n f ( w ) = 1 f i ( w ) = 1 � � ˜ f i ( w ) , n n i = 1 i = 1 where ˜ w )) ⊤ w f i ( w ) = f i ( w ) − ( ∇ f i ( ˜ w ) − ∇ f ( ˜ . � �� � sum to zero Pick ˜ w to be an approximate solution (close to w ∗ ). The SVRG rule (control variates) is w t = w t − 1 − η t ∇ ˜ f i ( w t − 1 ) = w t − 1 − η t [ ∇ f i ( w t − 1 ) − ∇ f i ( ˜ w ) + ∇ f ( ˜ w )] . � �� � small variance T. Zhang Big Data Optimization 16 / 73
Stochastic Variance Reduced Gradient (SVRG) II Assume that ˜ w ≈ w ∗ and w t − 1 ≈ w ∗ . Then ∇ f ( ˜ ∇ f i ( w t − 1 ) ≈ ∇ f i ( ˜ w ) ≈ ∇ f ( w ∗ ) = 0 w ) . This means ∇ f i ( w t − 1 ) − ∇ f i ( ˜ w ) + ∇ f ( ˜ w ) → 0 . It is possible to choose a constant step size η t = η instead of requiring η t → 0. One can achieve comparable linear convergence with SVRG: ρ ) t ) , Ef ( w t ) − f ( w ∗ ) = O (( 1 − ˜ where ˜ ρ = O ( λ n / ( L + λ n ) ; convergence is faster than GD. T. Zhang Big Data Optimization 17 / 73
SVRG Algorithm Procedure SVRG Parameters update frequency m and learning rate η Initialize ˜ w 0 Iterate: for s = 1 , 2 , . . . w = ˜ ˜ w s − 1 � n µ = 1 i = 1 ∇ ψ i ( ˜ ˜ w ) n w 0 = ˜ w Iterate: for t = 1 , 2 , . . . , m Randomly pick i t ∈ { 1 , . . . , n } and update weight w t = w t − 1 − η ( ∇ ψ i t ( w t − 1 ) − ∇ ψ i t ( ˜ w ) + ˜ µ ) end Set ˜ w s = w m end T. Zhang Big Data Optimization 18 / 73
SVRG v.s. Batch Gradient Descent: fast convergence Number of examples needed to achieve ǫ accuracy: Batch GD: ˜ O ( n · L /λ log ( 1 /ǫ )) SVRG: ˜ O (( n + L /λ ) log ( 1 /ǫ )) Assume L -smooth loss f i and λ strongly convex objective function. SVRG has fast convergence — condition number effectively reduced The gain of SVRG over batch algorithm is significant when n is large. T. Zhang Big Data Optimization 19 / 73
SVRG: variance ��� ��� �� ����������� ����������� �� �� � � ����� �������� ����� �������� ���������� ��������� �������� ��� ����������� � ��������� ���� ����� ���������� ��� ������ ������� ���� � �� ��� ��� ��� � ��� ��� ��� ��� ��� ������������������������� ����������������������� � Convex case (left): least squares on MNIST; Nonconvex case (right): neural nets on CIFAR-10. The numbers in the legends are learning rate T. Zhang Big Data Optimization 20 / 73
SVRG: convergence ���� ��� ��������� ���������� ����������� ���� ���� ��������� �������� ���� ������������ ���������� ����������� ���������� ��� ���� ����������� �������� �������� ���� ���� ���� ���� ���������� ���� ��� ����������� ���� ���� ���� ���������� ��� ���� � �� ��� ��� ��� � ��� ��� ��� ��� ��� ������������������������ ����������������������� � Convex case (left): least squares on MNIST; Nonconvex case (right): neural nets on CIFAR-10. The numbers in the legends are learning rate T. Zhang Big Data Optimization 21 / 73
Variance Reduction using Importance Sampling (combined with SVRG) n f ( w ) = 1 � f i ( w ) . n i = 1 L i : smoothness param of f i ( w ) ; λ : strong convexity param of f ( w ) Number of examples needed to achieve ǫ accuracy: With uniform sampling: ˜ O (( n + L /λ ) log ( 1 /ǫ )) , where L = max i L i With importance sampling: O (( n + ¯ ˜ L /λ ) log ( 1 /ǫ )) , L = n − 1 � n where ¯ i = 1 L i T. Zhang Big Data Optimization 22 / 73
Outline Background: big data optimization in machine learning: special structure T. Zhang Big Data Optimization 23 / 73
Outline Background: big data optimization in machine learning: special structure Single machine optimization stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration) T. Zhang Big Data Optimization 23 / 73
Outline Background: big data optimization in machine learning: special structure Single machine optimization stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration) Distributed optimization algorithm 5: accelerated minibatch SDCA algorithm 6: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling T. Zhang Big Data Optimization 23 / 73
Motivation Solve n f ( w ) = 1 � w ∗ = arg min w f ( w ) f i ( w ) . n i = 1 SGD with variance reduction via SVRG: w t = w t − 1 − η t [ ∇ f i ( w t − 1 ) − ∇ f i ( ˜ w ) + ∇ f ( ˜ w )] . � �� � small variance T. Zhang Big Data Optimization 24 / 73
Motivation Solve n f ( w ) = 1 � w ∗ = arg min w f ( w ) f i ( w ) . n i = 1 SGD with variance reduction via SVRG: w t = w t − 1 − η t [ ∇ f i ( w t − 1 ) − ∇ f i ( ˜ w ) + ∇ f ( ˜ w )] . � �� � small variance Compute full gradient ∇ f ( ˜ w ) periodically at an intermediate ˜ w T. Zhang Big Data Optimization 24 / 73
Motivation Solve n f ( w ) = 1 � w ∗ = arg min w f ( w ) f i ( w ) . n i = 1 SGD with variance reduction via SVRG: w t = w t − 1 − η t [ ∇ f i ( w t − 1 ) − ∇ f i ( ˜ w ) + ∇ f ( ˜ w )] . � �� � small variance Compute full gradient ∇ f ( ˜ w ) periodically at an intermediate ˜ w How to avoid computing ∇ f ( ˜ w ) ? Answer: keeping previously calculated gradients. T. Zhang Big Data Optimization 24 / 73
Stochastic Average Gradient ameliore: SAGA � n g = 1 Initialize: ˜ g i = ∇ f i ( w 0 ) and ˜ j = 1 ˜ g j n SAGA update rule: randomly select i , and w t = w t − 1 − η t [ ∇ f i ( w t − 1 ) − ˜ g i + ˜ g ] g =˜ ˜ g + ( ∇ f i ( w t − 1 ) − ˜ g i ) / n ˜ g i = ∇ f i ( w t − 1 ) Equivalent to: n w i ) + 1 � w t = w t − 1 − η t [ ∇ f i ( w t − 1 ) − ∇ f i ( ˜ ∇ f j ( ˜ ˜ w j )] w i = w t − 1 . n j = 1 � �� � small variance Compare to SVRG: w t = w t − 1 − η t [ ∇ f i ( w t − 1 ) − ∇ f i ( ˜ w ) + ∇ f ( ˜ w )] . � �� � small variance T. Zhang Big Data Optimization 25 / 73
Variance Reduction The gradient estimator of SAGA is unbiased: n w i ) + 1 � = ∇ f ( w t − 1 ) . ∇ f i ( w t − 1 ) − ∇ f i ( ˜ ∇ f j ( ˜ w j ) E n j = 1 Since ˜ w i → w ∗ , we have n w i ) + 1 � → 0 . ∇ f i ( w t − 1 ) − ∇ f i ( ˜ ∇ f j ( ˜ w j ) n j = 1 Therefore variance of the gradient estimator goes to zero. T. Zhang Big Data Optimization 26 / 73
Theory of SAGA Similar to SVRG, we have fast convergence for SAGA. Number of examples needed to achieve ǫ accuracy: Batch GD: ˜ O ( n · L /λ log ( 1 /ǫ )) SVRG: ˜ O (( n + L /λ ) log ( 1 /ǫ )) SAGA: ˜ O (( n + L /λ ) log ( 1 /ǫ )) Assume L -smooth loss f i and λ strongly convex objective function. T. Zhang Big Data Optimization 27 / 73
Outline Background: big data optimization in machine learning: special structure T. Zhang Big Data Optimization 28 / 73
Outline Background: big data optimization in machine learning: special structure Single machine optimization stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration) T. Zhang Big Data Optimization 28 / 73
Outline Background: big data optimization in machine learning: special structure Single machine optimization stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration) Distributed optimization algorithm 5: accelerated minibatch SDCA algorithm 6: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling T. Zhang Big Data Optimization 28 / 73
Motivation of SDCA: regularized loss minimization Assume we want to solve the Lasso problem: � � n 1 � ( w ⊤ x i − y i ) 2 + λ � w � 1 min n w i = 1 T. Zhang Big Data Optimization 29 / 73
Motivation of SDCA: regularized loss minimization Assume we want to solve the Lasso problem: � � n 1 � ( w ⊤ x i − y i ) 2 + λ � w � 1 min n w i = 1 or the ridge regression problem: n 1 λ � ( w ⊤ x i − y i ) 2 2 � w � 2 min + 2 n w i = 1 � �� � � �� � regularization loss Goal: solve regularized loss minimization problems as fast as we can. T. Zhang Big Data Optimization 29 / 73
Motivation of SDCA: regularized loss minimization Assume we want to solve the Lasso problem: � � n 1 � ( w ⊤ x i − y i ) 2 + λ � w � 1 min n w i = 1 or the ridge regression problem: n 1 λ � ( w ⊤ x i − y i ) 2 2 � w � 2 min + 2 n w i = 1 � �� � � �� � regularization loss Goal: solve regularized loss minimization problems as fast as we can. solution: proximal Stochastic Dual Coordinate Ascent (Prox-SDCA). can show: fast convergence of SDCA. T. Zhang Big Data Optimization 29 / 73
Loss Minimization with L 2 Regularization � � n 1 φ i ( w ⊤ x i ) + λ � 2 � w � 2 min w P ( w ) := . n i = 1 T. Zhang Big Data Optimization 30 / 73
Loss Minimization with L 2 Regularization � � n 1 φ i ( w ⊤ x i ) + λ � 2 � w � 2 min w P ( w ) := . n i = 1 Examples: φ i ( z ) Lipschitz smooth SVM max { 0 , 1 − y i z } ✓ ✗ Logistic regression log ( 1 + exp ( − y i z )) ✓ ✓ Abs-loss regression | z − y i | ✓ ✗ ( z − y i ) 2 Square-loss regression ✗ ✓ T. Zhang Big Data Optimization 30 / 73
Dual Formulation Primal problem: � � n 1 φ i ( w ⊤ x i ) + λ � 2 � w � 2 w ∗ = arg min w P ( w ) := n i = 1 Dual problem: 2 � � n n � � 1 i ( − α i ) − λ � � − φ ∗ 1 , � � α ∗ = max α ∈ R n D ( α ) := α i x i � � λ n n 2 � � i = 1 i = 1 and the convex conjugate (dual) is defined as: φ ∗ i ( a ) = sup ( az − φ i ( z )) . z T. Zhang Big Data Optimization 31 / 73
Relationship of Primal and Dual Solutions Weak duality: P ( w ) ≥ D ( α ) for all w and α Strong duality: P ( w ∗ ) = D ( α ∗ ) with the relationship n w ∗ = 1 � α ∗ i = − φ ′ i ( w ⊤ α ∗ , i · x i , ∗ x i ) . λ n i = 1 T. Zhang Big Data Optimization 32 / 73
Relationship of Primal and Dual Solutions Weak duality: P ( w ) ≥ D ( α ) for all w and α Strong duality: P ( w ∗ ) = D ( α ∗ ) with the relationship n w ∗ = 1 � α ∗ i = − φ ′ i ( w ⊤ α ∗ , i · x i , ∗ x i ) . λ n i = 1 Duality gap: for any w and α : P ( w ) − D ( α ) ≥ P ( w ) − P ( w ∗ ) . � �� � � �� � duality gap primal sub-optimality T. Zhang Big Data Optimization 32 / 73
Example: Linear Support Vector Machine Primal formulation: n P ( w ) = 1 max ( 0 , 1 − w ⊤ x i y i ) + λ � 2 � w � 2 2 n i = 1 Dual formulation: � � 2 n n � � D ( α ) = 1 1 � � � � α i y i − α i x i y i , α i y i ∈ [ 0 , 1 ] . � � 2 λ n 2 n � � i = 1 i = 1 2 Relationship: n w ∗ = 1 � α ∗ , i x i λ n i = 1 T. Zhang Big Data Optimization 33 / 73
Dual Coordinate Ascent (DCA) Solve the dual problem using coordinate ascent α ∈ R n D ( α ) , max and keep the corresponding primal solution using the relationship n w = 1 � α i x i . λ n i = 1 DCA: At each iteration, optimize D ( α ) w.r.t. a single coordinate, while the rest of the coordinates are kept in tact. Stochastic Dual Coordinate Ascent (SDCA): Choose the updated coordinate uniformly at random T. Zhang Big Data Optimization 34 / 73
Dual Coordinate Ascent (DCA) Solve the dual problem using coordinate ascent α ∈ R n D ( α ) , max and keep the corresponding primal solution using the relationship n w = 1 � α i x i . λ n i = 1 DCA: At each iteration, optimize D ( α ) w.r.t. a single coordinate, while the rest of the coordinates are kept in tact. Stochastic Dual Coordinate Ascent (SDCA): Choose the updated coordinate uniformly at random SMO (John Platt), Liblinear (Hsieh et al) etc implemented DCA. T. Zhang Big Data Optimization 34 / 73
SDCA vs. SGD — update rule Stochastic Gradient Descent (SGD) update rule: w ( t ) − φ ′ i ( w ( t ) ⊤ x i ) w ( t + 1 ) = � � 1 − 1 x i t λ t SDCA update rule: D ( α ( t ) + ∆ i e i ) 1 . ∆ i = argmax ∆ ∈ R 2 . w ( t + 1 ) = w ( t ) + ∆ i λ n x i Rather similar update rules. SDCA has several advantages: Stopping criterion: duality gap smaller than a value No need to tune learning rate T. Zhang Big Data Optimization 35 / 73
SDCA vs. SGD — update rule — Example SVM with the hinge loss: φ i ( w ) = max { 0 , 1 − y i w ⊤ x i } SGD update rule: i w ( t ) < 1 ] w ( t ) − 1 [ y i x ⊤ w ( t + 1 ) = � � 1 − 1 x i t λ t SDCA update rule: � � �� 1 , 1 − y i x ⊤ i w ( t − 1 ) + y i α ( t − 1 ) − α ( t − 1 ) 1 . ∆ i = y i max 0 , min i i � x i � 2 2 / ( λ n ) 1 . α ( t + 1 ) = α ( t ) + ∆ i e i 2 . w ( t + 1 ) = w ( t ) + ∆ i λ n x i T. Zhang Big Data Optimization 36 / 73
SDCA vs. SGD — experimental observations On CCAT dataset, λ = 10 − 6 , smoothed loss 0 10 SDCA SDCA−Perm −1 SGD 10 −2 10 −3 10 −4 10 −5 10 −6 10 5 10 15 20 25 T. Zhang Big Data Optimization 37 / 73
SDCA vs. SGD — experimental observations On CCAT dataset, λ = 10 − 6 , smoothed loss 0 10 SDCA SDCA−Perm −1 SGD 10 −2 10 −3 10 −4 10 −5 10 −6 10 5 10 15 20 25 The convergence of SDCA is shockingly fast! How to explain this? T. Zhang Big Data Optimization 37 / 73
SDCA vs. SGD — experimental observations On CCAT dataset, λ = 10 − 5 , hinge-loss 0 10 SDCA SDCA−Perm SGD −1 10 −2 10 −3 10 −4 10 −5 10 −6 10 5 10 15 20 25 30 35 How to understand the convergence behavior? T. Zhang Big Data Optimization 38 / 73
Derivation of SDCA I Consider the following optimization problem n � f ( w ) = n − 1 φ i ( w ) + 0 . 5 λ w ⊤ w . w ∗ = arg min w f ( w ) i = 1 The optimal condition is n � n − 1 ∇ φ i ( w ∗ ) + λ w ∗ = 0 . i = 1 We have dual representation: n i = − 1 � α ∗ α ∗ w ∗ = λ n ∇ φ i ( w ∗ ) i i = 1 T. Zhang Big Data Optimization 39 / 73
Derivation of SDCA II If we maintain a relationship: w = � n i = 1 α i , then SGD rule w t = w t − 1 − η ∇ φ i ( w t − 1 ) − ηλ w t − 1 . � �� � large variance property E [ w t | w t − 1 ] = w t − 1 − ∇ f ( w ) The dual representation of SGD rule is α t , j = ( 1 − ηλ ) α t − 1 , j − η ∇ φ i ( w t − 1 ) δ i , j . T. Zhang Big Data Optimization 40 / 73
Derivation of SDCA III The alternative SDCA rule is to replace − ηλ w t − 1 by − ηλ n α i : primal update is w t = w t − 1 − η ( ∇ φ i ( w t − 1 ) + λ n α i ) . � �� � small variance and the dual update is α t , j = α t − 1 , j − η ( ∇ φ i ( w t − 1 ) + λ n α i ) δ i , j . It is unbiased: E [ w t | w t − 1 ] = w t − 1 − ∇ f ( w ) T. Zhang Big Data Optimization 41 / 73
Benefit of SDCA Variance reduction effect: as w → w ∗ and α → α ∗ , ∇ φ i ( w t − 1 ) + λ n α i → 0 , thus the stochastic variance goes to zero. T. Zhang Big Data Optimization 42 / 73
Benefit of SDCA Variance reduction effect: as w → w ∗ and α → α ∗ , ∇ φ i ( w t − 1 ) + λ n α i → 0 , thus the stochastic variance goes to zero. Fast convergence rate result: Ef ( w t ) − f ( w ∗ ) = O ( µ k ) , where µ = 1 − O ( λ n / ( 1 + λ n )) . Convergence rate is fast even when λ = O ( 1 / n ) . Better than batch method T. Zhang Big Data Optimization 42 / 73
Fast Convergence of SDCA The number of iterations needed to achieve ǫ accuracy For L -smooth loss: �� � � n + L log 1 ˜ O λ ǫ For non-smooth but G -Lipschitz loss (bounded gradient): � � n + G 2 ˜ O λ ǫ T. Zhang Big Data Optimization 43 / 73
Fast Convergence of SDCA The number of iterations needed to achieve ǫ accuracy For L -smooth loss: �� � � n + L log 1 ˜ O λ ǫ For non-smooth but G -Lipschitz loss (bounded gradient): � � n + G 2 ˜ O λ ǫ Similar to that of SVRG; and effective when n is large T. Zhang Big Data Optimization 43 / 73
SDCA vs. DCA — Randomization is Crucial! On CCAT dataset, λ = 10 − 4 , smoothed hinge-loss 0 10 SDCA DCA−Cyclic SDCA−Perm −1 10 Bound −2 10 −3 10 −4 10 −5 10 −6 10 0 2 4 6 8 10 12 14 16 18 T. Zhang Big Data Optimization 44 / 73
SDCA vs. DCA — Randomization is Crucial! On CCAT dataset, λ = 10 − 4 , smoothed hinge-loss 0 10 SDCA DCA−Cyclic SDCA−Perm −1 10 Bound −2 10 −3 10 −4 10 −5 10 −6 10 0 2 4 6 8 10 12 14 16 18 Randomization is crucial! T. Zhang Big Data Optimization 44 / 73
Proximal SDCA for General Regularizer Want to solve: � � n 1 � φ i ( X ⊤ w P ( w ) := i w ) + λ g ( w ) , min n i = 1 where X i are matrices; g ( · ) is strongly convex. Examples: Multi-class logistic loss K � φ i ( X ⊤ exp ( w ⊤ X i ,ℓ ) − w ⊤ X i , y i . i w ) = ln ℓ = 1 L 1 − L 2 regularization 2 + σ g ( w ) = 1 2 � w � 2 λ � w � 1 T. Zhang Big Data Optimization 45 / 73
Dual Formulation Primal: � � n 1 � φ i ( X ⊤ min w P ( w ) := i w ) + λ g ( w ) , n i = 1 Dual: � � �� n n 1 1 � � − φ ∗ i ( − α i ) − λ g ∗ max D ( α ) := X i α i n λ n α i = 1 i = 1 with the relationship � � n 1 � w = ∇ g ∗ X i α i . λ n i = 1 Prox-SDCA: extension of SDCA for arbitrarily strongly convex g ( w ) . T. Zhang Big Data Optimization 46 / 73
Prox-SDCA Dual: � � n n 1 v = 1 � � − φ ∗ i ( − α i ) − λ g ∗ ( v ) max D ( α ) := , X i α i . n λ n α i = 1 i = 1 Assume g ( w ) is strongly convex in norm � · � P with dual norm � · � D . T. Zhang Big Data Optimization 47 / 73
Prox-SDCA Dual: � � n n 1 v = 1 � � − φ ∗ i ( − α i ) − λ g ∗ ( v ) max D ( α ) := , X i α i . n λ n α i = 1 i = 1 Assume g ( w ) is strongly convex in norm � · � P with dual norm � · � D . For each α , and the corresponding v and w , define prox-dual � n 1 � ˜ − φ ∗ D α (∆ α ) = i ( − ( α i + ∆ α i )) n i = 1 � � 2 n n � � g ∗ ( v ) + ∇ g ∗ ( v ) ⊤ 1 X i ∆ α i + 1 1 � � � � − λ X i ∆ α i � � λ n 2 λ n � � i = 1 i = 1 D � �� � upper bound of g ∗ ( · ) T. Zhang Big Data Optimization 47 / 73
Prox-SDCA Dual: � � n n 1 v = 1 � � − φ ∗ i ( − α i ) − λ g ∗ ( v ) max D ( α ) := , X i α i . n λ n α i = 1 i = 1 Assume g ( w ) is strongly convex in norm � · � P with dual norm � · � D . For each α , and the corresponding v and w , define prox-dual � n 1 � ˜ − φ ∗ D α (∆ α ) = i ( − ( α i + ∆ α i )) n i = 1 � � 2 n n � � g ∗ ( v ) + ∇ g ∗ ( v ) ⊤ 1 X i ∆ α i + 1 1 � � � � − λ X i ∆ α i � � λ n 2 λ n � � i = 1 i = 1 D � �� � upper bound of g ∗ ( · ) Prox-SDCA: randomly pick i and update ∆ α i by maximizing ˜ D α ( · ) . T. Zhang Big Data Optimization 47 / 73
Proximal-SDCA for L 1 - L 2 Regularization Algorithm: Keep dual α and v = ( λ n ) − 1 � i α i X i Randomly pick i Find ∆ i by approximately maximizing: 1 − φ ∗ i ( α i + ∆ i ) − trunc ( v , σ/λ ) ⊤ X i ∆ i − 2 ∆ 2 2 λ n � X i � 2 i , where φ ∗ i ( α i + ∆) = ( α i + ∆) Y i ln (( α i + ∆) Y i ) + ( 1 − ( α i + ∆) Y i ) ln ( 1 − ( α i + ∆) Y i ) α = α + ∆ i · e i v = v + ( λ n ) − 1 ∆ i · X i . Let w = trunc ( v , σ/λ ) . T. Zhang Big Data Optimization 48 / 73
Solving L 1 with Smooth Loss Want to solve L 1 regularization to accuracy ǫ with smooth φ i : n 1 � φ i ( w ) + σ � w � 1 . n i = 1 Apply Prox-SDCA with extra term 0 . 5 λ � w � 2 2 , where λ = O ( ǫ ) : number of iterations needed by prox-SDCA is ˜ O ( n + 1 /ǫ ) . T. Zhang Big Data Optimization 49 / 73
Solving L 1 with Smooth Loss Want to solve L 1 regularization to accuracy ǫ with smooth φ i : n 1 � φ i ( w ) + σ � w � 1 . n i = 1 Apply Prox-SDCA with extra term 0 . 5 λ � w � 2 2 , where λ = O ( ǫ ) : number of iterations needed by prox-SDCA is ˜ O ( n + 1 /ǫ ) . Compare to (number of examples needed to go through): Dual Averaging SGD (Xiao): ˜ O ( 1 /ǫ 2 ) . O ( n / √ ǫ ) . FISTA (Nesterov’s batch accelerated proximal gradient): ˜ Prox-SDCA wins in the statistically interesting regime: ǫ > Ω( 1 / n 2 ) T. Zhang Big Data Optimization 49 / 73
Solving L 1 with Smooth Loss Want to solve L 1 regularization to accuracy ǫ with smooth φ i : n 1 � φ i ( w ) + σ � w � 1 . n i = 1 Apply Prox-SDCA with extra term 0 . 5 λ � w � 2 2 , where λ = O ( ǫ ) : number of iterations needed by prox-SDCA is ˜ O ( n + 1 /ǫ ) . Compare to (number of examples needed to go through): Dual Averaging SGD (Xiao): ˜ O ( 1 /ǫ 2 ) . O ( n / √ ǫ ) . FISTA (Nesterov’s batch accelerated proximal gradient): ˜ Prox-SDCA wins in the statistically interesting regime: ǫ > Ω( 1 / n 2 ) Can design accelerated prox-SDCA always superior to FISTA T. Zhang Big Data Optimization 49 / 73
Outline Background: big data optimization in machine learning: special structure T. Zhang Big Data Optimization 50 / 73
Outline Background: big data optimization in machine learning: special structure Single machine optimization stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration) T. Zhang Big Data Optimization 50 / 73
Outline Background: big data optimization in machine learning: special structure Single machine optimization stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration) Distributed optimization algorithm 5: accelerated minibatch SDCA algorithm 6: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling T. Zhang Big Data Optimization 50 / 73
Accelerated Prox-SDCA Solving: n P ( w ) := 1 � φ i ( X ⊤ i w ) + λ g ( w ) n i = 1 Convergence rate of Prox-SDCA depends on O ( 1 /λ ) Inferior to acceleration when λ is very small ≪ O ( 1 / n ) , which has √ O ( 1 / λ ) dependency T. Zhang Big Data Optimization 51 / 73
Accelerated Prox-SDCA Solving: n P ( w ) := 1 � φ i ( X ⊤ i w ) + λ g ( w ) n i = 1 Convergence rate of Prox-SDCA depends on O ( 1 /λ ) Inferior to acceleration when λ is very small ≪ O ( 1 / n ) , which has √ O ( 1 / λ ) dependency Inner-outer Iteration Accelerated Prox-SDCA Pick a suitable κ = Θ( 1 / n ) and β For t = 2 , 3 . . . (outer iter) Let ˜ g t ( w ) = λ g ( w ) + 0 . 5 κ � w − y t − 1 � 2 2 ( κ -strongly convex) Let ˜ P t ( w ) = P ( w ) − λ g ( w ) + ˜ g t ( w ) (redefine P ( · ) – κ strongly convex) Approximately solve ˜ P t ( w ) for ( w ( t ) , α ( t ) ) with prox-SDCA (inner iter) Let y ( t ) = w ( t ) + β ( w ( t ) − w ( t − 1 ) ) (acceleration) T. Zhang Big Data Optimization 51 / 73
Performance Comparisons Problem Algorithm Runtime 1 SGD λǫ � SVM 1 AGD (Nesterov) n λ ǫ � � � n + min { 1 n Acc-Prox-SDCA λ ǫ , λǫ } d SGD and variants ǫ 2 n Stochastic Coordinate Descent Lasso ǫ � 1 FISTA n ǫ � � � n + min { 1 n Acc-Prox-SDCA ǫ , ǫ } � � n + 1 SGD, SDCA λ � 1 AGD n Ridge Regression λ � � � n + min { 1 n λ , λ } Acc-Prox-SDCA T. Zhang Big Data Optimization 52 / 73
Experiments of L 1 - L 2 regularization Smoothed hinge loss + λ 2 � w � 2 2 + σ � w � 1 on CCAT datasaet with σ = 10 − 5 0 . 5 AccProxSDCA AccProxSDCA 0 . 5 ProxSDCA ProxSDCA FISTA FISTA 0 . 4 0 . 4 0 . 3 0 . 3 0 . 2 0 . 2 0 . 1 0 . 1 0 20 40 60 80 100 0 20 40 60 80 100 λ = 10 − 7 λ = 10 − 9 T. Zhang Big Data Optimization 53 / 73
Additional Related Work on Acceleration Methods achieving fast accelerated convergence comparable to Acc-Prox-SDCA Upper bounds: Qihang Lin, Zhaosong Lu, Lin Xiao, An Accelerated Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization, 2014, arXiv Yuchen Zhang, Lin Xiao, Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization, ICML 2015 (APCG — accelerated proximal coordinate gradeint) Matching Lower bound: Alekh Agarwal and Leon Bottou, A Lower Bound for the Optimization of Finite Sums, ICML 2015 T. Zhang Big Data Optimization 54 / 73
Recommend
More recommend