stochastic optimization techniques for big data machine
play

Stochastic Optimization Techniques for Big Data Machine Learning - PowerPoint PPT Presentation

Stochastic Optimization Techniques for Big Data Machine Learning Tong Zhang Rutgers University & Baidu Inc. T. Zhang Big Data Optimization 1 / 73 Outline Background: big data optimization in machine learning: special structure T. Zhang


  1. Improving SGD via Variance Reduction GD converges fast but computation is slow SGD computation is fast but converges slowly slow convergence due to inherent variance SGD as a statistical estimator of gradient: let g i = ∇ f i . � n unbaisedness: E g i = 1 i = 1 g i = ∇ f . n error of using g i to approx ∇ f : variance E � g i − Eg i � 2 2 . Statistical thinking: relating variance to optimization design other unbiased gradient estimators with smaller variance T. Zhang Big Data Optimization 11 / 73

  2. Improving SGD using Variance Reduction The idea leads to modern stochastic algorithms for big data machine learning with fast convergence rate T. Zhang Big Data Optimization 12 / 73

  3. Improving SGD using Variance Reduction The idea leads to modern stochastic algorithms for big data machine learning with fast convergence rate Collins et al (2008): For special problems, with a relatively complicated algorithm (Exponentiated Gradient on dual) Le Roux, Schmidt, Bach (NIPS 2012): A variant of SGD called SAG (stochastic average gradient) and later SAGA (Defazio, Bach, Lacoste-Julien, NIPS 2014) Johnson and Z (NIPS 2013): SVRG (Stochastic variance reduced gradient) Shalev-Schwartz and Z (JMLR 2013): SDCA (Stochastic Dual Coordinate Ascent) , and later a variant with Zheng Qu and Peter Richtarik T. Zhang Big Data Optimization 12 / 73

  4. Outline Background: big data optimization in machine learning: special structure T. Zhang Big Data Optimization 13 / 73

  5. Outline Background: big data optimization in machine learning: special structure Single machine optimization stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration) T. Zhang Big Data Optimization 13 / 73

  6. Outline Background: big data optimization in machine learning: special structure Single machine optimization stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration) Distributed optimization algorithm 5: accelerated minibatch SDCA algorithm 6: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling T. Zhang Big Data Optimization 13 / 73

  7. Relating Statistical Variance to Optimization Want to optimize min w f ( w ) Full gradient ∇ f ( w ) . T. Zhang Big Data Optimization 14 / 73

  8. Relating Statistical Variance to Optimization Want to optimize min w f ( w ) Full gradient ∇ f ( w ) . Given unbiased random estimator g i of ∇ f ( w ) , and SGD rule w → w − η g i , reduction of objective is + η 2 L E f ( w − η g i ) ≤ f ( w ) − ( η − η 2 L / 2 ) �∇ f ( w ) � 2 2 E � g − Eg � 2 . 2 2 � �� � � �� � non-random variance T. Zhang Big Data Optimization 14 / 73

  9. Relating Statistical Variance to Optimization Want to optimize min w f ( w ) Full gradient ∇ f ( w ) . Given unbiased random estimator g i of ∇ f ( w ) , and SGD rule w → w − η g i , reduction of objective is + η 2 L E f ( w − η g i ) ≤ f ( w ) − ( η − η 2 L / 2 ) �∇ f ( w ) � 2 2 E � g − Eg � 2 . 2 2 � �� � � �� � non-random variance Smaller variance implies bigger reduction T. Zhang Big Data Optimization 14 / 73

  10. Statistical Thinking: variance reduction techniques Given unbiased estimator g i of ∇ f ; how to design other unbiased estimators with reduce variance? T. Zhang Big Data Optimization 15 / 73

  11. Statistical Thinking: variance reduction techniques Given unbiased estimator g i of ∇ f ; how to design other unbiased estimators with reduce variance? Control variates (leads to SVRG). find ˜ g i ≈ g i use estimator i := g i − ˜ g i + E ˜ g ′ g i . T. Zhang Big Data Optimization 15 / 73

  12. Statistical Thinking: variance reduction techniques Given unbiased estimator g i of ∇ f ; how to design other unbiased estimators with reduce variance? Control variates (leads to SVRG). find ˜ g i ≈ g i use estimator i := g i − ˜ g i + E ˜ g ′ g i . Importance sampling (Zhao and Zhang ICML 2014) sample g i proportional to ρ i ( E ρ i = 1) use estimator g i /ρ i T. Zhang Big Data Optimization 15 / 73

  13. Statistical Thinking: variance reduction techniques Given unbiased estimator g i of ∇ f ; how to design other unbiased estimators with reduce variance? Control variates (leads to SVRG). find ˜ g i ≈ g i use estimator i := g i − ˜ g i + E ˜ g ′ g i . Importance sampling (Zhao and Zhang ICML 2014) sample g i proportional to ρ i ( E ρ i = 1) use estimator g i /ρ i Stratified sampling (Zhao and Zhang) T. Zhang Big Data Optimization 15 / 73

  14. Stochastic Variance Reduced Gradient (SVRG) I Objective function n n f ( w ) = 1 f i ( w ) = 1 � � ˜ f i ( w ) , n n i = 1 i = 1 where ˜ w )) ⊤ w f i ( w ) = f i ( w ) − ( ∇ f i ( ˜ w ) − ∇ f ( ˜ . � �� � sum to zero Pick ˜ w to be an approximate solution (close to w ∗ ). The SVRG rule (control variates) is w t = w t − 1 − η t ∇ ˜ f i ( w t − 1 ) = w t − 1 − η t [ ∇ f i ( w t − 1 ) − ∇ f i ( ˜ w ) + ∇ f ( ˜ w )] . � �� � small variance T. Zhang Big Data Optimization 16 / 73

  15. Stochastic Variance Reduced Gradient (SVRG) II Assume that ˜ w ≈ w ∗ and w t − 1 ≈ w ∗ . Then ∇ f ( ˜ ∇ f i ( w t − 1 ) ≈ ∇ f i ( ˜ w ) ≈ ∇ f ( w ∗ ) = 0 w ) . This means ∇ f i ( w t − 1 ) − ∇ f i ( ˜ w ) + ∇ f ( ˜ w ) → 0 . It is possible to choose a constant step size η t = η instead of requiring η t → 0. One can achieve comparable linear convergence with SVRG: ρ ) t ) , Ef ( w t ) − f ( w ∗ ) = O (( 1 − ˜ where ˜ ρ = O ( λ n / ( L + λ n ) ; convergence is faster than GD. T. Zhang Big Data Optimization 17 / 73

  16. SVRG Algorithm Procedure SVRG Parameters update frequency m and learning rate η Initialize ˜ w 0 Iterate: for s = 1 , 2 , . . . w = ˜ ˜ w s − 1 � n µ = 1 i = 1 ∇ ψ i ( ˜ ˜ w ) n w 0 = ˜ w Iterate: for t = 1 , 2 , . . . , m Randomly pick i t ∈ { 1 , . . . , n } and update weight w t = w t − 1 − η ( ∇ ψ i t ( w t − 1 ) − ∇ ψ i t ( ˜ w ) + ˜ µ ) end Set ˜ w s = w m end T. Zhang Big Data Optimization 18 / 73

  17. SVRG v.s. Batch Gradient Descent: fast convergence Number of examples needed to achieve ǫ accuracy: Batch GD: ˜ O ( n · L /λ log ( 1 /ǫ )) SVRG: ˜ O (( n + L /λ ) log ( 1 /ǫ )) Assume L -smooth loss f i and λ strongly convex objective function. SVRG has fast convergence — condition number effectively reduced The gain of SVRG over batch algorithm is significant when n is large. T. Zhang Big Data Optimization 19 / 73

  18. SVRG: variance ��� ��� �� ����������� ����������� �� �� � � ����� �������� ����� �������� ���������� ��������� �������� ��� ����������� � ��������� ���� ����� ���������� ��� ������ ������� ���� � �� ��� ��� ��� � ��� ��� ��� ��� ��� ������������������������� ����������������������� � Convex case (left): least squares on MNIST; Nonconvex case (right): neural nets on CIFAR-10. The numbers in the legends are learning rate T. Zhang Big Data Optimization 20 / 73

  19. SVRG: convergence ���� ��� ��������� ���������� ����������� ���� ���� ��������� �������� ���� ������������ ���������� ����������� ���������� ��� ���� ����������� �������� �������� ���� ���� ���� ���� ���������� ���� ��� ����������� ���� ���� ���� ���������� ��� ���� � �� ��� ��� ��� � ��� ��� ��� ��� ��� ������������������������ ����������������������� � Convex case (left): least squares on MNIST; Nonconvex case (right): neural nets on CIFAR-10. The numbers in the legends are learning rate T. Zhang Big Data Optimization 21 / 73

  20. Variance Reduction using Importance Sampling (combined with SVRG) n f ( w ) = 1 � f i ( w ) . n i = 1 L i : smoothness param of f i ( w ) ; λ : strong convexity param of f ( w ) Number of examples needed to achieve ǫ accuracy: With uniform sampling: ˜ O (( n + L /λ ) log ( 1 /ǫ )) , where L = max i L i With importance sampling: O (( n + ¯ ˜ L /λ ) log ( 1 /ǫ )) , L = n − 1 � n where ¯ i = 1 L i T. Zhang Big Data Optimization 22 / 73

  21. Outline Background: big data optimization in machine learning: special structure T. Zhang Big Data Optimization 23 / 73

  22. Outline Background: big data optimization in machine learning: special structure Single machine optimization stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration) T. Zhang Big Data Optimization 23 / 73

  23. Outline Background: big data optimization in machine learning: special structure Single machine optimization stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration) Distributed optimization algorithm 5: accelerated minibatch SDCA algorithm 6: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling T. Zhang Big Data Optimization 23 / 73

  24. Motivation Solve n f ( w ) = 1 � w ∗ = arg min w f ( w ) f i ( w ) . n i = 1 SGD with variance reduction via SVRG: w t = w t − 1 − η t [ ∇ f i ( w t − 1 ) − ∇ f i ( ˜ w ) + ∇ f ( ˜ w )] . � �� � small variance T. Zhang Big Data Optimization 24 / 73

  25. Motivation Solve n f ( w ) = 1 � w ∗ = arg min w f ( w ) f i ( w ) . n i = 1 SGD with variance reduction via SVRG: w t = w t − 1 − η t [ ∇ f i ( w t − 1 ) − ∇ f i ( ˜ w ) + ∇ f ( ˜ w )] . � �� � small variance Compute full gradient ∇ f ( ˜ w ) periodically at an intermediate ˜ w T. Zhang Big Data Optimization 24 / 73

  26. Motivation Solve n f ( w ) = 1 � w ∗ = arg min w f ( w ) f i ( w ) . n i = 1 SGD with variance reduction via SVRG: w t = w t − 1 − η t [ ∇ f i ( w t − 1 ) − ∇ f i ( ˜ w ) + ∇ f ( ˜ w )] . � �� � small variance Compute full gradient ∇ f ( ˜ w ) periodically at an intermediate ˜ w How to avoid computing ∇ f ( ˜ w ) ? Answer: keeping previously calculated gradients. T. Zhang Big Data Optimization 24 / 73

  27. Stochastic Average Gradient ameliore: SAGA � n g = 1 Initialize: ˜ g i = ∇ f i ( w 0 ) and ˜ j = 1 ˜ g j n SAGA update rule: randomly select i , and w t = w t − 1 − η t [ ∇ f i ( w t − 1 ) − ˜ g i + ˜ g ] g =˜ ˜ g + ( ∇ f i ( w t − 1 ) − ˜ g i ) / n ˜ g i = ∇ f i ( w t − 1 ) Equivalent to: n w i ) + 1 � w t = w t − 1 − η t [ ∇ f i ( w t − 1 ) − ∇ f i ( ˜ ∇ f j ( ˜ ˜ w j )] w i = w t − 1 . n j = 1 � �� � small variance Compare to SVRG: w t = w t − 1 − η t [ ∇ f i ( w t − 1 ) − ∇ f i ( ˜ w ) + ∇ f ( ˜ w )] . � �� � small variance T. Zhang Big Data Optimization 25 / 73

  28. Variance Reduction The gradient estimator of SAGA is unbiased:   n w i ) + 1 �  = ∇ f ( w t − 1 ) .  ∇ f i ( w t − 1 ) − ∇ f i ( ˜ ∇ f j ( ˜ w j ) E n j = 1 Since ˜ w i → w ∗ , we have   n w i ) + 1 �  → 0 .  ∇ f i ( w t − 1 ) − ∇ f i ( ˜ ∇ f j ( ˜ w j ) n j = 1 Therefore variance of the gradient estimator goes to zero. T. Zhang Big Data Optimization 26 / 73

  29. Theory of SAGA Similar to SVRG, we have fast convergence for SAGA. Number of examples needed to achieve ǫ accuracy: Batch GD: ˜ O ( n · L /λ log ( 1 /ǫ )) SVRG: ˜ O (( n + L /λ ) log ( 1 /ǫ )) SAGA: ˜ O (( n + L /λ ) log ( 1 /ǫ )) Assume L -smooth loss f i and λ strongly convex objective function. T. Zhang Big Data Optimization 27 / 73

  30. Outline Background: big data optimization in machine learning: special structure T. Zhang Big Data Optimization 28 / 73

  31. Outline Background: big data optimization in machine learning: special structure Single machine optimization stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration) T. Zhang Big Data Optimization 28 / 73

  32. Outline Background: big data optimization in machine learning: special structure Single machine optimization stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration) Distributed optimization algorithm 5: accelerated minibatch SDCA algorithm 6: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling T. Zhang Big Data Optimization 28 / 73

  33. Motivation of SDCA: regularized loss minimization Assume we want to solve the Lasso problem: � � n 1 � ( w ⊤ x i − y i ) 2 + λ � w � 1 min n w i = 1 T. Zhang Big Data Optimization 29 / 73

  34. Motivation of SDCA: regularized loss minimization Assume we want to solve the Lasso problem: � � n 1 � ( w ⊤ x i − y i ) 2 + λ � w � 1 min n w i = 1 or the ridge regression problem:    n  1 λ �   ( w ⊤ x i − y i ) 2 2 � w � 2 min +   2 n w    i = 1  � �� � � �� � regularization loss Goal: solve regularized loss minimization problems as fast as we can. T. Zhang Big Data Optimization 29 / 73

  35. Motivation of SDCA: regularized loss minimization Assume we want to solve the Lasso problem: � � n 1 � ( w ⊤ x i − y i ) 2 + λ � w � 1 min n w i = 1 or the ridge regression problem:    n  1 λ �   ( w ⊤ x i − y i ) 2 2 � w � 2 min +   2 n w    i = 1  � �� � � �� � regularization loss Goal: solve regularized loss minimization problems as fast as we can. solution: proximal Stochastic Dual Coordinate Ascent (Prox-SDCA). can show: fast convergence of SDCA. T. Zhang Big Data Optimization 29 / 73

  36. Loss Minimization with L 2 Regularization � � n 1 φ i ( w ⊤ x i ) + λ � 2 � w � 2 min w P ( w ) := . n i = 1 T. Zhang Big Data Optimization 30 / 73

  37. Loss Minimization with L 2 Regularization � � n 1 φ i ( w ⊤ x i ) + λ � 2 � w � 2 min w P ( w ) := . n i = 1 Examples: φ i ( z ) Lipschitz smooth SVM max { 0 , 1 − y i z } ✓ ✗ Logistic regression log ( 1 + exp ( − y i z )) ✓ ✓ Abs-loss regression | z − y i | ✓ ✗ ( z − y i ) 2 Square-loss regression ✗ ✓ T. Zhang Big Data Optimization 30 / 73

  38. Dual Formulation Primal problem: � � n 1 φ i ( w ⊤ x i ) + λ � 2 � w � 2 w ∗ = arg min w P ( w ) := n i = 1 Dual problem:  2  � � n n � �  1 i ( − α i ) − λ � � − φ ∗ 1  , � � α ∗ = max α ∈ R n D ( α ) := α i x i � � λ n n 2 � � i = 1 i = 1 and the convex conjugate (dual) is defined as: φ ∗ i ( a ) = sup ( az − φ i ( z )) . z T. Zhang Big Data Optimization 31 / 73

  39. Relationship of Primal and Dual Solutions Weak duality: P ( w ) ≥ D ( α ) for all w and α Strong duality: P ( w ∗ ) = D ( α ∗ ) with the relationship n w ∗ = 1 � α ∗ i = − φ ′ i ( w ⊤ α ∗ , i · x i , ∗ x i ) . λ n i = 1 T. Zhang Big Data Optimization 32 / 73

  40. Relationship of Primal and Dual Solutions Weak duality: P ( w ) ≥ D ( α ) for all w and α Strong duality: P ( w ∗ ) = D ( α ∗ ) with the relationship n w ∗ = 1 � α ∗ i = − φ ′ i ( w ⊤ α ∗ , i · x i , ∗ x i ) . λ n i = 1 Duality gap: for any w and α : P ( w ) − D ( α ) ≥ P ( w ) − P ( w ∗ ) . � �� � � �� � duality gap primal sub-optimality T. Zhang Big Data Optimization 32 / 73

  41. Example: Linear Support Vector Machine Primal formulation: n P ( w ) = 1 max ( 0 , 1 − w ⊤ x i y i ) + λ � 2 � w � 2 2 n i = 1 Dual formulation: � � 2 n n � � D ( α ) = 1 1 � � � � α i y i − α i x i y i , α i y i ∈ [ 0 , 1 ] . � � 2 λ n 2 n � � i = 1 i = 1 2 Relationship: n w ∗ = 1 � α ∗ , i x i λ n i = 1 T. Zhang Big Data Optimization 33 / 73

  42. Dual Coordinate Ascent (DCA) Solve the dual problem using coordinate ascent α ∈ R n D ( α ) , max and keep the corresponding primal solution using the relationship n w = 1 � α i x i . λ n i = 1 DCA: At each iteration, optimize D ( α ) w.r.t. a single coordinate, while the rest of the coordinates are kept in tact. Stochastic Dual Coordinate Ascent (SDCA): Choose the updated coordinate uniformly at random T. Zhang Big Data Optimization 34 / 73

  43. Dual Coordinate Ascent (DCA) Solve the dual problem using coordinate ascent α ∈ R n D ( α ) , max and keep the corresponding primal solution using the relationship n w = 1 � α i x i . λ n i = 1 DCA: At each iteration, optimize D ( α ) w.r.t. a single coordinate, while the rest of the coordinates are kept in tact. Stochastic Dual Coordinate Ascent (SDCA): Choose the updated coordinate uniformly at random SMO (John Platt), Liblinear (Hsieh et al) etc implemented DCA. T. Zhang Big Data Optimization 34 / 73

  44. SDCA vs. SGD — update rule Stochastic Gradient Descent (SGD) update rule: w ( t ) − φ ′ i ( w ( t ) ⊤ x i ) w ( t + 1 ) = � � 1 − 1 x i t λ t SDCA update rule: D ( α ( t ) + ∆ i e i ) 1 . ∆ i = argmax ∆ ∈ R 2 . w ( t + 1 ) = w ( t ) + ∆ i λ n x i Rather similar update rules. SDCA has several advantages: Stopping criterion: duality gap smaller than a value No need to tune learning rate T. Zhang Big Data Optimization 35 / 73

  45. SDCA vs. SGD — update rule — Example SVM with the hinge loss: φ i ( w ) = max { 0 , 1 − y i w ⊤ x i } SGD update rule: i w ( t ) < 1 ] w ( t ) − 1 [ y i x ⊤ w ( t + 1 ) = � � 1 − 1 x i t λ t SDCA update rule: � � �� 1 , 1 − y i x ⊤ i w ( t − 1 ) + y i α ( t − 1 ) − α ( t − 1 ) 1 . ∆ i = y i max 0 , min i i � x i � 2 2 / ( λ n ) 1 . α ( t + 1 ) = α ( t ) + ∆ i e i 2 . w ( t + 1 ) = w ( t ) + ∆ i λ n x i T. Zhang Big Data Optimization 36 / 73

  46. SDCA vs. SGD — experimental observations On CCAT dataset, λ = 10 − 6 , smoothed loss 0 10 SDCA SDCA−Perm −1 SGD 10 −2 10 −3 10 −4 10 −5 10 −6 10 5 10 15 20 25 T. Zhang Big Data Optimization 37 / 73

  47. SDCA vs. SGD — experimental observations On CCAT dataset, λ = 10 − 6 , smoothed loss 0 10 SDCA SDCA−Perm −1 SGD 10 −2 10 −3 10 −4 10 −5 10 −6 10 5 10 15 20 25 The convergence of SDCA is shockingly fast! How to explain this? T. Zhang Big Data Optimization 37 / 73

  48. SDCA vs. SGD — experimental observations On CCAT dataset, λ = 10 − 5 , hinge-loss 0 10 SDCA SDCA−Perm SGD −1 10 −2 10 −3 10 −4 10 −5 10 −6 10 5 10 15 20 25 30 35 How to understand the convergence behavior? T. Zhang Big Data Optimization 38 / 73

  49. Derivation of SDCA I Consider the following optimization problem n � f ( w ) = n − 1 φ i ( w ) + 0 . 5 λ w ⊤ w . w ∗ = arg min w f ( w ) i = 1 The optimal condition is n � n − 1 ∇ φ i ( w ∗ ) + λ w ∗ = 0 . i = 1 We have dual representation: n i = − 1 � α ∗ α ∗ w ∗ = λ n ∇ φ i ( w ∗ ) i i = 1 T. Zhang Big Data Optimization 39 / 73

  50. Derivation of SDCA II If we maintain a relationship: w = � n i = 1 α i , then SGD rule w t = w t − 1 − η ∇ φ i ( w t − 1 ) − ηλ w t − 1 . � �� � large variance property E [ w t | w t − 1 ] = w t − 1 − ∇ f ( w ) The dual representation of SGD rule is α t , j = ( 1 − ηλ ) α t − 1 , j − η ∇ φ i ( w t − 1 ) δ i , j . T. Zhang Big Data Optimization 40 / 73

  51. Derivation of SDCA III The alternative SDCA rule is to replace − ηλ w t − 1 by − ηλ n α i : primal update is w t = w t − 1 − η ( ∇ φ i ( w t − 1 ) + λ n α i ) . � �� � small variance and the dual update is α t , j = α t − 1 , j − η ( ∇ φ i ( w t − 1 ) + λ n α i ) δ i , j . It is unbiased: E [ w t | w t − 1 ] = w t − 1 − ∇ f ( w ) T. Zhang Big Data Optimization 41 / 73

  52. Benefit of SDCA Variance reduction effect: as w → w ∗ and α → α ∗ , ∇ φ i ( w t − 1 ) + λ n α i → 0 , thus the stochastic variance goes to zero. T. Zhang Big Data Optimization 42 / 73

  53. Benefit of SDCA Variance reduction effect: as w → w ∗ and α → α ∗ , ∇ φ i ( w t − 1 ) + λ n α i → 0 , thus the stochastic variance goes to zero. Fast convergence rate result: Ef ( w t ) − f ( w ∗ ) = O ( µ k ) , where µ = 1 − O ( λ n / ( 1 + λ n )) . Convergence rate is fast even when λ = O ( 1 / n ) . Better than batch method T. Zhang Big Data Optimization 42 / 73

  54. Fast Convergence of SDCA The number of iterations needed to achieve ǫ accuracy For L -smooth loss: �� � � n + L log 1 ˜ O λ ǫ For non-smooth but G -Lipschitz loss (bounded gradient): � � n + G 2 ˜ O λ ǫ T. Zhang Big Data Optimization 43 / 73

  55. Fast Convergence of SDCA The number of iterations needed to achieve ǫ accuracy For L -smooth loss: �� � � n + L log 1 ˜ O λ ǫ For non-smooth but G -Lipschitz loss (bounded gradient): � � n + G 2 ˜ O λ ǫ Similar to that of SVRG; and effective when n is large T. Zhang Big Data Optimization 43 / 73

  56. SDCA vs. DCA — Randomization is Crucial! On CCAT dataset, λ = 10 − 4 , smoothed hinge-loss 0 10 SDCA DCA−Cyclic SDCA−Perm −1 10 Bound −2 10 −3 10 −4 10 −5 10 −6 10 0 2 4 6 8 10 12 14 16 18 T. Zhang Big Data Optimization 44 / 73

  57. SDCA vs. DCA — Randomization is Crucial! On CCAT dataset, λ = 10 − 4 , smoothed hinge-loss 0 10 SDCA DCA−Cyclic SDCA−Perm −1 10 Bound −2 10 −3 10 −4 10 −5 10 −6 10 0 2 4 6 8 10 12 14 16 18 Randomization is crucial! T. Zhang Big Data Optimization 44 / 73

  58. Proximal SDCA for General Regularizer Want to solve: � � n 1 � φ i ( X ⊤ w P ( w ) := i w ) + λ g ( w ) , min n i = 1 where X i are matrices; g ( · ) is strongly convex. Examples: Multi-class logistic loss K � φ i ( X ⊤ exp ( w ⊤ X i ,ℓ ) − w ⊤ X i , y i . i w ) = ln ℓ = 1 L 1 − L 2 regularization 2 + σ g ( w ) = 1 2 � w � 2 λ � w � 1 T. Zhang Big Data Optimization 45 / 73

  59. Dual Formulation Primal: � � n 1 � φ i ( X ⊤ min w P ( w ) := i w ) + λ g ( w ) , n i = 1 Dual: � � �� n n 1 1 � � − φ ∗ i ( − α i ) − λ g ∗ max D ( α ) := X i α i n λ n α i = 1 i = 1 with the relationship � � n 1 � w = ∇ g ∗ X i α i . λ n i = 1 Prox-SDCA: extension of SDCA for arbitrarily strongly convex g ( w ) . T. Zhang Big Data Optimization 46 / 73

  60. Prox-SDCA Dual: � � n n 1 v = 1 � � − φ ∗ i ( − α i ) − λ g ∗ ( v ) max D ( α ) := , X i α i . n λ n α i = 1 i = 1 Assume g ( w ) is strongly convex in norm � · � P with dual norm � · � D . T. Zhang Big Data Optimization 47 / 73

  61. Prox-SDCA Dual: � � n n 1 v = 1 � � − φ ∗ i ( − α i ) − λ g ∗ ( v ) max D ( α ) := , X i α i . n λ n α i = 1 i = 1 Assume g ( w ) is strongly convex in norm � · � P with dual norm � · � D . For each α , and the corresponding v and w , define prox-dual � n 1 � ˜ − φ ∗ D α (∆ α ) = i ( − ( α i + ∆ α i )) n i = 1     � � 2   n n  � �   g ∗ ( v ) + ∇ g ∗ ( v ) ⊤ 1 X i ∆ α i + 1 1 � �  � �   − λ X i ∆ α i  � �   λ n 2 λ n  � �   i = 1 i = 1 D    � �� � upper bound of g ∗ ( · ) T. Zhang Big Data Optimization 47 / 73

  62. Prox-SDCA Dual: � � n n 1 v = 1 � � − φ ∗ i ( − α i ) − λ g ∗ ( v ) max D ( α ) := , X i α i . n λ n α i = 1 i = 1 Assume g ( w ) is strongly convex in norm � · � P with dual norm � · � D . For each α , and the corresponding v and w , define prox-dual � n 1 � ˜ − φ ∗ D α (∆ α ) = i ( − ( α i + ∆ α i )) n i = 1     � � 2   n n  � �   g ∗ ( v ) + ∇ g ∗ ( v ) ⊤ 1 X i ∆ α i + 1 1 � �  � �   − λ X i ∆ α i  � �   λ n 2 λ n  � �   i = 1 i = 1 D    � �� � upper bound of g ∗ ( · ) Prox-SDCA: randomly pick i and update ∆ α i by maximizing ˜ D α ( · ) . T. Zhang Big Data Optimization 47 / 73

  63. Proximal-SDCA for L 1 - L 2 Regularization Algorithm: Keep dual α and v = ( λ n ) − 1 � i α i X i Randomly pick i Find ∆ i by approximately maximizing: 1 − φ ∗ i ( α i + ∆ i ) − trunc ( v , σ/λ ) ⊤ X i ∆ i − 2 ∆ 2 2 λ n � X i � 2 i , where φ ∗ i ( α i + ∆) = ( α i + ∆) Y i ln (( α i + ∆) Y i ) + ( 1 − ( α i + ∆) Y i ) ln ( 1 − ( α i + ∆) Y i ) α = α + ∆ i · e i v = v + ( λ n ) − 1 ∆ i · X i . Let w = trunc ( v , σ/λ ) . T. Zhang Big Data Optimization 48 / 73

  64. Solving L 1 with Smooth Loss Want to solve L 1 regularization to accuracy ǫ with smooth φ i : n 1 � φ i ( w ) + σ � w � 1 . n i = 1 Apply Prox-SDCA with extra term 0 . 5 λ � w � 2 2 , where λ = O ( ǫ ) : number of iterations needed by prox-SDCA is ˜ O ( n + 1 /ǫ ) . T. Zhang Big Data Optimization 49 / 73

  65. Solving L 1 with Smooth Loss Want to solve L 1 regularization to accuracy ǫ with smooth φ i : n 1 � φ i ( w ) + σ � w � 1 . n i = 1 Apply Prox-SDCA with extra term 0 . 5 λ � w � 2 2 , where λ = O ( ǫ ) : number of iterations needed by prox-SDCA is ˜ O ( n + 1 /ǫ ) . Compare to (number of examples needed to go through): Dual Averaging SGD (Xiao): ˜ O ( 1 /ǫ 2 ) . O ( n / √ ǫ ) . FISTA (Nesterov’s batch accelerated proximal gradient): ˜ Prox-SDCA wins in the statistically interesting regime: ǫ > Ω( 1 / n 2 ) T. Zhang Big Data Optimization 49 / 73

  66. Solving L 1 with Smooth Loss Want to solve L 1 regularization to accuracy ǫ with smooth φ i : n 1 � φ i ( w ) + σ � w � 1 . n i = 1 Apply Prox-SDCA with extra term 0 . 5 λ � w � 2 2 , where λ = O ( ǫ ) : number of iterations needed by prox-SDCA is ˜ O ( n + 1 /ǫ ) . Compare to (number of examples needed to go through): Dual Averaging SGD (Xiao): ˜ O ( 1 /ǫ 2 ) . O ( n / √ ǫ ) . FISTA (Nesterov’s batch accelerated proximal gradient): ˜ Prox-SDCA wins in the statistically interesting regime: ǫ > Ω( 1 / n 2 ) Can design accelerated prox-SDCA always superior to FISTA T. Zhang Big Data Optimization 49 / 73

  67. Outline Background: big data optimization in machine learning: special structure T. Zhang Big Data Optimization 50 / 73

  68. Outline Background: big data optimization in machine learning: special structure Single machine optimization stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration) T. Zhang Big Data Optimization 50 / 73

  69. Outline Background: big data optimization in machine learning: special structure Single machine optimization stochastic gradient (1st order) versus batch gradient: pros and cons algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SAGA (Stochastic Average Gradient ameliore) algorithm 3: SDCA (Stochastic Dual Coordinate Ascent) algorithm 4: accelerated SDCA (with Nesterov acceleration) Distributed optimization algorithm 5: accelerated minibatch SDCA algorithm 6: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling T. Zhang Big Data Optimization 50 / 73

  70. Accelerated Prox-SDCA Solving: n P ( w ) := 1 � φ i ( X ⊤ i w ) + λ g ( w ) n i = 1 Convergence rate of Prox-SDCA depends on O ( 1 /λ ) Inferior to acceleration when λ is very small ≪ O ( 1 / n ) , which has √ O ( 1 / λ ) dependency T. Zhang Big Data Optimization 51 / 73

  71. Accelerated Prox-SDCA Solving: n P ( w ) := 1 � φ i ( X ⊤ i w ) + λ g ( w ) n i = 1 Convergence rate of Prox-SDCA depends on O ( 1 /λ ) Inferior to acceleration when λ is very small ≪ O ( 1 / n ) , which has √ O ( 1 / λ ) dependency Inner-outer Iteration Accelerated Prox-SDCA Pick a suitable κ = Θ( 1 / n ) and β For t = 2 , 3 . . . (outer iter) Let ˜ g t ( w ) = λ g ( w ) + 0 . 5 κ � w − y t − 1 � 2 2 ( κ -strongly convex) Let ˜ P t ( w ) = P ( w ) − λ g ( w ) + ˜ g t ( w ) (redefine P ( · ) – κ strongly convex) Approximately solve ˜ P t ( w ) for ( w ( t ) , α ( t ) ) with prox-SDCA (inner iter) Let y ( t ) = w ( t ) + β ( w ( t ) − w ( t − 1 ) ) (acceleration) T. Zhang Big Data Optimization 51 / 73

  72. Performance Comparisons Problem Algorithm Runtime 1 SGD λǫ � SVM 1 AGD (Nesterov) n λ ǫ � � � n + min { 1 n Acc-Prox-SDCA λ ǫ , λǫ } d SGD and variants ǫ 2 n Stochastic Coordinate Descent Lasso ǫ � 1 FISTA n ǫ � � � n + min { 1 n Acc-Prox-SDCA ǫ , ǫ } � � n + 1 SGD, SDCA λ � 1 AGD n Ridge Regression λ � � � n + min { 1 n λ , λ } Acc-Prox-SDCA T. Zhang Big Data Optimization 52 / 73

  73. Experiments of L 1 - L 2 regularization Smoothed hinge loss + λ 2 � w � 2 2 + σ � w � 1 on CCAT datasaet with σ = 10 − 5 0 . 5 AccProxSDCA AccProxSDCA 0 . 5 ProxSDCA ProxSDCA FISTA FISTA 0 . 4 0 . 4 0 . 3 0 . 3 0 . 2 0 . 2 0 . 1 0 . 1 0 20 40 60 80 100 0 20 40 60 80 100 λ = 10 − 7 λ = 10 − 9 T. Zhang Big Data Optimization 53 / 73

  74. Additional Related Work on Acceleration Methods achieving fast accelerated convergence comparable to Acc-Prox-SDCA Upper bounds: Qihang Lin, Zhaosong Lu, Lin Xiao, An Accelerated Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization, 2014, arXiv Yuchen Zhang, Lin Xiao, Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization, ICML 2015 (APCG — accelerated proximal coordinate gradeint) Matching Lower bound: Alekh Agarwal and Leon Bottou, A Lower Bound for the Optimization of Finite Sums, ICML 2015 T. Zhang Big Data Optimization 54 / 73

Recommend


More recommend