Recent Progresses in Stochastic Algorithms for Big Data Optimization Tong Zhang Rutgers University & Baidu Inc. collaborators: Shai Shalev-Shwartz, Rie Johnson, Lin Xiao, Ohad Shamir and Nathan Srebro T. Zhang Big Data Optimization 1 / 36
Outline Background: big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons T. Zhang Big Data Optimization 2 / 36
Outline Background: big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons Stochastic gradient algorithms with variance reduction algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration) T. Zhang Big Data Optimization 2 / 36
Outline Background: big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons Stochastic gradient algorithms with variance reduction algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration) Strategies for distributed computing algorithm 4: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling T. Zhang Big Data Optimization 2 / 36
Mathematical Problem Big Data Optimization Problem in machine learning: n f ( w ) = 1 � w f ( w ) f i ( w ) min n i = 1 Special structure: sum over data. Big data ( n large) requires distrubuted training. T. Zhang Big Data Optimization 3 / 36
Assumptions on loss function λ -strong convexity: f ( w ′ ) ≥ f ( w ) + ∇ f ( w ) ⊤ ( w ′ − w ) + λ 2 � w ′ − w � 2 2 L -smoothness: f i ( w ′ ) ≤ f i ( w ) + ∇ f i ( w ) ⊤ ( w ′ − w ) + L 2 � w ′ − w � 2 2 T. Zhang Big Data Optimization 4 / 36
Example: Computational Advertizing Large scale regularized logistic regression n ln ( 1 + e − w ⊤ x i y i ) + λ 1 � 2 � w � 2 min 2 n w i = 1 � �� � f i ( w ) data ( x i , y i ) with y i ∈ {± 1 } parameter vector w . λ strongly convex and L = 0 . 25 max i � x i � 2 2 + λ smooth. T. Zhang Big Data Optimization 5 / 36
Example: Computational Advertizing Large scale regularized logistic regression n ln ( 1 + e − w ⊤ x i y i ) + λ 1 � 2 � w � 2 min 2 n w i = 1 � �� � f i ( w ) data ( x i , y i ) with y i ∈ {± 1 } parameter vector w . λ strongly convex and L = 0 . 25 max i � x i � 2 2 + λ smooth. big data: n ∼ 10 − 100 billion high dimension: dim ( x i ) ∼ 10 − 100 billion T. Zhang Big Data Optimization 5 / 36
Example: Computational Advertizing Large scale regularized logistic regression n ln ( 1 + e − w ⊤ x i y i ) + λ 1 � 2 � w � 2 min 2 n w i = 1 � �� � f i ( w ) data ( x i , y i ) with y i ∈ {± 1 } parameter vector w . λ strongly convex and L = 0 . 25 max i � x i � 2 2 + λ smooth. big data: n ∼ 10 − 100 billion high dimension: dim ( x i ) ∼ 10 − 100 billion How to solve big optimization problems efficiently? T. Zhang Big Data Optimization 5 / 36
Statistical Thinking: sampling Objective function: n f ( w ) = 1 � f i ( w ) n i = 1 sample objective function: only optimize approximate objective T. Zhang Big Data Optimization 6 / 36
Statistical Thinking: sampling Objective function: n f ( w ) = 1 � f i ( w ) n i = 1 sample objective function: only optimize approximate objective 1st order gradient n ∇ f ( w ) = 1 � ∇ f i ( w ) n i = 1 sample 1st order gradient (stochastic gradient): converges to exact optimal – variance reduction: fast rate T. Zhang Big Data Optimization 6 / 36
Statistical Thinking: sampling Objective function: n f ( w ) = 1 � f i ( w ) n i = 1 sample objective function: only optimize approximate objective 1st order gradient n ∇ f ( w ) = 1 � ∇ f i ( w ) n i = 1 sample 1st order gradient (stochastic gradient): converges to exact optimal – variance reduction: fast rate 2nd order gradient n ∇ 2 f ( w ) = 1 � ∇ 2 f i ( w ) n i = 1 sample 2nd order gradient (stochastic Newton): converges to exact optimal with fast rate, distributed computing T. Zhang Big Data Optimization 6 / 36
Batch Optimization Method: Gradient Descent Solve n f ( w ) = 1 � w ∗ = arg min w f ( w ) f i ( w ) . n i = 1 Gradient Descent (GD): w k = w k − 1 − η k ∇ f ( w k − 1 ) . How fast does this method converge to the optimal solution? T. Zhang Big Data Optimization 7 / 36
Batch Optimization Method: Gradient Descent Solve n f ( w ) = 1 � w ∗ = arg min w f ( w ) f i ( w ) . n i = 1 Gradient Descent (GD): w k = w k − 1 − η k ∇ f ( w k − 1 ) . How fast does this method converge to the optimal solution? General result: converge to local minimum under suitable conditions. Convergence rate depends on conditions of f ( · ) . For λ -strongly convex and L -smooth problems, it is linear rate: f ( w k ) − f ( w ∗ ) = O (( 1 − ρ ) k ) , where ρ = O ( λ/ L ) is the inverse condition number T. Zhang Big Data Optimization 7 / 36
Stochastic Approximate Gradient Computation If n f ( w ) = 1 � f i ( w ) , n i = 1 GD requires the computation of full gradient, which is extremely costly n ∇ f ( w ) = 1 � ∇ f i ( w ) n i = 1 T. Zhang Big Data Optimization 8 / 36
Stochastic Approximate Gradient Computation If n f ( w ) = 1 � f i ( w ) , n i = 1 GD requires the computation of full gradient, which is extremely costly n ∇ f ( w ) = 1 � ∇ f i ( w ) n i = 1 Idea: stochastic optimization employs random sample (mini-batch) B to approximate ∇ f ( w ) ≈ 1 � ∇ f i ( w ) | B | i ∈ B It is an unbiased estimator more efficient computation but introduces variance T. Zhang Big Data Optimization 8 / 36
SGD versus GD SGD: faster computation per step Sublinear convergence: due to the variance of gradient approximation. f ( w t ) − f ( w ∗ ) = ˜ O ( 1 / t ) . GD: slower computation per step Linear convergence: f ( w t ) − f ( w ∗ ) = O (( 1 − ρ ) t ) . T. Zhang Big Data Optimization 9 / 36
Improving SGD via Variance Reduction GD converges fast but computation is slow SGD computation is fast but converges slowly slow convergence due to inherent variance SGD as a statistical estimator of gradient: let g i = ∇ f i . � n unbaisedness: E g i = 1 i = 1 g i = ∇ f . n error of using g i to approx ∇ f : variance E � g i − Eg i � 2 2 . T. Zhang Big Data Optimization 10 / 36
Improving SGD via Variance Reduction GD converges fast but computation is slow SGD computation is fast but converges slowly slow convergence due to inherent variance SGD as a statistical estimator of gradient: let g i = ∇ f i . � n unbaisedness: E g i = 1 i = 1 g i = ∇ f . n error of using g i to approx ∇ f : variance E � g i − Eg i � 2 2 . Statistical thinking: relating variance to optimization design other unbiased gradient estimators with smaller variance T. Zhang Big Data Optimization 10 / 36
Relating Statistical Variance to Optimization Want to optimize min w f ( w ) Full gradient ∇ f ( w ) . T. Zhang Big Data Optimization 11 / 36
Relating Statistical Variance to Optimization Want to optimize min w f ( w ) Full gradient ∇ f ( w ) . Given unbiased random estimator g i of ∇ f ( w ) , and SGD rule w → w − η g i , reduction of objective is + η 2 L E f ( w − η g i ) ≤ f ( w ) − ( η − η 2 L / 2 ) �∇ f ( w ) � 2 2 E � g − Eg � 2 . 2 2 � �� � � �� � non-random variance T. Zhang Big Data Optimization 11 / 36
Relating Statistical Variance to Optimization Want to optimize min w f ( w ) Full gradient ∇ f ( w ) . Given unbiased random estimator g i of ∇ f ( w ) , and SGD rule w → w − η g i , reduction of objective is + η 2 L E f ( w − η g i ) ≤ f ( w ) − ( η − η 2 L / 2 ) �∇ f ( w ) � 2 2 E � g − Eg � 2 . 2 2 � �� � � �� � non-random variance Smaller variance implies bigger reduction T. Zhang Big Data Optimization 11 / 36
Outline Background: big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons T. Zhang Big Data Optimization 12 / 36
Outline Background: big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons Stochastic gradient algorithms with variance reduction algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration) T. Zhang Big Data Optimization 12 / 36
Outline Background: big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons Stochastic gradient algorithms with variance reduction algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration) Strategies for distributed computing algorithm 4: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling T. Zhang Big Data Optimization 12 / 36
Recommend
More recommend