Recent Progresses in Stochastic Algorithms for Big Data Optimization - PowerPoint PPT Presentation

Recent Progresses in Stochastic Algorithms for Big Data Optimization Tong Zhang Rutgers University & Baidu Inc. collaborators: Shai Shalev-Shwartz, Rie Johnson, Lin Xiao, Ohad Shamir and Nathan Srebro T. Zhang Big Data Optimization 1 / 36

Outline Background: big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons T. Zhang Big Data Optimization 2 / 36

Outline Background: big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons Stochastic gradient algorithms with variance reduction algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration) T. Zhang Big Data Optimization 2 / 36

Outline Background: big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons Stochastic gradient algorithms with variance reduction algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration) Strategies for distributed computing algorithm 4: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling T. Zhang Big Data Optimization 2 / 36

Mathematical Problem Big Data Optimization Problem in machine learning: n f ( w ) = 1 � w f ( w ) f i ( w ) min n i = 1 Special structure: sum over data. Big data ( n large) requires distrubuted training. T. Zhang Big Data Optimization 3 / 36

Assumptions on loss function λ -strong convexity: f ( w ′ ) ≥ f ( w ) + ∇ f ( w ) ⊤ ( w ′ − w ) + λ 2 � w ′ − w � 2 2 L -smoothness: f i ( w ′ ) ≤ f i ( w ) + ∇ f i ( w ) ⊤ ( w ′ − w ) + L 2 � w ′ − w � 2 2 T. Zhang Big Data Optimization 4 / 36

Example: Computational Advertizing Large scale regularized logistic regression   n  ln ( 1 + e − w ⊤ x i y i ) + λ 1 �   2 � w � 2 min   2 n w  i = 1 � �� f i ( w ) data ( x i , y i ) with y i ∈ {± 1 } parameter vector w . λ strongly convex and L = 0 . 25 max i � x i � 2 2 + λ smooth. T. Zhang Big Data Optimization 5 / 36

Example: Computational Advertizing Large scale regularized logistic regression   n  ln ( 1 + e − w ⊤ x i y i ) + λ 1 �   2 � w � 2 min   2 n w  i = 1 � �� f i ( w ) data ( x i , y i ) with y i ∈ {± 1 } parameter vector w . λ strongly convex and L = 0 . 25 max i � x i � 2 2 + λ smooth. big data: n ∼ 10 − 100 billion high dimension: dim ( x i ) ∼ 10 − 100 billion T. Zhang Big Data Optimization 5 / 36

Example: Computational Advertizing Large scale regularized logistic regression   n  ln ( 1 + e − w ⊤ x i y i ) + λ 1 �   2 � w � 2 min   2 n w  i = 1 � �� f i ( w ) data ( x i , y i ) with y i ∈ {± 1 } parameter vector w . λ strongly convex and L = 0 . 25 max i � x i � 2 2 + λ smooth. big data: n ∼ 10 − 100 billion high dimension: dim ( x i ) ∼ 10 − 100 billion How to solve big optimization problems efficiently? T. Zhang Big Data Optimization 5 / 36

Statistical Thinking: sampling Objective function: n f ( w ) = 1 � f i ( w ) n i = 1 sample objective function: only optimize approximate objective T. Zhang Big Data Optimization 6 / 36

Statistical Thinking: sampling Objective function: n f ( w ) = 1 � f i ( w ) n i = 1 sample objective function: only optimize approximate objective 1st order gradient n ∇ f ( w ) = 1 � ∇ f i ( w ) n i = 1 sample 1st order gradient (stochastic gradient): converges to exact optimal – variance reduction: fast rate T. Zhang Big Data Optimization 6 / 36

Statistical Thinking: sampling Objective function: n f ( w ) = 1 � f i ( w ) n i = 1 sample objective function: only optimize approximate objective 1st order gradient n ∇ f ( w ) = 1 � ∇ f i ( w ) n i = 1 sample 1st order gradient (stochastic gradient): converges to exact optimal – variance reduction: fast rate 2nd order gradient n ∇ 2 f ( w ) = 1 � ∇ 2 f i ( w ) n i = 1 sample 2nd order gradient (stochastic Newton): converges to exact optimal with fast rate, distributed computing T. Zhang Big Data Optimization 6 / 36

Batch Optimization Method: Gradient Descent Solve n f ( w ) = 1 � w ∗ = arg min w f ( w ) f i ( w ) . n i = 1 Gradient Descent (GD): w k = w k − 1 − η k ∇ f ( w k − 1 ) . How fast does this method converge to the optimal solution? T. Zhang Big Data Optimization 7 / 36

Batch Optimization Method: Gradient Descent Solve n f ( w ) = 1 � w ∗ = arg min w f ( w ) f i ( w ) . n i = 1 Gradient Descent (GD): w k = w k − 1 − η k ∇ f ( w k − 1 ) . How fast does this method converge to the optimal solution? General result: converge to local minimum under suitable conditions. Convergence rate depends on conditions of f ( · ) . For λ -strongly convex and L -smooth problems, it is linear rate: f ( w k ) − f ( w ∗ ) = O (( 1 − ρ ) k ) , where ρ = O ( λ/ L ) is the inverse condition number T. Zhang Big Data Optimization 7 / 36

Stochastic Approximate Gradient Computation If n f ( w ) = 1 � f i ( w ) , n i = 1 GD requires the computation of full gradient, which is extremely costly n ∇ f ( w ) = 1 � ∇ f i ( w ) n i = 1 T. Zhang Big Data Optimization 8 / 36

Stochastic Approximate Gradient Computation If n f ( w ) = 1 � f i ( w ) , n i = 1 GD requires the computation of full gradient, which is extremely costly n ∇ f ( w ) = 1 � ∇ f i ( w ) n i = 1 Idea: stochastic optimization employs random sample (mini-batch) B to approximate ∇ f ( w ) ≈ 1 � ∇ f i ( w ) | B | i ∈ B It is an unbiased estimator more efficient computation but introduces variance T. Zhang Big Data Optimization 8 / 36

SGD versus GD SGD: faster computation per step Sublinear convergence: due to the variance of gradient approximation. f ( w t ) − f ( w ∗ ) = ˜ O ( 1 / t ) . GD: slower computation per step Linear convergence: f ( w t ) − f ( w ∗ ) = O (( 1 − ρ ) t ) . T. Zhang Big Data Optimization 9 / 36

Improving SGD via Variance Reduction GD converges fast but computation is slow SGD computation is fast but converges slowly slow convergence due to inherent variance SGD as a statistical estimator of gradient: let g i = ∇ f i . � n unbaisedness: E g i = 1 i = 1 g i = ∇ f . n error of using g i to approx ∇ f : variance E � g i − Eg i � 2 2 . T. Zhang Big Data Optimization 10 / 36

Improving SGD via Variance Reduction GD converges fast but computation is slow SGD computation is fast but converges slowly slow convergence due to inherent variance SGD as a statistical estimator of gradient: let g i = ∇ f i . � n unbaisedness: E g i = 1 i = 1 g i = ∇ f . n error of using g i to approx ∇ f : variance E � g i − Eg i � 2 2 . Statistical thinking: relating variance to optimization design other unbiased gradient estimators with smaller variance T. Zhang Big Data Optimization 10 / 36

Relating Statistical Variance to Optimization Want to optimize min w f ( w ) Full gradient ∇ f ( w ) . T. Zhang Big Data Optimization 11 / 36

Relating Statistical Variance to Optimization Want to optimize min w f ( w ) Full gradient ∇ f ( w ) . Given unbiased random estimator g i of ∇ f ( w ) , and SGD rule w → w − η g i , reduction of objective is + η 2 L E f ( w − η g i ) ≤ f ( w ) − ( η − η 2 L / 2 ) �∇ f ( w ) � 2 2 E � g − Eg � 2 . 2 2 � �� non-random variance T. Zhang Big Data Optimization 11 / 36

Relating Statistical Variance to Optimization Want to optimize min w f ( w ) Full gradient ∇ f ( w ) . Given unbiased random estimator g i of ∇ f ( w ) , and SGD rule w → w − η g i , reduction of objective is + η 2 L E f ( w − η g i ) ≤ f ( w ) − ( η − η 2 L / 2 ) �∇ f ( w ) � 2 2 E � g − Eg � 2 . 2 2 � �� non-random variance Smaller variance implies bigger reduction T. Zhang Big Data Optimization 11 / 36

Outline Background: big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons T. Zhang Big Data Optimization 12 / 36

Outline Background: big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons Stochastic gradient algorithms with variance reduction algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration) T. Zhang Big Data Optimization 12 / 36

Outline Background: big data optimization problem 1st order stochastic gradient versus batch gradient: pros and cons Stochastic gradient algorithms with variance reduction algorithm 1: SVRG (Stochastic variance reduced gradient) algorithm 2: SDCA (Stochastic Dual Coordinate Ascent) algorithm 3: accelerated SDCA (with Nesterov acceleration) Strategies for distributed computing algorithm 4: DANE (Distributed Approximate NEwton-type method) behaves like 2nd order stochastic sampling T. Zhang Big Data Optimization 12 / 36

Recent Progresses in Stochastic Algorithms for Big Data Optimization - PowerPoint PPT Presentation

Recent Progresses in Stochastic Algorithms for Big Data Optimization Tong Zhang Rutgers University & Baidu Inc. collaborators: Shai Shalev-Shwartz, Rie Johnson, Lin Xiao, Ohad Shamir and Nathan Srebro T. Zhang Big Data Optimization 1 /

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Recent progresses in the calculation of the aggregate exposure to fragrance ingredients Bob

Recent Progresses on the Simplex Method Yinyu Ye www.stanford.edu/~yyye K.T. Li Professor of

Recent Progresses in Visual Segmentation Yunchao Wei ReLER, Australian Artificial Intelligence

Recent progresses in the variational reduced-density-matrix method . . . . .

Recent progresses in the development of a new generation adaptive DG dynamical core for RegCM

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Stochastic geometry and random generation 1 Stochastic geometry and random generation

Progresses and Problems in Theory of Nonlinear Expectations and Applications to Finance with

Big- Big -O O Analyzing Algorithms Asymptotically Analyzing Algorithms Asymptotically P1 P2

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms & Data Structures Tuesday,

Analysis of Algorithms & Big-O CS16: Introduction to Algorithms & Data Structures Spring

Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization Jialei Wang

Differential inclusions and applications Sweeping process Introduction New assumption Juliette

Introduction to Experimental Robotics CSCI 1108 Lecture 18 Course Review (2) CSCI 1108

RFID SECURITY MODULE 20th december 2017 pepe vila @cgvwzq

Inferring Visibility: Who is (not) talking to whom? Gonca Grsun, Natali Ruchansky, Evimaria

Brndsted-Rockafellar property of subdifferentials of prox-bounded functions Marc Lassonde

Projective Splitting Methods for Decomposing Convex Optimization Problems Jonat han Eckstein

Convex Optimization: Modeling and Algorithms Lieven Vandenberghe Electrical Engineering

Recent Progresses in Stochastic Algorithms for Big Data Optimization - PowerPoint PPT Presentation

Recent Progresses in Stochastic Algorithms for Big Data Optimization Tong Zhang Rutgers University & Baidu Inc. collaborators: Shai Shalev-Shwartz, Rie Johnson, Lin Xiao, Ohad Shamir and Nathan Srebro T. Zhang Big Data Optimization 1 /

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Recent progresses in the calculation of the aggregate exposure to fragrance ingredients Bob

Recent Progresses on the Simplex Method Yinyu Ye www.stanford.edu/~yyye K.T. Li Professor of

Recent Progresses in Visual Segmentation Yunchao Wei ReLER, Australian Artificial Intelligence

Recent progresses in the variational reduced-density-matrix method . . . . .

Recent progresses in the development of a new generation adaptive DG dynamical core for RegCM

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Stochastic geometry and random generation 1 Stochastic geometry and random generation

Progresses and Problems in Theory of Nonlinear Expectations and Applications to Finance with

Big- Big -O O Analyzing Algorithms Asymptotically Analyzing Algorithms Asymptotically P1 P2

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms &amp; Data Structures Tuesday,

Analysis of Algorithms &amp; Big-O CS16: Introduction to Algorithms &amp; Data Structures Spring

Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization Jialei Wang

Differential inclusions and applications Sweeping process Introduction New assumption Juliette

Introduction to Experimental Robotics CSCI 1108 Lecture 18 Course Review (2) CSCI 1108

RFID SECURITY MODULE 20th december 2017 pepe vila @cgvwzq

Inferring Visibility: Who is (not) talking to whom? Gonca Grsun, Natali Ruchansky, Evimaria

Brndsted-Rockafellar property of subdifferentials of prox-bounded functions Marc Lassonde

Projective Splitting Methods for Decomposing Convex Optimization Problems Jonat han Eckstein

Convex Optimization: Modeling and Algorithms Lieven Vandenberghe Electrical Engineering

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms & Data Structures Tuesday,

Analysis of Algorithms & Big-O CS16: Introduction to Algorithms & Data Structures Spring