Stochastic Cubic Regularization for Fast Nonconvex Optimization - PowerPoint PPT Presentation

Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier and Michael I. Jordan Achin Jain University of Pennsylvania STAT991, Spring 2019 Fast Nonconvex Optimization 1

Outline 1. Motivation 2. Objectives 3. Algorithm 4. Experiments 5. References Fast Nonconvex Optimization 2

Outline 1. Motivation 2. Objectives 3. Algorithm 4. Experiments 5. References Fast Nonconvex Optimization 3 1 – Motivation

Motivation x ∈ R d f ( x ) := E ξ ∈D [ f ( x ; ξ )] min f non-convex and f ( x ; ξ ) stochastic Variants of stochastic optimization 1. Offline setting: minimize the empirical loss over a fixed amount of data 2. Online setting: minimize the empirical loss when data arrives sequentially Applications • Large-scale statistics and machine learning problems • Example: optimization of deep neural networks Fast Nonconvex Optimization 4 1 – Motivation

Outline 1. Motivation 2. Objectives 3. Algorithm 4. Experiments 5. References Fast Nonconvex Optimization 5 2 – Objectives

Survey of (stochastic) gradient descent algorithms for ǫ -second order stationary points Fast Nonconvex Optimization 6 2 – Objectives

Cubic-regularized gradient descent Gradient descent � f ( x t ) + ∇ f ( x t ) T ( x − x t ) + L � 2 || x − x t || 2 x t + 1 = arg min x State of the art convergence: � ǫ − 2 � 1. hessian free perturbed GD O [Jin et al., 2017] Cubic-regularized gradient descent � � f ( x t ) + ∇ f ( x t ) T ( x − x t ) + 1 2 ( x − x t ) T ∇ 2 f ( x t )( x − x t ) + ρ 6 || x − x t || 3 x t + 1 = arg min x State of the art convergence: � ǫ − 1 . 5 � 1. full Hessian O [Nesterov and Polyak, 2006] � ǫ − 2 � 2. Hessian-vector product evaluations w/o acceleration O [Carmon and Duchi, 2016] � ǫ − 1 . 75 � 3. Hessian-vector product evaluations w/ acceleration O [Carmon et al., 2018] Fast Nonconvex Optimization 7 2 – Objectives

Cubic-regularized stochastic gradient descent Stochastic gradient descent � f ( x t ) + g ( x t , ξ t ) T ( x − x t ) + L � 2 || x − x t || 2 x t + 1 = arg min , E g ( x t , ξ t ) = ∇ f ( x t ) x State of the art convergence: � ǫ − 4 � 1. noisy SGD O [Ge et al., 2015] � ǫ − 3 . 5 � 2. Hessian-vector product evaluations w/ variance reduction O [Allen-Zhu, 2018] � ǫ − 3 . 5 � 3. gradient evaluations w/ variance reduction O [Allen-Zhu and Li, 2018] Stochastic cubic-regularized gradient descent [this paper] � � t ( x − x t ) + 1 2 ( x − x t ) T B t ( x − x t ) + ρ f ( x t ) + g T 6 || x − x t || 3 x t + 1 = arg min x State of the art convergence: � ǫ ? � 1. Hessian-vector product evaluations O [Tripuraneni et al., 2018] Fast Nonconvex Optimization 8 2 – Objectives

Problem statement x ∈ R d f ( x ) := E ξ ∈D [ f ( x ; ξ )] min f non-convex and f ( x ; ξ ) stochastic 1. Can we design a fully stochastic variant of the cubic-regularized Newton method? 2. Is such an algorithm faster than stochastic gradient descent? Fast Nonconvex Optimization 9 2 – Objectives

What’s coming up x ∈ R d f ( x ) := E ξ ∈D [ f ( x ; ξ )] min f non-convex and f ( x ; ξ ) stochastic Comparison of different stochastic optimization algorithms to find an ǫ -second-order stationary point: Method Run-time Variance Reduction Type 1 st order O ( ǫ − 4 ) SGD [Ge et al., 2015] no needed 2 nd order O ( ǫ − 3 . 5 ) Natasha 2 [Allen-Zhu, 2018] needed 2 nd order O ( ǫ − 3 . 5 ) Neon 2 [Allen-Zhu and Li, 2018] needed 2 nd order O ( ǫ − 3 . 5 ) SCR [this paper] not needed Fast Nonconvex Optimization 10 2 – Objectives

Outline 1. Motivation 2. Objectives 3. Algorithm 4. Experiments 5. References Fast Nonconvex Optimization 11 3 – Algorithm

Assumptions Assumption 1. The function f ( x ) has L -Lipschitz gradients and ρ -Lipschitz Hessians for all x 1 and x 2 . ∇ 2 f ( x 1 ) − ∇ 2 f ( x 2 ) || ≤ ρ || x 1 − x 2 || ||∇ f ( x 1 ) − ∇ f ( x 2 ) || ≤ L || x 1 − x 2 || , Assumption 2. The stochastic gradients and stochastic Hessians, and their variance are bounded. ||∇ f ( x , ξ ) − ∇ f ( x ) || 2 � ≤ σ 2 � 1 , ||∇ f ( x , ξ ) − ∇ f ( x ) || ≤ M 1 E ||∇ 2 f ( x , ξ ) − ∇ 2 f ( x ) || 2 � ≤ σ 2 ||∇ 2 f ( x , ξ ) − ∇ 2 f ( x ) || ≤ M 2 � E 2 , Fast Nonconvex Optimization 12 3 – Algorithm

Cubic-regularized gradient descent In the deterministic setting, we minimize the local upper bound on the function using a third-order Taylor expansion [Nesterov and Polyak, 2006]. � � f ( x t ) + ∇ f ( x t ) T ( x − x t ) + 1 2 ( x − x t ) T ∇ 2 f ( x t )( x − x t ) + ρ 6 || x − x t || 3 m t ( x ) = x t + 1 = arg min m t ( x ) x In the stochastic setting, 1. we only have access to stochastic gradients and Hessians, not the true gradient and Hessian, 2. our only means of interaction with the Hessian is through Hessian-vector products, and 3. the cubic submodel m t ( x ) cannot be solved exactly in practice, only up to some tolerance. Fast Nonconvex Optimization 13 3 – Algorithm

Stochastic Cubic Regularization (Meta-algorithm) In the deterministic setting, we minimize � � f ( x t ) + ( x − x t ) T ∇ f ( x t ) + 1 2 ( x − x t ) T ∇ 2 f ( x t )( x − x t ) + ρ 6 || x − x t || 3 m t ( x ) = x t + 1 = arg min m t ( x ) x In the stochastic setting, we minimize m (∆) = ∆ T g t + 1 2 ∆ T B t [∆] + ρ 6 || ∆ || 3 , 1. ˜ ∆ := x − x t (we need a cubic solver) � �� Hessian-vector product 2. ∆ t + 1 = arg min ∆ ˜ m (∆) , x t + 1 = x t + ∆ t + 1 3. ∆ ⋆ = arg min ∆ ˜ m (∆) (cubic solver will not solve exactly) 4. ˜ m (∆) = m t ( x t + ∆) − m t ( x t ) Fast Nonconvex Optimization 14 3 – Algorithm

Stochastic cubic regularization (meta-algorithm) m (∆) = ∆ T g t + 1 2 ∆ T B [∆] + ρ 6 || ∆ || 3 = ∆ m = m t ( x t + ∆) − m t ( x t ) ˜ ∆ = arg min m (∆) ˜ ∆ Fast Nonconvex Optimization 15 3 – Algorithm

Gradient descent as a cubic subsolver • lines 1–3: when g is large, the submodel ˜ m (∆) may be ill-conditioned, so instead of doing gradient descent, the iterate only moves one step in the g direction, which already guarantees sufficient descent [Carmon and Duchi, 2016] • line 6: the algorithm adds a small perturbation to g to avoid a hard case for the cubic submodel Fast Nonconvex Optimization 16 3 – Algorithm

Cubic final solver • Algorithm 2 may produce inexact ∆ • line 2, 4: gradient descent but with higher precision Fast Nonconvex Optimization 17 3 – Algorithm

Claims Condition 1. For a small constant c , Cubic-Subsolver ( g , B [] , ǫ ) terminates within T ( ǫ ) gradient iterations (which may depend on c), and returns a ∆ satisfying at least one of the following 1. the parameter change results in submodel and function decreases that are both sufficiently large 2. if that fails to hold, the second condition ensures that ∆ is not too large relative to the true solution ∆ ⋆ , and that the cubic submodel is solved to precision c ρ || ∆ ⋆ || 3 when ∆ ⋆ is large Theorem 1. There exists an absolute constant c such that if f ( x ) satisfies Assumptions 1, 2, CubicSubsolver satisfies Condition 1 with c , then for all δ > 0 and σ 2 σ 4 ∆ f ≥ f ( x 0 ) − f ∗ , and ǫ ≤ min { cM 1 , 1 2 2 ρ } , Algorithm 1 will output an ǫ -second-order c 2 M 2 stationary point of f with probability at least 1 − δ within � √ ρ ∆ f � σ 2 �� ǫ 2 + σ 2 1 2 O ρǫ T ( ǫ ) ǫ 1 . 5 total stochastic gradient and Hessian-vector product evaluations. Fast Nonconvex Optimization 18 3 – Algorithm

Claims Lemma 1. There exists an absolute constant c , such that under the same assumptions on f ( x ) and the same choice of parameters n 1 , n 2 as in Theorem 1, Algorithm 2 satisfies Condition 1 with atleast 1 − δ with � L � T ( ǫ ) ≤ O √ ρǫ Corollary 1. Under the same settings of Theorem 1, if we instantiate CubicSubsolver σ 2 σ 4 with Algorithm 2, and ǫ ≤ min { cM 1 , 1 2 2 ρ } , then Algorithm 1 will output an c 2 M 2 ǫ -second-order stationary point of f with probability at least 1 − δ within � √ ρ ∆ f � σ 2 �� σ 2 L 1 2 O ǫ 2 + √ ρ ) ǫ 1 . 5 ρǫ 1 . 5 total stochastic gradient and Hessian-vector product evaluations. Fast Nonconvex Optimization 19 3 – Algorithm

Proof Sketch Claim 1. If x t + 1 is not an ǫ -second-order stationary point of f ( x ) , the cubic submodel has large descent m t ( x t + 1 ) − m t ( x t ) . Claim 2. If the cubic submodel has large descent m t ( x t + 1 ) − m t ( x t ) , then the true function also has large descent f ( x t + 1 ) − f ( x t ) . Fast Nonconvex Optimization 20 3 – Algorithm

Outline 1. Motivation 2. Objectives 3. Algorithm 4. Experiments 5. References Fast Nonconvex Optimization 21 4 – Experiments

Synthetic Nonconvex Problem Piece-wise cubic function w ( x 1 ) w ( x 1 ) + 10 x 2 � � min 2 x ∈ R 2 Algorithm 1 is able to escape the saddle point at the origin and converge to one of the global minima faster than SGD. Fast Nonconvex Optimization 22 4 – Experiments

Deep Autoencoder Encoder: ( 28 × 28 ) → 512 → 256 → 128 → 32 Decoder: ( 28 × 28 ) ← 512 ← 256 ← 128 ← 32 � min pixelwise L 2 loss Fast Nonconvex Optimization 23 4 – Experiments

Stochastic Cubic Regularization for Fast Nonconvex Optimization - PowerPoint PPT Presentation

Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier and Michael I. Jordan Achin Jain University of Pennsylvania STAT991, Spring 2019 Fast Nonconvex Optimization 1

Convergence of Cubic Regularization for Nonconvex Optimization under ojasiewicz Property

Implicit Regularization in Nonconvex Statistical Estimation Yuxin Chen Electrical Engineering,

Tariff Metre Metre AED / Cubic Metre metre 10.55 metre kWh kWh kWh fils kWh fils

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Stabilizing Cubic HfO 2 Doped Y 2 O 3 using TEM Stabilizing Cubic HfO 2 Doped Y 2 O 3 using TEM

On cubic 4-ordered graphs and cubic 4-ordered Hamiltonian graphs Hamiltonian graphs Lih-Hsing

Overview Chapter 7 Ideal Gas Equation of State P= RT/V Van der Waals Equation of State Cubic

lecture 10 - cubic curves - cubic splines - bicubic surfaces We want to define smooth curves:

Derived categories and cubic persurfaces Paolo Stellari hypersurfaces Paolo Stellari Roma,

An n component face-cubic model on the complete graph Zongzheng (Eric) Zhou School of

On the cubic-quintic Schr odinger equation R emi Carles CNRS & Univ Rennes Based on a

Nonconvex Demixing from Bilinear Measurements Yuanming Shi 1 Outline Motivations Blind

PRACTICAL AUGMENTED LAGRANGIAN METHODS FOR NONCONVEX PROBLEMS Jos e Mario Mart nez

Fast Generation of Cubic Graphs Gunnar Brinkmann and Jan Goedgebeur Gunnar.Brinkmann@UGent.be

Dimension Free Optimization and Non-Convex Optimization Instructor: Sham Kakade 1 Non-convex

Generalized finite differences for solving stochastic control problems March 2005 F . Bonnans,

Liouvillian Solutions of Irreducible Second Order Linear Difference Equations Mark van Hoeij and

Interpolation & Polynomial Approximation Divided Differences: A Brief Introduction Numerical

Multiplicative Weights Update as a Distributed Optimization Algorithm: Constrained Optimization

No Spurious Local Minima in Training Deep Quadratic Networks Abbas Kazemipour Conference on

Variational approach to mean field games with density constraints Alp ar Rich ard M

Monte Carlo simulation inspired by computational optimization Colin Fox fox@physics.otago.ac.nz

Stochastic Cubic Regularization for Fast Nonconvex Optimization - PowerPoint PPT Presentation

Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier and Michael I. Jordan Achin Jain University of Pennsylvania STAT991, Spring 2019 Fast Nonconvex Optimization 1

Convergence of Cubic Regularization for Nonconvex Optimization under ojasiewicz Property

Implicit Regularization in Nonconvex Statistical Estimation Yuxin Chen Electrical Engineering,

Tariff Metre Metre AED / Cubic Metre metre 10.55 metre kWh kWh kWh fils kWh fils

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Stabilizing Cubic HfO 2 Doped Y 2 O 3 using TEM Stabilizing Cubic HfO 2 Doped Y 2 O 3 using TEM

On cubic 4-ordered graphs and cubic 4-ordered Hamiltonian graphs Hamiltonian graphs Lih-Hsing

Overview Chapter 7 Ideal Gas Equation of State P= RT/V Van der Waals Equation of State Cubic

lecture 10 - cubic curves - cubic splines - bicubic surfaces We want to define smooth curves:

Derived categories and cubic persurfaces Paolo Stellari hypersurfaces Paolo Stellari Roma,

An n component face-cubic model on the complete graph Zongzheng (Eric) Zhou School of

On the cubic-quintic Schr odinger equation R emi Carles CNRS &amp; Univ Rennes Based on a

Nonconvex Demixing from Bilinear Measurements Yuanming Shi 1 Outline Motivations Blind

PRACTICAL AUGMENTED LAGRANGIAN METHODS FOR NONCONVEX PROBLEMS Jos e Mario Mart nez

Fast Generation of Cubic Graphs Gunnar Brinkmann and Jan Goedgebeur Gunnar.Brinkmann@UGent.be

Dimension Free Optimization and Non-Convex Optimization Instructor: Sham Kakade 1 Non-convex

Generalized finite differences for solving stochastic control problems March 2005 F . Bonnans,

Liouvillian Solutions of Irreducible Second Order Linear Difference Equations Mark van Hoeij and

Interpolation &amp; Polynomial Approximation Divided Differences: A Brief Introduction Numerical

Multiplicative Weights Update as a Distributed Optimization Algorithm: Constrained Optimization

No Spurious Local Minima in Training Deep Quadratic Networks Abbas Kazemipour Conference on

Variational approach to mean field games with density constraints Alp ar Rich ard M

Monte Carlo simulation inspired by computational optimization Colin Fox fox@physics.otago.ac.nz

Regularization Overview Regularization Overview Problems & Multicollinearity We will

On the cubic-quintic Schr odinger equation R emi Carles CNRS & Univ Rennes Based on a

Interpolation & Polynomial Approximation Divided Differences: A Brief Introduction Numerical