stochastic cubic regularization for fast nonconvex
play

Stochastic Cubic Regularization for Fast Nonconvex Optimization - PowerPoint PPT Presentation

Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier and Michael I. Jordan Achin Jain University of Pennsylvania STAT991, Spring 2019 Fast Nonconvex Optimization 1


  1. Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier and Michael I. Jordan Achin Jain University of Pennsylvania STAT991, Spring 2019 Fast Nonconvex Optimization 1

  2. Outline 1. Motivation 2. Objectives 3. Algorithm 4. Experiments 5. References Fast Nonconvex Optimization 2

  3. Outline 1. Motivation 2. Objectives 3. Algorithm 4. Experiments 5. References Fast Nonconvex Optimization 3 1 – Motivation

  4. Motivation x ∈ R d f ( x ) := E ξ ∈D [ f ( x ; ξ )] min f non-convex and f ( x ; ξ ) stochastic Variants of stochastic optimization 1. Offline setting: minimize the empirical loss over a fixed amount of data 2. Online setting: minimize the empirical loss when data arrives sequentially Applications • Large-scale statistics and machine learning problems • Example: optimization of deep neural networks Fast Nonconvex Optimization 4 1 – Motivation

  5. Outline 1. Motivation 2. Objectives 3. Algorithm 4. Experiments 5. References Fast Nonconvex Optimization 5 2 – Objectives

  6. Survey of (stochastic) gradient descent algorithms for ǫ -second order stationary points Fast Nonconvex Optimization 6 2 – Objectives

  7. Cubic-regularized gradient descent Gradient descent � f ( x t ) + ∇ f ( x t ) T ( x − x t ) + L � 2 || x − x t || 2 x t + 1 = arg min x State of the art convergence: � ǫ − 2 � 1. hessian free perturbed GD O [Jin et al., 2017] Cubic-regularized gradient descent � � f ( x t ) + ∇ f ( x t ) T ( x − x t ) + 1 2 ( x − x t ) T ∇ 2 f ( x t )( x − x t ) + ρ 6 || x − x t || 3 x t + 1 = arg min x State of the art convergence: � ǫ − 1 . 5 � 1. full Hessian O [Nesterov and Polyak, 2006] � ǫ − 2 � 2. Hessian-vector product evaluations w/o acceleration O [Carmon and Duchi, 2016] � ǫ − 1 . 75 � 3. Hessian-vector product evaluations w/ acceleration O [Carmon et al., 2018] Fast Nonconvex Optimization 7 2 – Objectives

  8. Cubic-regularized stochastic gradient descent Stochastic gradient descent � f ( x t ) + g ( x t , ξ t ) T ( x − x t ) + L � 2 || x − x t || 2 x t + 1 = arg min , E g ( x t , ξ t ) = ∇ f ( x t ) x State of the art convergence: � ǫ − 4 � 1. noisy SGD O [Ge et al., 2015] � ǫ − 3 . 5 � 2. Hessian-vector product evaluations w/ variance reduction O [Allen-Zhu, 2018] � ǫ − 3 . 5 � 3. gradient evaluations w/ variance reduction O [Allen-Zhu and Li, 2018] Stochastic cubic-regularized gradient descent [this paper] � � t ( x − x t ) + 1 2 ( x − x t ) T B t ( x − x t ) + ρ f ( x t ) + g T 6 || x − x t || 3 x t + 1 = arg min x State of the art convergence: � ǫ ? � 1. Hessian-vector product evaluations O [Tripuraneni et al., 2018] Fast Nonconvex Optimization 8 2 – Objectives

  9. Problem statement x ∈ R d f ( x ) := E ξ ∈D [ f ( x ; ξ )] min f non-convex and f ( x ; ξ ) stochastic 1. Can we design a fully stochastic variant of the cubic-regularized Newton method? 2. Is such an algorithm faster than stochastic gradient descent? Fast Nonconvex Optimization 9 2 – Objectives

  10. What’s coming up x ∈ R d f ( x ) := E ξ ∈D [ f ( x ; ξ )] min f non-convex and f ( x ; ξ ) stochastic Comparison of different stochastic optimization algorithms to find an ǫ -second-order stationary point: Method Run-time Variance Reduction Type 1 st order O ( ǫ − 4 ) SGD [Ge et al., 2015] no needed 2 nd order O ( ǫ − 3 . 5 ) Natasha 2 [Allen-Zhu, 2018] needed 2 nd order O ( ǫ − 3 . 5 ) Neon 2 [Allen-Zhu and Li, 2018] needed 2 nd order O ( ǫ − 3 . 5 ) SCR [this paper] not needed Fast Nonconvex Optimization 10 2 – Objectives

  11. Outline 1. Motivation 2. Objectives 3. Algorithm 4. Experiments 5. References Fast Nonconvex Optimization 11 3 – Algorithm

  12. Assumptions Assumption 1. The function f ( x ) has L -Lipschitz gradients and ρ -Lipschitz Hessians for all x 1 and x 2 . ∇ 2 f ( x 1 ) − ∇ 2 f ( x 2 ) || ≤ ρ || x 1 − x 2 || ||∇ f ( x 1 ) − ∇ f ( x 2 ) || ≤ L || x 1 − x 2 || , Assumption 2. The stochastic gradients and stochastic Hessians, and their variance are bounded. ||∇ f ( x , ξ ) − ∇ f ( x ) || 2 � ≤ σ 2 � 1 , ||∇ f ( x , ξ ) − ∇ f ( x ) || ≤ M 1 E ||∇ 2 f ( x , ξ ) − ∇ 2 f ( x ) || 2 � ≤ σ 2 ||∇ 2 f ( x , ξ ) − ∇ 2 f ( x ) || ≤ M 2 � E 2 , Fast Nonconvex Optimization 12 3 – Algorithm

  13. Cubic-regularized gradient descent In the deterministic setting, we minimize the local upper bound on the function using a third-order Taylor expansion [Nesterov and Polyak, 2006]. � � f ( x t ) + ∇ f ( x t ) T ( x − x t ) + 1 2 ( x − x t ) T ∇ 2 f ( x t )( x − x t ) + ρ 6 || x − x t || 3 m t ( x ) = x t + 1 = arg min m t ( x ) x In the stochastic setting, 1. we only have access to stochastic gradients and Hessians, not the true gradient and Hessian, 2. our only means of interaction with the Hessian is through Hessian-vector products, and 3. the cubic submodel m t ( x ) cannot be solved exactly in practice, only up to some tolerance. Fast Nonconvex Optimization 13 3 – Algorithm

  14. Stochastic Cubic Regularization (Meta-algorithm) In the deterministic setting, we minimize � � f ( x t ) + ( x − x t ) T ∇ f ( x t ) + 1 2 ( x − x t ) T ∇ 2 f ( x t )( x − x t ) + ρ 6 || x − x t || 3 m t ( x ) = x t + 1 = arg min m t ( x ) x In the stochastic setting, we minimize m (∆) = ∆ T g t + 1 2 ∆ T B t [∆] + ρ 6 || ∆ || 3 , 1. ˜ ∆ := x − x t (we need a cubic solver) � �� � Hessian-vector product 2. ∆ t + 1 = arg min ∆ ˜ m (∆) , x t + 1 = x t + ∆ t + 1 3. ∆ ⋆ = arg min ∆ ˜ m (∆) (cubic solver will not solve exactly) 4. ˜ m (∆) = m t ( x t + ∆) − m t ( x t ) Fast Nonconvex Optimization 14 3 – Algorithm

  15. Stochastic cubic regularization (meta-algorithm) m (∆) = ∆ T g t + 1 2 ∆ T B [∆] + ρ 6 || ∆ || 3 = ∆ m = m t ( x t + ∆) − m t ( x t ) ˜ ∆ = arg min m (∆) ˜ ∆ Fast Nonconvex Optimization 15 3 – Algorithm

  16. Gradient descent as a cubic subsolver • lines 1–3: when g is large, the submodel ˜ m (∆) may be ill-conditioned, so instead of doing gradient descent, the iterate only moves one step in the g direction, which already guarantees sufficient descent [Carmon and Duchi, 2016] • line 6: the algorithm adds a small perturbation to g to avoid a hard case for the cubic submodel Fast Nonconvex Optimization 16 3 – Algorithm

  17. Cubic final solver • Algorithm 2 may produce inexact ∆ • line 2, 4: gradient descent but with higher precision Fast Nonconvex Optimization 17 3 – Algorithm

  18. Claims Condition 1. For a small constant c , Cubic-Subsolver ( g , B [] , ǫ ) terminates within T ( ǫ ) gradient iterations (which may depend on c), and returns a ∆ satisfying at least one of the following 1. the parameter change results in submodel and function decreases that are both sufficiently large 2. if that fails to hold, the second condition ensures that ∆ is not too large relative to the true solution ∆ ⋆ , and that the cubic submodel is solved to precision c ρ || ∆ ⋆ || 3 when ∆ ⋆ is large Theorem 1. There exists an absolute constant c such that if f ( x ) satisfies Assumptions 1, 2, CubicSubsolver satisfies Condition 1 with c , then for all δ > 0 and σ 2 σ 4 ∆ f ≥ f ( x 0 ) − f ∗ , and ǫ ≤ min { cM 1 , 1 2 2 ρ } , Algorithm 1 will output an ǫ -second-order c 2 M 2 stationary point of f with probability at least 1 − δ within � √ ρ ∆ f � σ 2 �� ǫ 2 + σ 2 1 2 O ρǫ T ( ǫ ) ǫ 1 . 5 total stochastic gradient and Hessian-vector product evaluations. Fast Nonconvex Optimization 18 3 – Algorithm

  19. Claims Lemma 1. There exists an absolute constant c , such that under the same assumptions on f ( x ) and the same choice of parameters n 1 , n 2 as in Theorem 1, Algorithm 2 satisfies Condition 1 with atleast 1 − δ with � L � T ( ǫ ) ≤ O √ ρǫ Corollary 1. Under the same settings of Theorem 1, if we instantiate CubicSubsolver σ 2 σ 4 with Algorithm 2, and ǫ ≤ min { cM 1 , 1 2 2 ρ } , then Algorithm 1 will output an c 2 M 2 ǫ -second-order stationary point of f with probability at least 1 − δ within � √ ρ ∆ f � σ 2 �� σ 2 L 1 2 O ǫ 2 + √ ρ ) ǫ 1 . 5 ρǫ 1 . 5 total stochastic gradient and Hessian-vector product evaluations. Fast Nonconvex Optimization 19 3 – Algorithm

  20. Proof Sketch Claim 1. If x t + 1 is not an ǫ -second-order stationary point of f ( x ) , the cubic submodel has large descent m t ( x t + 1 ) − m t ( x t ) . Claim 2. If the cubic submodel has large descent m t ( x t + 1 ) − m t ( x t ) , then the true function also has large descent f ( x t + 1 ) − f ( x t ) . Fast Nonconvex Optimization 20 3 – Algorithm

  21. Outline 1. Motivation 2. Objectives 3. Algorithm 4. Experiments 5. References Fast Nonconvex Optimization 21 4 – Experiments

  22. Synthetic Nonconvex Problem Piece-wise cubic function w ( x 1 ) w ( x 1 ) + 10 x 2 � � min 2 x ∈ R 2 Algorithm 1 is able to escape the saddle point at the origin and converge to one of the global minima faster than SGD. Fast Nonconvex Optimization 22 4 – Experiments

  23. Deep Autoencoder Encoder: ( 28 × 28 ) → 512 → 256 → 128 → 32 Decoder: ( 28 × 28 ) ← 512 ← 256 ← 128 ← 32 � min pixelwise L 2 loss Fast Nonconvex Optimization 23 4 – Experiments

Recommend


More recommend