PA-GD: On the Convergence of Perturbed Alternating Gradient Descent - PowerPoint PPT Presentation

ICML 2019 PA-GD: On the Convergence of Perturbed Alternating Gradient Descent to Second-Order Stationary Points for Structured Nonconvex Optimization Presenter: Songtao Lu University of Minnesota Twin Cities Joint work with Mingyi Hong and Zhengdao Wang 1

Co-authors Mingyi Hong Zhengdao Wang University of Minnesota Iowa State University 2

Agenda • Motivation – A class of structured non-convex problems • What we plan to achieve : – Random perturbation : Convergence rate of alternating gradient descent ( A-GD ) to second-order stationary points ( SOSPs ) with high probability • Numerical Results – Two-layer linear neural networks: – Matrix factorization • Concluding Remarks 3

Block Structured Nonconvex Optimization • Consider the following problem P : minimize f ( x , y ) x , y • f ( x , y ) : R d → R is a smooth nonconvex function – x ∈ R d x – y ∈ R d y – d = d x + d y 4

Motivation: Nice Landscapes • High dimensional problems: strict saddle points common • There are some nice/benign block structured problems [R. Ge et al., 2017, J. Lee et al., 2018] – All local minima are global minima – Saddle points: very poor compared with local minima – Every saddle point: strict (Hessian matrix has at least one negative eigenvalue) [0 , 0] T [0 , 0] T [0 , 0] T M : indefinite local maximum strict saddle local minimum T − M � 2 minimize x ∈ R 2 × 1 � xx F 5

Optimality Conditions • Common definition of first-order stationary points (FOSPs) �∇ f ( x , y ) � ≤ ǫ where ǫ > 0 , then ( x , y ) is an ǫ -FOSP. • Common definition of SOSPs If the following holds λ min ( ∇ 2 f ( x , y )) ≥ − γ �∇ f ( x , y ) � ≤ ǫ, and where ǫ, γ > 0 , then ( x , y ) is an ( ǫ, γ ) -SOSP. 6

Literature Algorithms with convergence guarantees to SOSPs: • Second-order methods (one block) – Trust region method [Conn et al., 2000] – Cubic regularized Newton’s method [Nesterov & Polyak, 2006] – Hybrid of first-order and second-order method [Reddi et al., 2018] • First-order methods (one block) – Perturbed gradient descent (PGD) [Jin et al., 2017] – Stochastic first order method (NEgative-curvature-Originated-from-Noise, NEON, [Xu et al., 2017]) – Neon2 (finding local minima via first-order oracles) [Allen-Zhu et al., 2017] – Accelerated methods [Carmon et al., 2016][Jin et al., 2018][Xu et al., 2018] – Many more 7

Literature • Block structured nonconvex optimization (asymptotic) : – Block coordinate descent (BCD) [Song et al., 2017][Lee et al., 2017] – Alternating direction methods of multipliers (ADMM) [Hong et al., 2018] • But none of these work has shown the convergence rate of block coordinate descent to SOSPs, even for the two-block case. • Gradient descent can take exponential number of iterations to escape saddle points [Du et al., 2017] 8

Motivation: Block Structured Nonconvex Problems • Many problems have block structures in nature. • We can have faster numerical convergence rates by leveraging block structures of the problem. 9

Motivation: Block Structured Nonconvex Problems • Matrix Factorization [Jain et al., 2013] 1 T − M � 2 2 � XY minimize F X ∈ R n × k , Y ∈ R m × k • Matrix Sensing [Sun et al., 2014] 1 T − M ) � 2 2 �A ( XY minimize F X ∈ R n × k , Y ∈ R m × k A : linear measurement operator and satisfies the restricted isometry property (RIP) condition 10

Motivation of This Work Can we solve the nice block structured nonconvex problems to SOSP? 11

Alternating Gradient Descent • Iterates of A-GD [Bertsekas 1999]: x ( t +1) = x ( t ) − η ∇ x f ( x ( t ) , y ( t ) ) (1) y ( t +1) = y ( t ) − η ∇ y f ( x ( t +1) , y ( t ) ) (2) • Step-size: η ≤ 1 /L max 12

Motivation of Alternating Gradient Descent minimize T Mx x x 1 ,x 2 � 1 � a M = a 1 • Whole problem: L = 1 + a • Block-wise: L max = 1 GD A-GD a = 1000 13

Motivation of Alernating Gradient Descent • A-GD: – numerically good – may take a long time to escape from saddle points • PA-GD: numerically good and convergence rate guarantees 14

Matrix Factorization A two-layer linear neural network: l � y i − UV T � x i � 2 X � 2 2 = � � Y − UV T � � � minimize F , (3) U ∈ R n × k , V ∈ R m × k i =1 • � Y and � X : n = 100 , m = 40 , k = 20 , l = 20 , CN (0 , 1) Proposed Gradient Descent Convergence comparison between GD and PA-GD for learning a two-layer neural network, where ǫ = 10 − 10 , g th = ǫ/ 10 , t th = 10 /ǫ 1 / 2 , r = ǫ/ 10 . 15

Connection with Existing Works Algorithm Iterations ( ǫ, γ ) -SOSP PGD [Jin et al, 2017] � ( ǫ, ǫ 1 / 2 ) O (1 /ǫ 2 ) NEON+SGD [Xu and Yang, 2017] � ( ǫ, ǫ 1 / 2 ) O (1 /ǫ 4 ) NEON2+SGD [Allen-Zhu and Li, 2017] � ( ǫ, ǫ 1 / 2 ) O (1 /ǫ 4 ) NEON + [Xu et al, 2017] � ( ǫ, ǫ 1 / 2 ) O (1 /ǫ 7 / 4 ) � ( ǫ, ǫ 1 / 2 ) O (1 /ǫ 7 / 4 ) Accelerated PGD [Jin et al, 2018] N/A BCD [Song et al, 2017] (0 , 0) N/A BCD [Lee et al, 2017] (0 , 0) PA-GD [This Work] � ( ǫ, ǫ 1 / 2 ) O (1 /ǫ 2 ) Convergence rates of algorithms to SOSPs with the first order information, where p ≥ 4 . 16

Connection with Existing Works Asymptotic Convergence convergence to rate to SOSPs SOSPs Gradient Lee, et al, 2017 Jin, et al, 2017 descent Alternating Lee, et al, 2017 This Work gradient Song, et al, 2017 descent 17

Challenge of the Problem • Variable Coupling • Consider a biconvex objective function � 1 � � x � 2 f ( x, y ) = [ x, y ] 2 1 y • Block-wise: convex • Whole problem: nonconvex ! 18

Adding Random Noise • Initialize iterates at (0 , 0) A-GD A-GD + random noise 19

Perturbed Gradient Descent • Perturbed gradient descent [Jin, et al 2017] For t = 1 , . . . , Step 1: Gradient descent Step 2: If the size of gradient is small (near saddle points) Add perturbation (extract negative curvature) Step 3: If no decrease after perturbation over t th iterations return 20

Perturbed Alternating Gradient Descent � x ( t ) � Thresholds: Let z ( t ) = • g th : gradient size y ( t ) • f th : objective value Input: z (1) , η , r , g th , f th , t th • t th : number of iteration For t = 1 , . . . , Update x ( t +1) by A-GD If �∇ x f ( x ( t ) , y ( t ) ) � 2 + �∇ y f ( x ( t +1) , y ( t ) ) � 2 ≤ g 2 th and t − t pert > t th Add random perturbation to z ( t ) Update x ( t +1) by A-GD EndIf Update y ( t +1) by A-GD If t − t pert = t th and f ( z ( t ) ) − f ( � z ( t pert ) ) > − f th z ( t pert ) return � EndIf 21

Perturbed Alternating Gradient Descent • Add perturbation z ( t ) ← z ( t ) and t pert ← t � z ( t ) = � z ( t ) + ξ ( t ) , random noise ξ ( t ) follows uniform distribution in the interval [0 , r ] • t th : the minimum number of iterations between adding two perturbations 22

Main Assumptions A1. Function f ( x ) : smooth and has Lipschitz continuous gradient: �∇ f ( x ) − ∇ f ( x ′ ) � ≤ L � x − x ′ � , ∀ x , x ′ A2. Function f ( x ) : smooth and has block-wise Lipschitz continuous gradient: �∇ x f ( x , y ) − ∇ x f ( x ′ , y ) � ≤ L x � x − x ′ � , ∀ x , x ′ �∇ y f ( x , y ) − ∇ y f ( x , y ′ ) � ≤ L y � y − y ′ � , ∀ y , y ′ . Further, let L max := max { L x , L y } ≤ L . A3. Function f ( x ) has Lipschitz continuous Hessian �∇ 2 f ( x ) − ∇ 2 f ( x ′ ) � ≤ ρ � x − x ′ � , ∀ x , x ′ 23

Convergence Rate Theorem 1. Under assumptions [A1]-[A3], when step-size η ≤ 1 /L max , with high probability the iterates generated by PA-GD converge to an ǫ -SOSP ( x , y ) satisfying λ min ( ∇ 2 f ( x , y )) ≥ −√ ρǫ �∇ f ( x , y ) � ≤ ǫ, and in the following number of iterations: � 1 � � O (4) ǫ 2 where � O hides factor polylog( d ). 24

Convergence Analysis is Challenging (One Block) • W.L.O.G set x (1) = 0 • The recursion of gradient descent (Mean Value Theorem): x ( t +1) = x ( t ) − η ∇ x f ( x ( t ) ) (5) �� 1 � = x ( t ) − η ∇ x f (0) − η ∇ 2 f ( θ x ( t ) ) dθ x ( t ) (6) 0 where θ ∈ [0 , 1] 25

Convergence Analysis is Challenging (Two Blocks) � � x ( t ) • Recall: z ( t ) := and W.L.O.G set z (1) = 0 y ( t ) • The recursion of A-GD (Mean Value Theorem): � � � � � � x ( t +1) x ( t ) ∇ x f ( x ( t ) , y ( t ) ) z ( t +1) = − η = (7) y ( t +1) y ( t ) ∇ y f ( x ( t +1) , y ( t ) ) � 1 � 1 = z ( t ) − η ∇ f (0) − η H ( t ) l dθ z ( t +1) − η H ( t ) u dθ z ( t ) (8) where 0 0 θ ∈ [0 , 1]   0 0   H ( t ) :=   l ∇ 2 xy f ( θ x ( t +1) , θ y ( t ) ) 0   ∇ 2 xx f ( θ x ( t ) , θ y ( t ) ) ∇ 2 xy f ( θ x ( t ) , θ y ( t ) )   H ( t ) u :=  .  ∇ 2 yy f ( θ x ( t +1) , θ y ( t ) ) 0 26

Idea of Proof • Let z ∗ be a strict saddle point, H = ∇ 2 f ( z ∗ ) and z (1) = 0 . • The dynamic of the perturbed gradient descent iterates: z ( t +1) = ( I − η H ) z ( t ) − η ∆ ( t ) z ( t ) − η ∇ f (0) (9) • The dynamic of the PA-GD iterates: z ( t +1) = M − 1 Tz ( t ) − η M − 1 ∆ ( t ) u z ( t ) − η M − 1 ∆ ( t ) l z ( t +1) (10) T := I − η H u M := I + η H l , � ∇ 2 � � � ∇ 2 xx f ( z ∗ ) xy f ( z ∗ ) 0 0 H u = H l = ∇ 2 ∇ 2 yy f ( z ∗ ) yx f ( z ∗ ) 0 0 27

PA-GD: On the Convergence of Perturbed Alternating Gradient Descent - PowerPoint PPT Presentation

ICML 2019 PA-GD: On the Convergence of Perturbed Alternating Gradient Descent to Second-Order Stationary Points for Structured Nonconvex Optimization Presenter: Songtao Lu University of Minnesota Twin Cities Joint work with Mingyi Hong and

Convergence of perturbed Proximal Gradient algorithms Gersende Fort Institut de Math ematiques

A.C. generates an alternating field Alternating field generates eddy currents in

Alternating Permutations Richard P. Stanley M.I.T. Alternating Permutations p. 1

Alternating Permutations Richard P. Stanley M.I.T. Alternating Permutations p. Basic

Alternating Current Slide 2 / 69 Topics to be covered Sources of alternating EMF Transformers

Alternating Permutations Richard P. Stanley M.I.T. Alternating Permutations p. 1 Basic

Unit 10: Alternating-current circuits Introduction. Alternating current features. Phasor

Linear Solvers for Singularly Perturbed Problems Numerical Analysis for Singularly Perturbed

The Perturbed The Perturbed Carbon Cycle Carbon Cycle EES 3310/5310 EES 3310/5310 Global

Alternating-time temporal logic Mehdi Dastani BBL-521 M.M.Dastani@uu.nl ATL: Alternating-time

Alternating offers bargaining with risk of breakdown Julio D avila 2009 Julio D avila

Two-Way Alternating Automata and Finite Models Tedious proofs of irrelevant results Mikolaj

On the global convergence of a singularly perturbed parabolic problem of reaction diffusion type

Learning with Differentiable Perturbed Optimizers Quentin Berthet Youth in High-dimensions -

Learning with Differentiable Perturbed Optimizers Quentin Berthet Optimization for ML - CIRM -

Fast direct solvers for elliptic partial differential equations on locally-perturbed geometries

LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning Tianyi Chen

SeparatingThickness fromGeometricThickness DavidEppstein

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

CS7015 (Deep Learning) : Lecture 5 Gradient Descent (GD), Momentum Based GD, Nesterov Accelerated

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima

Walk alkin ing Ran andomly ly, Mas Massiv ively ly, an and Effic iciently ly Jakub

materials for magnetic refrigeration Ekkes Brck Introduction Magnetic cooling Giant

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression