ICML 2019 PA-GD: On the Convergence of Perturbed Alternating Gradient Descent to Second-Order Stationary Points for Structured Nonconvex Optimization Presenter: Songtao Lu University of Minnesota Twin Cities Joint work with Mingyi Hong and Zhengdao Wang 1
Co-authors Mingyi Hong Zhengdao Wang University of Minnesota Iowa State University 2
Agenda • Motivation – A class of structured non-convex problems • What we plan to achieve : – Random perturbation : Convergence rate of alternating gradient descent ( A-GD ) to second-order stationary points ( SOSPs ) with high probability • Numerical Results – Two-layer linear neural networks: – Matrix factorization • Concluding Remarks 3
Block Structured Nonconvex Optimization • Consider the following problem P : minimize f ( x , y ) x , y • f ( x , y ) : R d → R is a smooth nonconvex function – x ∈ R d x – y ∈ R d y – d = d x + d y 4
Motivation: Nice Landscapes • High dimensional problems: strict saddle points common • There are some nice/benign block structured problems [R. Ge et al., 2017, J. Lee et al., 2018] – All local minima are global minima – Saddle points: very poor compared with local minima – Every saddle point: strict (Hessian matrix has at least one negative eigenvalue) [0 , 0] T [0 , 0] T [0 , 0] T M : indefinite local maximum strict saddle local minimum T − M � 2 minimize x ∈ R 2 × 1 � xx F 5
Optimality Conditions • Common definition of first-order stationary points (FOSPs) �∇ f ( x , y ) � ≤ ǫ where ǫ > 0 , then ( x , y ) is an ǫ -FOSP. • Common definition of SOSPs If the following holds λ min ( ∇ 2 f ( x , y )) ≥ − γ �∇ f ( x , y ) � ≤ ǫ, and where ǫ, γ > 0 , then ( x , y ) is an ( ǫ, γ ) -SOSP. 6
Literature Algorithms with convergence guarantees to SOSPs: • Second-order methods (one block) – Trust region method [Conn et al., 2000] – Cubic regularized Newton’s method [Nesterov & Polyak, 2006] – Hybrid of first-order and second-order method [Reddi et al., 2018] • First-order methods (one block) – Perturbed gradient descent (PGD) [Jin et al., 2017] – Stochastic first order method (NEgative-curvature-Originated-from-Noise, NEON, [Xu et al., 2017]) – Neon2 (finding local minima via first-order oracles) [Allen-Zhu et al., 2017] – Accelerated methods [Carmon et al., 2016][Jin et al., 2018][Xu et al., 2018] – Many more 7
Literature • Block structured nonconvex optimization (asymptotic) : – Block coordinate descent (BCD) [Song et al., 2017][Lee et al., 2017] – Alternating direction methods of multipliers (ADMM) [Hong et al., 2018] • But none of these work has shown the convergence rate of block coordinate descent to SOSPs, even for the two-block case. • Gradient descent can take exponential number of iterations to escape saddle points [Du et al., 2017] 8
Motivation: Block Structured Nonconvex Problems • Many problems have block structures in nature. • We can have faster numerical convergence rates by leveraging block structures of the problem. 9
Motivation: Block Structured Nonconvex Problems • Matrix Factorization [Jain et al., 2013] 1 T − M � 2 2 � XY minimize F X ∈ R n × k , Y ∈ R m × k • Matrix Sensing [Sun et al., 2014] 1 T − M ) � 2 2 �A ( XY minimize F X ∈ R n × k , Y ∈ R m × k A : linear measurement operator and satisfies the restricted isometry property (RIP) condition 10
Motivation of This Work Can we solve the nice block structured nonconvex problems to SOSP? 11
Alternating Gradient Descent • Iterates of A-GD [Bertsekas 1999]: x ( t +1) = x ( t ) − η ∇ x f ( x ( t ) , y ( t ) ) (1) y ( t +1) = y ( t ) − η ∇ y f ( x ( t +1) , y ( t ) ) (2) • Step-size: η ≤ 1 /L max 12
Motivation of Alternating Gradient Descent minimize T Mx x x 1 ,x 2 � 1 � a M = a 1 • Whole problem: L = 1 + a • Block-wise: L max = 1 GD A-GD a = 1000 13
Motivation of Alernating Gradient Descent • A-GD: – numerically good – may take a long time to escape from saddle points • PA-GD: numerically good and convergence rate guarantees 14
Matrix Factorization A two-layer linear neural network: l � y i − UV T � x i � 2 X � 2 2 = � � Y − UV T � � � minimize F , (3) U ∈ R n × k , V ∈ R m × k i =1 • � Y and � X : n = 100 , m = 40 , k = 20 , l = 20 , CN (0 , 1) Proposed Gradient Descent Convergence comparison between GD and PA-GD for learning a two-layer neural network, where ǫ = 10 − 10 , g th = ǫ/ 10 , t th = 10 /ǫ 1 / 2 , r = ǫ/ 10 . 15
Connection with Existing Works Algorithm Iterations ( ǫ, γ ) -SOSP PGD [Jin et al, 2017] � ( ǫ, ǫ 1 / 2 ) O (1 /ǫ 2 ) NEON+SGD [Xu and Yang, 2017] � ( ǫ, ǫ 1 / 2 ) O (1 /ǫ 4 ) NEON2+SGD [Allen-Zhu and Li, 2017] � ( ǫ, ǫ 1 / 2 ) O (1 /ǫ 4 ) NEON + [Xu et al, 2017] � ( ǫ, ǫ 1 / 2 ) O (1 /ǫ 7 / 4 ) � ( ǫ, ǫ 1 / 2 ) O (1 /ǫ 7 / 4 ) Accelerated PGD [Jin et al, 2018] N/A BCD [Song et al, 2017] (0 , 0) N/A BCD [Lee et al, 2017] (0 , 0) PA-GD [This Work] � ( ǫ, ǫ 1 / 2 ) O (1 /ǫ 2 ) Convergence rates of algorithms to SOSPs with the first order information, where p ≥ 4 . 16
Connection with Existing Works Asymptotic Convergence convergence to rate to SOSPs SOSPs Gradient Lee, et al, 2017 Jin, et al, 2017 descent Alternating Lee, et al, 2017 This Work gradient Song, et al, 2017 descent 17
Challenge of the Problem • Variable Coupling • Consider a biconvex objective function � 1 � � x � 2 f ( x, y ) = [ x, y ] 2 1 y • Block-wise: convex • Whole problem: nonconvex ! 18
Adding Random Noise • Initialize iterates at (0 , 0) A-GD A-GD + random noise 19
Perturbed Gradient Descent • Perturbed gradient descent [Jin, et al 2017] For t = 1 , . . . , Step 1: Gradient descent Step 2: If the size of gradient is small (near saddle points) Add perturbation (extract negative curvature) Step 3: If no decrease after perturbation over t th iterations return 20
Perturbed Alternating Gradient Descent � x ( t ) � Thresholds: Let z ( t ) = • g th : gradient size y ( t ) • f th : objective value Input: z (1) , η , r , g th , f th , t th • t th : number of iteration For t = 1 , . . . , Update x ( t +1) by A-GD If �∇ x f ( x ( t ) , y ( t ) ) � 2 + �∇ y f ( x ( t +1) , y ( t ) ) � 2 ≤ g 2 th and t − t pert > t th Add random perturbation to z ( t ) Update x ( t +1) by A-GD EndIf Update y ( t +1) by A-GD If t − t pert = t th and f ( z ( t ) ) − f ( � z ( t pert ) ) > − f th z ( t pert ) return � EndIf 21
Perturbed Alternating Gradient Descent • Add perturbation z ( t ) ← z ( t ) and t pert ← t � z ( t ) = � z ( t ) + ξ ( t ) , random noise ξ ( t ) follows uniform distribution in the interval [0 , r ] • t th : the minimum number of iterations between adding two perturbations 22
Main Assumptions A1. Function f ( x ) : smooth and has Lipschitz continuous gradient: �∇ f ( x ) − ∇ f ( x ′ ) � ≤ L � x − x ′ � , ∀ x , x ′ A2. Function f ( x ) : smooth and has block-wise Lipschitz continuous gradient: �∇ x f ( x , y ) − ∇ x f ( x ′ , y ) � ≤ L x � x − x ′ � , ∀ x , x ′ �∇ y f ( x , y ) − ∇ y f ( x , y ′ ) � ≤ L y � y − y ′ � , ∀ y , y ′ . Further, let L max := max { L x , L y } ≤ L . A3. Function f ( x ) has Lipschitz continuous Hessian �∇ 2 f ( x ) − ∇ 2 f ( x ′ ) � ≤ ρ � x − x ′ � , ∀ x , x ′ 23
Convergence Rate Theorem 1. Under assumptions [A1]-[A3], when step-size η ≤ 1 /L max , with high probability the iterates generated by PA-GD converge to an ǫ -SOSP ( x , y ) satisfying λ min ( ∇ 2 f ( x , y )) ≥ −√ ρǫ �∇ f ( x , y ) � ≤ ǫ, and in the following number of iterations: � 1 � � O (4) ǫ 2 where � O hides factor polylog( d ). 24
Convergence Analysis is Challenging (One Block) • W.L.O.G set x (1) = 0 • The recursion of gradient descent (Mean Value Theorem): x ( t +1) = x ( t ) − η ∇ x f ( x ( t ) ) (5) �� 1 � = x ( t ) − η ∇ x f (0) − η ∇ 2 f ( θ x ( t ) ) dθ x ( t ) (6) 0 where θ ∈ [0 , 1] 25
Convergence Analysis is Challenging (Two Blocks) � � x ( t ) • Recall: z ( t ) := and W.L.O.G set z (1) = 0 y ( t ) • The recursion of A-GD (Mean Value Theorem): � � � � � � x ( t +1) x ( t ) ∇ x f ( x ( t ) , y ( t ) ) z ( t +1) = − η = (7) y ( t +1) y ( t ) ∇ y f ( x ( t +1) , y ( t ) ) � 1 � 1 = z ( t ) − η ∇ f (0) − η H ( t ) l dθ z ( t +1) − η H ( t ) u dθ z ( t ) (8) where 0 0 θ ∈ [0 , 1] 0 0 H ( t ) := l ∇ 2 xy f ( θ x ( t +1) , θ y ( t ) ) 0 ∇ 2 xx f ( θ x ( t ) , θ y ( t ) ) ∇ 2 xy f ( θ x ( t ) , θ y ( t ) ) H ( t ) u := . ∇ 2 yy f ( θ x ( t +1) , θ y ( t ) ) 0 26
Idea of Proof • Let z ∗ be a strict saddle point, H = ∇ 2 f ( z ∗ ) and z (1) = 0 . • The dynamic of the perturbed gradient descent iterates: z ( t +1) = ( I − η H ) z ( t ) − η ∆ ( t ) z ( t ) − η ∇ f (0) (9) • The dynamic of the PA-GD iterates: z ( t +1) = M − 1 Tz ( t ) − η M − 1 ∆ ( t ) u z ( t ) − η M − 1 ∆ ( t ) l z ( t +1) (10) T := I − η H u M := I + η H l , � ∇ 2 � � � ∇ 2 xx f ( z ∗ ) xy f ( z ∗ ) 0 0 H u = H l = ∇ 2 ∇ 2 yy f ( z ∗ ) yx f ( z ∗ ) 0 0 27
Recommend
More recommend