Alternating Minimizations Converge to Second-order Optimal Solutions - PowerPoint PPT Presentation

Alternating Minimizations Converge to Second-order Optimal Solutions Qiuwei Li 1 Joint work with Zhihui Zhu 2 and Gongguo Tang 1 1 Colorado School of Mines 2 Johns Hopkins University

Why is Alternating Minimization so popular? v 1 ( j ) u 2 ( i ) X , Y ∥ XY ⊤ − M ⋆ ∥ 2 min ⋆ v 2 ( j ) Ω u 3 ( i ) v 3 ( j ) minimize f ( x , y ) x , y Many optimization problems have variables with natural partitions Dictionary learning Nonnegative MF Matrix sensing/completion Games …… Tensor decomposition EM algorithm Blind deconvolution 1

Why is Alternating Minimization so popular? y k = argmin y f ( x k − 1 , y ) x k = argmin x f ( x , y k ) Disadvantages Advantages ❖ No global optimality guarantee ✤ Simple to implement : No for general problems stepsize tuning ❖ Only exists 1st-order ✤ Good empirical performance convergence Our Approach Provide the 2nd-order convergence to partially solve the issue of “no global optimality guarantee”. 2

Why is Alternating Minimization so popular? y k = argmin y f ( x k − 1 , y ) x k = argmin x f ( x , y k ) Disadvantages Advantages ❖ No global optimality guarantee ✤ Simple to implement : No for general problems stepsize tuning ❖ Only exists 1st-order ✤ Good empirical performance convergence Theorem 1 Assume f is strongly bi-convex with a full-rank cross Hessian at all strict saddles. Then AltMin almost surely converges to a 2nd-order stationary point from random initialization. 2

Why second-order convergence is enough? All saddles are strict No spurious local minima All local minima are globally optimal Negative curvature 2nd-order optimal solution = globally optimal solution Matrix factorization [1] Matrix sensing [2] Matrix completion [3] Dictionary learning [4] Tensor decomposition [6] Blind deconvolution [5] [1] Jain et al. Global Convergence of Non-Convex Gradient Descent for Computing Matrix Squareroot [2] Bhojanapalli et al. Global Optimality of Local Search for Low Rank Matrix Recovery [3] Ge et al. Matrix Completion Has No Spurious Local Minimum [4] Sun et al. Complete Dictionary Recovery over The Sphere [5] Zhang et al. On the Global Geometry of Sphere-Constrained Sparse Blind Deconvolution [6] Ge et al. Online Stochastic Gradient for Tensor Decomposition 3

Why second-order convergence is enough? All saddles are strict No spurious local minima All local minima are globally optimal Negative curvature 2nd-order optimal solution = globally optimal solution Matrix factorization [1] Matrix sensing [2] Matrix completion [3] Dictionary learning [4] Tensor decomposition [6] Blind deconvolution [5] 1st-order convergence + avoid strict saddles = 2nd-order convergence It su ffi ces to show alternating minimization avoids strict saddles! 3

How to show avoiding strict saddles? A Key Result Lee et al [1,2] use Stable Manifold Theorem [3] to show that iterations defined by a global di ff eom avoids unstable fixed points. An Improved Version (Zero-Property Theorem [4] + Max-Rank Theorem [5]) This work relaxes the global di ff eom condition to show that a local di ff eom (at all unstable fixed points) can avoid unstable fixed points. General Recipe (1)Construct algorithm mapping g and show it is a local di ff eom (i.e., Show Dg is nonsingular); (2)Show all strict saddles of f are unstable fixed points of g; [1] Lee et al. Gradient Descent Converges to Minimizers. [2] Lee et al. First-order Methods Almost Always Avoid Saddle Points [3] Shub. Global Stability of Dynamical Systems [4] Ponomarev et al. Submersions and Preimages of Sets of Measure Zero [5] Bamber and Van. How Many Parameters Can A Model Have and still Be Testable 4

A Proof Sketch Construct the mapping { y k = ϕ ( x k − 1 ) = argmin y f ( x k − 1 , y ) ⟹ x k = g ( x k − 1 ) ≐ ψ ( ϕ ( x k − 1 )) x k = ψ ( y k ) = argmin x f ( x , y k ) Compute the Jacobian (use Implicit function theorem and chain rule) Dg ( x ⋆ ) ∼ ( ∇ 2 ⊤ 2 ) ( ∇ 2 2 ) x f ( x ⋆ , y ⋆ ) − 1 y f ( x ⋆ , y ⋆ ) − 1 x f ( x ⋆ , y ⋆ ) − 1 y f ( x ⋆ , y ⋆ ) − 1 2 ∇ 2 xy f ( x ⋆ , y ⋆ ) ∇ 2 2 ∇ 2 xy f ( x ⋆ , y ⋆ ) ∇ 2 LL ⊤ Show all strict saddles are “unstable” ( Connect Dg with “Schur complement” of the Hessian) ∇ 2 f ( x ⋆ , y ⋆ ) = [ y f ( x ⋆ , y ⋆ ) 1/2 ] [ I m ] [ y f ( x ⋆ , y ⋆ ) 1/2 ] ∇ 2 x f ( x ⋆ , y ⋆ ) 1/2 ∇ 2 x f ( x ⋆ , y ⋆ ) 1/2 I n L ∇ 2 L ⊤ ∇ 2 Φ Finally, by using a Schur complement theorem: ̸ 0 ⟺ Φ / I ≐ I − LL ⊤ ⪰ ∇ 2 f ( x ⋆ , y ⋆ ) ⪰ ̸ 0 ⟺ ∥ L ∥ > 1 ⟺ ρ ( Dg ( x ⋆ )) > 1. □ ̸ 0 ⟺ Φ ⪰ 5

Proximal Alternating Minimization minimize f ( x , y ) x , y Proximal Alternating Minimization x k = argmin x f ( x , y k − 1 ) + λ 2 ∥ x − x k − 1 ∥ 2 2 y k = argmin y f ( x k , y ) + λ 2 ∥ y − y k − 1 ∥ 2 2 Experiments on Key Assumption (Lipschitz bi-smoothness) max{ ∥∇ 2 x f ( x , y ) ∥ , ∥∇ 2 y f ( x , y ) ∥ } ≤ L , ∀ x , y Theorem 2 Assume f is L-Lipschitz bi-smooth and . Then Proximal AltMin λ > L almost surely converges to a 2nd-order stationary point from random initialization. 6

Alternating Minimizations Converge to Second-order Optimal Solutions - PowerPoint PPT Presentation

Alternating Minimizations Converge to Second-order Optimal Solutions Qiuwei Li 1 Joint work with Zhihui Zhu 2 and Gongguo Tang 1 1 Colorado School of Mines 2 Johns Hopkins University Why is Alternating Minimization so popular? v 1 ( j ) u 2 ( i )

A.C. generates an alternating field Alternating field generates eddy currents in

Alternating Permutations Richard P. Stanley M.I.T. Alternating Permutations p. 1

Alternating Permutations Richard P. Stanley M.I.T. Alternating Permutations p. Basic

Alternating Current Slide 2 / 69 Topics to be covered Sources of alternating EMF Transformers

Alternating Permutations Richard P. Stanley M.I.T. Alternating Permutations p. 1 Basic

Unit 10: Alternating-current circuits Introduction. Alternating current features. Phasor

Alternating-time temporal logic Mehdi Dastani BBL-521 M.M.Dastani@uu.nl ATL: Alternating-time

Alternating offers bargaining with risk of breakdown Julio D avila 2009 Julio D avila

Two-Way Alternating Automata and Finite Models Tedious proofs of irrelevant results Mikolaj

Converge Cornerstone Fund The Impact of COVID-19 and Ministry Lending S T A R T . S T R E N G T

Welcome to the CONVERGE Virtual Forum COVID-19 Working Groups for Public Health and Social

Why do irreversible processes converge faster to equilibrium than reversible ones? Marcus Kaiser

FY1 0 Second Quarter FY1 0 Second Quarter FY1 0 Second Quarter FY1 0 Second Quarter

Reorder Buffer Method Issue Execute Write Classic 5-stage pipeline In-order In-order

PA-GD: On the Convergence of Perturbed Alternating Gradient Descent to Second-Order Stationary

Alternating Current AC Circuits and Impedance LRC Series AC Circuits Resonance in AC Circuit

Series Solutions Bernd Schr oder logo1 Bernd Schr oder Louisiana Tech University, College

Skoltech Skolkovo Institute of Science and Technology Kernel Methods Refresher Kernel trick:

Announcements ICS 6B Regrades for everything returned today are due on Thursday Boolean

Query Answering with Transitive and Linear-Ordered Data Antoine Amar illi 1 , M i c h a el B

Logic as a Tool Chapter 3: Understanding First-order Logic 3.2 Semantics of first-order logic

Maximum number of distinct and nonequivalent nonstandard squares in a word Tomasz Kociumaka 1

An introduction to weak memory consistency and the out-of-thin-air problem Viktor Vafeiadis Max

6 Plane Stress Transformations ASEN 3112 Lecture 6 Slide 1 ASEN 3112 - Structures Plane