The Power of Nonconvex Optimization in Solving Random Quadratic Systems of Equations Yuxin Chen (Princeton) Emmanuel Cand` es (Stanford) Y. Chen, E. J. Cand` es, Communications on Pure and Applied Mathematics vol. 70, no. 5, pp. 822-883, May 2017
Agenda 1. The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2. Random initialization and implicit regularization in nonconvex statistical estimation (Aug. 29) 3. The projected power method: an efficient nonconvex algorithm for joint discrete assignment from pairwise data (Sep. 3) 4. Spectral methods meets asymmetry: two recent stories (Sep. 4) 5. Inference and uncertainty quantification for noisy matrix completion (Sep. 5)
on (high-dimensional) statistics nonconvex optimization
Nonconvex problems are everywhere Maximum likelihood estimation is usually nonconvex → maximize x ℓ ( x ; data) may be nonconcave x ∈ S → subj. to may be nonconvex
Nonconvex problems are everywhere Maximum likelihood estimation is usually nonconvex → maximize x ℓ ( x ; data) may be nonconcave x ∈ S → subj. to may be nonconvex • low-rank matrix completion • robust principal component analysis • graph clustering • dictionary learning • blind deconvolution • learning neural nets • ...
Nonconvex optimization may be super scary There may be bumps everywhere and exponentially many local optima e.g. 1-layer neural net (Auer, Herbster, Warmuth ’96; Vu ’98)
Example: solving quadratic programs is hard Finding maximum cut in a graph is x ⊤ W x maximize x x 2 subj. to i = 1 , i = 1 , · · · , n
Example: solving quadratic programs is hard Fig credit: coding horror
One strategy: convex relaxation Can relax into convex problems by • finding convex surrogates (e.g. compressed sensing, matrix completion) • lifting into higher dimensions (e.g. Max-Cut)
Example of convex surrogate: low-rank matrix completion — Fazel ’02, Recht, Parrilo, Fazel ’10, Cand` es, Recht ’09 minimize M rank ( M ) subj. to data constraints cvx surrogate minimize M nuc-norm ( M ) subj. to data constraints
Example of convex surrogate: low-rank matrix completion — Fazel ’02, Recht, Parrilo, Fazel ’10, Cand` es, Recht ’09 minimize M rank ( M ) subj. to data constraints cvx surrogate minimize M nuc-norm ( M ) subj. to data constraints Robust variation used everyday by Netflix
Example of convex surrogate: low-rank matrix completion — Fazel ’02, Recht, Parrilo, Fazel ’10, Cand` es, Recht ’09 minimize M rank ( M ) subj. to data constraints cvx surrogate minimize M nuc-norm ( M ) subj. to data constraints Robust variation used everyday by Netflix Problem: operate in full matrix space even though X is low-rank
Example of lifting: Max-Cut — Goemans, Williamson ’95 x ⊤ W x maximize x x 2 subj. to i = 1 , i = 1 , · · · , n
Example of lifting: Max-Cut — Goemans, Williamson ’95 x ⊤ W x maximize x x 2 subj. to i = 1 , i = 1 , · · · , n let X be xx ⊤ maximize X � X , W � subj. to X i,i = 1 , i = 1 , · · · , n X � 0 rank ( X ) = 1
Example of lifting: Max-Cut — Goemans, Williamson ’95 x ⊤ W x maximize x x 2 subj. to i = 1 , i = 1 , · · · , n let X be xx ⊤ maximize X � X , W � subj. to X i,i = 1 , i = 1 , · · · , n X � 0 rank ( X ) = 1
Example of lifting: Max-Cut — Goemans, Williamson ’95 x ⊤ W x maximize x x 2 subj. to i = 1 , i = 1 , · · · , n let X be xx ⊤ maximize X � X , W � subj. to X i,i = 1 , i = 1 , · · · , n X � 0 rank ( X ) = 1 Problem: explosion in dimensions ( R n → R n × n )
How about optimizing nonconvex problems directly without lifting?
A case study: solving random quadratic systems of equations
Solving quadratic systems of equations y = | Ax | 2 x A Ax 1 1 9 -3 2 4 -1 1 16 4 4 2 -2 4 1 -1 9 3 4 16 Solve for x ∈ C n in m quadratic equations |� a k , x �| 2 , y k ≈ k = 1 , . . . , m
Motivation: a missing phase problem in imaging science Detectors record intensities of diffracted rays • x ( t 1 , t 2 ) − → Fourier transform ˆ x ( f 1 , f 2 ) � � � 2 � 2 = � � � x ( t 1 , t 2 ) e − i 2 π ( f 1 t 1 + f 2 t 2 ) d t 1 d t 2 � intensity of electrical field: � ˆ x ( f 1 , f 2 ) � � Phase retrieval : recover true signal x ( t 1 , t 2 ) from intensity measurements
Motivation: latent variable models Example: mixture of regression y ≈ h x , β i y ≈ h x , − β i • Samples { ( y k , x k ) } : drawn from one of two unknown regressors β and − β � � x k , β � , with prob. 0 . 5 y k ≈ ( labels: latent variables ) � x k , − β � , else
Motivation: latent variable models Example: mixture of regression y ≈ h x , β i y ≈ h x , − β i • Samples { ( y k , x k ) } : drawn from one of two unknown regressors β and − β � � x k , β � , with prob. 0 . 5 y k ≈ ( labels: latent variables ) � x k , − β � , else — equivalent to observing | y k | 2 ≈ |� x k , β �| 2 • Goal: estimate β
Motivation: learning neural nets with quadratic activation — Soltanolkotabi, Javanmard, Lee ’17, Li, Ma, Zhang ’17 X \ X σ y a + a σ σ er output layer hidden layer i er input layer o input features: a ; weights: X = [ x 1 , · · · , x r ] r r σ ( z )= z 2 � � σ ( a ⊤ x i ) ( a ⊤ x i ) 2 output: y = := i =1 i =1
An equivalent view: low-rank factorization Lifting: introduce X = xx ∗ to linearize constraints k x | 2 = a ∗ y k = | a ∗ k ( xx ∗ ) a k y k = a ∗ = ⇒ k Xa k
An equivalent view: low-rank factorization Lifting: introduce X = xx ∗ to linearize constraints k x | 2 = a ∗ y k = | a ∗ k ( xx ∗ ) a k y k = a ∗ = ⇒ k Xa k X � 0 find y k = a ∗ k = 1 , · · · , m s.t. k Xa k , rank ( X ) = 1
An equivalent view: low-rank factorization Lifting: introduce X = xx ∗ to linearize constraints k x | 2 = a ∗ y k = | a ∗ k ( xx ∗ ) a k y k = a ∗ = ⇒ k Xa k X � 0 find y k = a ∗ k = 1 , · · · , m s.t. k Xa k , rank ( X ) = 1
An equivalent view: low-rank factorization Lifting: introduce X = xx ∗ to linearize constraints k x | 2 = a ∗ y k = | a ∗ k ( xx ∗ ) a k y k = a ∗ = ⇒ k Xa k X � 0 find y k = a ∗ k = 1 , · · · , m s.t. k Xa k , rank ( X ) = 1 Works well if { a k } are random, but huge increase in dimensions
Prior art (before our work) y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity cvx relaxation n mn infeasible comput. cost
Prior art (before our work) y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity infeasible cvx relaxation n mn infeasible comput. cost mn 2
Prior art (before our work) y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity infeasible cvx relaxation n mn infeasible comput. cost mn 2 mn 2
Prior art (before our work) y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity Wirtinger flow infeasible n log n 3 cvx relaxation n mn infeasible comput. cost mn 2 mn 2
Prior art (before our work) y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity alt-min (fresh samples at each iter) n log 3 n Wirtinger flow infeasible n log n 3 cvx relaxation n mn infeasible comput. cost mn 2 mn 2
A glimpse of our results y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity alt-min (fresh samples at each iter) n log 3 n Wirtinger flow infeasible n log n 3 cvx relaxation Our algorithm n mn infeasible comput. cost mn 2 mn 2 This work: random quadratic systems are solvable in linear time!
A glimpse of our results y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity alt-min (fresh samples at each iter) n log 3 n Wirtinger flow infeasible n log n 3 cvx relaxation Our algorithm n mn infeasible comput. cost mn 2 mn 2 This work: random quadratic systems are solvable in linear time! � minimal sample size � optimal statistical accuracy
A first impulse: maximum likelihood estimate 1 � m maximize z ℓ ( z ) = k =1 ℓ k ( z ) m
A first impulse: maximum likelihood estimate 1 � m maximize z ℓ ( z ) = k =1 ℓ k ( z ) m k x | 2 + N (0 , σ 2 ) y k ∼ | a ∗ • Gaussian data: k z | 2 � 2 � y k − | a ∗ ℓ k ( z ) = −
A first impulse: maximum likelihood estimate 1 � m maximize z ℓ ( z ) = k =1 ℓ k ( z ) m k x | 2 + N (0 , σ 2 ) y k ∼ | a ∗ • Gaussian data: k z | 2 � 2 � y k − | a ∗ ℓ k ( z ) = − � k x | 2 � | a ∗ • Poisson data: y k ∼ Poisson k z | 2 + y k log | a ∗ ℓ k ( z ) = −| a ∗ k z | 2
A first impulse: maximum likelihood estimate 1 � m maximize z ℓ ( z ) = k =1 ℓ k ( z ) m k x | 2 + N (0 , σ 2 ) y k ∼ | a ∗ • Gaussian data: k z | 2 � 2 � y k − | a ∗ ℓ k ( z ) = − � k x | 2 � | a ∗ • Poisson data: y k ∼ Poisson k z | 2 + y k log | a ∗ ℓ k ( z ) = −| a ∗ k z | 2 Problem: − ℓ nonconvex, many local stationary points
Recommend
More recommend