the power of nonconvex optimization in solving random
play

The Power of Nonconvex Optimization in Solving Random Quadratic - PowerPoint PPT Presentation

The Power of Nonconvex Optimization in Solving Random Quadratic Systems of Equations Yuxin Chen (Princeton) Emmanuel Cand` es (Stanford) Y. Chen, E. J. Cand` es, Communications on Pure and Applied Mathematics vol. 70, no. 5, pp. 822-883, May


  1. The Power of Nonconvex Optimization in Solving Random Quadratic Systems of Equations Yuxin Chen (Princeton) Emmanuel Cand` es (Stanford) Y. Chen, E. J. Cand` es, Communications on Pure and Applied Mathematics vol. 70, no. 5, pp. 822-883, May 2017

  2. Agenda 1. The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2. Random initialization and implicit regularization in nonconvex statistical estimation (Aug. 29) 3. The projected power method: an efficient nonconvex algorithm for joint discrete assignment from pairwise data (Sep. 3) 4. Spectral methods meets asymmetry: two recent stories (Sep. 4) 5. Inference and uncertainty quantification for noisy matrix completion (Sep. 5)

  3. on (high-dimensional) statistics nonconvex optimization

  4. Nonconvex problems are everywhere Maximum likelihood estimation is usually nonconvex → maximize x ℓ ( x ; data) may be nonconcave x ∈ S → subj. to may be nonconvex

  5. Nonconvex problems are everywhere Maximum likelihood estimation is usually nonconvex → maximize x ℓ ( x ; data) may be nonconcave x ∈ S → subj. to may be nonconvex • low-rank matrix completion • robust principal component analysis • graph clustering • dictionary learning • blind deconvolution • learning neural nets • ...

  6. Nonconvex optimization may be super scary There may be bumps everywhere and exponentially many local optima e.g. 1-layer neural net (Auer, Herbster, Warmuth ’96; Vu ’98)

  7. Example: solving quadratic programs is hard Finding maximum cut in a graph is x ⊤ W x maximize x x 2 subj. to i = 1 , i = 1 , · · · , n

  8. Example: solving quadratic programs is hard Fig credit: coding horror

  9. One strategy: convex relaxation Can relax into convex problems by • finding convex surrogates (e.g. compressed sensing, matrix completion) • lifting into higher dimensions (e.g. Max-Cut)

  10. Example of convex surrogate: low-rank matrix completion — Fazel ’02, Recht, Parrilo, Fazel ’10, Cand` es, Recht ’09 minimize M rank ( M ) subj. to data constraints cvx surrogate minimize M nuc-norm ( M ) subj. to data constraints

  11. Example of convex surrogate: low-rank matrix completion — Fazel ’02, Recht, Parrilo, Fazel ’10, Cand` es, Recht ’09 minimize M rank ( M ) subj. to data constraints cvx surrogate minimize M nuc-norm ( M ) subj. to data constraints Robust variation used everyday by Netflix

  12. Example of convex surrogate: low-rank matrix completion — Fazel ’02, Recht, Parrilo, Fazel ’10, Cand` es, Recht ’09 minimize M rank ( M ) subj. to data constraints cvx surrogate minimize M nuc-norm ( M ) subj. to data constraints Robust variation used everyday by Netflix Problem: operate in full matrix space even though X is low-rank

  13. Example of lifting: Max-Cut — Goemans, Williamson ’95 x ⊤ W x maximize x x 2 subj. to i = 1 , i = 1 , · · · , n

  14. Example of lifting: Max-Cut — Goemans, Williamson ’95 x ⊤ W x maximize x x 2 subj. to i = 1 , i = 1 , · · · , n let X be xx ⊤ maximize X � X , W � subj. to X i,i = 1 , i = 1 , · · · , n X � 0 rank ( X ) = 1

  15. Example of lifting: Max-Cut — Goemans, Williamson ’95 x ⊤ W x maximize x x 2 subj. to i = 1 , i = 1 , · · · , n let X be xx ⊤ maximize X � X , W � subj. to X i,i = 1 , i = 1 , · · · , n X � 0 rank ( X ) = 1

  16. Example of lifting: Max-Cut — Goemans, Williamson ’95 x ⊤ W x maximize x x 2 subj. to i = 1 , i = 1 , · · · , n let X be xx ⊤ maximize X � X , W � subj. to X i,i = 1 , i = 1 , · · · , n X � 0 rank ( X ) = 1 Problem: explosion in dimensions ( R n → R n × n )

  17. How about optimizing nonconvex problems directly without lifting?

  18. A case study: solving random quadratic systems of equations

  19. Solving quadratic systems of equations y = | Ax | 2 x A Ax 1 1 9 -3 2 4 -1 1 16 4 4 2 -2 4 1 -1 9 3 4 16 Solve for x ∈ C n in m quadratic equations |� a k , x �| 2 , y k ≈ k = 1 , . . . , m

  20. Motivation: a missing phase problem in imaging science Detectors record intensities of diffracted rays • x ( t 1 , t 2 ) − → Fourier transform ˆ x ( f 1 , f 2 ) � � � 2 � 2 = � � � x ( t 1 , t 2 ) e − i 2 π ( f 1 t 1 + f 2 t 2 ) d t 1 d t 2 � intensity of electrical field: � ˆ x ( f 1 , f 2 ) � � Phase retrieval : recover true signal x ( t 1 , t 2 ) from intensity measurements

  21. Motivation: latent variable models Example: mixture of regression y ≈ h x , β i y ≈ h x , − β i • Samples { ( y k , x k ) } : drawn from one of two unknown regressors β and − β � � x k , β � , with prob. 0 . 5 y k ≈ ( labels: latent variables ) � x k , − β � , else

  22. Motivation: latent variable models Example: mixture of regression y ≈ h x , β i y ≈ h x , − β i • Samples { ( y k , x k ) } : drawn from one of two unknown regressors β and − β � � x k , β � , with prob. 0 . 5 y k ≈ ( labels: latent variables ) � x k , − β � , else — equivalent to observing | y k | 2 ≈ |� x k , β �| 2 • Goal: estimate β

  23. Motivation: learning neural nets with quadratic activation — Soltanolkotabi, Javanmard, Lee ’17, Li, Ma, Zhang ’17 X \ X σ y a + a σ σ er output layer hidden layer i er input layer o input features: a ; weights: X = [ x 1 , · · · , x r ] r r σ ( z )= z 2 � � σ ( a ⊤ x i ) ( a ⊤ x i ) 2 output: y = := i =1 i =1

  24. An equivalent view: low-rank factorization Lifting: introduce X = xx ∗ to linearize constraints k x | 2 = a ∗ y k = | a ∗ k ( xx ∗ ) a k y k = a ∗ = ⇒ k Xa k

  25. An equivalent view: low-rank factorization Lifting: introduce X = xx ∗ to linearize constraints k x | 2 = a ∗ y k = | a ∗ k ( xx ∗ ) a k y k = a ∗ = ⇒ k Xa k X � 0 find y k = a ∗ k = 1 , · · · , m s.t. k Xa k , rank ( X ) = 1

  26. An equivalent view: low-rank factorization Lifting: introduce X = xx ∗ to linearize constraints k x | 2 = a ∗ y k = | a ∗ k ( xx ∗ ) a k y k = a ∗ = ⇒ k Xa k X � 0 find y k = a ∗ k = 1 , · · · , m s.t. k Xa k , rank ( X ) = 1

  27. An equivalent view: low-rank factorization Lifting: introduce X = xx ∗ to linearize constraints k x | 2 = a ∗ y k = | a ∗ k ( xx ∗ ) a k y k = a ∗ = ⇒ k Xa k X � 0 find y k = a ∗ k = 1 , · · · , m s.t. k Xa k , rank ( X ) = 1 Works well if { a k } are random, but huge increase in dimensions

  28. Prior art (before our work) y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity cvx relaxation n mn infeasible comput. cost

  29. Prior art (before our work) y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity infeasible cvx relaxation n mn infeasible comput. cost mn 2

  30. Prior art (before our work) y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity infeasible cvx relaxation n mn infeasible comput. cost mn 2 mn 2

  31. Prior art (before our work) y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity Wirtinger flow infeasible n log n 3 cvx relaxation n mn infeasible comput. cost mn 2 mn 2

  32. Prior art (before our work) y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity alt-min (fresh samples at each iter) n log 3 n Wirtinger flow infeasible n log n 3 cvx relaxation n mn infeasible comput. cost mn 2 mn 2

  33. A glimpse of our results y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity alt-min (fresh samples at each iter) n log 3 n Wirtinger flow infeasible n log n 3 cvx relaxation Our algorithm n mn infeasible comput. cost mn 2 mn 2 This work: random quadratic systems are solvable in linear time!

  34. A glimpse of our results y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity alt-min (fresh samples at each iter) n log 3 n Wirtinger flow infeasible n log n 3 cvx relaxation Our algorithm n mn infeasible comput. cost mn 2 mn 2 This work: random quadratic systems are solvable in linear time! � minimal sample size � optimal statistical accuracy

  35. A first impulse: maximum likelihood estimate 1 � m maximize z ℓ ( z ) = k =1 ℓ k ( z ) m

  36. A first impulse: maximum likelihood estimate 1 � m maximize z ℓ ( z ) = k =1 ℓ k ( z ) m k x | 2 + N (0 , σ 2 ) y k ∼ | a ∗ • Gaussian data: k z | 2 � 2 � y k − | a ∗ ℓ k ( z ) = −

  37. A first impulse: maximum likelihood estimate 1 � m maximize z ℓ ( z ) = k =1 ℓ k ( z ) m k x | 2 + N (0 , σ 2 ) y k ∼ | a ∗ • Gaussian data: k z | 2 � 2 � y k − | a ∗ ℓ k ( z ) = − � k x | 2 � | a ∗ • Poisson data: y k ∼ Poisson k z | 2 + y k log | a ∗ ℓ k ( z ) = −| a ∗ k z | 2

  38. A first impulse: maximum likelihood estimate 1 � m maximize z ℓ ( z ) = k =1 ℓ k ( z ) m k x | 2 + N (0 , σ 2 ) y k ∼ | a ∗ • Gaussian data: k z | 2 � 2 � y k − | a ∗ ℓ k ( z ) = − � k x | 2 � | a ∗ • Poisson data: y k ∼ Poisson k z | 2 + y k log | a ∗ ℓ k ( z ) = −| a ∗ k z | 2 Problem: − ℓ nonconvex, many local stationary points

Recommend


More recommend