implicit regularization in nonconvex statistical
play

Implicit Regularization in Nonconvex Statistical Estimation Yuxin - PowerPoint PPT Presentation

Implicit Regularization in Nonconvex Statistical Estimation Yuxin Chen Electrical Engineering, Princeton University Cong Ma Kaizheng Wang Yuejie Chi Princeton ORFE Princeton ORFE CMU ECE Nonconvex estimation problems are everywhere


  1. Implicit Regularization in Nonconvex Statistical Estimation Yuxin Chen Electrical Engineering, Princeton University

  2. Cong Ma Kaizheng Wang Yuejie Chi Princeton ORFE Princeton ORFE CMU ECE

  3. Nonconvex estimation problems are everywhere Empirical risk minimization is usually nonconvex → minimize x ℓ ( x ; y ) may be nonconvex subj. to x ∈ S → may be nonconvex 3/ 36

  4. Nonconvex estimation problems are everywhere Empirical risk minimization is usually nonconvex → minimize x ℓ ( x ; y ) may be nonconvex subj. to x ∈ S → may be nonconvex • low-rank matrix completion • graph clustering • dictionary learning • mixture models • deep learning • ... 3/ 36

  5. Nonconvex optimization may be super scary There may be bumps everywhere and exponentially many local optima e.g. 1-layer neural net (Auer, Herbster, Warmuth ’96; Vu ’98) 4/ 36

  6. Nonconvex optimization may be super scary There may be bumps everywhere and exponentially many local optima e.g. 1-layer neural net (Auer, Herbster, Warmuth ’96; Vu ’98) 4/ 36

  7. ... but is sometimes much nicer than we think Under certain statistical models, we see benign global geometry: no spurious local optima Fig credit: Sun, Qu & Wright 5/ 36

  8. ... but is sometimes much nicer than we think statistical models benign landscape exploit geometry efficient algorithms

  9. Optimization-based methods: two-stage approach ≈ h − i initial guess z 0 on: x 0 data m x ng x ¯ basin of attraction • Start from an appropriate initial point 7/ 36

  10. Optimization-based methods: two-stage approach ≈ h − i initial guess z 0 on: x 0 i data m ess z 0 on: x 0 data m • Find ind an i z 1 x 1 x 2 z 2 x x ng x ¯ ng x ¯ basin of attraction basin of attraction • Start from an appropriate initial point • Proceed via some iterative optimization algorithms 7/ 36

  11. Roles of regularization • Prevents overfitting and improves generalization ◦ e.g. ℓ 1 penalization, SCAD, nuclear norm penalization, ... 8/ 36

  12. Roles of regularization • Prevents overfitting and improves generalization ◦ e.g. ℓ 1 penalization, SCAD, nuclear norm penalization, ... • Improves computation by stabilizing search directions ◦ e.g. trimming, projection, regularized loss 8/ 36

  13. Roles of regularization • Prevents overfitting and improves generalization ◦ e.g. ℓ 1 penalization, SCAD, nuclear norm penalization, ... • Improves computation by stabilizing search directions ⇒ = focus of this talk ◦ e.g. trimming, projection, regularized loss 8/ 36

  14. 3 representative nonconvex problems phase matrix blind retrieval completion deconvolution 9/ 36

  15. Regularized methods phase matrix blind retrieval completion deconvolution regularized regularized regularized trimming regularized cost regularized cost projection projection 9/ 36

  16. Regularized vs. unregularized methods phase matrix blind retrieval completion deconvolution unregularized regularized regularized unregularized regularized unregularized suboptimal trimming regularized cost ? regularized cost ? comput. cost projection projection 9/ 36

  17. Regularized vs. unregularized methods phase matrix blind retrieval completion deconvolution unregularized regularized regularized unregularized regularized unregularized suboptimal trimming regularized cost ? regularized cost ? comput. cost projection projection Are unregularized methods suboptimal for nonconvex estimation? 9/ 36

  18. Missing phase problem Detectors record intensities of diffracted rays • electric field x ( t 1 , t 2 ) − → Fourier transform � x ( f 1 , f 2 ) Fig credit: Stanford SLAC � � � � � 2 � 2 = � � �� x ( t 1 , t 2 ) e − i 2 π ( f 1 t 1 + f 2 t 2 ) d t 1 d t 2 intensity of electrical field: x ( f 1 , f 2 ) � � 10/ 36

  19. Missing phase problem Detectors record intensities of diffracted rays • electric field x ( t 1 , t 2 ) − → Fourier transform � x ( f 1 , f 2 ) Fig credit: Stanford SLAC � � � � � 2 � 2 = � � �� x ( t 1 , t 2 ) e − i 2 π ( f 1 t 1 + f 2 t 2 ) d t 1 d t 2 intensity of electrical field: x ( f 1 , f 2 ) � � � � 2 Phase retrieval: recover signal x ( t 1 , t 2 ) from intensity | � x ( f 1 , f 2 ) 10/ 36

  20. Solving quadratic systems of equations y = | Ax | 2 x A Ax 1 1 -3 9 2 4 -1 1 ) = m 4 16 2 4 -2 4 -1 1 3 9 4 16 X y n p Recover x ♮ ∈ R n from m random quadratic measurements | a ⊤ k x ♮ | 2 , y k = k = 1 , . . . , m Assume w.l.o.g. � x ♮ � 2 = 1 11/ 36

  21. Wirtinger flow (Cand` es, Li, Soltanolkotabi ’14) Empirical risk minimization m �� a ⊤ � 2 � 1 � 2 − y k minimize x f ( x ) = k x 4 m k =1 12/ 36

  22. Wirtinger flow (Cand` es, Li, Soltanolkotabi ’14) Empirical risk minimization m �� a ⊤ � 2 � 1 � 2 − y k minimize x f ( x ) = k x 4 m k =1 • Initialization by spectral method • Gradient iterations: for t = 0 , 1 , . . . x t +1 = x t − η ∇ f ( x t ) 12/ 36

  23. Gradient descent theory revisited Two standard conditions that enable geometric convergence of GD 13/ 36

  24. Gradient descent theory revisited Two standard conditions that enable geometric convergence of GD • (local) restricted strong convexity (or regularity condition) 13/ 36

  25. Gradient descent theory revisited Two standard conditions that enable geometric convergence of GD • (local) restricted strong convexity (or regularity condition) • (local) smoothness ∇ 2 f ( x ) ≻ 0 and is well-conditioned 13/ 36

  26. Gradient descent theory revisited f is said to be α -strongly convex and β -smooth if 0 � α I � ∇ 2 f ( x ) � β I , ∀ x ℓ 2 error contraction: GD with η = 1 /β obeys � � 1 − α � x t +1 − x ♮ � 2 ≤ � x t − x ♮ � 2 β 14/ 36

  27. Gradient descent theory revisited � x t +1 − x ♮ � 2 ≤ (1 − α/β ) � x t − x ♮ � 2 region of local strong convexity + smoothness 15/ 36

  28. Gradient descent theory revisited � x t +1 − x ♮ � 2 ≤ (1 − α/β ) � x t − x ♮ � 2 region of local strong convexity + smoothness 15/ 36

  29. Gradient descent theory revisited � x t +1 − x ♮ � 2 ≤ (1 − α/β ) � x t − x ♮ � 2 region of local strong convexity + smoothness 15/ 36

  30. Gradient descent theory revisited � x t +1 − x ♮ � 2 ≤ (1 − α/β ) � x t − x ♮ � 2 region of local strong convexity + smoothness 15/ 36

  31. Gradient descent theory revisited 0 � α I � ∇ 2 f ( x ) � β I , ∀ x ℓ 2 error contraction: GD with η = 1 /β obeys � � 1 − α � x t +1 − x ♮ � 2 ≤ � x t − x ♮ � 2 β • Condition number β/α determines rate of convergence 16/ 36

  32. Gradient descent theory revisited 0 � α I � ∇ 2 f ( x ) � β I , ∀ x ℓ 2 error contraction: GD with η = 1 /β obeys � � 1 − α � x t +1 − x ♮ � 2 ≤ � x t − x ♮ � 2 β • Condition number β/α determines rate of convergence � β � iterations α log 1 • Attains ε -accuracy within O ε 16/ 36

  33. What does this optimization theory say about WF? i . i . d . Gaussian designs: a k ∼ N ( 0 , I n ) , 1 ≤ k ≤ m 17/ 36

  34. What does this optimization theory say about WF? i . i . d . Gaussian designs: a k ∼ N ( 0 , I n ) , 1 ≤ k ≤ m Population level (infinite samples) � 2 I + 2 xx ⊤ � �� 2 I + 2 x ♮ x ♮ ⊤ � � = 3 � x ♮ � � ∇ 2 f ( x ) � 2 � x � 2 − E � �� � locally positive definite and well-conditioned � log 1 � iterations if m → ∞ Consequence: WF converges within O ε 17/ 36

  35. What does this optimization theory say about WF? i . i . d . Gaussian designs: a k ∼ N ( 0 , I n ) , 1 ≤ k ≤ m Finite-sample level ( m ≍ n log n ) ∇ 2 f ( x ) ≻ 0 17/ 36

  36. What does this optimization theory say about WF? i . i . d . Gaussian designs: a k ∼ N ( 0 , I n ) , 1 ≤ k ≤ m Finite-sample level ( m ≍ n log n ) ∇ 2 f ( x ) ≻ 0 but ill-conditioned (even locally) � �� � condition number ≍ n 17/ 36

  37. What does this optimization theory say about WF? i . i . d . Gaussian designs: a k ∼ N ( 0 , I n ) , 1 ≤ k ≤ m Finite-sample level ( m ≍ n log n ) ∇ 2 f ( x ) ≻ 0 but ill-conditioned (even locally) � �� � condition number ≍ n Consequence (Cand` es et al ’14) : WF attains ε -accuracy within � iterations if m ≍ n log n � n log 1 O ε 17/ 36

  38. What does this optimization theory say about WF? i . i . d . Gaussian designs: a k ∼ N ( 0 , I n ) , 1 ≤ k ≤ m Finite-sample level ( m ≍ n log n ) ∇ 2 f ( x ) ≻ 0 but ill-conditioned (even locally) � �� � condition number ≍ n Consequence (Cand` es et al ’14) : WF attains ε -accuracy within � iterations if m ≍ n log n � n log 1 O ε Too slow ... can we accelerate it? 17/ 36

  39. One solution: truncated WF (Chen, Cand` es ’15) Regularize / trim gradient components to accelerate convergence x z 18/ 36

  40. But wait a minute ... WF converges in O ( n ) iterations 19/ 36

  41. But wait a minute ... WF converges in O ( n ) iterations Step size taken to be η t = O (1 /n ) 19/ 36

  42. But wait a minute ... WF converges in O ( n ) iterations Step size taken to be η t = O (1 /n ) This choice is suggested by generic optimization theory 19/ 36

Recommend


More recommend