nonconvex demixing from bilinear measurements
play

Nonconvex Demixing from Bilinear Measurements Yuanming Shi 1 - PowerPoint PPT Presentation

Nonconvex Demixing from Bilinear Measurements Yuanming Shi 1 Outline Motivations Blind deconvolution meets blind demixing T woVignettes: Implicitly regularized Wirtinger flow Why nonconvex optimization? Implicitly


  1. Nonconvex Demixing from Bilinear Measurements Yuanming Shi 1

  2. Outline  Motivations  Blind deconvolution meets blind demixing  T woVignettes:  Implicitly regularized Wirtinger flow  Why nonconvex optimization?  Implicitly regularized Wirtinger flow  Matrix optimization over manifolds  Why manifold optimization?  Riemannian optimization for blind demixing 2

  3. Motivations: Blind deconvolution meets blind demixing 3

  4. Blind deconvolution  In many science and engineering problems, the observed signal can be modeled as: where is the convolution operator is a physical signal of interest  is the impulse response of the sensory system   Applications: astronomy, neuroscience, image processing, computer vision, wireless communications, microscopy data processing,…  Blind deconvolution: estimate and given 4

  5. Image deblurring  Blurred images due to camera shake can be modeled as a convolution of the latent sharp image and a kernel capturing the motion of the camera kernel Fig. credit: Chi natural image How to find the high-resolution image and the blurring kernel simultaneously? 5

  6. Microscopy data analysis  Defects: the electronic structure of the material is contaminated by randomly and sparsely distributed “defects” Doped Graphene Fig. credit: Wright How to determine the locations and characteristic signatures of the defects? 6

  7. Blind demixing  The received measurement consists of the sum of all convolved signals convolutional dictionary learning (multi kernel) low-latency communication for IoT  Applications: IoT, dictionary learning, neural spike sorting,…  Blind demixing: estimate and given 7

  8. Convolutional dictionary learning  The observation signal is the superposition of several convolutions Fig. credit: Wright experiment on synthetic image experiment on microscopy image How to recover multiple kernels and the corresponding activation signals? 8

  9. Low-latency communications for IoT  Packet structure: metadata (preamble (PA) and header (H)) and data long data packet in current wireless systems short data packet in IoT  Proposal: transmitters just send overhead-free signals, and the receiver can still extract the information How to detect data without channel estimation in multi-user environments? 9

  10. Demixing from bilinear model? 10

  11. Bilinear model  Translate into the frequency domain…  Subspace assumptions: and lie in some known low-dimensional subspaces where , and : partial Fourier basis  Demixing from bilinear measurements: 11

  12. An equivalent view: low-rank factorization  Lifting: introduce to linearize constraints  Low-rank matrix optimization problem 12

  13. Convex relaxation  Ling and Strohmer (TIT’2017) proposed to solve the nuclear norm minimization problem: : partial Fourier basis  Sample-efficient: samples for exact recovery if is incoherent w.r.t.  Computational-expensive: SDP in the lifting space Can we solve the nonconvex matrix optimization problem directly? 13 13

  14. Vignettes A: Implicitly regularized Wirtinger flow 14

  15. Why nonconvex optimization? 15

  16. Nonconvex problems are everywhere  Empirical risk minimization is usually nonconvex  low-rank matrix completion  blind deconvolution/demixing  dictionary learning  phase retrieval  mixture models  deep learning  … 16

  17. Nonconvex optimization may be super scary  Challenges: saddle points, local optima, bumps,… Fig. credit: Chen  Fact: they are usually solved on a daily basis via simple algorithms like (stochastic) gradient descent 17

  18. Statistical models come to rescue  Blessings: when data are generated by certain statistical models, problems are often much nicer than worst-case instances Fig. credit: Chen 18

  19. First-order stationary points  Saddle points and local minima: Saddle points/local maxima Local minima 19

  20. First-order stationary points  Applications: PCA, matrix completion, dictionary learning etc.  Local minima: either all local minima are global minima or all local minima as good as global minima  Saddle points: very poor compared to global minima; several such points  Bottomline: local minima much more desirable than saddle points How to escape saddle points efficiently? 20

  21. Statistics meets optimization  Proposal: separation of landscape analysis and generic algorithm design landscape analysis generic algorithms (statistics) (optimization) all the saddle points all local minima are can be escaped global minima dictionary learning (Sun et al. ’15) gradient descent (Lee et al. ’16) • • phase retrieval (Sun et al. ’16) trust region method (Sun et al. ’16) • • matrix completion (Ge et al. ’16) perturbed GD (Jin et al. ’17) • • synchronization (Bandeira et al. ’16) cubic regularization (Agarwal et al. ’17) • • inverting deep neural nets (Hand et al. ’17) Fig. credit: Chen Natasha (Allen-Zhu ’17) • • ... ... • • Issue: conservative computational guarantees for specific problems (e.g., phase retrieval, blind deconvolution, matrix completion) 21

  22. Solution: blending landscape and convergence analysis implicitly regularized Wirtinger flow 22

  23. A natural least-squares formulation  Goal: demixing from bilinear measurements Given:  Pros: computational-efficient in the natural parameter space  Cons: is nonconvex: bilinear constraint, scaling ambiguity 23

  24. Wirtinger flow  Least-square minimization viaWirtinger flow (Candes, Li, Soltanolkotabi ’14)  Spectral initialization by top eigenvector of  Gradient iterations 24

  25. T wo-stage approach  Initialize within local basin sufficiently close to ground-truth (i.e., strongly convex, no saddle points/ local minima)  Iterative refinement via some iterative optimization algorithms Fig. credit: Chen 25

  26. Gradient descent theory  Two standard conditions that enable geometric convergence of GD  (local) restricted strong convexity  (local) smoothness 26

  27. Gradient descent theory  Question: which region enjoys both strong convexity and smoothness? is not far away from (convexity)  is incoherent w.r.t. sampling vectors (incoherence region for smoothness)  Prior works suggest enforcing regularization (e.g., regularized loss [Ling & Strohmer’17]) to promote incoherence 27

  28. Our finding: WF is implicitly regularized  WF (GD) implicitly forces iterates to remain incoherent with  cannot be derived from generic optimization theory  relies on finer statistical analysis for entire trajectory of GD region of local strong convexity and smoothness 28

  29. Key proof idea: leave-one-out analysis  introduce leave-one-out iterates by runningWF without l -th sample  leave-one-out iterate is independent of  leave-one-out iterate true iterate is nearly independent of (i.e., nearly orthogonal to)  29

  30. Theoretical guarantees  With i.i.d. Gaussian design,WF (regularization-free) achieves  Incoherence  Near-linear convergence rate  Summary:  Sample size:  Stepsize: vs. [Ling & Strohmer’17]  Computational complexity: vs. [Ling & Strohmer’17] 30

  31. Numerical results  stepsize:  number of users:  sample size: linear convergence: WF attains - accuracy within iterations 31

  32. Is carefully-designed initialization necessary? 32

  33. Numerical results of randomly initialized WF  stepsize:  number of users:  sample size:  initial point: Randomly initialized WF enters local basin within iterations 33

  34. Analysis: population dynamics Population level (infinite sample)  Signal strength: , is the alignment parameter  Size of residual component:  State evolution local basin 34

  35. Analysis: population dynamics Population level (infinite sample)  Signal strength: , is the alignment parameter  Size of residual component:  State evolution local basin 35

  36. Analysis: finite-sample analysis  Population-level analysis holds approximately if Fig. credit: Chen is well-controlled if is independent of   Key analysis ingredient: show is “nearly independent” of each is well-controlled in this region 36

  37. Theoretical guarantees  With i.i.d. Gaussian design,WF with random initialization achieves Summary:  Stepsize:  Sample size:  Stage I: reach local basin within iterations  Stage II: linear convergence  Computational complexity: 37

  38. Vignettes B: Matrix optimization over manifolds Optimization over Riemannian Manifolds (non-Euclidean geometry) 38

  39. Why manifold optimization? 39

  40. What is manifold optimization?  Manifold (or manifold-constrained) optimization problem is a smooth function  is a Riemannian manifold: spheres, orthonormal bases (Stiefel), rotations,  positive definite matrices, fixed-rank matrices , Euclidean distance matrices, semidefinite fixed-rank matrices, linear subspaces (Grassmann), phases, essential matrices, fixed-rank tensors, Euclidean spaces... 40

Recommend


More recommend