The Algorithmic Frontiers of Atomic Norm Minimization: Relaxation, Discretization, and Randomization Benjamin Recht University of California, Berkeley
Linear Inverse Problems Find me a solution of • y = Φ x � � Φ n x p, n<p • Of the infinite collection of solutions, which one • should we pick? Leverage structure: • � Sparsity Rank Smoothness Symmetry � How do we design algorithms to solve • underdetermined systems problems with priors?
Atomic Decompositions rank � � � model atoms weights � • Search for best linear combination of fewest atoms • “rank” = fewest atoms needed to describe the model
Atomic Norms Given a basic set of atoms, , define the function A • k x k A = inf { t > 0 : x 2 t conv( A ) } � � Under mild conditions, we get a norm • X X � k x k A = inf { | c a | : x = c a a } � a ∈ A a ∈ A � k z k A minimize IDEA: � subject to Φ z = y � When does this work? • How do we solve the optimization problem? •
Atomic Norm Minimization k z k A minimize IDEA: subject to Φ z = y Generalizes existing, powerful methods • Rigorous formula for developing new analysis • algorithms Precise, tight bounds on number of measurements • needed for model recovery One algorithm prototype for a myriad of data- • analysis applications Chandrasekaran, R, Parrilo, and Willsky
Union of Subspaces X has structured sparsity: linear combination of elements • from a set of subspaces {U g }. Atomic set: unit norm vectors living in one of the U g • Permutations and Rankings X a sum of a few permutation matrices • Examples: Multiobject Tracking, Ranked elections, BCS • Convex hull of permutation matrices: doubly stochastic matrices. •
Moments: convex hull of of [1,t,t 2 ,t 3 ,t 4 ,...], • t ∈T , some basic set. System Identification, Image Processing, • Numerical Integration, Statistical Inference Solve with semidefinite programming • � Cut-matrices: sums of rank-one sign matrices. • Collaborative Filtering, Clustering in Genetic • Networks, Combinatorial Approximation Algorithms Approximate with semidefinite • programming � Low-rank Tensors: sums of rank-one tensors • Computer Vision, Image Processing, • Hyperspectral Imaging, Neuroscience Approximate with alternating least- • squares
Algorithms k Φ z � y k 2 2 + µ k z k A minimize z � Naturally amenable to projected gradient algorithm: • � z k +1 = Π η µ ( z k − η Φ ∗ r k ) � residual � r k = Φ z k − y 2 k z � u k 2 + τ k u k A � “shrinkage” 1 Π τ ( z ) = arg min � u Similar algorithm for atomic norm constraint • � Same basic ingredients for ALM, ADM, Bregman, • Mirror Prox, etc... how to compute the shrinkage?
Shrinkage 2 k z � u k 2 + τ k u k A 1 Π τ ( z ) = arg min � u 2 k z � v k 2 1 Λ τ ( z ) = arg min � k v k ∗ A τ � � z = Π τ ( z ) + Λ τ ( z ) � � k v k ∗ Dual norm a ∈ A h v, a i • A = max
Relaxations k v k ∗ a ∈ A h v, a i A = max � Dual norm is efficiently computable if the set of • atoms is polyhedral or semidefinite representable A 1 ⇢ A 2 = ) k x k ∗ A 1 k x k ∗ A 2 and k x k A 2 k x k A 1 � Convex relaxations of atoms yield approximations to • the norm � NB! sample � complexity increases � � Hierarchy of relaxations based on θ -Bodies yield • progressively tighter bounds on the atomic norm
Theta Bodies Suppose is an algebraic variety A • A = { x : f ( x ) = 0 ∀ f ∈ I } � � k v k ∗ A = max a ∈ A h v, a i τ ( ) h v, a i = τ � q ( a ) � q = h + g � h ( x ) ≥ 0 ∀ x g ∈ I � positive vanishes on � everywhere atoms � Relaxation: restrict h to be sum of squares. • Gives a lower bound on atomic norm • Solvable by semidefinite programming ( Gouveia, • Parrilo, and Thomas, 2010 )
Approximation through discretization Relaxations: A 1 ⇢ A 2 = ) k x k A 2 k x k A 1 • � Let be a finite net. • � � Let be a matrix whose columns are the set A ✏ • Ψ (X ) � k x k A ✏ = inf | c k | : x = Ψ c � k + extra equality constraints ` 1 � � Often times, we can compute explicit bounds such • that λ ✏ k x k A ✏ k x k A k x k A ✏
Discretization Theory Discretize the parameter space to get a • finite number of grid points � Enforce finite number of constraints: • � | h Φ ∗ z, a ( ω j ) i | 1 , ω j 2 Ω m � Equivalently, in the primal replace the • atomic norm with a discrete one � � � � What happens to the solutions when •
Convergence in Dual 0 log 2 ( | f m − f ∗ | ) m = 32 Assumption: there exist parameters • 1.5 such that are � 10 linearly independent 1 � 0.5 � 20 Enforce finite constraints in the dual: • 0 0 0.5 1 | h Φ ∗ z, a ( ω j ) i | 1 , ω j 2 Ω m � 30 m = 128 4 6 8 10 12 1.5 1 Theorem: 0 z ∥ ) • The discretized optimal objectives 0.5 z m − ˆ converge to the original objective 0 � 5 • Any solution sequence of the { ˆ z m } 0 0.5 1 m = 512 discretized problems has a log 2 ( ∥ ˆ 1.5 subsequence that converges to the � 10 solution set of the original problem 1 • For the LASSO dual, the convergence 0.5 � 15 speed is 4 6 8 10 12 O ( ρ m ) 0 log 2 ( m ) 0 0.5 1
Single Molecule Imaging Courtesy of Zhuang Research Lab
Single Molecule Imaging Bundles of 8 tubes of 30 nm diameter • Sparse density: 81049 molecules on • 12000 frames Resolution: 64x64 pixels • Pixel size: 100nmx100nm • Field of view: 6400nmx6400nm • Target resolution: 10nmx10nm • Discretize the FOV into 640x640 • pixels X I ( x, y ) = c j PSF( x − x j , y − y j ) j ( x j , y j ) ∈ [0 , 6400] 2 ( x, y ) ∈ { 50 , 150 , . . . , 6350 }
Single Molecule Imaging
Single Molecule Imaging 1 1 0.8 0.8 0.6 0.6 Precision Recall 0.4 0.4 0.2 0.2 Sparse Sparse CoG CoG quickPALM quickPALM 0 0 0 10 20 30 40 50 0 10 20 30 40 50 Radius Radius 100 1 80 0.8 60 0.6 F � score Jaccard 40 0.4 0.2 20 Sparse Sparse CoG CoG quickPALM quickPALM 0 0 0 10 20 30 40 50 0 10 20 30 40 50 Radius Radius
Atomic norms in sparse approximation Greedy approximations • � k f � f n k L 2 c 0 k f k A p n � � Best n term approximation to a function f in the • convex hull of A . � Maurey, Jones, and Barron (1980s-90s) • Devore and Temlyakov (1996) •
If greedy is hard… Training these networks is hard • � � � But for fixed θ k , the following can be feasible • � � � Can we just not optimize the θ k ? • What if we randomly sample the parameters? •
Fix parameterized basis functions • Fix a probability distribution • � Our target space will be: • Example: Fourier basis functions: • � Gaussian parameters • � If , then means • that the frequency distribution of f has subgaussian tails.
Theorem 1 : Let f be in with • Let ω 1 ,…, ω n be sampled iid from p . Then � � with probability at least 1 - δ . • Generalization Error Estimation Error Approximation Error It’s a finite sized basis set! • Choosing gives overall convergence of •
% Approximates Gaussian Process regression % with Gaussian kernel of variance gamma % lambda: regularization parameter % dataset: X is dxN, y is 1xN % test: xtest is dx1 % D: dimensionality of random feature % training w = randn(D, size(X,1)); b = 2*pi*rand(D,1); Z = cos(sqrt(gamma)*w*X + repmat(b,1,size(X,2))); % Equivalent to % alpha = (lambda*eye(size(X,2))+Z*Z’)\(Z*y); alpha = symmlq(@(v)(lambda*v(:) + Z*(Z'*v(:))),… Z*y(:),1e-6,2000); � % testing ztest = alpha(:) ’ *cos( sqrt(gamma)*w*xtest(:) + … + repmat(b,1,size(X,2)) );
Relaxation - hierarchies of approximating complex • priors via semidefinite programming Discretization - fast convergence in distribution for • models that admit tight discretizations Randomization - efficient algorithms for greedy • methods with practical algorithms Challenge - integrate these ideas into fast, greedy • algorithms.
Recommend
More recommend