Latent Structure Beyond Sparse Codes Benjamin Recht Department of EECS and Statistics University of California, Berkeley
2.5x Gabor-like thingies... redundancy robustness and sparsity Sparse Codes Which mathematical representations can be learned robustly?
Sparse Approximation Lasso • n patients << n peaks • If very few are needed for diagnosis, search for a sparse set of markers Compressed Sensing • Use the fact that images are sparse in wavelet basis to reduce number of measurements required for signal acquisition.
Cardinality Minimization • PROBLEM: Find the vector of lowest cardinality that satisfies/approximates the underdetermined linear system Φ : R p → R n Φ x = y • NP-HARD: – Reduce to EXACT-COVER – Hard to approximate – Known exact algorithms require enumeration • HEURISTIC: Replace cardinality with l 1 norm
Recommender Geometric Systems Structure Data Gram Rank of: Rank of: Matrix Matrix Quantum Seismic Tomography Imaging Density Unfolded Rank of: Rank of: Matrix Tensor
Affine Rank Minimization • PROBLEM: Find the matrix of lowest rank that satisfies/approximates the underdetermined linear system Φ : R p 1 × p 2 → R n Φ ( X ) = y • NP-HARD: – Reduce to solving polynomial equations – Hard to approximate – Exact algorithms are awful • HUERISTIC: Replace rank with nuclear norm
Heuristic: Gradient Descent R * IDEA: Replace rank with nuclear norm : = L M minimize k X k ∗ subject to Φ ( X ) = b p 1 x p 2 p 1 x r r x p 2 Step 1: Pick (i,j) and compute residual: • e = ( L i R T j − M ij ) Step 2: Take a mixture of current model and • corrected model ( 𝛽, β >0): α L i − β e R j L i � � ← α R j − β e L i R j Some guy on livejournal, 2006 Succeeds when number of Fazel, Parillo, Recht, 2007 samples is Õ(r(p 1 +p 2 )) Candes and Recht, 2008
Observe a time series driven by the input y 1 , y 2 , . . . , y T u 1 , u 2 , . . . u T Na et al, 2012 What is a principled way to build a parsimonious model for the input-output responses? System Identification: find a dynamical model that agrees with time series data • All linear systems are combinations of single pole filters. • Leverage this structure for new algorithms and analysis. Shah, Bhaskar, Tang, and Recht 2012
Linear Inverse Problems Find me a solution of • y = Φ x Φ n x p, n<p • Of the infinite collection of solutions, which one • should we pick? Leverage structure: • Sparsity Rank Smoothness Symmetry How do we design algorithms to solve • underdetermined systems problems with priors?
Sparsity 1 • 1-sparse vectors of Euclidean norm 1 -1 1 • Convex hull is the unit ball of the l 1 norm -1 p X k x k 1 = | x i | i =1
x 2 Φ x=y k x k 1 minimize subject to Φ x = y x 1 Compressed Sensing: Candes, Romberg, Tao, Donoho, Tanner, Etc...
Rank • 2x2 matrices • plotted in 3d rank 1 x 2 + z 2 + 2y 2 = 1
Rank • 2x2 matrices • plotted in 3d rank 1 x 2 + z 2 + 2y 2 = 1 Convex hull: X k X k ∗ = σ i ( X ) i
• 2x2 matrices • plotted in 3d X k X k ∗ = σ i ( X ) i Nuclear Norm Heuristic Fazel 2002. R, Fazel, and Parillo 2007 Rank Minimization/Matrix Completion
Integer Programming • Integer solutions: all components of x (-1,1) (1,1) are ±1 • Convex hull is the unit ball of the l 1 norm (-1,-1) (1,-1)
x 2 Φ x=y k x k ∞ minimize subject to Φ x = y x 1 Donoho and Tanner 2008 Mangasarian and Recht. 2009.
Parsimonious Models rank model atoms weights • Search for best linear combination of fewest atoms • “rank” = fewest atoms needed to describe the model
Atomic Norms Given a basic set of atoms, , define the function A • k x k A = inf { t > 0 : x 2 t conv( A ) } When is centrosymmetric, we get a norm A • X X k x k A = inf { | c a | : x = c a a } a ∈ A a ∈ A k z k A minimize IDEA: subject to Φ z = y When can we compute this? • When does this work? •
Union of Subspaces X has structured sparsity: linear combination of elements • from a set of subspaces {U g }. Atomic set: unit norm vectors living in one of the U g • Permutations and Rankings X a sum of a few permutation matrices • Examples: Multiobject Tracking, Ranked elections, BCS • Convex hull of permutation matrices: doubly stochastic matrices. •
Moments: convex hull of of [1,t,t 2 ,t 3 ,t 4 ,...], • t ∈T , some basic set. System Identification, Image Processing, • Numerical Integration, Statistical Inference Solve with semidefinite programming • Cut-matrices: sums of rank-one sign matrices. • Collaborative Filtering, Clustering in Genetic • Networks, Combinatorial Approximation Algorithms Approximate with semidefinite • programming Low-rank Tensors: sums of rank-one tensors • Computer Vision, Image Processing, • Hyperspectral Imaging, Neuroscience Approximate with alternating least- • squares
Atomic norms in sparse approximation Greedy approximations • k f � f n k L 2 c 0 k f k A p n Best n term approximation to a function f in the • convex hull of A . Maurey, Jones, and Barron (1980s-90s) • Devore and Temlyakov (1996) • Random Feature Heuristics (Rahimi and R, 2007) •
Tangent Cones Set of directions that decrease the norm from x • form a cone: T A ( x ) = { d : k x + α d k A k x k A for some α > 0 } y = Φ z x k z k A minimize subject to Φ z = y T A ( x ) { z : k z k A k x k A } x is the unique minimizer if the intersection of this • cone with the null space of Φ ¡ equals {0}
Mean Width d 0 x Support Function: S C ( d ) = sup x 2 C S C ( d ) + S C ( − d ) − d 0 x measures width of C when projected onto span of d . d 0 x Z mean width: w ( C ) = S p − 1 S C ( u ) du
When does a random subspace, U in , intersect a R p • convex cone C at the origin? Gordon (1988): with high probability if • codim( U ) ≥ p w ( C ∩ S p − 1 ) 2 Z w ( C ∩ S p − 1 ) = where is the mean width . S p − 1 S C ( u ) du Corollary: For inverse problems, if Φ is a random • Gaussian matrix with n rows, need n ≥ p w ( T A ( x ) ∩ S p − 1 ) 2 for exact recovery of x .
Rates Hypercube: • n ≥ p/ 2 Sparse Vectors, p vector, sparsity s • � p + 5 s � n ≥ 2 s log 4 s Block sparse, M groups (possibly overlapping), • maximum group size B, k active groups √ ⌘ 2 ⇣p 2 log ( M − k ) + + kB n ≥ k B Low-rank matrices: p 1 x p 2 , ( p 1 < p 2 ), rank r • n ≥ 3 r ( p 1 + p 2 − r )
Robust Recovery (deterministic) k w k 2 δ y = Φ x + w Suppose we observe • k z k A minimize k Φ z � y k δ subject to x k 2 � ˆ k x � ˆ If is an optimal solution, then • x ✏ provided that n ≥ pw ( T A ( x ) ∩ S p − 1 ) 2 (1 − ✏ ) 2 k Φ z � y k δ { z : k z k A k x k A }
Robust Recovery (statistical) Suppose we observe y = Φ x + w • k Φ z � y k 2 + µ k z k A minimize p If is an optimal solution, then • ˆ k Φ x � Φ ˆ x k 2 µ k x k A x provided that µ � E w [ k Φ ∗ w k ∗ A ] ˆ x And under an additional “cone condition” k x � ˆ x k 2 η ( x, A , Φ , γ ) µ cone { u : k x + u k A k x k A + γ k u k } Bhaskar, Tang, and Recht 2011
Denoising Rates (re-derivations) Sparse Vectors, p vector, sparsity s • ✓ σ 2 s log( p ) ◆ 1 x � x ? k 2 p k ˆ 2 = O p Low-rank matrices: p 1 x p 2 , ( p 1 < p 2 ), rank r • ✓ σ 2 r ◆ 1 x � x ? k 2 k ˆ F = O p 1 p 2 p 1
Atomic Norm Minimization Chandrasekaran, Recht, Parrilo, and Willsky 2010 k z k A minimize IDEA: subject to Φ z = y Generalizes existing, powerful methods • Rigorous formula for developing new analysis • algorithms Tightest bounds on number of measurements • needed for model recovery in all common models One algorithm prototype for many data-mining • applications
Learning representations x z | � Φ x, Φ z � | � | � x, z � | ASSUME: Gram matrix of y vectors • • indicates overlapping very sparse vectors • support s < N 1/2 /log( N ) • Use graph algorithms to • very incoherent dictionary • identify single dictionary (much more than RIP) elements at a time number of observations is • much bigger than N Arora, Ge, and Moitra Agarwal, Anandkumar, and Netrapalli
Extended representations linear map C = π ( K ∩ L ) convex body cone affine space
C = π ( K ∩ L ) 1 -1 1 K = S d 1 + d 2 K = R 2 d + + 2 d -1 L = { Z : trace( Z ) = 1 } � L = { y : y i = 1 } �� A �� B i =1 = B π � I − I � B T C π = (-1,1) (1,1) K = S d +1 K = R 2 d + (-1,-1) (1,-1) + � T � � � � � x T toeplitz y i + y i + d = 1 L = Z = : L = y : x T u T 11 = u = 1 1 ≤ i ≤ d �� T �� x � I − I � = x π = π x T u
Extended representations linear map C = π ( K ∩ L ) cone affine space C ∗ = { y : � x, y � � 1 � x � C } polar body C has a lift into K if there are maps Representation B : C ∗ → K ∗ A : C → K learning becomes matrix factorization such that 1 � � x, y � = � A ( x ) , B ( y ) � for all extreme points of x ∈ C and y ∈ C* Gouveia, Parrilo, and Thomas
Recommend
More recommend