latent structure beyond sparse codes
play

Latent Structure Beyond Sparse Codes Benjamin Recht Department of - PowerPoint PPT Presentation

Latent Structure Beyond Sparse Codes Benjamin Recht Department of EECS and Statistics University of California, Berkeley 2.5x Gabor-like thingies... redundancy robustness and sparsity Sparse Codes Which mathematical representations can be


  1. Latent Structure Beyond Sparse Codes Benjamin Recht Department of EECS and Statistics University of California, Berkeley

  2. 2.5x Gabor-like thingies... redundancy robustness and sparsity Sparse Codes Which mathematical representations can be learned robustly?

  3. Sparse Approximation Lasso • n patients << n peaks • If very few are needed for diagnosis, search for a sparse set of markers Compressed Sensing • Use the fact that images are sparse in wavelet basis to reduce number of measurements required for signal acquisition.

  4. Cardinality Minimization • PROBLEM: Find the vector of lowest cardinality that satisfies/approximates the underdetermined linear system Φ : R p → R n Φ x = y • NP-HARD: – Reduce to EXACT-COVER – Hard to approximate – Known exact algorithms require enumeration • HEURISTIC: Replace cardinality with l 1 norm

  5. Recommender Geometric Systems Structure Data Gram Rank of: Rank of: Matrix Matrix Quantum Seismic Tomography Imaging Density Unfolded Rank of: Rank of: Matrix Tensor

  6. Affine Rank Minimization • PROBLEM: Find the matrix of lowest rank that satisfies/approximates the underdetermined linear system Φ : R p 1 × p 2 → R n Φ ( X ) = y • NP-HARD: – Reduce to solving polynomial equations – Hard to approximate – Exact algorithms are awful • HUERISTIC: Replace rank with nuclear norm

  7. Heuristic: Gradient Descent R * IDEA: Replace rank with nuclear norm : = L M minimize k X k ∗ subject to Φ ( X ) = b p 1 x p 2 p 1 x r r x p 2 Step 1: Pick (i,j) and compute residual: • e = ( L i R T j − M ij ) Step 2: Take a mixture of current model and • corrected model ( 𝛽, β >0):  α L i − β e R j  L i � � ← α R j − β e L i R j Some guy on livejournal, 2006 Succeeds when number of Fazel, Parillo, Recht, 2007 samples is Õ(r(p 1 +p 2 )) Candes and Recht, 2008

  8. Observe a time series driven by the input y 1 , y 2 , . . . , y T u 1 , u 2 , . . . u T Na et al, 2012 What is a principled way to build a parsimonious model for the input-output responses? System Identification: find a dynamical model that agrees with time series data • All linear systems are combinations of single pole filters. • Leverage this structure for new algorithms and analysis. Shah, Bhaskar, Tang, and Recht 2012

  9. Linear Inverse Problems Find me a solution of • y = Φ x Φ n x p, n<p • Of the infinite collection of solutions, which one • should we pick? Leverage structure: • Sparsity Rank Smoothness Symmetry How do we design algorithms to solve • underdetermined systems problems with priors?

  10. Sparsity 1 • 1-sparse vectors of Euclidean norm 1 -1 1 • Convex hull is the unit ball of the l 1 norm -1 p X k x k 1 = | x i | i =1

  11. x 2 Φ x=y k x k 1 minimize subject to Φ x = y x 1 Compressed Sensing: Candes, Romberg, Tao, Donoho, Tanner, Etc...

  12. Rank • 2x2 matrices • plotted in 3d rank 1 x 2 + z 2 + 2y 2 = 1

  13. Rank • 2x2 matrices • plotted in 3d rank 1 x 2 + z 2 + 2y 2 = 1 Convex hull: X k X k ∗ = σ i ( X ) i

  14. • 2x2 matrices • plotted in 3d X k X k ∗ = σ i ( X ) i Nuclear Norm Heuristic Fazel 2002. R, Fazel, and Parillo 2007 Rank Minimization/Matrix Completion

  15. Integer Programming • Integer solutions: all components of x (-1,1) (1,1) are ±1 • Convex hull is the unit ball of the l 1 norm (-1,-1) (1,-1)

  16. x 2 Φ x=y k x k ∞ minimize subject to Φ x = y x 1 Donoho and Tanner 2008 Mangasarian and Recht. 2009.

  17. Parsimonious Models rank model atoms weights • Search for best linear combination of fewest atoms • “rank” = fewest atoms needed to describe the model

  18. Atomic Norms Given a basic set of atoms, , define the function A • k x k A = inf { t > 0 : x 2 t conv( A ) } When is centrosymmetric, we get a norm A • X X k x k A = inf { | c a | : x = c a a } a ∈ A a ∈ A k z k A minimize IDEA: subject to Φ z = y When can we compute this? • When does this work? •

  19. Union of Subspaces X has structured sparsity: linear combination of elements • from a set of subspaces {U g }. Atomic set: unit norm vectors living in one of the U g • Permutations and Rankings X a sum of a few permutation matrices • Examples: Multiobject Tracking, Ranked elections, BCS • Convex hull of permutation matrices: doubly stochastic matrices. •

  20. Moments: convex hull of of [1,t,t 2 ,t 3 ,t 4 ,...], • t ∈T , some basic set. System Identification, Image Processing, • Numerical Integration, Statistical Inference Solve with semidefinite programming • Cut-matrices: sums of rank-one sign matrices. • Collaborative Filtering, Clustering in Genetic • Networks, Combinatorial Approximation Algorithms Approximate with semidefinite • programming Low-rank Tensors: sums of rank-one tensors • Computer Vision, Image Processing, • Hyperspectral Imaging, Neuroscience Approximate with alternating least- • squares

  21. Atomic norms in sparse approximation Greedy approximations • k f � f n k L 2  c 0 k f k A p n Best n term approximation to a function f in the • convex hull of A . Maurey, Jones, and Barron (1980s-90s) • Devore and Temlyakov (1996) • Random Feature Heuristics (Rahimi and R, 2007) •

  22. Tangent Cones Set of directions that decrease the norm from x • form a cone: T A ( x ) = { d : k x + α d k A  k x k A for some α > 0 } y = Φ z x k z k A minimize subject to Φ z = y T A ( x ) { z : k z k A  k x k A } x is the unique minimizer if the intersection of this • cone with the null space of Φ ¡ equals {0}

  23. Mean Width d 0 x Support Function: S C ( d ) = sup x 2 C S C ( d ) + S C ( − d ) − d 0 x measures width of C when projected onto span of d . d 0 x Z mean width: w ( C ) = S p − 1 S C ( u ) du

  24. When does a random subspace, U in , intersect a R p • convex cone C at the origin? Gordon (1988): with high probability if • codim( U ) ≥ p w ( C ∩ S p − 1 ) 2 Z w ( C ∩ S p − 1 ) = where is the mean width . S p − 1 S C ( u ) du Corollary: For inverse problems, if Φ is a random • Gaussian matrix with n rows, need n ≥ p w ( T A ( x ) ∩ S p − 1 ) 2 for exact recovery of x .

  25. Rates Hypercube: • n ≥ p/ 2 Sparse Vectors, p vector, sparsity s • � p + 5 s � n ≥ 2 s log 4 s Block sparse, M groups (possibly overlapping), • maximum group size B, k active groups √ ⌘ 2 ⇣p 2 log ( M − k ) + + kB n ≥ k B Low-rank matrices: p 1 x p 2 , ( p 1 < p 2 ), rank r • n ≥ 3 r ( p 1 + p 2 − r )

  26. Robust Recovery (deterministic) k w k 2  δ y = Φ x + w Suppose we observe • k z k A minimize k Φ z � y k  δ subject to x k  2 � ˆ k x � ˆ If is an optimal solution, then • x ✏ provided that n ≥ pw ( T A ( x ) ∩ S p − 1 ) 2 (1 − ✏ ) 2 k Φ z � y k  δ { z : k z k A  k x k A }

  27. Robust Recovery (statistical) Suppose we observe y = Φ x + w • k Φ z � y k 2 + µ k z k A minimize p If is an optimal solution, then • ˆ k Φ x � Φ ˆ x k 2  µ k x k A x provided that µ � E w [ k Φ ∗ w k ∗ A ] ˆ x And under an additional “cone condition” k x � ˆ x k 2  η ( x, A , Φ , γ ) µ cone { u : k x + u k A  k x k A + γ k u k } Bhaskar, Tang, and Recht 2011

  28. Denoising Rates (re-derivations) Sparse Vectors, p vector, sparsity s • ✓ σ 2 s log( p ) ◆ 1 x � x ? k 2 p k ˆ 2 = O p Low-rank matrices: p 1 x p 2 , ( p 1 < p 2 ), rank r • ✓ σ 2 r ◆ 1 x � x ? k 2 k ˆ F = O p 1 p 2 p 1

  29. Atomic Norm Minimization Chandrasekaran, Recht, Parrilo, and Willsky 2010 k z k A minimize IDEA: subject to Φ z = y Generalizes existing, powerful methods • Rigorous formula for developing new analysis • algorithms Tightest bounds on number of measurements • needed for model recovery in all common models One algorithm prototype for many data-mining • applications

  30. Learning representations x z | � Φ x, Φ z � | � | � x, z � | ASSUME: Gram matrix of y vectors • • indicates overlapping very sparse vectors • support s < N 1/2 /log( N ) • Use graph algorithms to • very incoherent dictionary • identify single dictionary (much more than RIP) elements at a time number of observations is • much bigger than N Arora, Ge, and Moitra Agarwal, Anandkumar, and Netrapalli

  31. Extended representations linear map C = π ( K ∩ L ) convex body cone affine space

  32. C = π ( K ∩ L ) 1 -1 1 K = S d 1 + d 2 K = R 2 d + + 2 d -1 L = { Z : trace( Z ) = 1 } � L = { y : y i = 1 } �� A �� B i =1 = B π � I − I � B T C π = (-1,1) (1,1) K = S d +1 K = R 2 d + (-1,-1) (1,-1) + � T � � � � � x T toeplitz y i + y i + d = 1 L = Z = : L = y : x T u T 11 = u = 1 1 ≤ i ≤ d �� T �� x � I − I � = x π = π x T u

  33. Extended representations linear map C = π ( K ∩ L ) cone affine space C ∗ = { y : � x, y � � 1 � x � C } polar body C has a lift into K if there are maps Representation B : C ∗ → K ∗ A : C → K learning becomes matrix factorization such that 1 � � x, y � = � A ( x ) , B ( y ) � for all extreme points of x ∈ C and y ∈ C* Gouveia, Parrilo, and Thomas

Recommend


More recommend