structured sparse methods for matrix factorization
play

Structured sparse methods for matrix factorization Francis Bach - PowerPoint PPT Presentation

Structured sparse methods for matrix factorization Francis Bach Sierra team, INRIA - Ecole Normale Sup erieure March 2011 Joint work with R. Jenatton, J. Mairal, G. Obozinski Structured sparse methods for matrix factorization Outline


  1. Structured sparse methods for matrix factorization Francis Bach Sierra team, INRIA - Ecole Normale Sup´ erieure March 2011 Joint work with R. Jenatton, J. Mairal, G. Obozinski

  2. Structured sparse methods for matrix factorization Outline • Learning problems on matrices • Sparse methods for matrices – Sparse principal component analysis – Dictionary learning • Structured sparse PCA – Sparsity-inducing norms and overlapping groups – Structure on dictionary elements – Structure on decomposition coefficients

  3. Learning on matrices - Collaborative filtering • Given n X “movies” x ∈ X and n Y “customers” y ∈ Y , • Predict the “rating” z ( x , y ) ∈ Z of customer y for movie x • Training data: large n X × n Y incomplete matrix Z that describes the known ratings of some customers for some movies • Goal : complete the matrix. 1 1 2 2 1 3 2 3 3 3 1 1 2 1 1 3 1 1 3 1 2 2 3

  4. Learning on matrices - Image denoising • Simultaneously denoise all patches of a given image • Example from Mairal, Bach, Ponce, Sapiro, and Zisserman (2009c)

  5. Learning on matrices - Source separation • Single microphone (Benaroya et al., 2006; F´ evotte et al., 2009)

  6. Learning on matrices - Multi-task learning • k linear prediction tasks on same covariates x ∈ R p – k weight vectors w j ∈ R p – Joint matrix of predictors W = ( w 1 , . . . , w k ) ∈ R p × k • Classical applications – Transfer learning – Multi-category classification (one task per class) (Amit et al., 2007) • Share parameters between tasks – Joint variable or feature selection (Obozinski et al., 2009; Pontil et al., 2007)

  7. Learning on matrices - Dimension reduction n ) ∈ R n × p • Given data matrix X = ( x ⊤ 1 , . . . , x ⊤ – Principal component analysis : x i ≈ D α i – K-means : x i ≈ d k ⇒ X = DA

  8. Sparsity in machine learning • Assumption : y = w ⊤ x + ε , with w ∈ R p sparse – Proxy for interpretability – Allow high-dimensional inference: log p = O ( n ) • Sparsity and convexity ( ℓ 1 -norm regularization): w ∈ R p L ( w ) + � w � 1 min w w 2 2 w w 1 1

  9. Two types of sparsity for matrices M ∈ R n × p I - Directly on the elements of M • Many zero elements: M ij = 0 M • Many zero rows (or columns): ( M i 1 , . . . , M ip ) = 0 M

  10. Two types of sparsity for matrices M ∈ R n × p II - Through a factorization of M = UV ⊤ • Matrix M = UV ⊤ , U ∈ R n × k and V ∈ R p × k • Low rank : m small V T U M = • Sparse decomposition : U sparse V T M U =

  11. Structured (sparse) matrix factorizations • Matrix M = UV ⊤ , U ∈ R n × k and V ∈ R p × k • Structure on U and/or V – Low-rank: U and V have few columns – Dictionary learning / sparse PCA: U has many zeros – Clustering ( k -means): U ∈ { 0 , 1 } n × m , U1 = 1 – Pointwise positivity: non negative matrix factorization (NMF) – Specific patterns of zeros – Low-rank + sparse (Cand` es et al., 2009) – etc. • Many applications • Many open questions : algorithms, identifiability, evaluation

  12. Sparse principal component analysis n ) ∈ R p × n , two views of PCA: • Given data X = ( x ⊤ 1 , . . . , x ⊤ – Analysis view : find the projection d ∈ R p of maximum variance (with deflation to obtain more components) – Synthesis view : find the basis d 1 , . . . , d k such that all x i have low reconstruction error when decomposed on this basis • For regular PCA, the two views are equivalent

  13. Sparse principal component analysis n ) ∈ R p × n , two views of PCA: • Given data X = ( x ⊤ 1 , . . . , x ⊤ – Analysis view : find the projection d ∈ R p of maximum variance (with deflation to obtain more components) – Synthesis view : find the basis d 1 , . . . , d k such that all x i have low reconstruction error when decomposed on this basis • For regular PCA, the two views are equivalent • Sparse (and/or non-negative) extensions – Interpretability – High-dimensional inference – Two views are differents – For analysis view, see d’Aspremont, Bach, and El Ghaoui (2008)

  14. Sparse principal component analysis Synthesis view • Find d 1 , . . . , d k ∈ R p sparse so that n � k � n 2 � � � � � � � � 2 min � x i − ( α i ) j d j = min � x i − D α i 2 is small � � α i ∈ R m � α i ∈ R m 2 i =1 j =1 i =1 – Look for A = ( α 1 , . . . , α n ) ∈ R k × n and D = ( d 1 , . . . , d k ) ∈ R p × k such that D is sparse and � X − DA � 2 F is small

  15. Sparse principal component analysis Synthesis view • Find d 1 , . . . , d k ∈ R p sparse so that n � k � n 2 � � � � � � � � 2 min � x i − ( α i ) j d j = min � x i − D α i 2 is small � � α i ∈ R m � α i ∈ R m 2 i =1 j =1 i =1 – Look for A = ( α 1 , . . . , α n ) ∈ R k × n and D = ( d 1 , . . . , d k ) ∈ R p × k such that D is sparse and � X − DA � 2 F is small • Sparse formulation (Witten et al., 2009; Bach et al., 2008) – Penalize/constrain d j by the ℓ 1 -norm for sparsity – Penalize/constrain α i by the ℓ 2 -norm to avoid trivial solutions n k � � � x i − D α i � 2 min 2 + λ � d j � 1 s.t. ∀ i, � α i � 2 � 1 D , A i =1 j =1

  16. Sparse PCA vs. dictionary learning • Sparse PCA : x i ≈ D α i , D sparse

  17. Sparse PCA vs. dictionary learning • Sparse PCA : x i ≈ D α i , D sparse • Dictionary learning : x i ≈ D α i , α i sparse

  18. Structured matrix factorizations (Bach et al., 2008) n k � � � x i − D α i � 2 min 2 + λ � d j � ⋆ s.t. ∀ i, � α i � • � 1 D , A i =1 j =1 n n � � � x i − D α i � 2 min 2 + λ � α i � • s.t. ∀ j, � d j � ⋆ � 1 D , A i =1 i =1 • Optimization by alternating minimization (non-convex) • α i decomposition coefficients (or “code”), d j dictionary elements • Two related/equivalent problems: – Sparse PCA = sparse dictionary ( ℓ 1 -norm on d j ) – Dictionary learning = sparse decompositions ( ℓ 1 -norm on α i ) (Olshausen and Field, 1997; Elad and Aharon, 2006; Lee et al., 2007)

  19. Dictionary learning for image denoising = + x y ε ���� ���� ���� measurements noise original image

  20. Dictionary learning for image denoising • Solving the denoising problem (Elad and Aharon, 2006) – Extract all overlapping 8 × 8 patches x i ∈ R 64 n ) ∈ R n × 64 – Form the matrix X = ( x ⊤ 1 , . . . , x ⊤ – Solve a matrix factorization problem: n � D , A || X − DA || 2 || x i − D α i || 2 min F = min 2 D , A i =1 where A is sparse , and D is the dictionary – Each patch is decomposed into x i = D α i – Average the reconstruction D α i of each patch x i to reconstruct a full-sized image • The number of patches n is large (= number of pixels)

  21. Online optimization for dictionary learning n � � � || x i − D α i || 2 min 2 + λ || α i || 1 A ∈ R k × n , D ∈D i =1 = { D ∈ R p × k s.t. ∀ j = 1 , . . . , k, △ D || d j || 2 � 1 } . • Classical optimization alternates between D and A . • Good results, but very slow !

  22. Online optimization for dictionary learning n � � � || x i − D α i || 2 min min 2 + λ || α i || 1 D ∈D α i i =1 = { D ∈ R p × k s.t. ∀ j = 1 , . . . , k, △ D || d j || 2 � 1 } . • Classical optimization alternates between D and A . • Good results, but very slow ! • Online learning (Mairal, Bach, Ponce, and Sapiro, 2009a) can – handle potentially infinite datasets – adapt to dynamic training sets – online code ( http://www.di.ens.fr/willow/SPAMS/ )

  23. Denoising result (Mairal, Bach, Ponce, Sapiro, and Zisserman, 2009c)

  24. Denoising result (Mairal, Bach, Ponce, Sapiro, and Zisserman, 2009c)

  25. What does the dictionary D look like?

  26. Inpainting a 12-Mpixel photograph

  27. Inpainting a 12-Mpixel photograph

  28. Inpainting a 12-Mpixel photograph

  29. Inpainting a 12-Mpixel photograph

  30. Alternative usages of dictionary learning Computer vision • Use the “code” α as representation of observations for subsequent processing (Raina et al., 2007; Yang et al., 2009) • Adapt dictionary elements to specific tasks (Mairal, Bach, Ponce, Sapiro, and Zisserman, 2009b) – Discriminative training for weakly supervised pixel classification (Mairal, Bach, Ponce, Sapiro, and Zisserman, 2008)

  31. Structured sparse methods for matrix factorization Outline • Learning problems on matrices • Sparse methods for matrices – Sparse principal component analysis – Dictionary learning • Structured sparse PCA – Sparsity-inducing norms and overlapping groups – Structure on dictionary elements – Structure on decomposition coefficients

  32. Sparsity-inducing norms data fitting term � �� � min f ( α ) + λ ψ ( α ) α ∈ R p � �� � sparsity-inducing norm • Regularizing by a sparsity-inducing norm ψ • Most popular choice for ψ – ℓ 1 -norm: � α � 1 = � p j =1 | α j | – Lasso (Tibshirani, 1996), basis pursuit (Chen et al., 2001) – ℓ 1 -norm only encodes cardinality • Structured sparsity – Certain patterns are favored – Improvement of interpretability and prediction performance

Recommend


More recommend