Structured sparse methods for matrix factorization Francis Bach Sierra team, INRIA - Ecole Normale Sup´ erieure March 2011 Joint work with R. Jenatton, J. Mairal, G. Obozinski
Structured sparse methods for matrix factorization Outline • Learning problems on matrices • Sparse methods for matrices – Sparse principal component analysis – Dictionary learning • Structured sparse PCA – Sparsity-inducing norms and overlapping groups – Structure on dictionary elements – Structure on decomposition coefficients
Learning on matrices - Collaborative filtering • Given n X “movies” x ∈ X and n Y “customers” y ∈ Y , • Predict the “rating” z ( x , y ) ∈ Z of customer y for movie x • Training data: large n X × n Y incomplete matrix Z that describes the known ratings of some customers for some movies • Goal : complete the matrix. 1 1 2 2 1 3 2 3 3 3 1 1 2 1 1 3 1 1 3 1 2 2 3
Learning on matrices - Image denoising • Simultaneously denoise all patches of a given image • Example from Mairal, Bach, Ponce, Sapiro, and Zisserman (2009c)
Learning on matrices - Source separation • Single microphone (Benaroya et al., 2006; F´ evotte et al., 2009)
Learning on matrices - Multi-task learning • k linear prediction tasks on same covariates x ∈ R p – k weight vectors w j ∈ R p – Joint matrix of predictors W = ( w 1 , . . . , w k ) ∈ R p × k • Classical applications – Transfer learning – Multi-category classification (one task per class) (Amit et al., 2007) • Share parameters between tasks – Joint variable or feature selection (Obozinski et al., 2009; Pontil et al., 2007)
Learning on matrices - Dimension reduction n ) ∈ R n × p • Given data matrix X = ( x ⊤ 1 , . . . , x ⊤ – Principal component analysis : x i ≈ D α i – K-means : x i ≈ d k ⇒ X = DA
Sparsity in machine learning • Assumption : y = w ⊤ x + ε , with w ∈ R p sparse – Proxy for interpretability – Allow high-dimensional inference: log p = O ( n ) • Sparsity and convexity ( ℓ 1 -norm regularization): w ∈ R p L ( w ) + � w � 1 min w w 2 2 w w 1 1
Two types of sparsity for matrices M ∈ R n × p I - Directly on the elements of M • Many zero elements: M ij = 0 M • Many zero rows (or columns): ( M i 1 , . . . , M ip ) = 0 M
Two types of sparsity for matrices M ∈ R n × p II - Through a factorization of M = UV ⊤ • Matrix M = UV ⊤ , U ∈ R n × k and V ∈ R p × k • Low rank : m small V T U M = • Sparse decomposition : U sparse V T M U =
Structured (sparse) matrix factorizations • Matrix M = UV ⊤ , U ∈ R n × k and V ∈ R p × k • Structure on U and/or V – Low-rank: U and V have few columns – Dictionary learning / sparse PCA: U has many zeros – Clustering ( k -means): U ∈ { 0 , 1 } n × m , U1 = 1 – Pointwise positivity: non negative matrix factorization (NMF) – Specific patterns of zeros – Low-rank + sparse (Cand` es et al., 2009) – etc. • Many applications • Many open questions : algorithms, identifiability, evaluation
Sparse principal component analysis n ) ∈ R p × n , two views of PCA: • Given data X = ( x ⊤ 1 , . . . , x ⊤ – Analysis view : find the projection d ∈ R p of maximum variance (with deflation to obtain more components) – Synthesis view : find the basis d 1 , . . . , d k such that all x i have low reconstruction error when decomposed on this basis • For regular PCA, the two views are equivalent
Sparse principal component analysis n ) ∈ R p × n , two views of PCA: • Given data X = ( x ⊤ 1 , . . . , x ⊤ – Analysis view : find the projection d ∈ R p of maximum variance (with deflation to obtain more components) – Synthesis view : find the basis d 1 , . . . , d k such that all x i have low reconstruction error when decomposed on this basis • For regular PCA, the two views are equivalent • Sparse (and/or non-negative) extensions – Interpretability – High-dimensional inference – Two views are differents – For analysis view, see d’Aspremont, Bach, and El Ghaoui (2008)
Sparse principal component analysis Synthesis view • Find d 1 , . . . , d k ∈ R p sparse so that n � k � n 2 � � � � � � � � 2 min � x i − ( α i ) j d j = min � x i − D α i 2 is small � � α i ∈ R m � α i ∈ R m 2 i =1 j =1 i =1 – Look for A = ( α 1 , . . . , α n ) ∈ R k × n and D = ( d 1 , . . . , d k ) ∈ R p × k such that D is sparse and � X − DA � 2 F is small
Sparse principal component analysis Synthesis view • Find d 1 , . . . , d k ∈ R p sparse so that n � k � n 2 � � � � � � � � 2 min � x i − ( α i ) j d j = min � x i − D α i 2 is small � � α i ∈ R m � α i ∈ R m 2 i =1 j =1 i =1 – Look for A = ( α 1 , . . . , α n ) ∈ R k × n and D = ( d 1 , . . . , d k ) ∈ R p × k such that D is sparse and � X − DA � 2 F is small • Sparse formulation (Witten et al., 2009; Bach et al., 2008) – Penalize/constrain d j by the ℓ 1 -norm for sparsity – Penalize/constrain α i by the ℓ 2 -norm to avoid trivial solutions n k � � � x i − D α i � 2 min 2 + λ � d j � 1 s.t. ∀ i, � α i � 2 � 1 D , A i =1 j =1
Sparse PCA vs. dictionary learning • Sparse PCA : x i ≈ D α i , D sparse
Sparse PCA vs. dictionary learning • Sparse PCA : x i ≈ D α i , D sparse • Dictionary learning : x i ≈ D α i , α i sparse
Structured matrix factorizations (Bach et al., 2008) n k � � � x i − D α i � 2 min 2 + λ � d j � ⋆ s.t. ∀ i, � α i � • � 1 D , A i =1 j =1 n n � � � x i − D α i � 2 min 2 + λ � α i � • s.t. ∀ j, � d j � ⋆ � 1 D , A i =1 i =1 • Optimization by alternating minimization (non-convex) • α i decomposition coefficients (or “code”), d j dictionary elements • Two related/equivalent problems: – Sparse PCA = sparse dictionary ( ℓ 1 -norm on d j ) – Dictionary learning = sparse decompositions ( ℓ 1 -norm on α i ) (Olshausen and Field, 1997; Elad and Aharon, 2006; Lee et al., 2007)
Dictionary learning for image denoising = + x y ε ���� ���� ���� measurements noise original image
Dictionary learning for image denoising • Solving the denoising problem (Elad and Aharon, 2006) – Extract all overlapping 8 × 8 patches x i ∈ R 64 n ) ∈ R n × 64 – Form the matrix X = ( x ⊤ 1 , . . . , x ⊤ – Solve a matrix factorization problem: n � D , A || X − DA || 2 || x i − D α i || 2 min F = min 2 D , A i =1 where A is sparse , and D is the dictionary – Each patch is decomposed into x i = D α i – Average the reconstruction D α i of each patch x i to reconstruct a full-sized image • The number of patches n is large (= number of pixels)
Online optimization for dictionary learning n � � � || x i − D α i || 2 min 2 + λ || α i || 1 A ∈ R k × n , D ∈D i =1 = { D ∈ R p × k s.t. ∀ j = 1 , . . . , k, △ D || d j || 2 � 1 } . • Classical optimization alternates between D and A . • Good results, but very slow !
Online optimization for dictionary learning n � � � || x i − D α i || 2 min min 2 + λ || α i || 1 D ∈D α i i =1 = { D ∈ R p × k s.t. ∀ j = 1 , . . . , k, △ D || d j || 2 � 1 } . • Classical optimization alternates between D and A . • Good results, but very slow ! • Online learning (Mairal, Bach, Ponce, and Sapiro, 2009a) can – handle potentially infinite datasets – adapt to dynamic training sets – online code ( http://www.di.ens.fr/willow/SPAMS/ )
Denoising result (Mairal, Bach, Ponce, Sapiro, and Zisserman, 2009c)
Denoising result (Mairal, Bach, Ponce, Sapiro, and Zisserman, 2009c)
What does the dictionary D look like?
Inpainting a 12-Mpixel photograph
Inpainting a 12-Mpixel photograph
Inpainting a 12-Mpixel photograph
Inpainting a 12-Mpixel photograph
Alternative usages of dictionary learning Computer vision • Use the “code” α as representation of observations for subsequent processing (Raina et al., 2007; Yang et al., 2009) • Adapt dictionary elements to specific tasks (Mairal, Bach, Ponce, Sapiro, and Zisserman, 2009b) – Discriminative training for weakly supervised pixel classification (Mairal, Bach, Ponce, Sapiro, and Zisserman, 2008)
Structured sparse methods for matrix factorization Outline • Learning problems on matrices • Sparse methods for matrices – Sparse principal component analysis – Dictionary learning • Structured sparse PCA – Sparsity-inducing norms and overlapping groups – Structure on dictionary elements – Structure on decomposition coefficients
Sparsity-inducing norms data fitting term � �� � min f ( α ) + λ ψ ( α ) α ∈ R p � �� � sparsity-inducing norm • Regularizing by a sparsity-inducing norm ψ • Most popular choice for ψ – ℓ 1 -norm: � α � 1 = � p j =1 | α j | – Lasso (Tibshirani, 1996), basis pursuit (Chen et al., 2001) – ℓ 1 -norm only encodes cardinality • Structured sparsity – Certain patterns are favored – Improvement of interpretability and prediction performance
Recommend
More recommend