New matrix norms for structured matrix estimation Jean-Philippe Vert Optimization and Statistical Learning workshop Les Houches, France, Jan 11-16, 2015
Outline Atomic norms 1 Sparse matrices with disjoint column supports 2 Low-rank matrices with sparse factors 3 http://www.homemade-gifts-made-easy.com/make-paper-lanterns.html
Outline Atomic norms 1 Sparse matrices with disjoint column supports 2 Low-rank matrices with sparse factors 3
Atomic Norm (Chandrasekaran et al., 2012) Definition Given a set of atoms A , the associated atomic norm is � x � A = inf { t > 0 | x ∈ t conv ( A ) } . NB: This is really a norm if A is centrally symmetric and spans R p Primal and dual form of the norm � � � � � x � A = inf c a | x = c a a , c a > 0 , ∀ a ∈ A a ∈A a ∈A � x � ∗ = sup � a , x � A a ∈A
Examples Vector ℓ 1 -norm: x ∈ R p �→ � x � 1 � � A = ± e k | 1 ≤ k ≤ p Matrix trace norm: Z ∈ R m 1 × m 2 �→ � Z � ∗ (sum of singular value) � � ab ⊤ : a ∈ R m 1 , b ∈ R m 2 , � a � 2 = � b � 2 = 1 A =
Group lasso (Yuan and Lin, 2006) For x ∈ R p and G = { g 1 , . . . , g G } a partition of [ 1 , p ] : � � x � 1 , 2 = � x g � 2 g ∈G is the atomic norm associated to the set of atoms � { u ∈ R p : supp ( u ) = g , � u � 2 = 1 } A G = g ∈G G = {{ 1 , 2 } , { 3 }} � x � 1 , 2 = � ( x 1 , x 2 ) ⊤ � 2 + � x 3 � 2 � � x 2 1 + x 2 x 2 = 2 + 3
Group lasso with overlaps How to generalize the group lasso when the groups overlap? Set features to zero by groups (Jenatton et al., 2011) � � x � 1 , 2 = � x g � 2 g ∈G Select support as a union of groups (Jacob et al., 2009) � x � A G , see also MKL (Bach et al., 2004) G = {{ 1 , 2 } , { 2 , 3 }}
Outline Atomic norms 1 Sparse matrices with disjoint column supports 2 Low-rank matrices with sparse factors 3
Joint work with... Kevin Vervier, Pierre Mahé, Jean-Baptiste Veyrieras (Biomerieux) Alexandre d’Aspremont (CNRS/ENS)
Columns with disjoint supports X = Motivation: multiclass or multitask classification problems where we want to select features specific to each class or task Example: recognize identify and emotion of a person from an image (Romera-Paredes et al., 2012), or hierarchical coarse-to-fine classifier (Xiao et al., 2011; Hwang et al., 2011)
From disjoint supports to orthogonal columns X = Two vectors v 1 and v 2 have disjoint support iff | v 1 | and | v 2 | are orthogonal If Ω ortho ( X ) is a norm to estimate matrices with orthogonal columns, then Ω disjoint ( X ) = Ω ortho ( | X | ) = − W ≤ X ≤ W Ω ortho ( W ) min is a norm to estimate matrices with disjoint column supports. How to estimate matrices with orthogonal columns? NOTE: more general than orthogonal matrices
Penalty for orthogonal columns For X = [ x 1 , . . . , x p ] ∈ R n × p we want x ⊤ i x j = 0 for i � = j A natural "relaxation": � � � � � � x ⊤ Ω( X ) = i x j � i � = j But not convex
Convex penalty for orthogonal columns � � p � � � � K ii � x i � 2 + � x ⊤ Ω K ( X ) = K ij i x j � i = 1 i � = j Theorem (Xiao et al., 2011) If ¯ K is positive semidefinite, then Ω K is convex, where � | K ii | if i = j , ¯ � � K ij = � K ij � − otherwise.
Can we be tighter? � � p � � � � � x i � 2 + � x ⊤ Ω K ( X ) = K ij i x j � i = 1 i � = j
Can we be tighter? � � p � � � � � x i � 2 + � x ⊤ Ω K ( X ) = K ij i x j � i = 1 i � = j Let O be the set of matrices of unit Frobenius norm, with orthogonal columns � � X ∈ R n × p : X ⊤ X is diagonal and Trace ( X ⊤ X ) = 1 O = Note that ∀ X ∈ O , Ω K ( X ) = 1 The atomic norm � X � O associated to O is the tightest convex penalty to recover the atoms in O !
Optimality of Ω K for p = 2 Theorem (Vervier, Mahé, d’Aspremont, Veyrieras and V., 2014) For any X ∈ R n × 2 , � X � 2 O = Ω K ( X ) with � 1 � 1 K = . 1 1
Case p > 2 Ω K ( X ) � = � X � 2 O But sparse combinations of matrices in O may not be interesting anyway... Theorem (Vervier et al., 2014) For any p ≥ 2, let K be a symmetric p -by- p matrix with non-negative entries and such that, � ∀ i = 1 , . . . , p K ii = K ij . j � = i Then � K ij � ( x i , x j ) � 2 Ω K ( X ) = O . i < j
Simulations Regression Y = XW + ǫ , W has disjoint column support, n = p = 10 ● Ridge Regression LASSO 0.6 LASSO Disjoint Supports 0.40 Xiao ● 0.5 Disjoint Supports 0.4 disjointness ● MSE 0.35 ● 0.3 ● 0.2 ● ● ● ● 0.30 ● 0.1 0.0 10 20 30 40 50 10 20 30 40 50 Training set size Training set size
Example: multiclass classification of MS spectra multi HAE YER ESH − SHG ENT Spectra CIT STR CLO LIS BAC 0 Features
Outline Atomic norms 1 Sparse matrices with disjoint column supports 2 Low-rank matrices with sparse factors 3
Joint work with... Emile Richard (Stanford) Guillaume Obozinski (Ecole des Ponts - ParisTech)
Low-rank matrices with sparse factors X = r � u i v ⊤ X = i i = 1 factors not orthogonal a priori � = from assuming the SVD of X is sparse
Dictionary Learning n n � � � x i − D α i � 2 min 2 + λ � α i � 1 s.t. ∀ j , � d j � 2 ≤ 1 . A ∈ R k × n i = 1 i = 1 D ∈ R p × k Dictionary Learning X T α D . = e.g. overcomplete dictionaries for natural images sparse decomposition (Elad and Aharon, 2006)
Dictionary Learning /Sparse PCA n n � � � x i − D α i � 2 min 2 + λ � α i � 1 s.t. ∀ j , � d j � 2 ≤ 1 . A ∈ R k × n i = 1 i = 1 D ∈ R p × k Dictionary Learning Sparse PCA X T α X T D D . . = = α e.g. overcomplete dictionaries e.g. microarray data for natural images sparse dictionary sparse decomposition (Witten et al., 2009; Bach et al., (Elad and Aharon, 2006) 2008) Sparsity of the loadings vs sparsity of the dictionary elements
Applications Low rank factorization with “community structure" Modeling clusters or community structure in social networks or recommendation systems (Richard et al., 2012). Subspace clustering (Wang et al., 2013) Up to an unknown permutation, X ⊤ = � X ⊤ X ⊤ � . . . 1 K with X k low rank, so that there exists a low rank matrix Z k such that X k = Z k X k . Finally, X = ZX with Z = BkDiag ( Z 1 , . . . , Z K ) . Sparse PCA from ˆ Σ n Sparse bilinear regression y = x ⊤ Mx ′ + ε
Existing approaches Bi-convex formulations U , V L ( UV ⊤ ) + λ ( � U � 1 + � V � 1 ) , min with U ∈ R n × r , V ∈ R p × r . Convex formulation for sparse and low rank min Z L ( Z ) + λ � Z � 1 + µ � Z � ∗ Doan and Vavasis (2013); Richard et al. (2012) factors not necessarily sparse as r increases.
A new formulation for sparse matrix factorization r � Assumptions: a i b ⊤ X = i i = 1 All left factors a i have support of size k . All right factors b i have support of size q . Goals: Propose a convex formulation for sparse matrix factorization that is able to handle multiple sparse factors permits to identify the sparse factors themselves leads to better statistical performance than ℓ 1 /trace norm. Propose algorithms based on this formulation.
The ( k , q ) -rank of a matrix Sparse unit vectors: j = { a ∈ R n : � a � 0 ≤ j , � a � 2 = 1 } A n ( k , q ) -rank of a m 1 × m 2 matrix Z : � � r � i , ( a i , b i , c i ) ∈ A m 1 k ×A m 2 c i a i b ⊤ r k , q ( Z ) = min r : Z = q × R + i = 1 � � ∞ � i , ( a i , b i , c i ) ∈ A m 1 k ×A m 2 c i a i b ⊤ = min � c � 0 : Z = q × R + i = 1 Z r k , q ( Z ) = 3 =
The ( k , q ) trace norm (Richard et al., 2014) For a matrix Z ∈ R m 1 × m 2 , we have combinatorial penality � Z � 0 rank ( Z ) convex relaxation � Z � 1 � Z � ∗
The ( k , q ) trace norm (Richard et al., 2014) For a matrix Z ∈ R m 1 × m 2 , we have ( 1 , 1 ) -rank ( k , q ) -rank ( m 1 , m 2 ) -rank combinatorial penality � Z � 0 r k , q ( Z ) rank ( Z ) convex relaxation � Z � 1 � Z � ∗
The ( k , q ) trace norm (Richard et al., 2014) For a matrix Z ∈ R m 1 × m 2 , we have ( 1 , 1 ) -rank ( k , q ) -rank ( m 1 , m 2 ) -rank combinatorial penality � Z � 0 r k , q ( Z ) rank ( Z ) convex relaxation � Z � 1 Ω k , q ( Z ) � Z � ∗ The ( k , q ) trace norm Ω k , q ( Z ) is the atomic norm associated with � � ab ⊤ | a ∈ A m 1 k , b ∈ A m 2 A k , q := , q namely: � � ∞ � i , ( a i , b i , c i ) ∈ A m 1 k ×A m 2 c i a i b ⊤ Ω k , q ( Z ) = inf � c � 1 : Z = q × R + i = 1
Recommend
More recommend