MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 20: Dictionary Learning
What is data representation? Let X be a data-space Φ Ψ Φ( M ) M Ψ ◦ Φ ( M ) F X X A data representation is a map Φ : X → F , from the data space to a representation space F . A data reconstruction is a map Ψ : F → X . 9.520/6.860 Fall 2017
Road map Last class: ◮ Prologue: Learning theory and data representation ◮ Part I: Data representations by design This class: ◮ Part II: Data representations by unsupervised learning – Dictionary Learning – PCA – Sparse coding – K-means, K-flats Next class: ◮ Part III: Deep data representations 9.520/6.860 Fall 2017
Notation X : data space ◮ X = R d or X = C d (also more general later). ◮ x ∈ X Data representation : Φ : X → F . ∀ x ∈ X , ∃ z ∈ F : Φ( x ) F : representation space ◮ F = R p or F = C p ◮ z ∈ F Data reconstruction : Ψ : F → X . ∀ z ∈ F , ∃ x ∈ X : Ψ( z ) = x 9.520/6.860 Fall 2017
Why learning? Ideally: automatic, autonomous learning ◮ with as little prior information as possible, but also.... . . ◮ . . . with as little human supervision as possible. f ( x ) = � w , Φ( x ) � F , ∀ x ∈ X Two-step learning scheme: ◮ supervised or unsupervised learning of Φ: X → F ◮ supervised learning of w in F 9.520/6.860 Fall 2017
Unsupervised representation learning Samples from a distribution ρ on input space X S = { x 1 , . . . , x n } ∼ ρ n Training set S from ρ (supported on X ρ ). Goal: find Φ( x ) which is “good” not only for S but for other x ∼ ρ . Principles for unsupervised learning of “good” representations? 9.520/6.860 Fall 2017
Unsupervised representation learning principles Two main concepts: 1. Similarity preservation , it holds Φ( x ) ∼ Φ( x ′ ) ⇔ x ∼ x ′ , ∀ x ∈ X 2. Reconstruction , there exists a map Ψ : F → X such that Ψ ◦ Φ( x ) ∼ x , ∀ x ∈ X 9.520/6.860 Fall 2017
Plan We will first introduce a reconstruction based framework for learning data representation, and then discuss in some detail several examples . We will mostly consider X = R d and F = R p ◮ Representation : Φ : X → F . ◮ Reconstruction : Ψ : F → X . If linear maps: ◮ Representation : Φ( x ) = Cx (coding) ◮ Reconstruction : Ψ( z ) = Dz (decoding) 9.520/6.860 Fall 2017
Reconstruction based data representation Basic idea : the quality of a representation Φ is measured by the reconstruction error provided by an associated reconstruction Ψ � x − Ψ ◦ Φ( x ) � , Ψ ◦ Φ: denotes the composition of Φ and Ψ 9.520/6.860 Fall 2017
Empirical data and population Given S = { x 1 , . . . , x n } minimize the empirical reconstruction error n � E (Φ , Ψ) = 1 � x i − Ψ ◦ Φ( x i ) � 2 , � n i =1 as a proxy to the expected reconstruction error � d ρ ( x ) � x − Ψ ◦ Φ( x ) � 2 , E (Φ , Ψ) = X where ρ is the data distribution (fixed but uknown). 9.520/6.860 Fall 2017
Empirical data and population � d ρ ( x ) � x − Ψ ◦ Φ( x ) � 2 , min Φ , Ψ E (Φ , Ψ) , E (Φ , Ψ) = X Caveat Reconstruction alone is not enough ... copying data, i.e. Ψ ◦ Φ = I , gives zero reconstruction error! 9.520/6.860 Fall 2017
Parsimonious reconstruction Reconstruction is meaningful only with constraints ! ◮ constraints implement some form of parsimonious reconstruction, ◮ identified with a form of regularization , ◮ choice of the constraints corresponds to different algorithms . Fundamental difference with supervised learning: problem is not well defined! 9.520/6.860 Fall 2017
Parsimonious reconstruction Φ Ψ Φ( M ) M Ψ ◦ Φ ( M ) X F X 9.520/6.860 Fall 2017
Dictionary learning � x − Ψ ◦ Φ( x ) � Let X = R d , F = R p . 1. linear reconstruction Ψ( z ) = Dz , D ∈ D , with D a subset of the space of linear maps from X to F . 2. nearest neighbor representation , � x − Dz � 2 , Φ( x ) = Φ Ψ ( x ) = arg min D ∈ D , F λ ⊂ F . z ∈F λ 9.520/6.860 Fall 2017
Linear reconstruction and dictionaries Reconstruction D ∈ D can be identified by a d × p dictionary matrix with columns a 1 , . . . , a p ∈ R d . Reconstruction of x ∈ X corresponds to a suitable linear expansion on the dictionary D with coefficients β k = z k , z ∈ F λ p p � � a k z k = x = Dz = a k β k , β 1 , . . . , β k ∈ R . k =1 k =1 9.520/6.860 Fall 2017
Nearest neighbor representation � x − Dz � 2 , Φ( x ) = Φ Ψ ( x ) = arg min D ∈ D , F λ ⊂ F . z ∈F λ Nearest neighbor (NN) representation since, for D ∈ D and letting X λ = D F λ , Φ( x ) provides the closest point to x in X λ , x ′ ∈X λ � x − x ′ � 2 = min z ′ ∈F λ � x − Dz ′ � 2 . d ( x , X λ ) = min 9.520/6.860 Fall 2017
Nearest neighbor representation (cont.) NN representation are defined by a constrained inverse problem , z ∈F λ � x − Dz � 2 . min Alternatively, let F λ = F and add a regularization term R : F → R � � � x − Dz � 2 + λ R ( z ) min . z ∈F Note : Formulations coincide for R ( z ) = 1 I F λ , z ∈ F . 9.520/6.860 Fall 2017
Dictionary learning Empirical reconstruction error minimization � n 1 � � x i − Ψ ◦ Φ( x i ) � 2 min E (Φ , Ψ) = min n Φ , Ψ Φ , Ψ i =1 for joint dictionary and representation learning: n � 1 z i ∈F λ � x i − Dz i � 2 min min . n D ∈D ���� � �� � i =1 Dictionary learning Representation learning Dictionary learning ◮ learning a regularized representation on a dictionary, ◮ while simultaneously learning the dictionary itself. 9.520/6.860 Fall 2017
Examples The DL framework encompasses a number of approaches. ◮ PCA (& kernel PCA) ◮ K-SVD ◮ Sparse coding ◮ K-means ◮ K-flats ◮ . . . 9.520/6.860 Fall 2017
Principal Component Analysis (PCA) Let F λ = F k = R k , k ≤ min { n , d } , and D = { D : F → X , linear | D ∗ D = I } . ◮ D is a d × k matrix with orthogonal, unit norm columns ◮ Reconstruction: k � a j z j , Dz = z ∈ F j =1 ◮ Representation: D ∗ : X → F , D ∗ x = ( � a 1 , x � , . . . , � a k , x � ) , x ∈ X 9.520/6.860 Fall 2017
PCA and subset selection � k DD ∗ : X → X , DD ∗ x = a j � a j , x � , x ∈ X . j =1 P = DD ∗ is a projection 1 on subspace of R d spanned by a 1 , . . . , a k . 1 P = P 2 (idempotent) 9.520/6.860 Fall 2017
Rewriting PCA � n 1 z i ∈F k � x i − Dz i � 2 min min . n D ∈D � �� � i =1 Representation learning Note that: � x − Dz � 2 , Φ( x ) = D ∗ x = arg min ∀ x ∈ X , z ∈F k Rewrite minimization (set z = D ∗ x ) as n � 1 � x i − DD ∗ x i � 2 . min n D ∈D i =1 Subspace learning Finding the k − dimensional orthogonal projection D ∗ with the best (empirical) reconstruction . 9.520/6.860 Fall 2017
Learning a linear representation with PCA Subspace learning Finding the k − dimensional orthogonal projection with the best reconstruction . X 9.520/6.860 Fall 2017
PCA computation Recall the solution for k = 1. For all x ∈ X , DD ∗ x = � a , x � a , � x − � a , x � a � 2 = � x � 2 − | � a , x � | 2 with a ∈ R d such that � a � = 1. Then, equivalently: � n � n 1 1 � x i − DD ∗ x i � 2 ⇔ | � a , x i � | 2 . min max n n D ∈D a ∈ R d , � a � =1 i =1 i =1 9.520/6.860 Fall 2017
PCA computation (cont.) X T � Let � n � X the n × d data matrix and V = 1 X . � � n n n � � � 1 | � a , x i � | 2 = 1 a , 1 � a , x i � � a , x i � = � a , x i � x i = � a , Va � . n n n i =1 i =1 i =1 Then, equivalently: � n 1 | � a , x i � | 2 ⇔ max a ∈ R d , � a � =1 � a , Va � max n a ∈ R d , � a � =1 i =1 9.520/6.860 Fall 2017
PCA is an eigenproblem a ∈ R d , � a � =1 � a , Va � max ◮ Solutions are the stationary points of the Lagrangian L ( a , λ ) = � a , Va � − λ ( � a � 2 − 1) . ◮ Set ∂ L /∂ a = 0, then Va = λ a , � a , Va � = λ . Optimization problem is solved by the eigenvector of V associated to the largest eigenvalue. Note : reasoning extends to k > 1 – solution is given by the first k eigenvectors of V . 9.520/6.860 Fall 2017
PCA model Assumes the support of the data distribution is well approximated by a low dimensional linear subspace. X Can we consider an affine representation? Can we consider non-linear representations using PCA? 9.520/6.860 Fall 2017
PCA and affine dictionaries Consider the problem, with D as in PCA: n � 1 z i ∈F k � x i − Dz i − b � 2 . min min n D ∈D , b ∈ R d i =1 The above problem is equivalent to � � 2 � � n � � � 1 � x i − DD ∗ � min x i � ���� � n D ∈D � � i =1 P with x i = x i − m , i = 1 . . . , n . Note : - Computations are unchanged but need to consider centered data. 9.520/6.860 Fall 2017
Recommend
More recommend