RegML 2016 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016
Data representation A mapping of data in new format better suited for further processing Data Representation L.Rosasco, RegML 2016
Data representation (cont.) X data-space, a data representation is a map Φ : X → F , to a representation space F . Different names in different fields: ◮ machine learning : feature map ◮ signal processing : analysis operator/transform ◮ information theory : encoder ◮ computational geometry : embedding L.Rosasco, RegML 2016
Supervised or Unsupervised? Supervised (labelled/annotated) data are expensive! Ideally a good data representation should reduce the need of (human) annotation. . . � Unsupervised learning of Φ L.Rosasco, RegML 2016
Unsupervised representation learning Samples S = { x 1 , . . . , x n } from a distribution ρ on the input space X are available. What are the principles to learn ”good” representation in an unsupervised fashion? L.Rosasco, RegML 2016
Unsupervised representation learning principles Two main concepts 1. Reconstruction , there exists a map Ψ : F → X such that Ψ ◦ Φ( x ) ∼ x, ∀ x ∈ X 2. Similarity preservation , it holds Φ( x ) ∼ Φ( x ′ ) ⇔ x ∼ x ′ , ∀ x ∈ X Most unsupervised work has focused on reconstruction rather than on similarity ��� We give an overview next L.Rosasco, RegML 2016
Reconstruction based data representation Basic idea : the quality of a representation Φ is measured by the reconstruction error provided by an associated reconstruction Ψ � x − Ψ ◦ Φ( x ) � , L.Rosasco, RegML 2016
Empirical data and population Given S = { x 1 , . . . , x n } minimize the empirical reconstruction error n � E (Φ , Ψ) = 1 � x i − Ψ ◦ Φ( x i ) � 2 , � n i =1 as a proxy to the expected reconstruction error � dρ ( x ) � x − Ψ ◦ Φ( x ) � 2 , E (Φ , Ψ) = where ρ is the data distribution (fixed but uknown). L.Rosasco, RegML 2016
Empirical data and population � dρ ( x ) � x − Ψ ◦ Φ( x ) � 2 , min Φ , Ψ E (Φ , Ψ) , E (Φ , Ψ) = Caveat. . . But reconstruction alone is not enough ... copying data, i.e. Ψ ◦ Φ = I , gives zero reconstruction error! L.Rosasco, RegML 2016
Dictionary learning � x − Ψ ◦ Φ( x ) � Let X = R d , F = R p 1. linear reconstruction Ψ ∈ D , with D a subset of the space of linear maps from X to F . 2. nearest neighbor representation , � x − Ψ β � 2 , Φ( x ) = Φ Ψ ( x ) = arg min Ψ ∈ D , β ∈F λ where F λ is a subset of F . L.Rosasco, RegML 2016
Linear reconstruction and dictionaries Each reconstruction Ψ ∈ D can be identified a dictionary matrix with columns a 1 , . . . , a p ∈ R d . The reconstruction of an input x ∈ X corresponds to a suitable linear expansion on the dictionary � p x = a j β j , β 1 , . . . , β p ∈ R . j =1 L.Rosasco, RegML 2016
Nearest neighbor representation � x − Ψ β � 2 , Φ( x ) = Φ Ψ ( x ) = arg min Ψ ∈ D , β ∈F λ The above representation is called nearest neighbor (NN) since, for Ψ ∈ D , X λ = Ψ F λ , the representation Φ( x ) provides the closest point to x in X λ , x ′ ∈X λ � x − x ′ � 2 = min β ∈F λ � x − Ψ β � 2 . d ( x, X λ ) = min L.Rosasco, RegML 2016
Nearest neighbor representation (cont.) NN representation are defined by a constrained inverse problem , β ∈F λ � x − Ψ β � 2 . min Alternatively let F λ = F and adding a regularization term R λ : F → R � � � x − Ψ β � 2 + R λ ( β ) min . β ∈F L.Rosasco, RegML 2016
Dictionary learning Then � n 1 � x i − Ψ ◦ Φ( x i ) � 2 min n Ψ , Φ i =1 becomes � n 1 β i ∈F λ � x i − Ψ β i � 2 min min . n Ψ ∈D ���� � �� � i =1 Dictionary learning Representation learning Dictionary learning ◮ learning a regularized representation on a dictionary. . . ◮ while simultaneously learning the dictionary itself. L.Rosasco, RegML 2016
Examples The framework introduced above encompasses a large number of approaches. ◮ PCA (& kernel PCA) ◮ KSVD ◮ Sparse coding ◮ K-means ◮ K-flats ◮ . . . L.Rosasco, RegML 2016
Example 1: Principal Component Analysis (PCA) Let F λ = F k = R k , k ≤ min { n, d } , and D = { Ψ : F → X , linear | Ψ ∗ Ψ = I } . ◮ Ψ is a d × k matrix with orthogonal, unit norm columns, k � Ψ β = a j β j , β ∈ F j =1 ◮ Ψ ∗ : X → F , Ψ ∗ x = ( � a 1 , x � , . . . , � a k , x � ) , x ∈ X L.Rosasco, RegML 2016
PCA & best subspace ΨΨ ∗ x = � k ◮ ΨΨ ∗ : X → X , j =1 a j � a j , x � , x ∈ X . x x − h x, a i a a |{z} h x,a i a ◮ P = ΨΨ ∗ is the projection ( P = P 2 ) on the subspace of R d spanned by a 1 , . . . , a k . L.Rosasco, RegML 2016
Rewriting PCA Note that, � x − Ψ β � 2 , Φ( x ) = Ψ ∗ x = arg min ∀ x ∈ X , β ∈F k so that we can rewrite the PCA minimization as n � 1 � x − ΨΨ ∗ x i � 2 . min n Ψ ∈D i =1 Subspace learning The problem of finding a k − dimensional orthogonal projection giving the best reconstruction . L.Rosasco, RegML 2016
PCA computation X T � Let � n � X the n × d data matrix and C = 1 X . . . . PCA optimization problem is solved by the eigenvector of C associated to the K largest eigenvalues. L.Rosasco, RegML 2016
Learning a linear representation with PCA Subspace learning The problem of finding a k − dimensional orthogonal projection giving the best reconstruction . X PCA assumes the support of the data distribution to be well approximated by a low dimensional linear subspace L.Rosasco, RegML 2016
PCA beyond linearity X L.Rosasco, RegML 2016
PCA beyond linearity X L.Rosasco, RegML 2016
PCA beyond linearity X L.Rosasco, RegML 2016
Kernel PCA Consider K ( x, x ′ ) = � φ ( x ) , φ ( x ′ ) � H φ : X → H , and a feature map and associated (reproducing) kernel . We can consider the empirical reconstruction in the feature space , � n 1 β i ∈H � φ ( x i ) − Ψ β i � 2 min min H . n Ψ ∈D i =1 Connection to manifold learning. . . L.Rosasco, RegML 2016
Examples 2: Sparse coding One of the first and most famous dictionary learning techniques. It corresponds to ◮ F = R p , ◮ p ≥ d , F λ = { β ∈ F : � β � 1 ≤ λ } , λ > 0 , ◮ D = { Ψ : F → X | � Ψ e j � F ≤ 1 } . Hence, n � 1 β i ∈F λ � x i − Ψ β i � 2 min min n Ψ ∈D ���� � �� � i =1 dictionary learning sparse representation L.Rosasco, RegML 2016
Sparse coding (cont.) n � 1 β i ∈ R p , � β i �≤ λ � x i − Ψ β i � 2 min min n Ψ ∈D i =1 ◮ The problem is not convex . . . but it is separately convex in the β i ’s and Ψ . ◮ An alternating minimization is fairly natural (other approaches possible–see e.g. [Schnass ’15, Elad et al. ’06]) L.Rosasco, RegML 2016
Representation computation Given a dictionary, the problems β ∈F λ � x i − Ψ β � 2 , i = 1 , . . . , n min are convex and correspond to a sparse representation problems. They can be solved using convex optimization techniques. Splitting/proximal methods β t +1 = T γ,λ ( β t − γ Ψ ∗ ( x i − Ψ β t )) , β 0 , t = 0 , . . . , T max with T λ the soft-thresholding operator, L.Rosasco, RegML 2016
Dictionary computation Given Φ( x i ) = β i , i = 1 , . . . , n , we have � � n � 1 1 2 � x i − Ψ ◦ Φ( x i ) � 2 = min � � � � X − B ∗ Ψ min F , � n n Ψ ∈D Ψ ∈D i =1 where B is the n × p matrix with rows β i , i = 1 , . . . , n and we denoted by �·� F , the Frobenius norm. It is a convex problem, solvable via standard techniques. Splitting/proximal methods Ψ t +1 = P (Ψ t − γ t B ∗ ( X − Ψ B )) , Ψ 0 , t = 0 , . . . , T max where P is the projection corresponding to the constraints, � � Ψ j � � � Ψ j � � , � > 1 P (Ψ j ) Ψ j / = if � � Ψ j � � ≤ 1 . P (Ψ j ) Ψ j , = if L.Rosasco, RegML 2016
Sparse coding model ◮ Sparse coding assumes the support of the data distribution to be a � p � union of subspaces, i.e. all possible s dimensional subspaces in s R p , where s is the sparsity level. ◮ More general penalties, more general geometric assumptions. L.Rosasco, RegML 2016
Example 3: K-means & vector quantization K-means is typically seen as a clustering algorithm in machine learning. . . but it is also a classical vector quantization approach. Here we revisit this point of view from a data representation perspective. K-means corresponds to ◮ F λ = F k = { e 1 , . . . , e k } , the canonical basis in R k , k ≤ n ◮ D = { Ψ : F → X | linear } . L.Rosasco, RegML 2016
K-means computation n � 1 β i ∈{ e 1 ,...,e k } � x i − Ψ β i � 2 min min n Ψ ∈D i =1 The K-means problem is not convex. Alternating minimization 1. Initialize dictionary Ψ 0 . 2. Let Φ( x i ) = β i , i = 1 , . . . , n be the solution of the problems β ∈{ e 1 ,...,e k } � x i − Ψ β � 2 , min i = 1 , . . . , n. with V j = { x ∈ S | Φ( x ) = e j } , (multiple points have same representation since k ≤ n ). 3. Letting a j = Ψ e j , we can write � n � k � 1 1 � x i − Ψ ◦ Φ( x i ) � 2 = � x − a j � 2 . min min n n Ψ ∈D a 1 ,...,a k ∈ R d i =1 j =1 x ∈ V j L.Rosasco, RegML 2016
Recommend
More recommend