regml 2020 class 7 dictionary learning
play

RegML 2020 Class 7 Dictionary learning Lorenzo Rosasco - PowerPoint PPT Presentation

RegML 2020 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT Data representation A mapping of data in new format better suited for further processing Data Representation L.Rosasco, RegML 2020 Data representation (cont.) X


  1. RegML 2020 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT

  2. Data representation A mapping of data in new format better suited for further processing Data Representation L.Rosasco, RegML 2020

  3. Data representation (cont.) X data-space, a data representation is a map Φ : X → F , to a representation space F . Different names in different fields: ◮ machine learning : feature map ◮ signal processing : analysis operator/transform ◮ information theory : encoder ◮ computational geometry : embedding L.Rosasco, RegML 2020

  4. Outline Part II: Data representation by learning Dictionary learning Metric learning L.Rosasco, RegML 2020

  5. Supervised or Unsupervised? Supervised (labelled/annotated) data are expensive! Ideally a good data representation should reduce the need of (human) annotation. . . � Unsupervised learning of Φ L.Rosasco, RegML 2020

  6. Unsupervised representation learning Samples S = { x 1 , . . . , x n } from a distribution ρ on the input space X are available. What are the principles to learn ”good” representation in an unsupervised fashion? L.Rosasco, RegML 2020

  7. Unsupervised representation learning principles Two main concepts 1. Reconstruction , there exists a map Ψ : F → X such that Ψ ◦ Φ( x ) ∼ x, ∀ x ∈ X 2. Similarity preservation , it holds Φ( x ) ∼ Φ( x ′ ) ⇔ x ∼ x ′ , ∀ x ∈ X L.Rosasco, RegML 2020

  8. Unsupervised representation learning principles Two main concepts 1. Reconstruction , there exists a map Ψ : F → X such that Ψ ◦ Φ( x ) ∼ x, ∀ x ∈ X 2. Similarity preservation , it holds Φ( x ) ∼ Φ( x ′ ) ⇔ x ∼ x ′ , ∀ x ∈ X Most unsupervised work has focused on reconstruction rather than on similarity ��� We give an overview next L.Rosasco, RegML 2020

  9. Reconstruction based data representation Basic idea : the quality of a representation Φ is measured by the reconstruction error provided by an associated reconstruction Ψ � x − Ψ ◦ Φ( x ) � , L.Rosasco, RegML 2020

  10. Empirical data and population Given S = { x 1 , . . . , x n } minimize the empirical reconstruction error n � E (Φ , Ψ) = 1 � x i − Ψ ◦ Φ( x i ) � 2 , � n i =1 L.Rosasco, RegML 2020

  11. Empirical data and population Given S = { x 1 , . . . , x n } minimize the empirical reconstruction error n � E (Φ , Ψ) = 1 � x i − Ψ ◦ Φ( x i ) � 2 , � n i =1 as a proxy to the expected reconstruction error � dρ ( x ) � x − Ψ ◦ Φ( x ) � 2 , E (Φ , Ψ) = where ρ is the data distribution (fixed but uknown). L.Rosasco, RegML 2020

  12. Empirical data and population � dρ ( x ) � x − Ψ ◦ Φ( x ) � 2 , min Φ , Ψ E (Φ , Ψ) , E (Φ , Ψ) = Caveat. . . But reconstruction alone is not enough ... copying data, i.e. Ψ ◦ Φ = I , gives zero reconstruction error! L.Rosasco, RegML 2020

  13. Dictionary learning � x − Ψ ◦ Φ( x ) � Let X = R d , F = R p 1. linear reconstruction Ψ ∈ D , with D a subset of the space of linear maps from X to F . L.Rosasco, RegML 2020

  14. Dictionary learning � x − Ψ ◦ Φ( x ) � Let X = R d , F = R p 1. linear reconstruction Ψ ∈ D , with D a subset of the space of linear maps from X to F . 2. nearest neighbor representation , � x − Ψ β � 2 , Φ( x ) = Φ Ψ ( x ) = arg min Ψ ∈ D , β ∈F λ where F λ is a subset of F . L.Rosasco, RegML 2020

  15. Linear reconstruction and dictionaries Each reconstruction Ψ ∈ D can be identified a dictionary matrix with columns a 1 , . . . , a p ∈ R d . L.Rosasco, RegML 2020

  16. Linear reconstruction and dictionaries Each reconstruction Ψ ∈ D can be identified a dictionary matrix with columns a 1 , . . . , a p ∈ R d . The reconstruction of an input x ∈ X corresponds to a suitable linear expansion on the dictionary p � x = a j β j , β 1 , . . . , β p ∈ R . j =1 L.Rosasco, RegML 2020

  17. Nearest neighbor representation � x − Ψ β � 2 , Φ( x ) = Φ Ψ ( x ) = arg min Ψ ∈ D , β ∈F λ The above representation is called nearest neighbor (NN) since, for Ψ ∈ D , X λ = Ψ F λ , the representation Φ( x ) provides the closest point to x in X λ , x ′ ∈X λ � x − x ′ � 2 = min β ∈F λ � x − Ψ β � 2 . d ( x, X λ ) = min L.Rosasco, RegML 2020

  18. Nearest neighbor representation (cont.) NN representation are defined by a constrained inverse problem , β ∈F λ � x − Ψ β � 2 . min L.Rosasco, RegML 2020

  19. Nearest neighbor representation (cont.) NN representation are defined by a constrained inverse problem , β ∈F λ � x − Ψ β � 2 . min Alternatively let F λ = F and adding a regularization term R λ : F → R � � � x − Ψ β � 2 + R λ ( β ) min . β ∈F L.Rosasco, RegML 2020

  20. Dictionary learning Then n � 1 � x i − Ψ ◦ Φ( x i ) � 2 min n Ψ , Φ i =1 becomes n � 1 β i ∈F λ � x i − Ψ β i � 2 min min . n Ψ ∈D ���� � �� � i =1 Dictionary learning Representation learning Dictionary learning ◮ learning a regularized representation on a dictionary. . . ◮ while simultaneously learning the dictionary itself. L.Rosasco, RegML 2020

  21. Examples The framework introduced above encompasses a large number of approaches. ◮ PCA (& kernel PCA) ◮ KSVD ◮ Sparse coding ◮ K-means ◮ K-flats ◮ . . . L.Rosasco, RegML 2020

  22. Example 1: Principal Component Analysis (PCA) Let F λ = F k = R k , k ≤ min { n, d } , and D = { Ψ : F → X , linear | Ψ ∗ Ψ = I } . L.Rosasco, RegML 2020

  23. Example 1: Principal Component Analysis (PCA) Let F λ = F k = R k , k ≤ min { n, d } , and D = { Ψ : F → X , linear | Ψ ∗ Ψ = I } . ◮ Ψ is a d × k matrix with orthogonal, unit norm columns, � k Ψ β = a j β j , β ∈ F j =1 L.Rosasco, RegML 2020

  24. Example 1: Principal Component Analysis (PCA) Let F λ = F k = R k , k ≤ min { n, d } , and D = { Ψ : F → X , linear | Ψ ∗ Ψ = I } . ◮ Ψ is a d × k matrix with orthogonal, unit norm columns, � k Ψ β = a j β j , β ∈ F j =1 ◮ Ψ ∗ : X → F , Ψ ∗ x = ( � a 1 , x � , . . . , � a k , x � ) , x ∈ X L.Rosasco, RegML 2020

  25. PCA & best subspace ΨΨ ∗ x = � k ◮ ΨΨ ∗ : X → X , j =1 a j � a j , x � , x ∈ X . x x − h x, a i a a |{z} h x,a i a ◮ P = ΨΨ ∗ is the projection ( P = P 2 ) on the subspace of R d spanned by a 1 , . . . , a k . L.Rosasco, RegML 2020

  26. Rewriting PCA Note that, � x − Ψ β � 2 , Φ( x ) = Ψ ∗ x = arg min ∀ x ∈ X , β ∈F k so that we can rewrite the PCA minimization as � n 1 � x − ΨΨ ∗ x i � 2 . min n Ψ ∈D i =1 L.Rosasco, RegML 2020

  27. Rewriting PCA Note that, � x − Ψ β � 2 , Φ( x ) = Ψ ∗ x = arg min ∀ x ∈ X , β ∈F k so that we can rewrite the PCA minimization as � n 1 � x − ΨΨ ∗ x i � 2 . min n Ψ ∈D i =1 Subspace learning The problem of finding a k − dimensional orthogonal projection giving the best reconstruction . L.Rosasco, RegML 2020

  28. PCA computation X T � Let � n � X the n × d data matrix and C = 1 X . L.Rosasco, RegML 2020

  29. PCA computation X T � Let � n � X the n × d data matrix and C = 1 X . . . . PCA optimization problem is solved by the eigenvector of C associated to the K largest eigenvalues. L.Rosasco, RegML 2020

  30. Learning a linear representation with PCA Subspace learning The problem of finding a k − dimensional orthogonal projection giving the best reconstruction . X PCA assumes the support of the data distribution to be well approximated by a low dimensional linear subspace L.Rosasco, RegML 2020

  31. PCA beyond linearity X L.Rosasco, RegML 2020

  32. PCA beyond linearity X L.Rosasco, RegML 2020

  33. PCA beyond linearity X L.Rosasco, RegML 2020

  34. Kernel PCA Consider K ( x, x ′ ) = � φ ( x ) , φ ( x ′ ) � H φ : X → H , and a feature map and associated (reproducing) kernel . We can consider the empirical reconstruction in the feature space , n � 1 β i ∈H � φ ( x i ) − Ψ β i � 2 min min H . n Ψ ∈D i =1 Connection to manifold learning. . . L.Rosasco, RegML 2020

  35. Examples 2: Sparse coding One of the first and most famous dictionary learning techniques. L.Rosasco, RegML 2020

  36. Examples 2: Sparse coding One of the first and most famous dictionary learning techniques. It corresponds to ◮ F = R p , L.Rosasco, RegML 2020

  37. Examples 2: Sparse coding One of the first and most famous dictionary learning techniques. It corresponds to ◮ F = R p , ◮ p ≥ d , F λ = { β ∈ F : � β � 1 ≤ λ } , λ > 0 , L.Rosasco, RegML 2020

  38. Examples 2: Sparse coding One of the first and most famous dictionary learning techniques. It corresponds to ◮ F = R p , ◮ p ≥ d , F λ = { β ∈ F : � β � 1 ≤ λ } , λ > 0 , ◮ D = { Ψ : F → X | � Ψ e j � F ≤ 1 } . L.Rosasco, RegML 2020

  39. Examples 2: Sparse coding One of the first and most famous dictionary learning techniques. It corresponds to ◮ F = R p , ◮ p ≥ d , F λ = { β ∈ F : � β � 1 ≤ λ } , λ > 0 , ◮ D = { Ψ : F → X | � Ψ e j � F ≤ 1 } . Hence, � n 1 β i ∈F λ � x i − Ψ β i � 2 min min n Ψ ∈D ���� � �� � i =1 dictionary learning sparse representation L.Rosasco, RegML 2020

Recommend


More recommend