RegML 2016 Class 7 Dictionary learning Lorenzo Rosasco - PowerPoint PPT Presentation

RegML 2016 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016

Data representation A mapping of data in new format better suited for further processing Data Representation L.Rosasco, RegML 2016

Data representation (cont.) X data-space, a data representation is a map Φ : X → F , to a representation space F . Different names in different fields: ◮ machine learning : feature map ◮ signal processing : analysis operator/transform ◮ information theory : encoder ◮ computational geometry : embedding L.Rosasco, RegML 2016

Supervised or Unsupervised? Supervised (labelled/annotated) data are expensive! Ideally a good data representation should reduce the need of (human) annotation. . . � Unsupervised learning of Φ L.Rosasco, RegML 2016

Unsupervised representation learning Samples S = { x 1 , . . . , x n } from a distribution ρ on the input space X are available. What are the principles to learn ”good” representation in an unsupervised fashion? L.Rosasco, RegML 2016

Unsupervised representation learning principles Two main concepts 1. Reconstruction , there exists a map Ψ : F → X such that Ψ ◦ Φ( x ) ∼ x, ∀ x ∈ X 2. Similarity preservation , it holds Φ( x ) ∼ Φ( x ′ ) ⇔ x ∼ x ′ , ∀ x ∈ X Most unsupervised work has focused on reconstruction rather than on similarity �� We give an overview next L.Rosasco, RegML 2016

Reconstruction based data representation Basic idea : the quality of a representation Φ is measured by the reconstruction error provided by an associated reconstruction Ψ � x − Ψ ◦ Φ( x ) � , L.Rosasco, RegML 2016

Empirical data and population Given S = { x 1 , . . . , x n } minimize the empirical reconstruction error n � E (Φ , Ψ) = 1 � x i − Ψ ◦ Φ( x i ) � 2 , � n i =1 as a proxy to the expected reconstruction error � dρ ( x ) � x − Ψ ◦ Φ( x ) � 2 , E (Φ , Ψ) = where ρ is the data distribution (fixed but uknown). L.Rosasco, RegML 2016

Empirical data and population � dρ ( x ) � x − Ψ ◦ Φ( x ) � 2 , min Φ , Ψ E (Φ , Ψ) , E (Φ , Ψ) = Caveat. . . But reconstruction alone is not enough ... copying data, i.e. Ψ ◦ Φ = I , gives zero reconstruction error! L.Rosasco, RegML 2016

Dictionary learning � x − Ψ ◦ Φ( x ) � Let X = R d , F = R p 1. linear reconstruction Ψ ∈ D , with D a subset of the space of linear maps from X to F . 2. nearest neighbor representation , � x − Ψ β � 2 , Φ( x ) = Φ Ψ ( x ) = arg min Ψ ∈ D , β ∈F λ where F λ is a subset of F . L.Rosasco, RegML 2016

Linear reconstruction and dictionaries Each reconstruction Ψ ∈ D can be identified a dictionary matrix with columns a 1 , . . . , a p ∈ R d . The reconstruction of an input x ∈ X corresponds to a suitable linear expansion on the dictionary � p x = a j β j , β 1 , . . . , β p ∈ R . j =1 L.Rosasco, RegML 2016

Nearest neighbor representation � x − Ψ β � 2 , Φ( x ) = Φ Ψ ( x ) = arg min Ψ ∈ D , β ∈F λ The above representation is called nearest neighbor (NN) since, for Ψ ∈ D , X λ = Ψ F λ , the representation Φ( x ) provides the closest point to x in X λ , x ′ ∈X λ � x − x ′ � 2 = min β ∈F λ � x − Ψ β � 2 . d ( x, X λ ) = min L.Rosasco, RegML 2016

Nearest neighbor representation (cont.) NN representation are defined by a constrained inverse problem , β ∈F λ � x − Ψ β � 2 . min Alternatively let F λ = F and adding a regularization term R λ : F → R � � � x − Ψ β � 2 + R λ ( β ) min . β ∈F L.Rosasco, RegML 2016

Dictionary learning Then � n 1 � x i − Ψ ◦ Φ( x i ) � 2 min n Ψ , Φ i =1 becomes � n 1 β i ∈F λ � x i − Ψ β i � 2 min min . n Ψ ∈D �� i =1 Dictionary learning Representation learning Dictionary learning ◮ learning a regularized representation on a dictionary. . . ◮ while simultaneously learning the dictionary itself. L.Rosasco, RegML 2016

Examples The framework introduced above encompasses a large number of approaches. ◮ PCA (& kernel PCA) ◮ KSVD ◮ Sparse coding ◮ K-means ◮ K-flats ◮ . . . L.Rosasco, RegML 2016

Example 1: Principal Component Analysis (PCA) Let F λ = F k = R k , k ≤ min { n, d } , and D = { Ψ : F → X , linear | Ψ ∗ Ψ = I } . ◮ Ψ is a d × k matrix with orthogonal, unit norm columns, k � Ψ β = a j β j , β ∈ F j =1 ◮ Ψ ∗ : X → F , Ψ ∗ x = ( � a 1 , x � , . . . , � a k , x � ) , x ∈ X L.Rosasco, RegML 2016

PCA & best subspace ΨΨ ∗ x = � k ◮ ΨΨ ∗ : X → X , j =1 a j � a j , x � , x ∈ X . x x − h x, a i a a |{z} h x,a i a ◮ P = ΨΨ ∗ is the projection ( P = P 2 ) on the subspace of R d spanned by a 1 , . . . , a k . L.Rosasco, RegML 2016

Rewriting PCA Note that, � x − Ψ β � 2 , Φ( x ) = Ψ ∗ x = arg min ∀ x ∈ X , β ∈F k so that we can rewrite the PCA minimization as n � 1 � x − ΨΨ ∗ x i � 2 . min n Ψ ∈D i =1 Subspace learning The problem of finding a k − dimensional orthogonal projection giving the best reconstruction . L.Rosasco, RegML 2016

PCA computation X T � Let � n � X the n × d data matrix and C = 1 X . . . . PCA optimization problem is solved by the eigenvector of C associated to the K largest eigenvalues. L.Rosasco, RegML 2016

Learning a linear representation with PCA Subspace learning The problem of finding a k − dimensional orthogonal projection giving the best reconstruction . X PCA assumes the support of the data distribution to be well approximated by a low dimensional linear subspace L.Rosasco, RegML 2016

PCA beyond linearity X L.Rosasco, RegML 2016

Kernel PCA Consider K ( x, x ′ ) = � φ ( x ) , φ ( x ′ ) � H φ : X → H , and a feature map and associated (reproducing) kernel . We can consider the empirical reconstruction in the feature space , � n 1 β i ∈H � φ ( x i ) − Ψ β i � 2 min min H . n Ψ ∈D i =1 Connection to manifold learning. . . L.Rosasco, RegML 2016

Examples 2: Sparse coding One of the first and most famous dictionary learning techniques. It corresponds to ◮ F = R p , ◮ p ≥ d , F λ = { β ∈ F : � β � 1 ≤ λ } , λ > 0 , ◮ D = { Ψ : F → X | � Ψ e j � F ≤ 1 } . Hence, n � 1 β i ∈F λ � x i − Ψ β i � 2 min min n Ψ ∈D �� i =1 dictionary learning sparse representation L.Rosasco, RegML 2016

Sparse coding (cont.) n � 1 β i ∈ R p , � β i �≤ λ � x i − Ψ β i � 2 min min n Ψ ∈D i =1 ◮ The problem is not convex . . . but it is separately convex in the β i ’s and Ψ . ◮ An alternating minimization is fairly natural (other approaches possible–see e.g. [Schnass ’15, Elad et al. ’06]) L.Rosasco, RegML 2016

Representation computation Given a dictionary, the problems β ∈F λ � x i − Ψ β � 2 , i = 1 , . . . , n min are convex and correspond to a sparse representation problems. They can be solved using convex optimization techniques. Splitting/proximal methods β t +1 = T γ,λ ( β t − γ Ψ ∗ ( x i − Ψ β t )) , β 0 , t = 0 , . . . , T max with T λ the soft-thresholding operator, L.Rosasco, RegML 2016

Dictionary computation Given Φ( x i ) = β i , i = 1 , . . . , n , we have � � n � 1 1 2 � x i − Ψ ◦ Φ( x i ) � 2 = min � � � � X − B ∗ Ψ min F , � n n Ψ ∈D Ψ ∈D i =1 where B is the n × p matrix with rows β i , i = 1 , . . . , n and we denoted by �·� F , the Frobenius norm. It is a convex problem, solvable via standard techniques. Splitting/proximal methods Ψ t +1 = P (Ψ t − γ t B ∗ ( X − Ψ B )) , Ψ 0 , t = 0 , . . . , T max where P is the projection corresponding to the constraints, � � Ψ j � � � Ψ j � � , � > 1 P (Ψ j ) Ψ j / = if � � Ψ j � � ≤ 1 . P (Ψ j ) Ψ j , = if L.Rosasco, RegML 2016

Sparse coding model ◮ Sparse coding assumes the support of the data distribution to be a � p � union of subspaces, i.e. all possible s dimensional subspaces in s R p , where s is the sparsity level. ◮ More general penalties, more general geometric assumptions. L.Rosasco, RegML 2016

Example 3: K-means & vector quantization K-means is typically seen as a clustering algorithm in machine learning. . . but it is also a classical vector quantization approach. Here we revisit this point of view from a data representation perspective. K-means corresponds to ◮ F λ = F k = { e 1 , . . . , e k } , the canonical basis in R k , k ≤ n ◮ D = { Ψ : F → X | linear } . L.Rosasco, RegML 2016

K-means computation n � 1 β i ∈{ e 1 ,...,e k } � x i − Ψ β i � 2 min min n Ψ ∈D i =1 The K-means problem is not convex. Alternating minimization 1. Initialize dictionary Ψ 0 . 2. Let Φ( x i ) = β i , i = 1 , . . . , n be the solution of the problems β ∈{ e 1 ,...,e k } � x i − Ψ β � 2 , min i = 1 , . . . , n. with V j = { x ∈ S | Φ( x ) = e j } , (multiple points have same representation since k ≤ n ). 3. Letting a j = Ψ e j , we can write � n � k � 1 1 � x i − Ψ ◦ Φ( x i ) � 2 = � x − a j � 2 . min min n n Ψ ∈D a 1 ,...,a k ∈ R d i =1 j =1 x ∈ V j L.Rosasco, RegML 2016

RegML 2016 Class 7 Dictionary learning Lorenzo Rosasco - PowerPoint PPT Presentation

RegML 2016 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Data representation A mapping of data in new format better suited for further processing Data Representation L.Rosasco, RegML 2016 Data representation

RegML 2020 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT Data representation A

RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Exploiting

The Dictionary ADT The dictionary ADT models a searchable collection findElement(k): if the

RegML 2016 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT June

RegML 2020 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT

RegML 2020 Class 1 Statistical Learning Theory Lorenzo Rosasco UNIGE-MIT-IIT All starts with

RegML 2020 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT Learning

CMSC 206 Dictionaries and Hashing The Dictionary ADT n a dictionary (table) is an abstract

RegML 2020 Class 3 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT

Sparse Coding and Dictionary Learning for Image Analysis Part II: Dictionary Learning for signal

Agenda Announcements Dictionary please snarf code for class today

Applications to high dimensional problems Francesca Odone and Lorenzo Rosasco RegML 2013

Dictionary lookup Suppose youre looking up a word in the dictionary (paper one, not

Dictionaries A Good morning dictionary English: Good morning Spanish: Buenas das

Hashing - Introduction Dictionary Dictionary = a dynamic set that supports the = a dynamic set

The dictionary problem. A dictionary can be seen as a database of records; in each record we

Computing the Straight Skeleton of an Orthogonal Monotone Polygon in Linear Time G unther Eder,

Geometric Path Planning McGill COMP 765 Oct 12 th , 2017 The Motion Planning Problem

A Wavefront-Like Strategy for Computing Multiplicatively Weighted Voronoi Diagrams Martin Held 1

Introduction to CGAL Constantinos Tsirogiannis TU / Eindhoven Constantinos Tsirogiannis

Geom ometry, Art, a and C Cultural R Relevance with C h Comput putationa nal Think nking

Robustness in Geometric Computation Vikram Sharma 1 1 Institute of Mathematical Sciences, Chennai,

Computer Graphics and Visualization in a Computational Science Program Steve Cunningham

Edge-guarding Orthogonal Polyhedra (23 rd Canadian Conference on Computational Geometry) Nadia M.

Sambuz

Useful Links

Newsletter

Mail Us