Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction I Byron C Wallace
Machine Learning 2 DS 4420 - Spring 2020 Some slides today borrowing from : Percy Liang (Stanford) Other material from the MML book (Faisal and Ong)
Motivation • We often want to work with high dimensional data (e.g., images). We also often have lots of it. • This is computationally expensive to store and work with.
Dimensionality Reduction Fundamental idea Exploit redundancy in the data; find lower-dimensional representation 4 4 2 2 x 2 x 2 0 0 − 2 − 2 − 4 − 4 − 5 . 0 − 2 . 5 0 . 0 2 . 5 5 . 0 − 5 . 0 − 2 . 5 0 . 0 2 . 5 5 . 0 x 1 x 1
Example (from lecture 5): Dimensionality reduction via k- means
Example (from lecture 5): Dimensionality reduction via k- means This highlights the natural connection between dimensionality reduction and compression.
Dimensionality reduction Goal: Map high dimensional data onto lower-dimensional data in a manner that preserves distances/similarities Original Data (4 dims) Projection with PCA (2 dims) Objective: projection should “preserve” relative distances
Linear dimensionality reduction Idea : Project high-dimensional vector onto a lower dimensional space ∈ x ∈ R 361 z = U > x z ∈ R 10
Linear dimensionality reduction Original Reconstructed R D R D Compressed R M x z x ˜
Objective Key intuition: variance of data = captured variance + reconstruction error | {z } | {z } | {z } fixed want small want large
Principal Component Analysis (on board)
In Sum: Principal Component Analysis Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n d d Eigenvectors of Covariance Eigen-decomposition λ 1 λ 2 Λ = ... λ d Idea : Take top- k eigenvectors to maximize variance
Getting the eigenvalues, two ways • Direct eigenvalue decomposition of the covariance matrix N S = 1 n = 1 X N XX > x n x > N n =1
Getting the eigenvalues, two ways • Direct eigenvalue decomposition of the covariance matrix N S = 1 n = 1 X N XX > x n x > N n =1 • Singular Value Decomposition (SVD)
Singular Value Decomposition Idea : Decompose the d x n matrix X into 1. A n x n basis V (unitary matrix) 2. A d x n matrix Σ (diagonal projection) 3. A d x d basis U (unitary matrix) d X = U d ⇥ d Σ d ⇥ n V > n ⇥ n
SVD for PCA V > = U X Σ , |{z} |{z} |{z} |{z} D ⇥ N D ⇥ D D ⇥ N N ⇥ N S = 1 N XX > = 1 Σ > U > = 1 N U Σ V > V N U ΣΣ > U > | {z } = I N
SVD for PCA V > = U X Σ , |{z} |{z} |{z} |{z} D ⇥ N D ⇥ D D ⇥ N N ⇥ N S = 1 N XX > = 1 Σ > U > = 1 N U Σ V > V N U ΣΣ > U > | {z } = I N It turns out the columns of U are the eigenvectors of XX T
Principal Component Analysis Example 10.3 (MNIST Digits Embedding) the in
Principal Component Analysis Top 2 components Bottom 2 components Data : three varieties of wheat: Kama, Rosa, Canadian Attributes : Area, Perimeter, Compactness, Length of Kernel, Width of Kernel, Asymmetry Coefficient, Length of Groove
Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i
Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i X d × n U d × k Z k × n u ) ( z 1 . . . z n ) ) u ( ( . . .
Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i X d × n U d × k Z k × n u ) ( z 1 . . . z n ) ) u ( ( . . . Idea: z i more “meaningful” representation of i -th face than x i Can use z i for nearest-neighbor classification
Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i X d × n U d × k Z k × n u ) ( z 1 . . . z n ) ) u ( ( . . . Idea: z i more “meaningful” representation of i -th face than x i Can use z i for nearest-neighbor classification Much faster: O ( dk + nk ) time instead of O ( dn ) when n, d � k
Aside: How many components? • • Magnitude of eigenvalues indicate fraction of variance captured. • Eigenvalues on a face image dataset: 1353.2 1086.7 820.1 λ i 553.6 287.1 2 3 4 5 6 7 8 9 10 11 i • Eigenvalues typically drop o ff sharply, so don’t need that many. • Of course variance isn’t everything...
Latent Semantic Analysis [Deerwater 1990] • d = number of words in the vocabulary • Each x i ∈ R d is a vector of word counts =
Latent Semantic Analysis [Deerwater 1990] • d = number of words in the vocabulary • Each x i ∈ R d is a vector of word counts • x ji = frequency of word j in document i X d ⇥ n U d ⇥ k Z k ⇥ n u ( game: 1 · · · · · · · · · 3 ) u ( 1.9 ) ( z 1 . . . z n ) stocks: 2 · · · · · · · · · 0 0.4 ·· -0.001 chairman: 4 · · · · · · · · · 1 0.8 ·· 0.03 the: 8 · · · · · · · · · 7 0.01 ·· 0.04 · · · . . · · · · · · · · · . . . . . . . . . ·· . wins: 0 · · · · · · · · · 2 0.002 ·· 2.3 0.003 ··
Latent Semantic Analysis [Deerwater 1990] • d = number of words in the vocabulary • Each x i ∈ R d is a vector of word counts • x ji = frequency of word j in document i X d ⇥ n U d ⇥ k Z k ⇥ n u ( game: 1 · · · · · · · · · 3 ) u ( 1.9 ) ( z 1 . . . z n ) stocks: 2 · · · · · · · · · 0 0.4 ·· -0.001 chairman: 4 · · · · · · · · · 1 0.8 ·· 0.03 the: 8 · · · · · · · · · 7 0.01 ·· 0.04 · · · . . · · · · · · · · · . . . . . . . . . ·· . wins: 0 · · · · · · · · · 2 0.002 ·· 2.3 0.003 ·· How to measure similarity between two documents? z > 1 z 2 is probably better than x > 1 x 2
Probabilistic PCA • If we define a prior over z then we can sample from the latent space and hallucinate images
Limitations of Linearity PCA is e ff ective PCA is ine ff ective
Nonlinear PCA Broken solution Desired solution u 1 x 2 We want desired solution: S = { ( x 1 , x 2 ) : x 2 = u 2 1 }
Nonlinear PCA Broken solution Desired solution u 1 x 2 We want desired solution: S = { ( x 1 , x 2 ) : x 2 = u 2 1 } Idea: Use kernels 1 , x 2 ) > We can get this: S = { φ ( x ) = Uz } with φ ( x ) = ( x 2 { } Linear dimensionality reduction in φ ( x ) space ⇔ Nonlinear dimensionality reduction in x space
Kernel PCA
Wrapping up • PCA is a linear model for dimensionality reduction which finds a mapping to a lower dimensional space that maximizes variance • We saw that this is equivalent to performing an eigendecomposition on the covariance matrix of X • Next time Auto-encoders and neural compression for non-linear projections
Wrapping up • PCA is a linear model for dimensionality reduction which finds a mapping to a lower dimensional space that maximizes variance • We saw that this is equivalent to performing an eigendecomposition on the covariance matrix of X • Next time Auto-encoders and neural compression for non-linear projections
Wrapping up • PCA is a linear model for dimensionality reduction which finds a mapping to a lower dimensional space that maximizes variance • We saw that this is equivalent to performing an eigendecomposition on the covariance matrix of X • Next time Auto-encoders and neural compression for non-linear projections
Recommend
More recommend