Principal Component Analysis (PCA) CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani
Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given feature set Feature extraction A linear or non-linear transform on the original feature space 𝑦 𝑗 1 𝑦 1 𝑦 1 𝑧 1 𝑦 1 ⋮ ⋮ → ⋮ ⋮ ⋮ → = 𝑔 𝑦 𝑒 𝑦 𝑗 𝑒′ 𝑦 𝑒 𝑧 𝑒 ′ 𝑦 𝑒 Feature Feature Selection Extraction ( 𝑒 ′ < 𝑒 ) 2
Feature Extraction Mapping of the original data to another space Criterion for feature extraction can be different based on problem settings Unsupervised task: minimize the information loss (reconstruction error) Supervised task: maximize the class discrimination on the projected space Feature extraction algorithms Linear Methods Unsupervised: e.g., Principal Component Analysis (PCA) Supervised: e.g., Linear Discriminant Analysis (LDA) Also known as Fisher ’ s Discriminant Analysis (FDA) 3
Feature Extraction Unsupervised feature extraction: A mapping 𝑔: ℝ 𝑒 → ℝ 𝑒 ′ (1) (1) 𝑦 1 ⋯ 𝑦 𝑒 Or 𝒀 = ⋮ ⋱ ⋮ Feature Extraction only the transformed data (𝑂) (𝑂) 𝑦 1 ⋯ 𝑦 𝑒 (1) (1) 𝑦′ 1 ⋯ 𝑦′ 𝑒 ′ 𝒀 ′ = ⋮ ⋱ ⋮ (𝑂) (𝑂) 𝑦′ 1 ⋯ 𝑦′ 𝑒 ′ Supervised feature extraction: A mapping 𝑔: ℝ 𝑒 → ℝ 𝑒 ′ (1) (1) 𝑦 1 ⋯ 𝑦 𝑒 Or 𝒀 = ⋮ ⋱ ⋮ Feature Extraction only the transformed data (𝑂) (𝑂) 𝑦 1 ⋯ 𝑦 𝑒 (1) (1) 𝑦′ 1 ⋯ 𝑦′ 𝑒 ′ 𝒀 ′ = 𝑧 (1) ⋮ ⋱ ⋮ 𝑍 = ⋮ (𝑂) (𝑂) 𝑦′ 1 ⋯ 𝑦′ 𝑒 ′ 𝑧 (𝑂) 4
Unsupervised Feature Reduction Visualization: projection of high-dimensional data onto 2D or 3D. Data compression: efficient storage, communication, or and retrieval. Pre-process: to improve accuracy by reducing features As a preprocessing step to reduce dimensions for supervised learning tasks Helps avoiding overfitting Noise removal E.g, “ noise ” in the images introduced by minor lighting variations, slightly different imaging conditions, etc. 5
Linear Transformation For linear transformation, we find an explicit mapping 𝑔 𝒚 = 𝑩 𝑈 𝒚 that can transform also new data vectors. Original data Type equation here. reduced data = 𝒚′ ∈ ℝ 𝑒 ′ 𝑩 𝑈 ∈ ℝ 𝑒 ′ ×𝑒 𝒚 ′ = 𝑩 𝑈 𝒚 𝑒 ′ < 𝑒 𝒚 ∈ ℝ 6
Linear Transformation Linear transformation are simple mappings 𝑏 11 ⋯ 𝑏 1𝑒 ′ 𝒚 ′ = 𝑩 𝑈 𝒚 ⋮ ⋱ ⋮ 𝑩 = 𝑏 𝑒1 ⋯ 𝑏 𝑒𝑒 ′ a a d 1 𝑈 𝒃 1 ′ 𝑏 11 ⋯ 𝑏 𝑒1 𝑦 1 𝑦 1 ⋮ ⋱ ⋮ ⋮ ⋮ = ′ 𝑏 1𝑒 ′ ⋯ 𝑏 𝑒 ′ 𝑒 𝑦 𝑒 𝑦 𝑒 ′ 𝑈 𝒃 𝑒 ′ ′ = 𝒃 𝑘 𝑈 𝒚 𝑦 𝑘 𝑘 = 1, … , 𝑒 ′ 7
Linear Dimensionality Reduction Unsupervised Principal Component Analysis (PCA) [we will discuss] Independent Component Analysis (ICA) [we will discuss] SingularValue Decomposition (SVD) Multi Dimensional Scaling (MDS) Canonical Correlation Analysis (CCA) 8
Principal Component Analysis (PCA) Also known as Karhonen-Loeve (KL) transform Principal Components (PCs): orthogonal vectors that are ordered by the fraction of the total information (variation) in the corresponding directions Find the directions at which data approximately lie When the data is projected onto first PC, the variance of the projected data is maximized PCA is an orthogonal projection of the data into a subspace so that the variance of the projected data is maximized. 9
Principal Component Analysis (PCA) The “ best ” linear subspace (i.e. providing least reconstruction error of data): Find mean reduced data The axes have been rotated to new (principal) axes such that: Principal axis 1 has the highest variance .... Principal axis i has the i-th highest variance. The principal axes are uncorrelated Covariance among each pair of the principal axes is zero. Goal: reducing the dimensionality of the data while preserving the variation present in the dataset as much as possible. PCs can be found as the “ best ” eigenvectors of the covariance matrix of the data points. 10
Principal components If data has a Gaussian distribution 𝑂(𝝂, 𝚻), the direction of the largest variance can be found by the eigenvector of 𝚻 that corresponds to the largest eigenvalue of 𝚻 𝒘 2 𝒘 1 11
PCA: Steps Input: 𝑂 × 𝑒 data matrix 𝒀 (each row contain a 𝑒 dimensional data point) 1 𝑂 𝒚 (𝑗) 𝑂 𝑗=1 𝝂 = 𝒀 ← Mean value of data points is subtracted from rows of 𝒀 1 𝑂 𝒀 𝑈 𝒀 (Covariance matrix) 𝑫 = Calculate eigenvalue and eigenvectors of 𝑫 Pick 𝑒 ′ eigenvectors corresponding to the largest eigenvalues and put them in the columns of 𝑩 = [𝒘 1 , … , 𝒘 𝑒 ′ ] 𝒀′ = 𝒀𝑩 First PC d ’ -th PC 12
Covariance Matrix 𝜈 1 𝐹(𝑦 1 ) ⋮ ⋮ 𝝂 𝒚 = = 𝜈 𝑒 𝐹(𝑦 𝑒 ) 𝒚 − 𝝂 𝒚 𝑈 𝜯 = 𝐹 𝒚 − 𝝂 𝒚 𝑂 : ML estimate of covariance matrix from data points 𝒚 (𝑗) 𝑗=1 𝑂 𝜯 = 1 𝑈 = 1 𝒚 (𝑗) − 𝒚 (𝑗) − 𝒀 𝑈 𝑂 𝝂 𝝂 𝒀 𝑂 𝑗=1 𝒚 (1) − 𝑂 𝒚 (1) 𝝂 𝝂 = 1 𝒚 (𝑗) 𝒀 = = 𝑂 ⋮ ⋮ 𝒚 (𝑂) − 𝒚 (𝑂) 𝝂 𝑗=1 We now assume that data are mean removed 13 Mean-centered data and 𝒚 in the later slides is indeed 𝒚
Correlation matrix (1) (1) 𝑦 1 … 𝑦 𝑒 𝒀 = ⋮ ⋱ ⋮ (𝑂) (𝑂) 𝑦 1 … 𝑦 𝑒 (1) (𝑂) (1) (1) 𝑦 1 … 𝑦 1 𝑦 1 … 𝑦 𝑒 𝑂 𝒀 𝑈 𝒀 = 1 1 ⋮ ⋱ ⋮ ⋮ ⋱ ⋮ 𝑂 (1) (𝑂) (𝑂) (𝑂) 𝑦 𝑒 … 𝑦 𝑒 𝑦 1 … 𝑦 𝑒 𝑂 𝑂 (𝑜) 𝑦 1 (𝑜) 𝑦 𝑒 (𝑜) (𝑜) 𝑦 1 … 𝑦 1 = 1 𝑜=1 𝑜=1 ⋮ ⋱ ⋮ 𝑂 𝑂 𝑂 (𝑜) 𝑦 1 (𝑜) 𝑦 𝑒 (𝑜) (𝑜) 𝑦 𝑒 … 𝑦 𝑒 𝑜=1 𝑜=1 14
Two Interpretations MaximumVariance Subspace PCA finds vectors v such that projections on to the vectors capture maximum variance in the data 2 = 1 1 𝑂 𝒃 𝑈 𝒚 𝑜 𝑂 𝒃 𝑈 𝒀 𝑈 𝒀𝒃 𝑂 𝑜=1 Minimum Reconstruction Error PCA finds vectors v such that projection on to the vectors yields minimum MSE reconstruction 2 1 𝒚 𝑜 − 𝒃 𝑈 𝒚 𝑜 𝑂 𝑂 𝑜=1 𝒃 15
Least Squares Error Interpretation PCs are linear least squares fits to samples, each orthogonal to the previous PCs: First PC is a minimum distance fit to a vector in the original feature space Second PC is a minimum distance fit to a vector in the plane perpendicular to the first PC And so on 16
Example 17
Example 18
Least Squares Error and Maximum Variance Views Are Equivalent (1-dim Interpretation) Minimizing sum of square distances to the line is equivalent to maximizing the sum of squares of the projections on that line (Pythagoras). origin red 2 +blue 2 =green 2 green 2 is fixed (shows the data vector after mean removing) ⇒ maximizing blue 2 is equivalent to minimizing red 2 19
First PC The first PC is direction of greatest variability in data We will show that the first PC is the eigenvector of the covariance matrix corresponding the maximum eigen value of this matrix. If ||𝒃|| = 1 , the projection of a d-dimensional 𝒚 on 𝒃 is 𝒃 𝑈 𝒚 𝒚 𝒃 𝜄 𝒃 𝑈 𝒚 origin = 𝒃 𝑈 𝒚 𝒚 cos 𝜄 = 𝒚 𝒚 𝒃 20
First PC 𝑂 1 2 = 1 𝒃 𝑈 𝒚 𝑜 𝑂 𝒃 𝑈 𝒀 𝑈 𝒀𝒃 argmax 𝑂 𝒃 𝑜=1 s.t. 𝒃 𝑈 𝒃 = 1 𝜖 1 = 0 ⇒ 1 𝑂 𝒃 𝑈 𝒀 𝑈 𝒀𝒃 + 𝜇 1 − 𝒃 𝑈 𝒃 𝑂 𝒀 𝑈 𝒀𝒃 = 𝜇𝒃 𝜖𝒃 1 𝒃 is the eigenvector of sample covariance matrix 𝑂 𝒀 𝑈 𝒀 The eigenvalue 𝜇 denotes the amount of variance along that dimension. Variance= 1 1 𝑂 𝒃 𝑈 𝒀 𝑈 𝒀𝒃 = 𝒃 𝑈 𝑂 𝒀 𝑈 𝒀𝒃 = 𝒃 𝑈 𝜇 𝒃 = 𝜇 So, if we seek the dimension with the largest variance, it will be the eigenvector corresponding to the largest eigenvalue of the sample covariance matrix 21
PCA: Uncorrelated Features 𝒚 ′ = 𝑩 𝑈 𝒚 𝑺 𝒚 ′ = 𝐹 𝒚 ′ 𝒚 ′ 𝑈 = 𝐹 𝑩 𝑈 𝒚𝒚 𝑈 𝑩 = 𝑩 𝑈 𝐹 𝒚𝒚 𝑈 𝑩 = 𝑩 𝑈 𝑺 𝒚 𝑩 If 𝑩 = [𝒃 1 , … , 𝒃 𝑒 ] where 𝒃 1 , … , 𝒃 𝑒 are orthonormal eighenvectors of 𝑺 𝒚 : 𝑺 𝒚 ′ = 𝑩 𝑈 𝑺 𝒚 𝑩 = 𝑩 𝑈 𝑩𝚳𝑩 𝑈 𝑩 = 𝚳 ′ = 0 ′ 𝒚 𝑘 ⇒ ∀𝑗 ≠ 𝑘 𝑗, 𝑘 = 1, … , 𝑒 𝐹 𝒚 𝑗 then mutually uncorrelated features are obtained Completely uncorrelated features avoid information redundancies 22
PCA Derivation: Mean Square Error Approximation Incorporating all eigenvectors in 𝑩 = [𝒃 1 , … , 𝒃 𝑒 ] : 𝒚 ′ = 𝑩 𝑈 𝒚 ⇒ 𝑩𝒚 ′ = 𝑩𝑩 𝑈 𝒚 = 𝒚 ⇒ 𝒚 = 𝑩𝒚 ′ ⟹ If 𝑒 ′ = 𝑒 then 𝒚 can be reconstructed exactly from 𝒚 ′ 23
Recommend
More recommend