PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2018 Soleymani
Dimensionality Reduction: Feature Selection vs. Feature Extraction } Feature selection } Select a subset of a given feature set } Feature extraction } A linear or non-linear transform on the original feature space ๐ฆ & ' ๐ฆ " ๐ฆ " ๐ง " ๐ฆ " โฎ โฎ โ โฎ โฎ โฎ โ = ๐ ๐ฆ $ ๐ฆ & () ๐ฆ $ ๐ง $ ) ๐ฆ $ Feature Feature Selection Extraction ( ๐ + < ๐ ) 2
Feature Extraction } Mapping of the original data to another space } Criterion for feature extraction can be different based on problem settings } Unsupervised task: minimize the information loss (reconstruction error) } Supervised task: maximize the class discrimination on the projected space } Feature extraction algorithms } Linear Methods } Unsupervised: e.g., Principal Component Analysis (PCA) } Supervised: e.g., Linear Discriminant Analysis (LDA) ยจ Also known as Fisherโs Discriminant Analysis (FDA) } Non-linear methods: } Supervised: MLP neural networks } Unsupervised: e.g., autoencoders 3
Feature Extraction } Unsupervised feature extraction: A mapping ๐: โ $ โ โ $ ) (") (") ๐ฆ " โฏ ๐ฆ $ Or ๐ = โฎ โฑ โฎ Feature Extraction only the transformed data (5) (5) ๐ฆ " โฏ ๐ฆ $ (") (") ๐ฆโฒ " โฏ ๐ฆโฒ $ ) ๐ + = โฎ โฑ โฎ (5) (5) ๐ฆโฒ " โฏ ๐ฆโฒ $ ) } Supervised feature extraction: A mapping ๐: โ $ โ โ $ ) (") (") ๐ฆ " โฏ ๐ฆ $ Or ๐ = โฎ โฑ โฎ Feature Extraction only the transformed data (5) (5) ๐ฆ " โฏ ๐ฆ $ (") (") ๐ฆโฒ " โฏ ๐ฆโฒ $ ) ๐ง (") ๐ + = โฎ โฑ โฎ ๐ = โฎ (5) (5) ๐ฆโฒ " โฏ ๐ฆโฒ $ ) ๐ง (5) 4
Unsupervised Feature Reduction } Visualization and interpretation: projection of high- dimensional data onto 2D or 3D. } Data compression: efficient storage, communication, or and retrieval. } Pre-process: to improve accuracy by reducing features } As a preprocessing step to reduce dimensions for supervised learning tasks } Helps avoiding overfitting } Noise removal } E.g, โnoiseโ in the images introduced by minor lighting variations, slightly different imaging conditions, 5
Linear Transformation } For linear transformation, we find an explicit mapping ๐ ๐ = ๐ฉ < ๐ that can transform also new data vectors. Original data Type equation here. reduced data = ๐โฒ โ โ $ ) ๐ฉ < โ โ $ ) ร$ ๐ + = ๐ฉ < ๐ ๐ + < ๐ ๐ โ โ 6
Linear Transformation } Linear transformation are simple mappings ยข ยข = = T T x A x ( x a x ) ๐ = 1, โฆ , ๐ j j ๐ "" โฏ ๐ "$ โฎ โฑ โฎ ๐ฉ = ๐ $" โฏ ๐ $$ ) a a 1 d ยข 7
Linear Dimensionality Reduction } Unsupervised } Principal Component Analysis (PCA) } Independent Component Analysis (ICA) } SingularValue Decomposition (SVD) } Multi Dimensional Scaling (MDS) } Canonical Correlation Analysis (CCA) } โฆ 8
Principal Component Analysis (PCA) } Also known as Karhonen-Loeve (KL) transform } Principal Components (PCs): orthogonal vectors that are ordered by the fraction of the total information (variation) in the corresponding directions } Find the directions at which data approximately lie } When the data is projected onto first PC, the variance of the projected data is maximized 9
Principal Component Analysis (PCA) } The โbestโ linear subspace (i.e. providing least reconstruction error of data): } Find mean reduced data } The axes have been rotated to new (principal) axes such that: } Principal axis 1 has the highest variance .... } Principal axis i has the i-th highest variance. } The principal axes are uncorrelated } Covariance among each pair of the principal axes is zero. } Goal: reducing the dimensionality of the data while preserving the variation present in the dataset as much as possible. } PCs can be found as the โbestโ eigenvectors of the covariance matrix of the data points. 10
Principal components } If data has a Gaussian distribution ๐(๐, ๐ป), the direction of the largest variance can be found by the eigenvector of ๐ป that corresponds to the largest eigenvalue of ๐ป ๐ W ๐ " 11
Example: random direction 12
Example: principal component 13
Covariance Matrix ๐ " ๐น(๐ฆ " ) โฎ โฎ ๐ ๐ = = ๐ $ ๐น(๐ฆ $ ) ๐ โ ๐ ๐ < ๐ฏ = ๐น ๐ โ ๐ ๐ 5 : } ML estimate of covariance matrix from data points ๐ (&) &\" 5 ] = 1 = 1 ๐ ^ ๐ (&) โ ๐ ๐ (&) โ ๐ < ` < ๐ ` ๐ฏ _ _ ๐ ๐ &\" ๐ (") โ ๐ 5 a (") _ ๐ _ = 1 ` = ๐ ^ ๐ (&) ๐ = ๐ โฎ โฎ ๐ (5) โ ๐ a (5) _ ๐ &\" 14 Mean-centered data
PCA: Steps } Input: ๐ร๐ data matrix ๐ (each row contain a ๐ dimensional data point) " 5 ๐ (&) 5 โ } ๐ = &\" ` โ Mean value of data points is subtracted from rows of ๐ } ๐ " ` < ๐ ` (Covariance matrix) } ๐ป = 5 ๐ } Calculate eigenvalue and eigenvectors of ๐ป } Pick ๐ + eigenvectors corresponding to the largest eigenvalues and put them in the columns of ๐ฉ = [๐ " , โฆ , ๐ $ ) ] } ๐โฒ = ๐๐ฉ First PC dโ-th PC 15
Find principal components } Assume that data is centered. } Find vector ๐ that maximizes sample variance of the projected data: 5 1 = 1 W ๐ ^ ๐ค < ๐ฆ j ๐ ๐ค < ๐ < ๐๐ค argmax h j\" s. t. ๐ค < ๐ค = 1 ๐ ๐ค, ๐ = ๐ค < ๐ < ๐๐ค โ ๐๐ค < ๐ค ๐๐ ๐๐ค = 0 โ 2๐ < ๐๐ค โ 2๐๐ค = 0 โ ๐ < ๐๐ค = ๐๐ค 16
Find principal components } For symmetric matrices, there exist eigen-vectors that are orthogonal. } Let ๐ค " , โฆ ๐ค $ denote the eigen-vectors of ๐ < ๐ such that: < ๐ค s = 0, โ๐ โ ๐ ๐ค & < ๐ค & = 1, ๐ค & โ๐ 17
Find principal components ๐ < ๐๐ = ๐๐ โ ๐ < ๐ < ๐๐ = ๐๐ < ๐ = ๐ } ๐ denotes the amount of variance along the found dimension ๐ (called energy along that dimension). } } Eigenvalues: ๐ " โฅ ๐ W โฅ ๐ x โฅ โฏ } The first PC ๐ " is the the eigenvector of the sample covariance matrix ๐ < ๐ associated with the largest eigenvalue. } The 2nd PC ๐ W is the the eigenvector of the sample covariance matrix ๐ < ๐ associated with the second largest eigenvalue } And so on ... 18
Another Interpretation: Least Squares Error } PCs are linear least squares fits to samples, each orthogonal to the previous PCs: } First PC is a minimum distance fit to a vector in the original feature space } Second PC is a minimum distance fit to a vector in the plane perpendicular to the first PC } โฆ 19
Least Squares Error and Maximum Variance Views Are Equivalent (1-dim Interpretation) } When data are mean-removed: } Minimizing sum of square distances to the line is equivalent to maximizing the sum of squares of the projections on that line (Pythagoras). origin 20
Two interpretations } Maximum variance subspace 5 W ^ ๐ค < ๐ฆ j = ๐ค < ๐ < ๐๐ค argmax h j\" } Minimum reconstruction error 5 W ^ ๐ฆ j โ ๐ค < ๐ฆ j argmin ๐ค h j\" ๐ฆ ๐ค blue 2 + red 2 = geen 2 geen 2 is fixed (shows data) So, maximizing red 2 is equivalent to minimizing blue 2 ๐ค < ๐ฆ origin 21
PCA: Uncorrelated Features ๐ + = ๐ฉ < ๐ ๐บ ๐ ) = ๐น ๐ + ๐ +< = ๐น ๐ฉ < ๐๐ < ๐ฉ = ๐ฉ < ๐น ๐๐ < ๐ฉ = ๐ฉ < ๐บ ๐ ๐ฉ } If ๐ฉ = [๐ " , โฆ , ๐ $ ] where ๐ " , โฆ , ๐ $ are orthonormal eighenvectors of ๐บ ๐ : ๐บ ๐ ) = ๐ฉ < ๐บ ๐ ๐ฉ = ๐ฉ < ๐ฉ๐ณ๐ฉ < ๐ฉ = ๐ณ + = 0 + ๐ s โ โ๐ โ ๐ ๐, ๐ = 1, โฆ , ๐ ๐น ๐ & } then mutually uncorrelated features are obtained } Completely uncorrelated features avoid information redundancies 22
Reconstruction < ๐ < ๐ " ๐ " ๐ + = โฎ โฎ = ๐ < ๐ < ๐ $ ) ๐ $ ) ๐ + = ๐ฉ < ๐ ๐ฉ = [๐ " , โฆ , ๐ $ ) ] โ ๐ฉ๐ + = ๐ฉ๐ฉ < ๐ = ๐ โ ๐ = ๐ฉ๐ + } Incorporating all eigenvectors in ๐ฉ = [๐ " , โฆ , ๐ $ ] : โน If ๐ + = ๐ then ๐ can be reconstructed exactly from ๐ + 23
PCA Derivation: Relation between Eigenvalues and Variances } The ๐ -th largest eigenvalue of ๐บ ๐ is the variance on the ๐ -th PC: + = ๐ s < ๐บ ๐ ๐ s = ๐ s ๐ค๐๐ ๐ฆ s 24
Recommend
More recommend