pca ica
play

PCA & ICA CE-717: Machine Learning Sharif University of - PowerPoint PPT Presentation

PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2018 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction } Feature selection } Select a subset of a given feature set } Feature extraction }


  1. PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2018 Soleymani

  2. Dimensionality Reduction: Feature Selection vs. Feature Extraction } Feature selection } Select a subset of a given feature set } Feature extraction } A linear or non-linear transform on the original feature space ๐‘ฆ & ' ๐‘ฆ " ๐‘ฆ " ๐‘ง " ๐‘ฆ " โ‹ฎ โ‹ฎ โ†’ โ‹ฎ โ‹ฎ โ‹ฎ โ†’ = ๐‘” ๐‘ฆ $ ๐‘ฆ & () ๐‘ฆ $ ๐‘ง $ ) ๐‘ฆ $ Feature Feature Selection Extraction ( ๐‘’ + < ๐‘’ ) 2

  3. Feature Extraction } Mapping of the original data to another space } Criterion for feature extraction can be different based on problem settings } Unsupervised task: minimize the information loss (reconstruction error) } Supervised task: maximize the class discrimination on the projected space } Feature extraction algorithms } Linear Methods } Unsupervised: e.g., Principal Component Analysis (PCA) } Supervised: e.g., Linear Discriminant Analysis (LDA) ยจ Also known as Fisherโ€™s Discriminant Analysis (FDA) } Non-linear methods: } Supervised: MLP neural networks } Unsupervised: e.g., autoencoders 3

  4. Feature Extraction } Unsupervised feature extraction: A mapping ๐‘”: โ„ $ โ†’ โ„ $ ) (") (") ๐‘ฆ " โ‹ฏ ๐‘ฆ $ Or ๐’€ = โ‹ฎ โ‹ฑ โ‹ฎ Feature Extraction only the transformed data (5) (5) ๐‘ฆ " โ‹ฏ ๐‘ฆ $ (") (") ๐‘ฆโ€ฒ " โ‹ฏ ๐‘ฆโ€ฒ $ ) ๐’€ + = โ‹ฎ โ‹ฑ โ‹ฎ (5) (5) ๐‘ฆโ€ฒ " โ‹ฏ ๐‘ฆโ€ฒ $ ) } Supervised feature extraction: A mapping ๐‘”: โ„ $ โ†’ โ„ $ ) (") (") ๐‘ฆ " โ‹ฏ ๐‘ฆ $ Or ๐’€ = โ‹ฎ โ‹ฑ โ‹ฎ Feature Extraction only the transformed data (5) (5) ๐‘ฆ " โ‹ฏ ๐‘ฆ $ (") (") ๐‘ฆโ€ฒ " โ‹ฏ ๐‘ฆโ€ฒ $ ) ๐‘ง (") ๐’€ + = โ‹ฎ โ‹ฑ โ‹ฎ ๐‘ = โ‹ฎ (5) (5) ๐‘ฆโ€ฒ " โ‹ฏ ๐‘ฆโ€ฒ $ ) ๐‘ง (5) 4

  5. Unsupervised Feature Reduction } Visualization and interpretation: projection of high- dimensional data onto 2D or 3D. } Data compression: efficient storage, communication, or and retrieval. } Pre-process: to improve accuracy by reducing features } As a preprocessing step to reduce dimensions for supervised learning tasks } Helps avoiding overfitting } Noise removal } E.g, โ€œnoiseโ€ in the images introduced by minor lighting variations, slightly different imaging conditions, 5

  6. Linear Transformation } For linear transformation, we find an explicit mapping ๐‘” ๐’š = ๐‘ฉ < ๐’š that can transform also new data vectors. Original data Type equation here. reduced data = ๐’šโ€ฒ โˆˆ โ„ $ ) ๐‘ฉ < โˆˆ โ„ $ ) ร—$ ๐’š + = ๐‘ฉ < ๐’š ๐‘’ + < ๐‘’ ๐’š โˆˆ โ„ 6

  7. Linear Transformation } Linear transformation are simple mappings ยข ยข = = T T x A x ( x a x ) ๐‘˜ = 1, โ€ฆ , ๐‘’ j j ๐‘ "" โ‹ฏ ๐‘ "$ โ‹ฎ โ‹ฑ โ‹ฎ ๐‘ฉ = ๐‘ $" โ‹ฏ ๐‘ $$ ) a a 1 d ยข 7

  8. Linear Dimensionality Reduction } Unsupervised } Principal Component Analysis (PCA) } Independent Component Analysis (ICA) } SingularValue Decomposition (SVD) } Multi Dimensional Scaling (MDS) } Canonical Correlation Analysis (CCA) } โ€ฆ 8

  9. Principal Component Analysis (PCA) } Also known as Karhonen-Loeve (KL) transform } Principal Components (PCs): orthogonal vectors that are ordered by the fraction of the total information (variation) in the corresponding directions } Find the directions at which data approximately lie } When the data is projected onto first PC, the variance of the projected data is maximized 9

  10. Principal Component Analysis (PCA) } The โ€œbestโ€ linear subspace (i.e. providing least reconstruction error of data): } Find mean reduced data } The axes have been rotated to new (principal) axes such that: } Principal axis 1 has the highest variance .... } Principal axis i has the i-th highest variance. } The principal axes are uncorrelated } Covariance among each pair of the principal axes is zero. } Goal: reducing the dimensionality of the data while preserving the variation present in the dataset as much as possible. } PCs can be found as the โ€œbestโ€ eigenvectors of the covariance matrix of the data points. 10

  11. Principal components } If data has a Gaussian distribution ๐‘‚(๐‚, ๐šป), the direction of the largest variance can be found by the eigenvector of ๐šป that corresponds to the largest eigenvalue of ๐šป ๐’˜ W ๐’˜ " 11

  12. Example: random direction 12

  13. Example: principal component 13

  14. Covariance Matrix ๐œˆ " ๐น(๐‘ฆ " ) โ‹ฎ โ‹ฎ ๐‚ ๐’š = = ๐œˆ $ ๐น(๐‘ฆ $ ) ๐’š โˆ’ ๐‚ ๐’š < ๐œฏ = ๐น ๐’š โˆ’ ๐‚ ๐’š 5 : } ML estimate of covariance matrix from data points ๐’š (&) &\" 5 ] = 1 = 1 ๐‘‚ ^ ๐’š (&) โˆ’ ๐‚ ๐’š (&) โˆ’ ๐‚ < ` < ๐’€ ` ๐œฏ _ _ ๐‘‚ ๐’€ &\" ๐’š (") โˆ’ ๐‚ 5 a (") _ ๐’š _ = 1 ` = ๐‘‚ ^ ๐’š (&) ๐’€ = ๐‚ โ‹ฎ โ‹ฎ ๐’š (5) โˆ’ ๐‚ a (5) _ ๐’š &\" 14 Mean-centered data

  15. PCA: Steps } Input: ๐‘‚ร—๐‘’ data matrix ๐’€ (each row contain a ๐‘’ dimensional data point) " 5 ๐’š (&) 5 โˆ‘ } ๐‚ = &\" ` โ† Mean value of data points is subtracted from rows of ๐’€ } ๐’€ " ` < ๐’€ ` (Covariance matrix) } ๐šป = 5 ๐’€ } Calculate eigenvalue and eigenvectors of ๐šป } Pick ๐‘’ + eigenvectors corresponding to the largest eigenvalues and put them in the columns of ๐‘ฉ = [๐’˜ " , โ€ฆ , ๐’˜ $ ) ] } ๐’€โ€ฒ = ๐’€๐‘ฉ First PC dโ€™-th PC 15

  16. Find principal components } Assume that data is centered. } Find vector ๐’˜ that maximizes sample variance of the projected data: 5 1 = 1 W ๐‘‚ ^ ๐‘ค < ๐‘ฆ j ๐‘‚ ๐‘ค < ๐‘Œ < ๐‘Œ๐‘ค argmax h j\" s. t. ๐‘ค < ๐‘ค = 1 ๐‘€ ๐‘ค, ๐œ‡ = ๐‘ค < ๐‘Œ < ๐‘Œ๐‘ค โˆ’ ๐œ‡๐‘ค < ๐‘ค ๐œ–๐‘€ ๐œ–๐‘ค = 0 โ‡’ 2๐‘Œ < ๐‘Œ๐‘ค โˆ’ 2๐œ‡๐‘ค = 0 โ‡’ ๐‘Œ < ๐‘Œ๐‘ค = ๐œ‡๐‘ค 16

  17. Find principal components } For symmetric matrices, there exist eigen-vectors that are orthogonal. } Let ๐‘ค " , โ€ฆ ๐‘ค $ denote the eigen-vectors of ๐‘Œ < ๐‘Œ such that: < ๐‘ค s = 0, โˆ€๐‘— โ‰  ๐‘˜ ๐‘ค & < ๐‘ค & = 1, ๐‘ค & โˆ€๐‘— 17

  18. Find principal components ๐‘Œ < ๐‘Œ๐’˜ = ๐œ‡๐’˜ โ‡’ ๐’˜ < ๐‘Œ < ๐‘Œ๐’˜ = ๐œ‡๐’˜ < ๐’˜ = ๐œ‡ } ๐œ‡ denotes the amount of variance along the found dimension ๐’˜ (called energy along that dimension). } } Eigenvalues: ๐œ‡ " โ‰ฅ ๐œ‡ W โ‰ฅ ๐œ‡ x โ‰ฅ โ‹ฏ } The first PC ๐’˜ " is the the eigenvector of the sample covariance matrix ๐‘Œ < ๐‘Œ associated with the largest eigenvalue. } The 2nd PC ๐’˜ W is the the eigenvector of the sample covariance matrix ๐‘Œ < ๐‘Œ associated with the second largest eigenvalue } And so on ... 18

  19. Another Interpretation: Least Squares Error } PCs are linear least squares fits to samples, each orthogonal to the previous PCs: } First PC is a minimum distance fit to a vector in the original feature space } Second PC is a minimum distance fit to a vector in the plane perpendicular to the first PC } โ€ฆ 19

  20. Least Squares Error and Maximum Variance Views Are Equivalent (1-dim Interpretation) } When data are mean-removed: } Minimizing sum of square distances to the line is equivalent to maximizing the sum of squares of the projections on that line (Pythagoras). origin 20

  21. Two interpretations } Maximum variance subspace 5 W ^ ๐‘ค < ๐‘ฆ j = ๐‘ค < ๐‘Œ < ๐‘Œ๐‘ค argmax h j\" } Minimum reconstruction error 5 W ^ ๐‘ฆ j โˆ’ ๐‘ค < ๐‘ฆ j argmin ๐‘ค h j\" ๐‘ฆ ๐‘ค blue 2 + red 2 = geen 2 geen 2 is fixed (shows data) So, maximizing red 2 is equivalent to minimizing blue 2 ๐‘ค < ๐‘ฆ origin 21

  22. PCA: Uncorrelated Features ๐’š + = ๐‘ฉ < ๐’š ๐‘บ ๐’š ) = ๐น ๐’š + ๐’š +< = ๐น ๐‘ฉ < ๐’š๐’š < ๐‘ฉ = ๐‘ฉ < ๐น ๐’š๐’š < ๐‘ฉ = ๐‘ฉ < ๐‘บ ๐’š ๐‘ฉ } If ๐‘ฉ = [๐’ƒ " , โ€ฆ , ๐’ƒ $ ] where ๐’ƒ " , โ€ฆ , ๐’ƒ $ are orthonormal eighenvectors of ๐‘บ ๐’š : ๐‘บ ๐’š ) = ๐‘ฉ < ๐‘บ ๐’š ๐‘ฉ = ๐‘ฉ < ๐‘ฉ๐šณ๐‘ฉ < ๐‘ฉ = ๐šณ + = 0 + ๐’š s โ‡’ โˆ€๐‘— โ‰  ๐‘˜ ๐‘—, ๐‘˜ = 1, โ€ฆ , ๐‘’ ๐น ๐’š & } then mutually uncorrelated features are obtained } Completely uncorrelated features avoid information redundancies 22

  23. Reconstruction < ๐’š < ๐’˜ " ๐’˜ " ๐’š + = โ‹ฎ โ‹ฎ = ๐’š < ๐’š < ๐’˜ $ ) ๐’˜ $ ) ๐’š + = ๐‘ฉ < ๐’š ๐‘ฉ = [๐’˜ " , โ€ฆ , ๐’˜ $ ) ] โ‡’ ๐‘ฉ๐’š + = ๐‘ฉ๐‘ฉ < ๐’š = ๐’š โ‡’ ๐’š = ๐‘ฉ๐’š + } Incorporating all eigenvectors in ๐‘ฉ = [๐’˜ " , โ€ฆ , ๐’˜ $ ] : โŸน If ๐‘’ + = ๐‘’ then ๐’š can be reconstructed exactly from ๐’š + 23

  24. PCA Derivation: Relation between Eigenvalues and Variances } The ๐‘˜ -th largest eigenvalue of ๐‘บ ๐’š is the variance on the ๐‘˜ -th PC: + = ๐’˜ s < ๐‘บ ๐’š ๐’˜ s = ๐œ‡ s ๐‘ค๐‘๐‘  ๐‘ฆ s 24

Recommend


More recommend