principal component analysis pca
play

Principal Component Analysis (PCA) CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Principal Component Analysis (PCA) CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given feature


  1. Principal Component Analysis (PCA) CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani

  2. Dimensionality Reduction: Feature Selection vs. Feature Extraction  Feature selection  Select a subset of a given feature set  Feature extraction  A linear or non-linear transform on the original feature space 𝑦 𝑗 1 𝑦 1 𝑦 1 𝑧 1 𝑦 1 ⋮ ⋮ → ⋮ ⋮ ⋮ → = 𝑔 𝑦 𝑒 𝑦 𝑗 𝑒′ 𝑦 𝑒 𝑧 𝑒 ′ 𝑦 𝑒 Feature Feature Selection Extraction ( 𝑒 ′ < 𝑒 ) 2

  3. Feature Extraction  Mapping of the original data to another space  Criterion for feature extraction can be different based on problem settings  Unsupervised task: minimize the information loss (reconstruction error)  Supervised task: maximize the class discrimination on the projected space  Feature extraction algorithms  Linear Methods  Unsupervised: e.g., Principal Component Analysis (PCA)  Supervised: e.g., Linear Discriminant Analysis (LDA)  Also known as Fisher ’ s Discriminant Analysis (FDA) 3

  4. Feature Extraction  Unsupervised feature extraction: A mapping 𝑔: ℝ 𝑒 → ℝ 𝑒 ′ (1) (1) 𝑦 1 ⋯ 𝑦 𝑒 Or 𝒀 = ⋮ ⋱ ⋮ Feature Extraction only the transformed data (𝑂) (𝑂) 𝑦 1 ⋯ 𝑦 𝑒 (1) (1) 𝑦′ 1 ⋯ 𝑦′ 𝑒 ′ 𝒀 ′ = ⋮ ⋱ ⋮ (𝑂) (𝑂) 𝑦′ 1 ⋯ 𝑦′ 𝑒 ′  Supervised feature extraction: A mapping 𝑔: ℝ 𝑒 → ℝ 𝑒 ′ (1) (1) 𝑦 1 ⋯ 𝑦 𝑒 Or 𝒀 = ⋮ ⋱ ⋮ Feature Extraction only the transformed data (𝑂) (𝑂) 𝑦 1 ⋯ 𝑦 𝑒 (1) (1) 𝑦′ 1 ⋯ 𝑦′ 𝑒 ′ 𝒀 ′ = 𝑧 (1) ⋮ ⋱ ⋮ 𝑍 = ⋮ (𝑂) (𝑂) 𝑦′ 1 ⋯ 𝑦′ 𝑒 ′ 𝑧 (𝑂) 4

  5. Unsupervised Feature Reduction  Visualization: projection of high-dimensional data onto 2D or 3D.  Data compression: efficient storage, communication, or and retrieval.  Pre-process: to improve accuracy by reducing features  As a preprocessing step to reduce dimensions for supervised learning tasks  Helps avoiding overfitting  Noise removal  E.g, “ noise ” in the images introduced by minor lighting variations, slightly different imaging conditions, etc. 5

  6. Linear Transformation  For linear transformation, we find an explicit mapping 𝑔 𝒚 = 𝑩 𝑈 𝒚 that can transform also new data vectors. Original data Type equation here. reduced data = 𝒚′ ∈ ℝ 𝑒 ′ 𝑩 𝑈 ∈ ℝ 𝑒 ′ ×𝑒 𝒚 ′ = 𝑩 𝑈 𝒚 𝑒 ′ < 𝑒 𝒚 ∈ ℝ 6

  7. Linear Transformation  Linear transformation are simple mappings 𝑏 11 ⋯ 𝑏 1𝑒 ′ 𝒚 ′ = 𝑩 𝑈 𝒚 ⋮ ⋱ ⋮ 𝑩 = 𝑏 𝑒1 ⋯ 𝑏 𝑒𝑒 ′ a a d  1 𝑈 𝒃 1 ′ 𝑏 11 ⋯ 𝑏 𝑒1 𝑦 1 𝑦 1 ⋮ ⋱ ⋮ ⋮ ⋮ = ′ 𝑏 1𝑒 ′ ⋯ 𝑏 𝑒 ′ 𝑒 𝑦 𝑒 𝑦 𝑒 ′ 𝑈 𝒃 𝑒 ′ ′ = 𝒃 𝑘 𝑈 𝒚 𝑦 𝑘 𝑘 = 1, … , 𝑒 ′ 7

  8. Linear Dimensionality Reduction  Unsupervised  Principal Component Analysis (PCA) [we will discuss]  Independent Component Analysis (ICA) [we will discuss]  SingularValue Decomposition (SVD)  Multi Dimensional Scaling (MDS)  Canonical Correlation Analysis (CCA) 8

  9. Principal Component Analysis (PCA)  Also known as Karhonen-Loeve (KL) transform  Principal Components (PCs): orthogonal vectors that are ordered by the fraction of the total information (variation) in the corresponding directions  Find the directions at which data approximately lie  When the data is projected onto first PC, the variance of the projected data is maximized  PCA is an orthogonal projection of the data into a subspace so that the variance of the projected data is maximized. 9

  10. Principal Component Analysis (PCA)  The “ best ” linear subspace (i.e. providing least reconstruction error of data):  Find mean reduced data  The axes have been rotated to new (principal) axes such that:  Principal axis 1 has the highest variance ....  Principal axis i has the i-th highest variance.  The principal axes are uncorrelated  Covariance among each pair of the principal axes is zero.  Goal: reducing the dimensionality of the data while preserving the variation present in the dataset as much as possible.  PCs can be found as the “ best ” eigenvectors of the covariance matrix of the data points. 10

  11. Principal components  If data has a Gaussian distribution 𝑂(𝝂, 𝚻), the direction of the largest variance can be found by the eigenvector of 𝚻 that corresponds to the largest eigenvalue of 𝚻 𝒘 2 𝒘 1 11

  12. PCA: Steps  Input: 𝑂 × 𝑒 data matrix 𝒀 (each row contain a 𝑒 dimensional data point) 1 𝑂 𝒚 (𝑗) 𝑂 𝑗=1  𝝂 =  𝒀 ← Mean value of data points is subtracted from rows of 𝒀 1 𝑂 𝒀 𝑈 𝒀 (Covariance matrix)  𝑫 =  Calculate eigenvalue and eigenvectors of 𝑫  Pick 𝑒 ′ eigenvectors corresponding to the largest eigenvalues and put them in the columns of 𝑩 = [𝒘 1 , … , 𝒘 𝑒 ′ ]  𝒀′ = 𝒀𝑩 First PC d ’ -th PC 12

  13. Covariance Matrix 𝜈 1 𝐹(𝑦 1 ) ⋮ ⋮ 𝝂 𝒚 = = 𝜈 𝑒 𝐹(𝑦 𝑒 ) 𝒚 − 𝝂 𝒚 𝑈 𝜯 = 𝐹 𝒚 − 𝝂 𝒚 𝑂 :  ML estimate of covariance matrix from data points 𝒚 (𝑗) 𝑗=1 𝑂 𝜯 = 1 𝑈 = 1 𝒚 (𝑗) − 𝒚 (𝑗) − 𝒀 𝑈 𝑂 𝝂 𝝂 𝒀 𝑂 𝑗=1 𝒚 (1) − 𝑂 𝒚 (1) 𝝂 𝝂 = 1 𝒚 (𝑗) 𝒀 = = 𝑂 ⋮ ⋮ 𝒚 (𝑂) − 𝒚 (𝑂) 𝝂 𝑗=1 We now assume that data are mean removed 13 Mean-centered data and 𝒚 in the later slides is indeed 𝒚

  14. Correlation matrix (1) (1) 𝑦 1 … 𝑦 𝑒 𝒀 = ⋮ ⋱ ⋮ (𝑂) (𝑂) 𝑦 1 … 𝑦 𝑒 (1) (𝑂) (1) (1) 𝑦 1 … 𝑦 1 𝑦 1 … 𝑦 𝑒 𝑂 𝒀 𝑈 𝒀 = 1 1 ⋮ ⋱ ⋮ ⋮ ⋱ ⋮ 𝑂 (1) (𝑂) (𝑂) (𝑂) 𝑦 𝑒 … 𝑦 𝑒 𝑦 1 … 𝑦 𝑒 𝑂 𝑂 (𝑜) 𝑦 1 (𝑜) 𝑦 𝑒 (𝑜) (𝑜) 𝑦 1 … 𝑦 1 = 1 𝑜=1 𝑜=1 ⋮ ⋱ ⋮ 𝑂 𝑂 𝑂 (𝑜) 𝑦 1 (𝑜) 𝑦 𝑒 (𝑜) (𝑜) 𝑦 𝑒 … 𝑦 𝑒 𝑜=1 𝑜=1 14

  15. Two Interpretations  MaximumVariance Subspace  PCA finds vectors v such that projections on to the vectors capture maximum variance in the data 2 = 1 1 𝑂 𝒃 𝑈 𝒚 𝑜 𝑂 𝒃 𝑈 𝒀 𝑈 𝒀𝒃 𝑂 𝑜=1   Minimum Reconstruction Error  PCA finds vectors v such that projection on to the vectors yields minimum MSE reconstruction 2 1 𝒚 𝑜 − 𝒃 𝑈 𝒚 𝑜 𝑂 𝑂 𝑜=1 𝒃  15

  16. Least Squares Error Interpretation  PCs are linear least squares fits to samples, each orthogonal to the previous PCs:  First PC is a minimum distance fit to a vector in the original feature space  Second PC is a minimum distance fit to a vector in the plane perpendicular to the first PC  And so on 16

  17. Example 17

  18. Example 18

  19. Least Squares Error and Maximum Variance Views Are Equivalent (1-dim Interpretation)  Minimizing sum of square distances to the line is equivalent to maximizing the sum of squares of the projections on that line (Pythagoras). origin red 2 +blue 2 =green 2 green 2 is fixed (shows the data vector after mean removing) ⇒ maximizing blue 2 is equivalent to minimizing red 2 19

  20. First PC  The first PC is direction of greatest variability in data  We will show that the first PC is the eigenvector of the covariance matrix corresponding the maximum eigen value of this matrix.  If ||𝒃|| = 1 , the projection of a d-dimensional 𝒚 on 𝒃 is 𝒃 𝑈 𝒚 𝒚 𝒃 𝜄 𝒃 𝑈 𝒚 origin = 𝒃 𝑈 𝒚 𝒚 cos 𝜄 = 𝒚 𝒚 𝒃 20

  21. First PC 𝑂 1 2 = 1 𝒃 𝑈 𝒚 𝑜 𝑂 𝒃 𝑈 𝒀 𝑈 𝒀𝒃 argmax 𝑂 𝒃 𝑜=1 s.t. 𝒃 𝑈 𝒃 = 1 𝜖 1 = 0 ⇒ 1 𝑂 𝒃 𝑈 𝒀 𝑈 𝒀𝒃 + 𝜇 1 − 𝒃 𝑈 𝒃 𝑂 𝒀 𝑈 𝒀𝒃 = 𝜇𝒃 𝜖𝒃 1  𝒃 is the eigenvector of sample covariance matrix 𝑂 𝒀 𝑈 𝒀  The eigenvalue 𝜇 denotes the amount of variance along that dimension.  Variance= 1 1 𝑂 𝒃 𝑈 𝒀 𝑈 𝒀𝒃 = 𝒃 𝑈 𝑂 𝒀 𝑈 𝒀𝒃 = 𝒃 𝑈 𝜇 𝒃 = 𝜇  So, if we seek the dimension with the largest variance, it will be the eigenvector corresponding to the largest eigenvalue of the sample covariance matrix 21

  22. PCA: Uncorrelated Features 𝒚 ′ = 𝑩 𝑈 𝒚 𝑺 𝒚 ′ = 𝐹 𝒚 ′ 𝒚 ′ 𝑈 = 𝐹 𝑩 𝑈 𝒚𝒚 𝑈 𝑩 = 𝑩 𝑈 𝐹 𝒚𝒚 𝑈 𝑩 = 𝑩 𝑈 𝑺 𝒚 𝑩  If 𝑩 = [𝒃 1 , … , 𝒃 𝑒 ] where 𝒃 1 , … , 𝒃 𝑒 are orthonormal eighenvectors of 𝑺 𝒚 : 𝑺 𝒚 ′ = 𝑩 𝑈 𝑺 𝒚 𝑩 = 𝑩 𝑈 𝑩𝚳𝑩 𝑈 𝑩 = 𝚳 ′ = 0 ′ 𝒚 𝑘 ⇒ ∀𝑗 ≠ 𝑘 𝑗, 𝑘 = 1, … , 𝑒 𝐹 𝒚 𝑗  then mutually uncorrelated features are obtained  Completely uncorrelated features avoid information redundancies 22

  23. PCA Derivation: Mean Square Error Approximation  Incorporating all eigenvectors in 𝑩 = [𝒃 1 , … , 𝒃 𝑒 ] : 𝒚 ′ = 𝑩 𝑈 𝒚 ⇒ 𝑩𝒚 ′ = 𝑩𝑩 𝑈 𝒚 = 𝒚 ⇒ 𝒚 = 𝑩𝒚 ′  ⟹ If 𝑒 ′ = 𝑒 then 𝒚 can be reconstructed exactly from 𝒚 ′ 23

Recommend


More recommend