principal component analysis
play

Principal Component Analysis MAT 6480W / STT 6705V Guy Wolf - PowerPoint PPT Presentation

Geometric Data Analysis Principal Component Analysis MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de Montr eal Fall 2019 MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 1 / 20 Outline Preprocessing for data


  1. Geometric Data Analysis Principal Component Analysis MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit´ e de Montr´ eal Fall 2019 MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 1 / 20

  2. Outline Preprocessing for data simplification 1 Sampling Aggregation Discretization Density estimation Dimensionality reduction Principal component analysis (PCA) 2 Autoencoder Variance maximization Singular value decomposition (SVD) MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 2 / 20

  3. Preprocessing for data simplification Sampling Sampling Select a subset of representative data points instead of processing the entire data. A sampled subset is useful only if its analysis yields the same patterns, results, conclusions, etc., as the analysis of the entire data. 8000 points 2000 points 500 points MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 3 / 20

  4. Preprocessing for data simplification Sampling Sampling Select a subset of representative data points instead of processing the entire data. Common sampling approaches Random: an equal probability of selecting any particular item. Without replacement: iteratively selected & remove items. With replacement: selected items remain in the population. Stratified: draw random samples from each partition. Choosing a sufficient sample size is often crucial for effective sampling. MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 3 / 20

  5. Preprocessing for data simplification Sampling Example Choose enough samples guarantee at least one representative is selected from each distinct group/cluster/profile in the data. MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 3 / 20

  6. Preprocessing for data simplification Aggregation Instead of sampling representative data points we can coarse-grain the data by aggregating together attributes or data points. Aggregation Combining several attributes to a single feature, or several data points into a single observation. Examples Change monthly revenues to annual revenues Analyze neighborhoods instead of houses Provide average rating of a season (not per episode) MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 4 / 20

  7. Preprocessing for data simplification Discretization It is sometimes convenient to transform the entire data to nominal (or ordinal) attributes. Discretization Transformation of continuous attributes (or ones with infinite range) to discrete ones with a finite range. Discretization can be done in a supervised discretization (e.g., using class labels) or in an unsupervised manner (e.g., using clustering). MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 5 / 20

  8. Preprocessing for data simplification Discretization Supervised discretization based on minimizing impurity: 3 values per axis 5 values per axis MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 5 / 20

  9. Preprocessing for data simplification Discretization Unsupervised discretization: MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 5 / 20

  10. Preprocessing for data simplification Density estimation Transforming attributes from raw vales to densities can be used to coarse-grain the data and bring its features to comparable scales between zero and one. MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 6 / 20

  11. Preprocessing for data simplification Density estimation Transforming attributes from raw vales to densities can be used to coarse-grain the data and bring its features to comparable scales between zero and one. Cell-based density estimation MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 6 / 20

  12. Preprocessing for data simplification Density estimation Transforming attributes from raw vales to densities can be used to coarse-grain the data and bring its features to comparable scales between zero and one. Center-based density estimation MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 6 / 20

  13. Preprocessing for data simplification Dimensionality reduction Dimensionality of data is generally determined by the number of attributes or features that represent each data point. Curse of dimensionality A general term for various phenomena that arise when analyzing and processing high-dimensional data. Common theme - statistical significance is difficult, impractical, or even impossible to obtain due to sparsity of the data in high-dimensions Causes poor performance of classical statistical methods compared to low-dimensional data Common solution - reduce the dimensionality of the data as part of its (pre)processing. MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 7 / 20

  14. Preprocessing for data simplification Dimensionality reduction There are several approaches to represent the data in a lower dimension, which can generally be split into two types: Dimensionality reduction approaches Feature selection/weighting - select a subset of existing features and only use them in the analysis, while possibly also assigning them importance weights to eliminate redundant information Feature extraction/construction - create new features by extracting relevant information from the original features PCA and MDS are two of the most common dimensionality reduction methods in data analysis, but many others exist as well. MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 7 / 20

  15. Preprocessing for data simplification Feature subset selection Ideally - choose the best feature subset out of all possible combinations. Impractical - there are 2 n choices for n attributes! Feature selection approaches Embedded methods - choose the best features for a task as part of the data mining algorithm (e.g., decision trees). Filter methods - choose features that optimize a general criterion (e.g., min correlation) as part of data preprocessing using an efficient search algorithm. Wrapper methods - first formulate & handle a data mining task to select features, and then use the resulting subset to solve the real task. Alternatively, expert knowledge can sometimes be used to eliminate redundant and unnecessary features. MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 8 / 20

  16. Principal component analysis MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 9 / 20

  17. Principal component analysis � � � � � � MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 9 / 20

  18. Principal component analysis � � Assume: avg = 0 � � � Find: � best k -dim projection MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 9 / 20

  19. Principal component analysis Projection on principal components: Principal components                     Data points               MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 10 / 20

  20. Principal component analysis Projection on principal components: � ✒ λ 1 φ 1 3D space ✻ ✲ � ✁ ☛ �→ 1D space ✲ MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 10 / 20

  21. ❘ Principal component analysis What is the best projection? Find subspace S ⊆ ❘ n s.t. dim( S ) = k and the data is well approximated by ˆ x = proj S x . MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 11 / 20

  22. Principal component analysis What is the best projection? Find subspace S ⊆ ❘ n s.t. dim( S ) = k and the data is well approximated by ˆ x = proj S x . ⇓ Find subspace S ⊆ ❘ n s.t. S = span { u 1 , . . . , u k } and the data is � x − ˆ x � is minimal over the data with ˆ x = proj S x . MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 11 / 20

  23. Principal component analysis What is the best projection? Find subspace S ⊆ ❘ n s.t. dim( S ) = k and the data is well approximated by ˆ x = proj S x . ⇓ Find subspace S ⊆ ❘ n s.t. S = span { u 1 , . . . , u k } and the data is � x − ˆ x � is minimal over the data with ˆ x = proj S x . ⇓ x i � 2 is minimal with Find k vectors u 1 , . . . , u k s.t. N − 1 � N i =1 � x i − ˆ x = proj span { u 1 ,..., u k } x . ˆ MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 11 / 20

  24. Principal component analysis What is the best projection? Find subspace S ⊆ ❘ n s.t. dim( S ) = k and the data is well approximated by ˆ x = proj S x . ⇓ Find subspace S ⊆ ❘ n s.t. S = span { u 1 , . . . , u k } and the data is � x − ˆ x � is minimal over the data with ˆ x = proj S x . ⇓ x i � 2 is minimal with Find k vectors u 1 , . . . , u k s.t. N − 1 � N i =1 � x i − ˆ x = proj span { u 1 ,..., u k } x . ˆ How do we find these vectors u 1 , . . . , u k ? MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 11 / 20

  25. Principal component analysis Autoencoder x i � 2 s.t. ˆ Minimize N − 1 � n i =1 � x i − ˆ x = proj span { u 1 ,..., u k } x Input layer: x [1] x [2] x [3] x [4] x [5] h i = W x i ❄ Hidden layer: h [1] h [2] h [3] x i = Uh i ˆ ❄ Output layer: x [1] ˆ ˆ x [2] ˆ x [3] x [4] ˆ x [5] ˆ N � � x i − UWx i � 2 arg min W ∈ ❘ k × n , U ∈ ❘ n × k i =1 MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 12 / 20

  26. Principal component analysis Reconstruction error minimization We only need to consider orthonormal vectors u 1 , . . . , u k (i.e., � u i � = 1, � u i , u j � = 0 for i � = j ) that form a basis for the subspace. We can then extend this set to form a basis u 1 , . . . , u n for the entire ❘ n . � n � n j =1 u j u T Then, we can write x = j =1 � x , u j � u j = j x and proj span { u 1 ,..., u k } x = � k j =1 u j u T j x . x i � 2 . We now consider the reconstruction error N − 1 � N i =1 � x i − ˆ MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 13 / 20

Recommend


More recommend