applied machine learning
play

Applied Machine Learning Dimensionality reduction using PCA Siamak - PowerPoint PPT Presentation

Applied Machine Learning Dimensionality reduction using PCA Siamak Ravanbakhsh COMP 551 (Fall 2020) Learning objectives What is dimensionality reduction? What is it good for? Linear dimensionality reduction: Principal Component Analysis


  1. Applied Machine Learning Dimensionality reduction using PCA Siamak Ravanbakhsh COMP 551 (Fall 2020)

  2. Learning objectives What is dimensionality reduction? What is it good for? Linear dimensionality reduction: Principal Component Analysis Relation to Singular Value Decomposition

  3. Motivation Scenario: we are given high dimensional data and asked to make sense of it! Real-world data is high-dimensional we can't visualize beyond 3D features may not have any semantics (value of the pixel vs happy/sad) processing and storage is costly many features may not vary much in our dataset (e.g., background pixels in face images) Dimensionality reduction: faithfully represent the data in low dimensions We can often do this with real-world data (manifold hypothesis) How to do it?

  4. Dimensionality reduction Dimensionality reduction: faithfully represent the data in low dimensions How to do it? learn a mapping between (coordinates) at low-dimension and high-dimensional data ( n ) R 3 ( n ) R 2 ∈ ∈ x z some methods give this mapping in both directions and some only in one direction.

  5. Dimensionality reduction Dimensionality reduction: faithfully represent the data in low dimensions How to do it? learn a mapping between low-dimensional (Euclidean space and our data) ( n ) R 400 ( n ) R 2 ∈ ∈ x z each image is 20x20 image: wikipedia COMP 551 | Fall 2020

  6. Principal Component Analysis (PCA) PCA is a linear dimensionality reduction method R 2 R 2 ( n ) R 3 ( n ) ( n ) ∈ ∈ ∈ z z x Q ∈ R 3×2 Q ⊤ ⊤ where W has orthonormal columns Q Q = I † ⊤ −1 ⊤ Q ⊤ Q = ( Q Q ) = it follows that the pseudo-invarse of Q is Q

  7. PCA: optimization objective PCA is a linear dimensionality reduction method ( n ) R 784 ∈ x ( n ) ( n ) R 2 R 2 ∈ ∈ z z each image has 28x28=784 pixels Q ∈ R 784×2 Q ⊤ faithfulness is measured by the reconstruction error ( n )⊤ ( n ) ⊤ 2 ⊤ s . t . Q Q = min Q ∣∣ x − QQ ∣∣ ∑ n I x 2 z ( n )

  8. PCA: optimization objective PCA is a linear dimensionality reduction method faithfulness is measured by the reconstruction error ( n )⊤ ⊤ ( n ) ⊤ 2 min Q s . t . Q Q = ∣∣ x − QQ ∣∣ ∑ n I x 2 z ( n ) strategy : find matrix Q, and only use D' columns ⎡ Q ⎤ D × D , … , Q 1,1 1, D ⎢ ⎥ Since Q is orthogonal we can think of it as a change of coordinates Q = ⎣ ⋮, ⋱ , ⋮ D , D ⎦ , … , Q (0, 1, 0) Q D ,1 q 1 q 1 q D q 3 (1, 0, 0) (0, 0, 1) q 2

  9. PCA: optimization objective strategy : find matrix Q, and only use D' columns ⎡ Q ⎤ D × D , … , Q 1,1 1, D ⎢ ⎥ Since Q is orthonormal we can think of it as a change of coordinates Q = ⎣ ⋮, ⋱ , ⋮ D , D ⎦ , … , Q (0, 1, 0) Q D ,1 q 1 q 1 q D q 3 D = 2 example (1, 0, 0) (0, 1, 0) (0, 0, 1) q 2 q 1 we want to change coordinates such that coordinates 1,2,...,D' best explain the data for any given D' (1, 0, 0) q 2 COMP 551 | Fall 2020

  10. In other words ⎡ Q ⎤ , … , Q 1,1 1, D ⎢ ⎥ Q = Find a change of coordinate using orthonormal matrix ⋮, ⋱ , ⋮ ⎣ D , D ⎦ first new coordinate has maximum variance (lowest reconstruction error) , … , Q Q D ,1 second coordinate has the next largest variance q 1 ... along which one of these directions the data has a higher variance? this direction is the vector q 1 ( n ) ⊤ ( n )⊤ projection is given by x q = 1 x q 1 ∣∣ q ∣∣ 1 2 projection of the whole dataset is Xq 1 = z 1

  11. Covariance matrix Find a change of coordinate using orthonormal matrix first new coordinate has maximum variance porjection of the whole dataset is z = Xq 1 1 1 ⊤ 1 assuming features have zero mean, maximize the variance of the projection z z 1 N 1 ⊤ 1 = max 1 ⊤ ⊤ ⊤ max = max q Σ q z z q X Xq 1 1 1 1 1 q 1 N q 1 N q 1 dxd covariance matrix 1 ∑ n recall 1 ⊤ ( n ) ( n ) 0) ⊤ Σ = X X = ( x − 0)( x − N N because the mean is zero 1 ∑ n ( n ) ( n ) Σ i , j is the sample covariance of feature i and j Σ = Cov[ X , X ] = x x i , j :, i :, j i j N

  12. Eigenvalue decomposition find a change of coordinate using an orthogonal matrix first new coordinate has maximum variance ⊤ max q Σ q s . t . ∣∣ q ∣∣ = 1 1 1 q 1 1 covariance matrix is symmetric and positive semi-definite ⊤ ⊤ ⊤ ⊤ 1 ⊤ ⊤ 1 2 ( X X ) = a Σ a = a X Xa = ∣∣ Xa ∣∣ ≥ 0 ∀ a X X 2 N N any symmetric matrix has the following decomposition Σ = Q Λ Q ⊤ (as we see shortly using Q here is not a co-incidence) dxd orthogonal matrix diagonal ⊤ ⊤ = Q Q = QQ I each column is an eigenvector corresponding eigenvalues are on the diagonal positive semi-definiteness means these are non-negative

  13. Principal directions find a change of coordinates using an orthogonal matrix first new coordinate has maximum variance ∗ 1⊤ q = arg max Σ q s . t . ∣∣ q ∣∣ = 1 q 1 1 1 q 1 ⊤ ⊤ 1 max q Q Λ Q q = using eigenvalue decomposition λ 1 q 1 1 maximizing direction is the eigenvector with the largest eigenvalue (first column of Q) q = first principal direction Q :,1 1 second eigenvector gives the q = second principal direction Q :,2 2 ... so for PCA we need to find the eigenvectors of the covariance matrix

  14. Reducing dimensionality projection into the principal direction is given by Xq i q i think of the projection XQ as a change of coordinates we can use the first D' coordinates Z = XQ :,: D ′ to reduce the dimensionality while capturing a lot of the variance in the data ~ we can project back into original coordinates using ⊤ = X ZQ :,: D ′ reconstruction

  15. Example: digits dataset ( n ) R 784 ∈ let's only work with digit 2! x ... x (1) x (2) form the covariance matrix Σ 784 × 784 center the data and find the eigenvectors of the covariance matrix, the principal directions ... … q 20 q 1 q 2 use the first 20 directions to reduce dimensionality from 784 to 20! using 20 numbers we can represent ⊤ i PC coefficient x q (the new coordinates) each image with a good accuracy

  16. example 2: digits dataset 3D embedding of MNIST digits ( https://projector.tensorflow.org/ ) ( n ) R 784 ∈ x the embedding 3D coordinates are Xq , Xq , Xq 1 2 3 COMP 551 | Fall 2020

  17. there is another way to do PCA without using the covariance matrix

  18. Singular Value Decomposition (SVD) any N x D real matrix has the following decomposition X = USV ⊤ compressed SVD assuming we can ignore N > D N × D N × N N × D D × D why? orthogonal the last (N-D) columns of U orthogonal rectangular ⎡ ∣ ⎤ ⎡ s 1 ⎤ ⎡ ∣ ⎤ ⊤ ∣ diagonal … ∣ last (N-D) rows of S ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ∣ … ∣ ⎢ ⎥ ⎢ ⎥ s 2 similarly if we can compress , … D > N ⎣ ∣ ⎦ V S ⎢ ⎥ v 1 v N ⎢ ⎥ … ⋱ ⎢ ⎥ u 1 u N ⎢ ⎥ ∣ … X = USV ⊤ ∣ … ∣ ⎣ ∣ ⎦ ⎣ ⎦ ∣ N × D N × D D × D D × D ⊤ j v v = 0∀ i =  j s ≥ 0 ⊤ i u u = 0∀ i =  j i j i { u } left singular vectors singular values right singular vectors i

  19. Singular Value Decomposition (SVD) optional it is as if we are finding orthonormal bases U and V for R , R D N such that X simply scales the i'th basis of and maps it to i'th basis of R D R N N=D=2 X s s 2 1 V ⊤ S s 1 s 2

  20. Singular value & eigenvalue decomposition 1 ⊤ Σ = recall that for PCA we used the eigenvalue decomposition of X X N how does it relate to SVD? ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ 2 ⊤ X X = ( USV ) ( USV ) = = V S U USV V S V 1 ⊤ Q Λ Q ⊤ X X = compare to N Σ eigenvectors of are right singular vectors of X Q = V so for for PCA we could use SVD

  21. Picking the number of PCs optional number of PCs in PCA is a hyper-parameter how should we choose this? ( n )2 1 ∑ n each new principle direction explains some variance in the data z a d d N a ≥ a ≥ … ≥ a D such that we have (by definition of PCA) 1 2 a i we can divide by total variance to get a ratio r = i ∑ d a d example for our digits example we get sum of variance ratios up to a PC we can explain 90% of variance in the data using 100 PCs first few principal directions explain most of the variance in the data!

  22. Picking the number of PCs optional recall that for picking the principal direction we maximized the variance of the PC 1 ⊤ ⊤ max ⊤ = max q Σ q qX Xq ⊤ ⊤ = max q Q Λ Q q = λ 1 q N q q 1 ∣∣ q ∣∣ = 1 ∣∣ q ∣∣ = 1 ∣∣ q ∣∣ = 1 λ i r = so the variance ratios are also given by i ∑ d λ d so we can also use eigenvalues to pick the number of PCs digits example : two estimates of variance ratios do match COMP 551 | Fall 2020

  23. Matrix factorization PCA and SVD perform matrix factorization N × D ′ D × ′ D D D ′ X ≈ ( XQ ) Q ⊤ Q ⊤ D ≈ rows of this matrix are principal components × X factor matrix N N D ′ Z rows are orthonormal Z this is the matrix of low-dimensional features pc coefficients N × D ′ factor loading matrix this gives a row-rank approximation to our original matrix X we can use this to compress the matrix we can find give a "smooth" reconstruction of X (remove noise or fill missing values)

Recommend


More recommend