principal component analysis
play

Principal component analysis DS-GA 1013 / MATH-GA 2824 Mathematical - PowerPoint PPT Presentation

Principal component analysis DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data Science https://cims.nyu.edu/~cfgranda/pages/MTDS_spring20/index.html Carlos Fernandez-Granda Discussion Covariance matrix The spectral theorem Principal


  1. Step By induction assumption there exist γ 1 , . . . , γ d − 1 and w 1 , . . . , w d − 1 such that || y || 2 = 1 y T By γ 1 = max || y || 2 = 1 y T By w 1 = arg max y T By , γ k = max 2 ≤ k ≤ d − 2 || y || 2 = 1 , y ⊥ w 1 ,..., w k − 1 y T By , w k = arg max 2 ≤ k ≤ d − 2 || y || 2 = 1 , y ⊥ w 1 ,..., w k − 1

  2. Step For any x ∈ span( u 1 ) ⊥ , x = V ⊥ y for some y ∈ R d − 1 x T Ax = x T ( A − λ 1 u 1 u T max max 1 ) x || x || 2 = 1 , x ⊥ u 1 || x || 2 = 1 , x ⊥ u 1 x T V ⊥ V T ⊥ ( A − λ 1 u 1 u T 1 ) V ⊥ V T = max ⊥ x || x || 2 = 1 , x ⊥ u 1 || y || 2 = 1 y T By = max = γ 1 Inspired by this: u k := V ⊥ w k − 1 for k = 2 , . . . , d u 1 , . . . , u d are orthonormal basis

  3. Step: eigenvectors Au k = V ⊥ V T ⊥ ( A − λ 1 u 1 u T 1 ) V ⊥ V T ⊥ V ⊥ w k − 1 = V ⊥ Bw k − 1 = γ k − 1 V ⊥ w k − 1 = λ k u k u k is an eigenvector of A with eigenvalue λ k := γ k − 1

  4. Step Let x ∈ span( u 1 ) ⊥ be orthogonal to u k ′ , where 2 ≤ k ′ ≤ d There is y ∈ R d − 1 such that x = V ⊥ y and w T k ′ − 1 y = w T k ′ V T ⊥ V ⊥ y = u T k ′ x = 0

  5. Step: eigenvalues Let x ∈ span( u 1 ) ⊥ be orthogonal to u k ′ , where 2 ≤ k ′ ≤ d There is y ∈ R d − 1 such that x = V ⊥ y and w T k ′ − 1 y = 0 x T Ax = x T V ⊥ V T ⊥ ( A − λ 1 u 1 u T 1 ) V ⊥ V T max max ⊥ x || x || 2 = 1 , x ⊥ u 1 ,..., u k − 1 || x || 2 = 1 , x ⊥ u 1 ,..., u k − 1 y T By = max || y || 2 = 1 , y ⊥ w 1 ,..., w k − 2 = γ k − 1 = λ k

  6. Covariance matrix The spectral theorem Principal component analysis Dimensionality reduction via PCA Gaussian random vectors

  7. Spectral theorem If A ∈ R d × d is symmetric, then it has an eigendecomposition   λ 1 0 · · · 0 � �   � � T , 0 λ 2 · · · 0   A = u 1 u 2 · · · u d u 1 u 2 · · · u d   · · · 0 0 · · · λ d Eigenvalues λ 1 ≥ λ 2 ≥ · · · ≥ λ d are real Eigenvectors u 1 , u 2 , . . . , u n are real and orthogonal

  8. Variance in direction of a fixed vector v If random vector ˜ x has covariance matrix Σ ˜ x � � v T ˜ = v T Σ ˜ Var x x v

  9. Principal directions Let u 1 , . . . , u d , and λ 1 > . . . > λ d be the eigenvectors/eigenvalues of Σ ˜ x || v || 2 = 1 Var ( v T ˜ λ 1 = max x ) || v || 2 = 1 Var ( v T ˜ u 1 = arg max x ) Var ( v T ˜ λ k = max x ) , 2 ≤ k ≤ d || v || 2 = 1 , v ⊥ u 1 ,..., u k − 1 Var ( v T ˜ u k = arg max x ) , 2 ≤ k ≤ d || v || 2 = 1 , v ⊥ u 1 ,..., u k − 1

  10. Principal components Let c (˜ x ) := ˜ x − E (˜ x ) pc [ i ] := u T � i c (˜ x ) , 1 ≤ i ≤ d is the i th principal component Var ( � pc [ i ]) := λ i , 1 ≤ i ≤ d

  11. Principal components are uncorrelated pc [ j ]) = E ( u T x ) u T E ( � pc [ i ] � i c (˜ j c (˜ x )) = u T x ) T ) u j i E ( c (˜ x ) c (˜ = u T i Σ ˜ x u j = λ i u T i u j = 0

  12. Principal components For dataset X containing x 1 , x 2 , . . . , x n ∈ R d 1. Compute sample covariance matrix Σ X 2. Eigendecomposition of Σ X yields principal directions u 1 , . . . , u d 3. Center the data and compute principal components pc i [ j ] := u T j c ( x i ) , 1 ≤ i ≤ n , 1 ≤ j ≤ d , where c ( x i ) := x i − av( X )

  13. First principal direction Centered latitude 30 20 10 0 10 40 20 0 20 40 Centered longitude

  14. First principal component 0.10 0.08 Density 0.06 0.04 0.02 0.00 40 20 0 20 40 First principal component

  15. Second principal direction Centered latitude 30 20 10 0 10 40 20 0 20 40 Centered longitude

  16. Second principal component 0.10 0.08 Density 0.06 0.04 0.02 0.00 40 20 0 20 40 Second principal component

  17. Sample variance in direction of a fixed vector v var ( P v X ) = v T Σ X v

  18. Principal directions Let u 1 , . . . , u d , and λ 1 > . . . > λ d be the eigenvectors/eigenvalues of Σ X λ 1 = max || v || 2 = 1 var ( P v X ) u 1 = arg max || v || 2 = 1 var ( P v X ) λ k = max var ( P v X ) , 2 ≤ k ≤ d || v || 2 = 1 , v ⊥ u 1 ,..., u k − 1 u k = arg max var ( P v X ) , 2 ≤ k ≤ d || v || 2 = 1 , v ⊥ u 1 ,..., u k − 1

  19. Sample variance = 229 (sample std = 15.1) Centered latitude 30 20 10 0 10 40 20 0 20 40 Centered longitude

  20. Sample variance = 229 (sample std = 15.1) 0.10 0.08 Density 0.06 0.04 0.02 0.00 40 20 0 20 40 Component in selected direction

  21. Sample variance = 531 (sample std = 23.1) Centered latitude 30 20 10 0 10 40 20 0 20 40 Centered longitude

  22. Sample variance = 531 (sample std = 23.1 0.10 0.08 Density 0.06 0.04 0.02 0.00 40 20 0 20 40 First principal component

  23. Sample variance = 46.2 (sample std = 6.80) Centered latitude 30 20 10 0 10 40 20 0 20 40 Centered longitude

  24. Sample variance = 46.2 (sample std = 6.80) 0.10 0.08 Density 0.06 0.04 0.02 0.00 40 20 0 20 40 Second principal component

  25. PCA of faces Data set of 400 64 × 64 images from 40 subjects (10 per subject) Each face is vectorized and interpreted as a vector in R 4096

  26. PCA of faces Center PD 1 PD 2 PD 3 PD 4 PD 5 330 251 192 152 130

  27. PCA of faces PD 10 PD 15 PD 20 PD 30 PD 40 PD 50 90.2 70.8 58.7 45.1 36.0 30.8

  28. PCA of faces PD 100 PD 150 PD 200 PD 250 PD 300 PD 359 19.0 13.7 10.3 8.01 6.14 3.06

  29. Covariance matrix The spectral theorem Principal component analysis Dimensionality reduction via PCA Gaussian random vectors

  30. Dimensionality reduction Data with a large number of features can be difficult to analyze or process Dimensionality reduction is a useful preprocessing step If data are modeled as vectors in R p we can reduce the dimension by projecting onto R k , where k < p For orthogonal projections, the new representation is � v 1 , x � , � v 2 , x � , . . . , � v k , x � for a basis v 1 , . . . , v k of the subspace that we project on Problem: How do we choose the subspace? Possible criterion: Capture as much sample variance as possible

  31. Captured variance For any orthonormal v 1 , . . . , v k � k � k � n 1 v T i c ( x j ) c ( x j ) T v i var( P v i X ) = n i = 1 i = 1 j = 1 � k v T = i Σ X v i i = 1 By spectral theorem, eigenvectors optimize each individual term

  32. Eigenvectors also optimize sum For any symmetric A ∈ R d × d with eigenvectors u 1 , . . . , u k � k � k u T v T i Au i ≥ i Av i . i = 1 i = 1 for any k orthonormal vectors v 1 , . . . , v k

  33. Proof by induction on k Base ( k = 1)? Follows from spectral theorem

  34. Step Let S := span( v 1 , . . . , v k ) For any orthonormal basis for S b 1 , . . . , b k of S VV T = BB T Choice of basis does not change cost function � � � k i = 1 v T V T AV i Av i = trace � AVV T � = trace � ABB T � = trace = � k i = 1 b T i Ab i Let’s choose wisely

  35. Step We choose b orthogonal to u 1 , . . . , u k − 1 By spectral theorem u T k Au k ≥ b T Ab Now choose orthonormal basis b 1 , b 2 , . . . , b k for S so that b k := b By induction assumption k − 1 k − 1 � � u T b T i Au i ≥ i Ab i i = 1 i = 1

  36. Conclusion For any k orthonormal vectors v 1 , . . . , v k k k � � var(pc[ i ]) ≥ var( P v i X ) , i = 1 i = 1 where pc[ i ] := { pc 1 [ i ] , . . . , pc n [ i ] } = P u i X

  37. Faces � 7 x reduced := av( X ) + pc i [ j ] u j i j = 1

  38. Projection onto first 7 principal directions Center PD 1 PD 2 = - 2459 8613 + 665 PD 3 PD 4 PD 5 - 180 + 301 + 566 PD 6 PD 7 + 638 + 403

  39. Projection onto first k principal directions Signal 5 PDs 10 PDs 20 PDs 30 PDs 50 PDs 100 PDs 150 PDs 200 PDs 250 PDs 300 PDs 359 PDs

  40. Nearest-neighbor classification Training set of points and labels { x 1 , l 1 } , . . . , { x n , l n } To classify a new data point y , find i ∗ := arg min 1 ≤ i ≤ n || y − x i || 2 , and assign l i ∗ to y Cost: O ( nd ) to classify new point

  41. Nearest neighbors in principal-component space Idea: Project onto first k main principal directions beforehand Costly reduced to O ( nk ) Computing eigendecomposition is costly, but only needs to be done once

  42. Face recognition Training set: 360 64 × 64 images from 40 different subjects (9 each) Test set: 1 new image from each subject We model each image as a vector in R 4096 ( d = 4096) To classify we: 1. Project onto first k principal directions 2. Apply nearest-neighbor classification using the ℓ 2 -norm distance in R k

  43. Performance 30 Errors 20 10 4 0 10 20 30 40 50 60 70 80 90 100 Number of principal components

  44. Nearest neighbor in R 41 Test image Projection Closest projection Corresponding image

  45. Dimensionality reduction for visualization Motivation: Visualize high-dimensional features projected onto 2D or 3D Example: Seeds from three different varieties of wheat: Kama, Rosa and Canadian Features: ◮ Area ◮ Perimeter ◮ Compactness ◮ Length of kernel ◮ Width of kernel ◮ Asymmetry coefficient ◮ Length of kernel groove

  46. Projection onto two first PDs 2.0 Second principal component 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 First principal component

  47. Projection onto two last PDs 2.0 1.5 dth principal component 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 (d-1)th principal component

  48. Covariance matrix The spectral theorem Principal component analysis Dimensionality reduction via PCA Gaussian random vectors

  49. Gaussian random variables The pdf of a Gaussian or normal random variable ˜ a with mean µ and standard deviation σ is given by 1 e − ( a − µ ) 2 √ f ˜ a ( a ) = 2 σ 2 2 πσ

  50. Gaussian random variables µ = 2 σ = 1 0 . 4 µ = 0 σ = 2 µ = 0 σ = 4 0 . 3 a ( a ) 0 . 2 f ˜ 0 . 1 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 10 a

  51. Gaussian random variables � ∞ µ = af ˜ a ( a ) d a a = −∞ � ∞ σ 2 = ( a − µ ) 2 f ˜ a ( a ) d a a = −∞

  52. Linear transformation of Gaussian If ˜ a is a Gaussian random variable with mean µ and standard deviation σ , then for any α, β ∈ R ˜ b := α ˜ a + β is a Gaussian random variable with αµ + β and standard deviation | α | σ

  53. Proof Let α > 0 (proof for a < 0 is very similar), � � ˜ F ˜ b ( b ) = P b ≤ b = P ( α ˜ a + β ≤ b ) � � a ≤ b − β = P ˜ α � b − β e − ( a − µ ) 2 1 α = √ d a 2 σ 2 2 πσ −∞ � b e − ( w − αµ − β ) 2 1 = √ d w change of variables w := α a + β 2 α 2 σ 2 2 πασ −∞ Differentiating with respect to b : 1 e − ( b − αµ − β ) 2 b ( b ) = √ f ˜ 2 α 2 σ 2 2 πασ

  54. Gaussian random vector A Gaussian random vector ˜ x is a random vector with joint pdf � � 1 − 1 2 ( x − µ ) T Σ − 1 ( x − µ ) � f ˜ x ( x ) = exp ( 2 π ) n | Σ | where µ ∈ R d is the mean and Σ ∈ R d × d the covariance matrix Σ ∈ R d × d is positive definite (positive eigenvalues)

  55. Contour surfaces Set of points at which pdf is constant c = x T Σ − 1 x assuming µ = 0 = x T U Λ − 1 Ux d � ( u T i x ) 2 = λ i i = 1 Ellipsoid with axes proportional to √ λ i

  56. 2D example µ = 0 � 0 . 5 � − 0 . 3 Σ = − 0 . 3 0 . 5 λ 1 = 0 . 8 λ 2 = 0 . 2 � 1 / √ � 2 √ u 1 = − 1 / 2 √ � 1 / � 2 √ u 2 = 1 / 2 How does the ellipse look like?

  57. Contour surfaces 1 . 5 10 − 4 1 . 0 10 − 2 0 . 5 x[2] 0 . 0 0.37 0.24 10 − 1 − 0 . 5 10 − 2 − 1 . 0 10 − 4 − 1 . 5 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 1 . 5 x[1]

Recommend


More recommend