Step By induction assumption there exist γ 1 , . . . , γ d − 1 and w 1 , . . . , w d − 1 such that || y || 2 = 1 y T By γ 1 = max || y || 2 = 1 y T By w 1 = arg max y T By , γ k = max 2 ≤ k ≤ d − 2 || y || 2 = 1 , y ⊥ w 1 ,..., w k − 1 y T By , w k = arg max 2 ≤ k ≤ d − 2 || y || 2 = 1 , y ⊥ w 1 ,..., w k − 1
Step For any x ∈ span( u 1 ) ⊥ , x = V ⊥ y for some y ∈ R d − 1 x T Ax = x T ( A − λ 1 u 1 u T max max 1 ) x || x || 2 = 1 , x ⊥ u 1 || x || 2 = 1 , x ⊥ u 1 x T V ⊥ V T ⊥ ( A − λ 1 u 1 u T 1 ) V ⊥ V T = max ⊥ x || x || 2 = 1 , x ⊥ u 1 || y || 2 = 1 y T By = max = γ 1 Inspired by this: u k := V ⊥ w k − 1 for k = 2 , . . . , d u 1 , . . . , u d are orthonormal basis
Step: eigenvectors Au k = V ⊥ V T ⊥ ( A − λ 1 u 1 u T 1 ) V ⊥ V T ⊥ V ⊥ w k − 1 = V ⊥ Bw k − 1 = γ k − 1 V ⊥ w k − 1 = λ k u k u k is an eigenvector of A with eigenvalue λ k := γ k − 1
Step Let x ∈ span( u 1 ) ⊥ be orthogonal to u k ′ , where 2 ≤ k ′ ≤ d There is y ∈ R d − 1 such that x = V ⊥ y and w T k ′ − 1 y = w T k ′ V T ⊥ V ⊥ y = u T k ′ x = 0
Step: eigenvalues Let x ∈ span( u 1 ) ⊥ be orthogonal to u k ′ , where 2 ≤ k ′ ≤ d There is y ∈ R d − 1 such that x = V ⊥ y and w T k ′ − 1 y = 0 x T Ax = x T V ⊥ V T ⊥ ( A − λ 1 u 1 u T 1 ) V ⊥ V T max max ⊥ x || x || 2 = 1 , x ⊥ u 1 ,..., u k − 1 || x || 2 = 1 , x ⊥ u 1 ,..., u k − 1 y T By = max || y || 2 = 1 , y ⊥ w 1 ,..., w k − 2 = γ k − 1 = λ k
Covariance matrix The spectral theorem Principal component analysis Dimensionality reduction via PCA Gaussian random vectors
Spectral theorem If A ∈ R d × d is symmetric, then it has an eigendecomposition λ 1 0 · · · 0 � � � � T , 0 λ 2 · · · 0 A = u 1 u 2 · · · u d u 1 u 2 · · · u d · · · 0 0 · · · λ d Eigenvalues λ 1 ≥ λ 2 ≥ · · · ≥ λ d are real Eigenvectors u 1 , u 2 , . . . , u n are real and orthogonal
Variance in direction of a fixed vector v If random vector ˜ x has covariance matrix Σ ˜ x � � v T ˜ = v T Σ ˜ Var x x v
Principal directions Let u 1 , . . . , u d , and λ 1 > . . . > λ d be the eigenvectors/eigenvalues of Σ ˜ x || v || 2 = 1 Var ( v T ˜ λ 1 = max x ) || v || 2 = 1 Var ( v T ˜ u 1 = arg max x ) Var ( v T ˜ λ k = max x ) , 2 ≤ k ≤ d || v || 2 = 1 , v ⊥ u 1 ,..., u k − 1 Var ( v T ˜ u k = arg max x ) , 2 ≤ k ≤ d || v || 2 = 1 , v ⊥ u 1 ,..., u k − 1
Principal components Let c (˜ x ) := ˜ x − E (˜ x ) pc [ i ] := u T � i c (˜ x ) , 1 ≤ i ≤ d is the i th principal component Var ( � pc [ i ]) := λ i , 1 ≤ i ≤ d
Principal components are uncorrelated pc [ j ]) = E ( u T x ) u T E ( � pc [ i ] � i c (˜ j c (˜ x )) = u T x ) T ) u j i E ( c (˜ x ) c (˜ = u T i Σ ˜ x u j = λ i u T i u j = 0
Principal components For dataset X containing x 1 , x 2 , . . . , x n ∈ R d 1. Compute sample covariance matrix Σ X 2. Eigendecomposition of Σ X yields principal directions u 1 , . . . , u d 3. Center the data and compute principal components pc i [ j ] := u T j c ( x i ) , 1 ≤ i ≤ n , 1 ≤ j ≤ d , where c ( x i ) := x i − av( X )
First principal direction Centered latitude 30 20 10 0 10 40 20 0 20 40 Centered longitude
First principal component 0.10 0.08 Density 0.06 0.04 0.02 0.00 40 20 0 20 40 First principal component
Second principal direction Centered latitude 30 20 10 0 10 40 20 0 20 40 Centered longitude
Second principal component 0.10 0.08 Density 0.06 0.04 0.02 0.00 40 20 0 20 40 Second principal component
Sample variance in direction of a fixed vector v var ( P v X ) = v T Σ X v
Principal directions Let u 1 , . . . , u d , and λ 1 > . . . > λ d be the eigenvectors/eigenvalues of Σ X λ 1 = max || v || 2 = 1 var ( P v X ) u 1 = arg max || v || 2 = 1 var ( P v X ) λ k = max var ( P v X ) , 2 ≤ k ≤ d || v || 2 = 1 , v ⊥ u 1 ,..., u k − 1 u k = arg max var ( P v X ) , 2 ≤ k ≤ d || v || 2 = 1 , v ⊥ u 1 ,..., u k − 1
Sample variance = 229 (sample std = 15.1) Centered latitude 30 20 10 0 10 40 20 0 20 40 Centered longitude
Sample variance = 229 (sample std = 15.1) 0.10 0.08 Density 0.06 0.04 0.02 0.00 40 20 0 20 40 Component in selected direction
Sample variance = 531 (sample std = 23.1) Centered latitude 30 20 10 0 10 40 20 0 20 40 Centered longitude
Sample variance = 531 (sample std = 23.1 0.10 0.08 Density 0.06 0.04 0.02 0.00 40 20 0 20 40 First principal component
Sample variance = 46.2 (sample std = 6.80) Centered latitude 30 20 10 0 10 40 20 0 20 40 Centered longitude
Sample variance = 46.2 (sample std = 6.80) 0.10 0.08 Density 0.06 0.04 0.02 0.00 40 20 0 20 40 Second principal component
PCA of faces Data set of 400 64 × 64 images from 40 subjects (10 per subject) Each face is vectorized and interpreted as a vector in R 4096
PCA of faces Center PD 1 PD 2 PD 3 PD 4 PD 5 330 251 192 152 130
PCA of faces PD 10 PD 15 PD 20 PD 30 PD 40 PD 50 90.2 70.8 58.7 45.1 36.0 30.8
PCA of faces PD 100 PD 150 PD 200 PD 250 PD 300 PD 359 19.0 13.7 10.3 8.01 6.14 3.06
Covariance matrix The spectral theorem Principal component analysis Dimensionality reduction via PCA Gaussian random vectors
Dimensionality reduction Data with a large number of features can be difficult to analyze or process Dimensionality reduction is a useful preprocessing step If data are modeled as vectors in R p we can reduce the dimension by projecting onto R k , where k < p For orthogonal projections, the new representation is � v 1 , x � , � v 2 , x � , . . . , � v k , x � for a basis v 1 , . . . , v k of the subspace that we project on Problem: How do we choose the subspace? Possible criterion: Capture as much sample variance as possible
Captured variance For any orthonormal v 1 , . . . , v k � k � k � n 1 v T i c ( x j ) c ( x j ) T v i var( P v i X ) = n i = 1 i = 1 j = 1 � k v T = i Σ X v i i = 1 By spectral theorem, eigenvectors optimize each individual term
Eigenvectors also optimize sum For any symmetric A ∈ R d × d with eigenvectors u 1 , . . . , u k � k � k u T v T i Au i ≥ i Av i . i = 1 i = 1 for any k orthonormal vectors v 1 , . . . , v k
Proof by induction on k Base ( k = 1)? Follows from spectral theorem
Step Let S := span( v 1 , . . . , v k ) For any orthonormal basis for S b 1 , . . . , b k of S VV T = BB T Choice of basis does not change cost function � � � k i = 1 v T V T AV i Av i = trace � AVV T � = trace � ABB T � = trace = � k i = 1 b T i Ab i Let’s choose wisely
Step We choose b orthogonal to u 1 , . . . , u k − 1 By spectral theorem u T k Au k ≥ b T Ab Now choose orthonormal basis b 1 , b 2 , . . . , b k for S so that b k := b By induction assumption k − 1 k − 1 � � u T b T i Au i ≥ i Ab i i = 1 i = 1
Conclusion For any k orthonormal vectors v 1 , . . . , v k k k � � var(pc[ i ]) ≥ var( P v i X ) , i = 1 i = 1 where pc[ i ] := { pc 1 [ i ] , . . . , pc n [ i ] } = P u i X
Faces � 7 x reduced := av( X ) + pc i [ j ] u j i j = 1
Projection onto first 7 principal directions Center PD 1 PD 2 = - 2459 8613 + 665 PD 3 PD 4 PD 5 - 180 + 301 + 566 PD 6 PD 7 + 638 + 403
Projection onto first k principal directions Signal 5 PDs 10 PDs 20 PDs 30 PDs 50 PDs 100 PDs 150 PDs 200 PDs 250 PDs 300 PDs 359 PDs
Nearest-neighbor classification Training set of points and labels { x 1 , l 1 } , . . . , { x n , l n } To classify a new data point y , find i ∗ := arg min 1 ≤ i ≤ n || y − x i || 2 , and assign l i ∗ to y Cost: O ( nd ) to classify new point
Nearest neighbors in principal-component space Idea: Project onto first k main principal directions beforehand Costly reduced to O ( nk ) Computing eigendecomposition is costly, but only needs to be done once
Face recognition Training set: 360 64 × 64 images from 40 different subjects (9 each) Test set: 1 new image from each subject We model each image as a vector in R 4096 ( d = 4096) To classify we: 1. Project onto first k principal directions 2. Apply nearest-neighbor classification using the ℓ 2 -norm distance in R k
Performance 30 Errors 20 10 4 0 10 20 30 40 50 60 70 80 90 100 Number of principal components
Nearest neighbor in R 41 Test image Projection Closest projection Corresponding image
Dimensionality reduction for visualization Motivation: Visualize high-dimensional features projected onto 2D or 3D Example: Seeds from three different varieties of wheat: Kama, Rosa and Canadian Features: ◮ Area ◮ Perimeter ◮ Compactness ◮ Length of kernel ◮ Width of kernel ◮ Asymmetry coefficient ◮ Length of kernel groove
Projection onto two first PDs 2.0 Second principal component 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 First principal component
Projection onto two last PDs 2.0 1.5 dth principal component 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 (d-1)th principal component
Covariance matrix The spectral theorem Principal component analysis Dimensionality reduction via PCA Gaussian random vectors
Gaussian random variables The pdf of a Gaussian or normal random variable ˜ a with mean µ and standard deviation σ is given by 1 e − ( a − µ ) 2 √ f ˜ a ( a ) = 2 σ 2 2 πσ
Gaussian random variables µ = 2 σ = 1 0 . 4 µ = 0 σ = 2 µ = 0 σ = 4 0 . 3 a ( a ) 0 . 2 f ˜ 0 . 1 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 10 a
Gaussian random variables � ∞ µ = af ˜ a ( a ) d a a = −∞ � ∞ σ 2 = ( a − µ ) 2 f ˜ a ( a ) d a a = −∞
Linear transformation of Gaussian If ˜ a is a Gaussian random variable with mean µ and standard deviation σ , then for any α, β ∈ R ˜ b := α ˜ a + β is a Gaussian random variable with αµ + β and standard deviation | α | σ
Proof Let α > 0 (proof for a < 0 is very similar), � � ˜ F ˜ b ( b ) = P b ≤ b = P ( α ˜ a + β ≤ b ) � � a ≤ b − β = P ˜ α � b − β e − ( a − µ ) 2 1 α = √ d a 2 σ 2 2 πσ −∞ � b e − ( w − αµ − β ) 2 1 = √ d w change of variables w := α a + β 2 α 2 σ 2 2 πασ −∞ Differentiating with respect to b : 1 e − ( b − αµ − β ) 2 b ( b ) = √ f ˜ 2 α 2 σ 2 2 πασ
Gaussian random vector A Gaussian random vector ˜ x is a random vector with joint pdf � � 1 − 1 2 ( x − µ ) T Σ − 1 ( x − µ ) � f ˜ x ( x ) = exp ( 2 π ) n | Σ | where µ ∈ R d is the mean and Σ ∈ R d × d the covariance matrix Σ ∈ R d × d is positive definite (positive eigenvalues)
Contour surfaces Set of points at which pdf is constant c = x T Σ − 1 x assuming µ = 0 = x T U Λ − 1 Ux d � ( u T i x ) 2 = λ i i = 1 Ellipsoid with axes proportional to √ λ i
2D example µ = 0 � 0 . 5 � − 0 . 3 Σ = − 0 . 3 0 . 5 λ 1 = 0 . 8 λ 2 = 0 . 2 � 1 / √ � 2 √ u 1 = − 1 / 2 √ � 1 / � 2 √ u 2 = 1 / 2 How does the ellipse look like?
Contour surfaces 1 . 5 10 − 4 1 . 0 10 − 2 0 . 5 x[2] 0 . 0 0.37 0.24 10 − 1 − 0 . 5 10 − 2 − 1 . 0 10 − 4 − 1 . 5 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 1 . 5 x[1]
Recommend
More recommend