PCA objective 2: projected variance Empirical distribution: uniform over x 1 , . . . , x n Expectation (think sum over data points): � n ˆ E [ f ( x )] = 1 i =1 f ( x i ) n Variance (think sum of squares if centered): � n E [ f ( x )]) 2 = ˆ var [ f ( x )] + (ˆ E [ f ( x ) 2 ] = 1 i =1 f ( x i ) 2 � n Assume data is centered: ˆ E [ x ] = 0 (what’s ˆ E [ U ⊤ x ] ?) Principal component analysis (PCA) / Basic principles 10
PCA objective 2: projected variance Empirical distribution: uniform over x 1 , . . . , x n Expectation (think sum over data points): � n ˆ E [ f ( x )] = 1 i =1 f ( x i ) n Variance (think sum of squares if centered): � n E [ f ( x )]) 2 = ˆ var [ f ( x )] + (ˆ E [ f ( x ) 2 ] = 1 i =1 f ( x i ) 2 � n Assume data is centered: ˆ E [ x ] = 0 (what’s ˆ E [ U ⊤ x ] ?) Objective: maximize variance of projected data ˆ E [ � U ⊤ x � 2 ] max U ∈ R d × k , U ⊤ U = I Principal component analysis (PCA) / Basic principles 10
Equivalence in two objectives Key intuition: variance of data = captured variance + reconstruction error � �� � � �� � � �� � fixed want small want large Principal component analysis (PCA) / Basic principles 11
Equivalence in two objectives Key intuition: variance of data = captured variance + reconstruction error � �� � � �� � � �� � fixed want small want large Pythagorean decomposition: x = UU ⊤ x + ( I − UU ⊤ ) x Principal component analysis (PCA) / Basic principles 11
Equivalence in two objectives Key intuition: variance of data = captured variance + reconstruction error � �� � � �� � � �� � fixed want small want large Pythagorean decomposition: x = UU ⊤ x + ( I − UU ⊤ ) x � x � � ( I − UU ⊤ ) x � � UU ⊤ x � Principal component analysis (PCA) / Basic principles 11
Equivalence in two objectives Key intuition: variance of data = captured variance + reconstruction error � �� � � �� � � �� � fixed want small want large Pythagorean decomposition: x = UU ⊤ x + ( I − UU ⊤ ) x � x � � ( I − UU ⊤ ) x � � UU ⊤ x � Take expectations; note rotation U doesn’t affect length: E [ � x � 2 ] = ˆ ˆ E [ � U ⊤ x � 2 ] + ˆ E [ � x − UU ⊤ x � 2 ] Principal component analysis (PCA) / Basic principles 11
Equivalence in two objectives Key intuition: variance of data = captured variance + reconstruction error � �� � � �� � � �� � fixed want small want large Pythagorean decomposition: x = UU ⊤ x + ( I − UU ⊤ ) x � x � � ( I − UU ⊤ ) x � � UU ⊤ x � Take expectations; note rotation U doesn’t affect length: E [ � x � 2 ] = ˆ ˆ E [ � U ⊤ x � 2 ] + ˆ E [ � x − UU ⊤ x � 2 ] Minimize reconstruction error ↔ Maximize captured variance Principal component analysis (PCA) / Basic principles 11
Finding one principal component Input data: X = ( x 1 . . . x n ) Principal component analysis (PCA) / Basic principles 12
Finding one principal component Objective: maximize variance of projected data Input data: X = ( x 1 . . . x n ) Principal component analysis (PCA) / Basic principles 12
Finding one principal component Objective: maximize variance of projected data ˆ E [( u ⊤ x ) 2 ] = max � u � =1 Input data: X = ( x 1 . . . x n ) Principal component analysis (PCA) / Basic principles 12
Finding one principal component Objective: maximize variance of projected data ˆ E [( u ⊤ x ) 2 ] = max � u � =1 n 1 � ( u ⊤ x i ) 2 = max � u � =1 n i =1 Input data: X = ( x 1 . . . x n ) Principal component analysis (PCA) / Basic principles 12
Finding one principal component Objective: maximize variance of projected data ˆ E [( u ⊤ x ) 2 ] = max � u � =1 n 1 � ( u ⊤ x i ) 2 = max � u � =1 n i =1 1 Input data: n � u ⊤ X � 2 = max � u � =1 X = ( x 1 . . . x n ) Principal component analysis (PCA) / Basic principles 12
Finding one principal component Objective: maximize variance of projected data ˆ E [( u ⊤ x ) 2 ] = max � u � =1 n 1 � ( u ⊤ x i ) 2 = max � u � =1 n i =1 1 Input data: n � u ⊤ X � 2 = max � u � =1 X = ( x 1 . . . x n ) � 1 � n XX ⊤ � u � =1 u ⊤ = max u Principal component analysis (PCA) / Basic principles 12
Finding one principal component Objective: maximize variance of projected data ˆ E [( u ⊤ x ) 2 ] = max � u � =1 n 1 � ( u ⊤ x i ) 2 = max n � u � =1 i =1 1 Input data: n � u ⊤ X � 2 = max � u � =1 X = ( x 1 . . . x n ) � 1 � n XX ⊤ � u � =1 u ⊤ = max u = 1 def n XX ⊤ = largest eigenvalue of C ( C is covariance matrix of data) Principal component analysis (PCA) / Basic principles 12
How many principal components? • Similar to question of “How many clusters?” • Magnitude of eigenvalues indicate fraction of variance captured. Principal component analysis (PCA) / Basic principles 15
How many principal components? • Similar to question of “How many clusters?” • Magnitude of eigenvalues indicate fraction of variance captured. • Eigenvalues on a face image dataset: 1353.2 1086.7 820.1 λ i 553.6 287.1 2 3 4 5 6 7 8 9 10 11 i Principal component analysis (PCA) / Basic principles 15
How many principal components? • Similar to question of “How many clusters?” • Magnitude of eigenvalues indicate fraction of variance captured. • Eigenvalues on a face image dataset: 1353.2 1086.7 820.1 λ i 553.6 287.1 2 3 4 5 6 7 8 9 10 11 i • Eigenvalues typically drop off sharply, so don’t need that many. • Of course variance isn’t everything... Principal component analysis (PCA) / Basic principles 15
Computing PCA Method 1: eigendecomposition n XX ⊤ U are eigenvectors of covariance matrix C = 1 Computing C already takes O ( nd 2 ) time (very expensive) Principal component analysis (PCA) / Basic principles 16
Computing PCA Method 1: eigendecomposition n XX ⊤ U are eigenvectors of covariance matrix C = 1 Computing C already takes O ( nd 2 ) time (very expensive) Method 2: singular value decomposition (SVD) Find X = U d × d Σ d × n V ⊤ n × n where U ⊤ U = I d × d , V ⊤ V = I n × n , Σ is diagonal Computing top k singular vectors takes only O ( ndk ) Principal component analysis (PCA) / Basic principles 16
Computing PCA Method 1: eigendecomposition n XX ⊤ U are eigenvectors of covariance matrix C = 1 Computing C already takes O ( nd 2 ) time (very expensive) Method 2: singular value decomposition (SVD) Find X = U d × d Σ d × n V ⊤ n × n where U ⊤ U = I d × d , V ⊤ V = I n × n , Σ is diagonal Computing top k singular vectors takes only O ( ndk ) Relationship between eigendecomposition and SVD: Left singular vectors are principal components ( C = U Σ 2 U ⊤ ) Principal component analysis (PCA) / Basic principles 16
Roadmap • Principal component analysis (PCA) – Basic principles – Case studies – Kernel PCA – Probabilistic PCA • Canonical correlation analysis (CCA) • Fisher discriminant analysis (FDA) • Summary Principal component analysis (PCA) / Case studies 17
Eigen-faces [Turk and Pentland, 1991] • d = number of pixels • Each x i ∈ R d is a face image • x ji = intensity of the j -th pixel in image i Principal component analysis (PCA) / Case studies 18
Eigen-faces [Turk and Pentland, 1991] • d = number of pixels • Each x i ∈ R d is a face image • x ji = intensity of the j -th pixel in image i X d × n ≅ U d × k Z k × n ) ( z 1 . . . z n ) ) ≅ ( ( . . . Principal component analysis (PCA) / Case studies 18
Eigen-faces [Turk and Pentland, 1991] • d = number of pixels • Each x i ∈ R d is a face image • x ji = intensity of the j -th pixel in image i X d × n ≅ U d × k Z k × n ) ( z 1 . . . z n ) ) ≅ ( ( . . . Idea: z i more “meaningful” representation of i -th face than x i Can use z i for nearest-neighbor classification Much faster: O ( dk + nk ) time instead of O ( dn ) when n, d ≫ k Why no time savings for linear classifier? Principal component analysis (PCA) / Case studies 18
Latent Semantic Analysis [Deerwater, 1990] • d = number of words in the vocabulary • Each x i ∈ R d is a vector of word counts • x ji = frequency of word j in document i X d × n ≅ U d × k Z k × n ( game: 1 · · · · · · · · · 3 ) ≅ ( 1.9 ) ( z 1 . . . z n ) stocks: 2 · · · · · · · · · 0 0.4 ·· -0.001 chairman: 4 · · · · · · · · · 1 0.8 ·· 0.03 the: 8 · · · · · · · · · 7 0.01 ·· 0.04 · · · . . · · · · · · · · · . . . . . . . . . ·· . wins: 0 · · · · · · · · · 2 0.002 ·· 2.3 0.003 ·· Principal component analysis (PCA) / Case studies 19
Latent Semantic Analysis [Deerwater, 1990] • d = number of words in the vocabulary • Each x i ∈ R d is a vector of word counts • x ji = frequency of word j in document i X d × n ≅ U d × k Z k × n ( game: 1 · · · · · · · · · 3 ) ≅ ( 1.9 ) ( z 1 . . . z n ) stocks: 2 · · · · · · · · · 0 0.4 ·· -0.001 chairman: 4 · · · · · · · · · 1 0.8 ·· 0.03 the: 8 · · · · · · · · · 7 0.01 ·· 0.04 · · · . . · · · · · · · · · . . . . . . . . . ·· . wins: 0 · · · · · · · · · 2 0.002 ·· 2.3 0.003 ·· How to measure similarity between two documents? z ⊤ 1 z 2 is probably better than x ⊤ 1 x 2 Principal component analysis (PCA) / Case studies 19
Latent Semantic Analysis [Deerwater, 1990] • d = number of words in the vocabulary • Each x i ∈ R d is a vector of word counts • x ji = frequency of word j in document i X d × n ≅ U d × k Z k × n ( game: 1 · · · · · · · · · 3 ) ≅ ( 1.9 ) ( z 1 . . . z n ) stocks: 2 · · · · · · · · · 0 0.4 ·· -0.001 chairman: 4 · · · · · · · · · 1 0.8 ·· 0.03 the: 8 · · · · · · · · · 7 0.01 ·· 0.04 · · · . . · · · · · · · · · . . . . . . . . . ·· . wins: 0 · · · · · · · · · 2 0.002 ·· 2.3 0.003 ·· How to measure similarity between two documents? z ⊤ 1 z 2 is probably better than x ⊤ 1 x 2 Applications: information retrieval Principal component analysis (PCA) / Case studies 19
Latent Semantic Analysis [Deerwater, 1990] • d = number of words in the vocabulary • Each x i ∈ R d is a vector of word counts • x ji = frequency of word j in document i X d × n ≅ U d × k Z k × n ( game: 1 · · · · · · · · · 3 ) ≅ ( 1.9 ) ( z 1 . . . z n ) stocks: 2 · · · · · · · · · 0 0.4 ·· -0.001 chairman: 4 · · · · · · · · · 1 0.8 ·· 0.03 the: 8 · · · · · · · · · 7 0.01 ·· 0.04 · · · . . · · · · · · · · · . . . . . . . . . ·· . wins: 0 · · · · · · · · · 2 0.002 ·· 2.3 0.003 ·· How to measure similarity between two documents? z ⊤ 1 z 2 is probably better than x ⊤ 1 x 2 Applications: information retrieval Note: no computational savings; original x is already sparse Principal component analysis (PCA) / Case studies 19
Network anomaly detection [Lakhina, ’05] x ji = amount of traffic on link j in the network during each time interval i Principal component analysis (PCA) / Case studies 20
Network anomaly detection [Lakhina, ’05] x ji = amount of traffic on link j in the network during each time interval i Model assumption: total traffic is sum of flows along a few “paths” Principal component analysis (PCA) / Case studies 20
Network anomaly detection [Lakhina, ’05] x ji = amount of traffic on link j in the network during each time interval i Model assumption: total traffic is sum of flows along a few “paths” Apply PCA: each principal component intuitively represents a “path” Principal component analysis (PCA) / Case studies 20
Network anomaly detection [Lakhina, ’05] x ji = amount of traffic on link j in the network during each time interval i Model assumption: total traffic is sum of flows along a few “paths” Apply PCA: each principal component intuitively represents a “path” Anomaly when traffic deviates from first few principal components Principal component analysis (PCA) / Case studies 20
Network anomaly detection [Lakhina, ’05] x ji = amount of traffic on link j in the network during each time interval i Model assumption: total traffic is sum of flows along a few “paths” Apply PCA: each principal component intuitively represents a “path” Anomaly when traffic deviates from first few principal components Principal component analysis (PCA) / Case studies 20
Unsupervised POS tagging [Sch¨ utze, ’95] Part-of-speech (POS) tagging task: Input: I like reducing the dimensionality of data . Output: NOUN VERB VERB(-ING) DET NOUN PREP NOUN . Principal component analysis (PCA) / Case studies 21
Unsupervised POS tagging [Sch¨ utze, ’95] Part-of-speech (POS) tagging task: Input: I like reducing the dimensionality of data . Output: NOUN VERB VERB(-ING) DET NOUN PREP NOUN . Each x i is (the context distribution of) a word. x ji is number of times word i appeared in context j Key idea: words appearing in similar contexts tend to have the same POS tags; so cluster using the contexts of each word type Problem: contexts are too sparse Principal component analysis (PCA) / Case studies 21
Unsupervised POS tagging [Sch¨ utze, ’95] Part-of-speech (POS) tagging task: Input: I like reducing the dimensionality of data . Output: NOUN VERB VERB(-ING) DET NOUN PREP NOUN . Each x i is (the context distribution of) a word. x ji is number of times word i appeared in context j Key idea: words appearing in similar contexts tend to have the same POS tags; so cluster using the contexts of each word type Problem: contexts are too sparse Solution: run PCA first, then cluster using new representation Principal component analysis (PCA) / Case studies 21
Multi-task learning [Ando & Zhang, ’05] • Have n related tasks (classify documents for various users) • Each task has a linear classifier with weights x i • Want to share structure between classifiers Principal component analysis (PCA) / Case studies 22
Multi-task learning [Ando & Zhang, ’05] • Have n related tasks (classify documents for various users) • Each task has a linear classifier with weights x i • Want to share structure between classifiers One step of their procedure: given n linear classifiers x 1 , . . . , x n , run PCA to identify shared structure: Principal component analysis (PCA) / Case studies 22
Multi-task learning [Ando & Zhang, ’05] • Have n related tasks (classify documents for various users) • Each task has a linear classifier with weights x i • Want to share structure between classifiers One step of their procedure: given n linear classifiers x 1 , . . . , x n , run PCA to identify shared structure: X = ( x 1 . . . x n ) ≅ UZ Principal component analysis (PCA) / Case studies 22
Multi-task learning [Ando & Zhang, ’05] • Have n related tasks (classify documents for various users) • Each task has a linear classifier with weights x i • Want to share structure between classifiers One step of their procedure: given n linear classifiers x 1 , . . . , x n , run PCA to identify shared structure: X = ( x 1 . . . x n ) ≅ UZ Each principal component is a eigen-classifier Principal component analysis (PCA) / Case studies 22
Multi-task learning [Ando & Zhang, ’05] • Have n related tasks (classify documents for various users) • Each task has a linear classifier with weights x i • Want to share structure between classifiers One step of their procedure: given n linear classifiers x 1 , . . . , x n , run PCA to identify shared structure: X = ( x 1 . . . x n ) ≅ UZ Each principal component is a eigen-classifier Other step of their procedure: Retrain classifiers, regularizing towards subspace U Principal component analysis (PCA) / Case studies 22
PCA summary • Intuition: capture variance of data or minimize reconstruction error Principal component analysis (PCA) / Case studies 23
PCA summary • Intuition: capture variance of data or minimize reconstruction error • Algorithm: find eigendecomposition of covariance matrix or SVD Principal component analysis (PCA) / Case studies 23
PCA summary • Intuition: capture variance of data or minimize reconstruction error • Algorithm: find eigendecomposition of covariance matrix or SVD • Impact: reduce storage (from O ( nd ) to O ( nk ) ), reduce time complexity Principal component analysis (PCA) / Case studies 23
PCA summary • Intuition: capture variance of data or minimize reconstruction error • Algorithm: find eigendecomposition of covariance matrix or SVD • Impact: reduce storage (from O ( nd ) to O ( nk ) ), reduce time complexity • Advantages: simple, fast Principal component analysis (PCA) / Case studies 23
PCA summary • Intuition: capture variance of data or minimize reconstruction error • Algorithm: find eigendecomposition of covariance matrix or SVD • Impact: reduce storage (from O ( nd ) to O ( nk ) ), reduce time complexity • Advantages: simple, fast • Applications: eigen-faces, eigen-documents, network anomaly detection, etc. Principal component analysis (PCA) / Case studies 23
Roadmap • Principal component analysis (PCA) – Basic principles – Case studies – Kernel PCA – Probabilistic PCA • Canonical correlation analysis (CCA) • Fisher discriminant analysis (FDA) • Summary Principal component analysis (PCA) / Kernel PCA 24
Limitations of linearity Principal component analysis (PCA) / Kernel PCA 25
Limitations of linearity PCA is effective Principal component analysis (PCA) / Kernel PCA 25
Limitations of linearity PCA is effective Principal component analysis (PCA) / Kernel PCA 25
Limitations of linearity PCA is effective PCA is ineffective Principal component analysis (PCA) / Kernel PCA 25
Limitations of linearity PCA is effective PCA is ineffective Problem is that PCA subspace is linear: S = { x = Uz : z ∈ R k } Principal component analysis (PCA) / Kernel PCA 25
Limitations of linearity PCA is effective PCA is ineffective Problem is that PCA subspace is linear: S = { x = Uz : z ∈ R k } In this example: S = { ( x 1 , x 2 ) : x 2 = u 2 u 1 x 1 } Principal component analysis (PCA) / Kernel PCA 25
Going beyond linearity: quick solution Broken solution Principal component analysis (PCA) / Kernel PCA 26
Going beyond linearity: quick solution Broken solution Desired solution u 1 x 2 We want desired solution: S = { ( x 1 , x 2 ) : x 2 = u 2 1 } Principal component analysis (PCA) / Kernel PCA 26
Going beyond linearity: quick solution Broken solution Desired solution u 1 x 2 We want desired solution: S = { ( x 1 , x 2 ) : x 2 = u 2 1 } 1 , x 2 ) ⊤ We can get this: S = { φ ( x ) = Uz } with φ ( x ) = ( x 2 Principal component analysis (PCA) / Kernel PCA 26
Going beyond linearity: quick solution Broken solution Desired solution u 1 x 2 We want desired solution: S = { ( x 1 , x 2 ) : x 2 = u 2 1 } 1 , x 2 ) ⊤ We can get this: S = { φ ( x ) = Uz } with φ ( x ) = ( x 2 Linear dimensionality reduction in φ ( x ) space ⇔ Nonlinear dimensionality reduction in x space Principal component analysis (PCA) / Kernel PCA 26
Going beyond linearity: quick solution Broken solution Desired solution u 1 x 2 We want desired solution: S = { ( x 1 , x 2 ) : x 2 = u 2 1 } 1 , x 2 ) ⊤ We can get this: S = { φ ( x ) = Uz } with φ ( x ) = ( x 2 Linear dimensionality reduction in φ ( x ) space ⇔ Nonlinear dimensionality reduction in x space 1 , x 1 x 2 , sin( x 1 ) , . . . ) ⊤ In general, can set φ ( x ) = ( x 1 , x 2 Principal component analysis (PCA) / Kernel PCA 26
Going beyond linearity: quick solution Broken solution Desired solution u 1 x 2 We want desired solution: S = { ( x 1 , x 2 ) : x 2 = u 2 1 } 1 , x 2 ) ⊤ We can get this: S = { φ ( x ) = Uz } with φ ( x ) = ( x 2 Linear dimensionality reduction in φ ( x ) space ⇔ Nonlinear dimensionality reduction in x space 1 , x 1 x 2 , sin( x 1 ) , . . . ) ⊤ In general, can set φ ( x ) = ( x 1 , x 2 Problems: (1) ad-hoc and tedious (2) φ ( x ) large, computationally expensive Principal component analysis (PCA) / Kernel PCA 26
Towards kernels Representer theorem: PCA solution is linear combination of x i s Principal component analysis (PCA) / Kernel PCA 27
Towards kernels Representer theorem: PCA solution is linear combination of x i s Why? Recall PCA eigenvalue problem: XX ⊤ u = λ u Principal component analysis (PCA) / Kernel PCA 27
Towards kernels Representer theorem: PCA solution is linear combination of x i s Why? Recall PCA eigenvalue problem: XX ⊤ u = λ u Notice that u = X α = � n i =1 α i x i for some weights α Principal component analysis (PCA) / Kernel PCA 27
Towards kernels Representer theorem: PCA solution is linear combination of x i s Why? Recall PCA eigenvalue problem: XX ⊤ u = λ u Notice that u = X α = � n i =1 α i x i for some weights α Analogy with SVMs: weight vector w = X α Principal component analysis (PCA) / Kernel PCA 27
Towards kernels Representer theorem: PCA solution is linear combination of x i s Why? Recall PCA eigenvalue problem: XX ⊤ u = λ u Notice that u = X α = � n i =1 α i x i for some weights α Analogy with SVMs: weight vector w = X α Key fact: PCA only needs inner products K = X ⊤ X Principal component analysis (PCA) / Kernel PCA 27
Towards kernels Representer theorem: PCA solution is linear combination of x i s Why? Recall PCA eigenvalue problem: XX ⊤ u = λ u Notice that u = X α = � n i =1 α i x i for some weights α Analogy with SVMs: weight vector w = X α Key fact: PCA only needs inner products K = X ⊤ X Why? Use representer theorem on PCA objective: � u � =1 u ⊤ XX ⊤ u = α ⊤ ( X ⊤ X )( X ⊤ X ) α max max α ⊤ X ⊤ X α =1 Principal component analysis (PCA) / Kernel PCA 27
Kernel PCA Kernel function: k ( x 1 , x 2 ) such that K , the kernel matrix formed by K ij = k ( x i , x j ) , is positive semi-definite Principal component analysis (PCA) / Kernel PCA 28
Recommend
More recommend