COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University
P RINCIPAL C OMPONENT A NALYSIS
D IMENSIONALITY REDUCTION We’re given data x 1 , . . . , x n , where x ∈ R d . This data is often high-dimensional, but the “information” doesn’t use the full d dimensions. For example, we could represent the above images with three numbers since they have three degrees of freedom. Two for shifts and a third for rotation. Principal component analysis can be thought of as a way of automatically mapping data x i into some new low-dimensional coordinate system. ◮ It capture most of the information in the data in a few dimensions ◮ Extensions allow us to handle missing data, and “unwrap” the data.
P RINCIPAL COMPONENT ANALYSIS Example: How can we approximate this data using a unit-length vector q ? x i q is a unit-length vector, q T q = 1. Red dot: The length, q T x i , to the axis after projecting x onto the line defined by q . (q T x i )q The vector ( q T x i ) q takes q and stretches it q to the corresponding red dot. So what’s a good q ? How about minimizing the squared approximation error, n � � x i − qq T x i � 2 q T q = 1 q = arg min subject to q i = 1 qq T x i = ( q T x i ) q : The approximation of x i by stretching q to the “red dot.”
PCA : THE FIRST PRINCIPAL COMPONENT This is related to the problem of finding the largest eigenvalue, n � � x i − qq T x i � 2 q T q = 1 q = arg min s.t. q i = 1 � n � n � � x T i x i − q T x i x T = arg min q i q i = 1 i = 1 � �� � = XX T We’ve defined X = [ x 1 , . . . , x n ] . Since the first term doesn’t depend on q and we have a negative sign in front of the second term, equivalently we solve q T ( XX T ) q q T q = 1 q = arg max subject to q This is the eigendecomposition problem: ◮ q is the first eigenvector of XX T ◮ λ = q T ( XX T ) q is the first eigenvalue
PCA: G ENERAL The general form of PCA considers K eigenvectors, n K � 1 , k = k ′ � � ( x T � 2 s.t. q T = � x i − i q k ) q k k q k ′ = q arg min 0 , k � = k ′ q i = 1 k = 1 � �� � approximates x � n � n K � � � x T q T x i x T = i x i − arg min q k k i q i = 1 k = 1 i = 1 � �� � = XX T The vectors in Q = [ q 1 , . . . , q K ] give us a K dimensional subspace with which to represent the data: q T 1 x K � . ( q T . x proj = , x ≈ k x ) q k = Qx proj . q T k = 1 K x The eigenvectors of ( XX T ) can be learned using built-in software.
E IGENVALUES , EIGENVECTORS AND THE SVD An equivalent formulation of the problem is to find ( λ, q ) such that ( XX T ) q = λ q Since ( XX T ) is a PSD matrix, there are r ≤ min { d , n } pairs, λ 1 ≥ λ 2 ≥ · · · ≥ λ r > 0 , q T q T k q k ′ = 0 k q k = 1 , Why is ( XX T ) PSD? Using the SVD, X = USV T , we have that ( XX T ) = US 2 U T λ i = ( S 2 ) ii ≥ 0 ⇒ Q = U , Preprocessing: Usually we first subtract off the mean of each dimension of x .
PCA: E XAMPLE OF PROJECTING FROM R 3 TO R 2 For this data, most information (structure in the data) can be captured in R 2 . (left) The original data in R 3 . The hyperplane is defined by q 1 and q 2 . � � x T i q 1 (right) The new coordinates for the data: x i → x proj = . i x T i q 2
E XAMPLE : D IGITS Data : 16 × 16 images of handwritten 3’s (as vectors in R 256 ) λ 1 = 3 .4· 1 0 5 λ 2 = 2 .8· 1 0 5 λ 3 = 2 .4· 1 0 5 λ 4 = 1 .6· 1 0 5 Mean Above: The first four eigenvectors q and their eigenvalues λ . Original M = 1 M = 1 0 M = 5 0 M = 2 5 0 Above: Reconstructing a 3 using the first M − 1 eigenvectors plus the mean, and approximation M − 1 � ( x T q k ) q k x ≈ mean + k = 1
P ROBABILISTIC PCA
PCA AND THE SVD We’ve discussed how any matrix X has a singular value decomposition, X = USV T , U T U = I , V T V = I and S is a diagonal matrix with non-negative entries. Therefore, XX T = US 2 U T ( XX T ) U = US 2 ⇔ U is a matrix of eigenvectors, and S 2 is a diagonal matrix of eigenvalues.
A MODELING APPROACH TO PCA Using the SVD perspective of PCA, we can also derive a probabilistic model for the problem and use the EM algorithm to learn it. This model will have the advantages of: ◮ Handling the problem of missing data ◮ Allowing us to learn additional parameters such as noise ◮ Provide a framework that could be extended to more complex models ◮ Gives distributions used to characterize uncertainty in predictions ◮ etc.
P ROBABILISTIC PCA In effect, this is a new matrix factorization model. ◮ With the SVD, we had X = USV T . ◮ We now approximate X ≈ WZ , where ◮ W is a d × K matrix. In different settings this is called a “factor loadings” matrix, or a “dictionary.” It’s like the eigenvectors, but no orthonormality. ◮ The i th column of Z is called z i ∈ R K . Think of it as a low-dimensional representation of x i . The generative process of Probabilistic PCA is x i ∼ N ( Wz i , σ 2 I ) , z i ∼ N ( 0 , I ) . In this case, we don’t know W or any of the z i .
T HE L IKELIHOOD Maximum likelihood Our goal is to find the maximum likelihood solution of the matrix W under the marginal distribution, i.e., with the z i vectors integrated out, n � W ML = arg max ln p ( x 1 , . . . , x n | W ) = arg max ln p ( x i | W ) . W W i = 1 This is intractable because p ( x i | W ) = N ( x i | 0 , σ 2 I + WW T ) , 1 2 x T ( σ 2 I + WW T ) − 1 x 2 e − 1 N ( x i | 0 , σ 2 I + WW T ) = d 1 2 | σ 2 I + WW T | ( 2 π ) We can set up an EM algorithm that uses the vectors z 1 , . . . , z n .
EM FOR P ROBABILISTIC PCA Setup The marginal log likelihood can be expressed using EM as n n � � q ( z i ) ln p ( x i , z i | W ) � � p ( x i , z i | W ) dz i = ← L ln dz i q ( z i ) i = 1 i = 1 n � q ( z i ) � + q ( z i ) ln ← KL p ( z i | x i , W ) dz i i = 1 EM Algorithm : Remember that EM has two iterated steps 1. Set q ( z i ) = p ( z i | x i , W ) for each i (making KL = 0) and calculate L 2. Maximize L with respect to W Again, for this to work well we need that ◮ we can calculate the posterior distribution p ( z i | x i , W ) , and ◮ maximizing L is easy, i.e., we update W using a simple equation
T HE A LGORITHM EM for Probabilistic PCA Given : Data x 1 : n , x i ∈ R d and model x i ∼ N ( Wz i , σ 2 ) , z i ∼ N ( 0 , I ) , z ∈ R K Output : Point estimate of W and posterior distribution on each z i E-Step : Set each q ( z i ) = p ( z i | x i , W ) = N ( z i | µ i , Σ i ) where Σ i = ( I + W T W /σ 2 ) − 1 , µ i = Σ i W T x i /σ 2 M-Step : Update W by maximizing the objective L from the E-step � n � � � − 1 n � � x i µ T σ 2 I + ( µ i µ T W = i + Σ i ) i i = 1 i = 1 Iterate E and M steps until increase in � n i = 1 ln p ( x i | W ) is “small.” Comment: ◮ The probabilistic framework gives a way to learn K and σ 2 as well.
E XAMPLE : I MAGE PROCESSING = 8 x 8 patch X data matrix, e.g., 64 x 262,144 For image problems such as denoising or inpainting (missing data) ◮ Extract overlapping patches (e.g., 8 × 8) and vectorize to construct X ◮ Model with a factor model such as Probabilistic PCA ◮ Approximate x i ≈ W µ i , where µ i is the posterior mean of z i ◮ Reconstruct the image by replacing x i with W µ i (and averaging)
E XAMPLE : D ENOISING Noisy image on left, denoised image on right. The noise variance parameter σ 2 was learned for this example.
E XAMPLE : M ISSING DATA Another somewhat extreme example: ◮ Image is 480 × 320 × 3 (RGB dimension) ◮ Throw away 80% at random ◮ (left) Missing data, (middle) reconstruction, (right) original image
K ERNEL PCA
K ERNEL PCA We’ve seen how we can take an algorithm that uses dot products, x T x , and generalize with a nonlinear kernel. This generalization can be made to PCA. Recall: With PCA we find the eigenvectors of the matrix � n i = 1 x i x T i = XX T . ◮ Let φ ( x ) be a feature mapping from R d to R D , where D ≫ d ◮ We want to solve the eigendecomposition � n � � φ ( x i ) φ ( x i ) T q k = λ k q k i = 1 without having to work in the higher dimensional space. ◮ That is, how can we do PCA without explicitly using φ ( · ) and q ?
K ERNEL PCA Notice that we can reorganize the operations of the eigendecomposition n � � � φ ( x i ) T q k φ ( x i ) /λ k = q k � �� � i = 1 = a ki That is, the eigenvector q k = � n i = 1 a ki φ ( x i ) for some vector a k ∈ R n . The trick is that instead of learning q k , we’ll learn a k . Plug this equation for q k back into the first equation: N n n � � � a kj φ ( x i ) T φ ( x j ) φ ( x i ) = λ k a ki φ ( x i ) � �� � i = 1 j = 1 i = 1 = K ( x i , x j ) and multiply both sides by φ ( x l ) T for each l ∈ { 1 , . . . , n } .
Recommend
More recommend