recasting principal components
play

Recasting Principal Components R.W. Oldford University of Waterloo - PowerPoint PPT Presentation

Recasting Principal Components R.W. Oldford University of Waterloo Reducing dimensions - recasting the problem of principal components The principal axes ( V ) for a set of data X T = [ x 1 , . . . , x n ] can be found in one of two ways.


  1. Recasting Principal Components R.W. Oldford University of Waterloo

  2. Reducing dimensions - recasting the problem of principal components The principal axes ( V ) for a set of data X T = [ x 1 , . . . , x n ] can be found in one of two ways. Either: ◮ via the eigen-decomposition of X T X = VD λ V T , or ◮ via the singular value decomposition of X = UD σ V T Either way, the principal components are formed as Y = XV = UD σ which means, that all we need are U and D σ . We don’t really need V . We can get these two matrices from different eigen-decomposition, namely: YY T UD σ ( UD σ ) T = UD 2 σ U T = XX T = Note that this matrix is n × n , i.e. depends only on the sample size and not on the dimension p . So there is a choice, either the p × p matrix X T X or the n × n matrix XX T .

  3. Reducing dimensions - recasting the problem of principal components The choice XX T has an interesting structure: T  x 1  T x 2 XX T   =  [ x 1 , x 2 , . . . , x n ] .   . .  T x n T x 1 T x 2 T x n · · ·  x 1 x 1 x 1  T x 1 T x 2 T x n x 2 x 2 · · · x 2   = . . .   . . . . . .   T x 1 T x 2 T x n x n x n · · · x n Note that ◮ the ( i , j ) element x i T x j is an inner product ◮ this matrix of inner products is often called the Gram matrix ◮ if the data were centred ( � n i =1 x i = 0 ) then each row and column above sums to 0

  4. Reducing dimensions - problems with principal axes Principal axes are, well, just that . . . axes. That is, they correspond to directions v 1 , . . . , v k where the data x 1 , . . . , x n , when projected orthogonally onto them, have a maximal spread. We look to remove axes (directions) for which the sum of the squared lengths of the projections are relatively small. For this to work, the data have to be (nearly) restricted to a linear subspace.

  5. Reducing dimensions - problems with principal axes What if the data lie in a very restricted part of the space, but it’s just not linear? Clearly, these points lie in a very compact region of the plane but not along any direction vector (or principal axis). Data are still essentially only one dimensional though, just around the perimeter of a circle.

  6. Reducing dimensions - problems with principal axes Possible solutions?: 1. Change variables. E.g. to polar coordinates? � � �� x 1 � x 2 1 + x 2 ( x 1 , x 2 ) → ( f 1 , f 2 ) = 2 , acos � x 2 1 + x 2 2 2. Somehow follow the points around the circle? ◮ That is try to find the nonlinear manifold on (or near) which the points lie. ◮ Somehow preserve local structure ◮ want points in the same neighbourhood of each other (along the nonlinear manifold) should also be near each other in the reduced (linear) space.

  7. Reducing dimensions - changing variables x1 <- data[,1] x2 <- data[,2] newdata <- data.frame (f1 = sqrt (x1 ^ 2 + x2 ^ 2), f2 = acos (x1 /sqrt (x1 ^ 2 + x2 ^ 2)) ) colnames (newdata) <- c ("r","theta") newdata <- scale (newdata, center=TRUE, scale=FALSE) svd_data <- svd (newdata) svd_data $ d ## [1] 1.103150e+01 7.430655e-16

  8. Reducing dimensions - changing variables x1 <- data[,1] x2 <- data[,2] newdata <- data.frame (f1 = sqrt (x1 ^ 2 + x2 ^ 2), f2 = acos (x1 /sqrt (x1 ^ 2 + x2 ^ 2)) ) colnames (newdata) <- c ("r","theta") newdata <- scale (newdata, center=TRUE, scale=FALSE) svd_data <- svd (newdata) svd_data $ d ## [1] 1.103150e+01 7.430655e-16 This non-linear transformation has produced a coordinate system where all the data lie in a one-dimensional linear subspace.

  9. Reducing dimensions - changing variables Alternatively, we might try a transformation that added non-linear coordinates, say ( x 1 , x 2 ) → ( f 1 , f 2 , f 3 , f 4 ) = � x 1 , x 2 , x 2 1 , x 2 � 2 −1.0 0.0 0.5 1.0 0.0 0.4 0.8 1.0 0.5 f1 0.0 −1.0 1.0 0.5 0.0 f2 −1.0 0.8 f3 0.4 0.0 0.8 f4 0.4 0.0 −1.0 0.0 0.5 1.0 0.0 0.4 0.8

  10. Reducing dimensions - changing variables A principal component analysis on the transformed data produces the singular values: ## [1] 8.790603e+00 8.461389e+00 5.923241e+00 1.750249e-15 T The last of these is essentially zero; the corresponding eigen-vector v 4 ## [1] 0.000000e+00 -3.052969e-16 7.071068e-01 7.071068e-01 Note the linear structure in ( f 3 , f 4 ). This would be picked up by the last eigen-vector which, as seen above, is ≈ (0 , 0 , 1 , 1 ) T . √ √ 2 2 The line in ( f 3 , f 4 ) at left is orthog- onal to the (1 , 1) (i.e. the eigen- vector v 4 ). Note also that there would be 3 principal components; more than the original dimensionality of the data!

  11. Reducing dimensions - changing variables More generally, if x ∈ R p , we can consider a mapping ψ : R p → R m . ◮ ψ could be non-linear ◮ m could be larger than p ◮ f i = ψ ( x i ) for i = 1 , . . . , n are called “feature vectors” by some writers and the range of ψ the “feature space” F ⊂ R m . Note that while the dimensionality increases, the number of points n stays the same. Whether working in the original data space or in the constructed feature space, a principal component analysis can be had by an n × n Gram matrix. The dimensionality of the feature space could be much larger than the dimensionality of the data (i.e. m >> p ); it could even be infinite! The principal component analysis (PCA) will never need a matrix larger than the corresponding n × n Gram matrix.

  12. Reducing dimensions - changing variables F T = [ f 1 , . . . , f n ] is the m × n matrix of feature vectors, which can be a nuisance to work with if m >> n and impossible if m = ∞ . The corresponding Gram matrix, K = [ k ij ] = FF T for the feature space however is always n × n (even if m = ∞ ). All we need to be able to do is determine the inner products f i T f j k ij = ψ T ( x i ) ψ ( x j ) = = < ψ ( x i ) , ψ ( x j ) > = K ( x i , x j ) , say. Looks like we only need to choose the function K ( x i , x j ). That is we never need to calculate any feature vector f i = ψ ( x i ) (or even determine the function ψ ( . )) if we have the function K ( x i , x j ), a function of vector pairs in the data space. K ( x i , x j ) is called a Kernel function (N.B. not to be confused with “Kernel density” estimates) and this move from ψ functions to Kernel functions is sometimes called the “Kernel trick”.

  13. Reducing dimensions - changing variables A number of kernel functions have been proposed. Three common choices are 1. Polynomial of degree d , scale parameter σ , and offset θ : K ( x , y ) = ( σ x T y + θ ) d For example, suppose p = 3, d = 2 (with σ = 1, and θ = 0) then ( x 1 y 1 + x 2 y 2 + x 3 y 3 ) 2 K ( x , y ) = x 2 1 y 2 1 + x 2 2 y 2 2 + x 2 3 y 2 = 3 + 2 x 1 x 2 y 1 y 2 + 2 x 1 x 3 y 1 y 3 + 2 x 2 x 3 y 2 y 3 y 2   1 y 2 2 y 2 √ √ √   ( x 2 1 , x 2 2 , x 2 3 = 3 , 2 x 1 x 2 , 2 x 1 x 3 , 2 x 2 x 3 ) √   2 y 1 y 2  √  2 y 1 y 3 √ 2 y 2 y 3 √ √ √ T So ψ ( x ) = ( x 2 1 , x 2 2 , x 2 3 , 2 x 1 x 2 , 2 x 1 x 3 , 2 x 2 x 3 )

  14. Reducing dimensions - changing variables 2. Radial basis function (Gaussian) (with scale parameter σ ): � � − || x − y || 2 K ( x , y ) = exp 2 σ 2 To see that this is also an inner product, consider a series expansion of e t . The feature space is infinite dimensional. 3. Sigmoid (hyperbolic tangent) with scale σ and offset θ : K ( x , y ) = tanh � σ x T y + θ � There is a theorem from functional analysis (involving reproducing Kernel Hilbert spaces, hence the name) called Mercer’s Theorem which gives conditions for which a function K ( x , y ) can be expressed as a dot product.

  15. Reducing dimensions - changing variables The kernel function needs to be such that the vectors f 1 , . . . , f n in the feature space are centred about � n i =1 f i = 0 . Recall that the kernel matrix K is a Gram matrix so its elements are the inner T f j in the feature space. products f i A simple way to effect this is to ensure that the Kernel matrix K has rows and columns that sum to zero. That is, replace K by removing means from the front and back: � I n − 1 ( 1 T 1 ) − 1 1 T � K � I n − 1 ( 1 T 1 ) − 1 1 T � K ∗ = � I n − 1 n 11 T � K � I n − 1 n 11 T � = � n 11 T K � � n 11 T � K − 1 I n − 1 = n K11 T − 1 K − 1 n 11 T K + n 2 11 T K11 T 1 = n K1 ) 1 T − 1 ( 1 K − ( 1 n 1 T K ) + n 2 ( 1 T K1 ) 11 T 1 = Or, in words, subtract the row means, subtract the column means, add back in the overall mean.

Recommend


More recommend