Reducing dimensionality Principal components R.W. Oldford
Reducing dimensions Recall how orthogonal projections work. Given V = [ v 1 , . . . , v k ], an orthogonal projection matrix P is easily constructed as V T V � − 1 V T . P = V � And, if the column vectors of V form an orthonormal basis for S , then P = VV T . So far, we have only considered the choice where the v i s are unit vectors in the direction of the original data axes e i . Projections onto these directions simply return the scatterplots on pairs of the original variates. Are there other directions which would do as well (or possibly better)? Imagine that we have n points x 1 , . . . , x n ∈ R p centred so that � n i =1 x i = 0 and we denote by X = [ x 1 , . . . , x n ] T the n × p real matrix whose i th row is x T i . Being centred, we can now ask whether the data truly lie in a linear subspace of R p . And, if they do, can we find that subspace? Alternatively, do they lie nearly in a linear subspace and could we find it?
Reducing dimensions - Finding the principal axes For example suppose n = 20 and p = 2 so that a point cloud might look like: r r r r r r r r r r r r r r r r r r r r The points x 1 , . . . , x n lie in the plane, but do not occupy all of it. They appear nearly to lie in a one-dimensional subspace of R 2
Reducing dimensions - Finding the principal axes We can think about orthogonally projecting the points (or equivalently, vectors) x 1 , . . . , x n onto any direction vector a ∈ R p (i.e. � a � = 1 , a ∈ R p ). P P r x k a r r ✄ ✗ r ✄ r ✄ r r ✄ r r r r r r r r r r r r r
Reducing dimensions - Finding the principal axes The orthogonal projection of the point x k onto a (i.e. onto the span { a } ) is ( aa T ) x k = a × w k a vector in the direction of a × sign( w k ) of length | w k | with w k = a T x k . Note that the squared length of this projection is 2 = || aa T x k || 2 = x k T aa T x k = a T x k x k T a . w k Since every point can be projected onto the direction vector a , we might ask what vector would maximize (or minimize) the sum of the squared lengths of the projections? That is, onto which direction would the projections have the largest average squared length? Because the points are already centred about 0 , this is the same as asking which direction are the original data points most (or least) spread out (or variable)?
Reducing dimensions - Finding the principal axes Mathematically we want to find the direction vector a , which maximizes (minimizes) the sum � n k =1 w 2 k . This sum can in turn be expressed in terms of the original points in the point cloud as: n n 2 = � � T aa T x k w k x k k =1 k =1 n � a T x k x k T a = k =1 n a T ( � T ) a = x k x k k =1 a T ( X T X ) a = where X T = [ x 1 , x 2 , . . . , x n ] is the p × n matrix of data vectors.
Reducing dimensions - Finding the principal axes The maximization (minimization) problem can now be expressed as follows. Find a ∈ R p which maximizes (minimizes) a T ( X T X ) a subject to the constraint that a T a = 1. We can write this as an unconstrained optimization by introducing a Lagrange multiplier λ . The problem then becomes to find λ ∈ R and a ∈ R p which maximizes (minimizes) a T ( X T X ) a + λ (1 − a T a ) which we now simply differentiate with respect to a , set to zero, solve, etc. Note that the objective function to be maximized (minimized) is a quadratic form in a . More generally, a quadratic form in z ∈ R p can always be written as Q ( z ) = z T Az + b T z + c where A ∈ R p × p , b ∈ R p , and c ∈ R are all constants (wlog A = A T ).
Reducing dimensions - Finding the principal axes Differentiating the quadratic form Q ( z ) = z T Az + b T z + c with respect to the vector z is ∂ ∂ z Q ( z ) = 2 Az + b . For our problem, we have the variable vector z = a and constants c = λ , b = 0 , and A = X T X − λ I p . Differentiating with respect to a and setting the result to 0 gives the set of equations: 2( X T X − λ I p ) a = 0 or ( X T X ) a = λ a . Differentiating with respect to λ , setting to zero and solving yields a T a = 1. Which should look familiar . . . ?
Reducing dimensions - Finding the principal axes The solution to the systems of equations ( X T X ) a = λ a a T a = 1 and are the eigen-vector a and its corresponding eigen-value λ of the real symmetric matrix X T X . The quadratic form we are maximizing (minimizing) is a T ( X T X ) a = λ a T a = λ. To maximize (minimize) this quadratic form, we choose the eigen-vector corresponding to the largest (smallest) eigen-value. Denote the solution to this problem as v 1 (or v p for the minimization problem). The solution v 1 (or v p ) will be the eigen-vector of ( X T X ) which corresponds to its largest (or smallest) eigen-value λ 1 (or λ p ). Putting all eigen-vectors into an orthogonal matrix V = [ v 1 , · · · , v p ] ordered by the eigen-values λ 1 ≥ λ 2 ≥ · · · ≥ λ p we have ( X T X ) = VD λ V T is the eigen-decomposition of X T X with D λ = diag ( λ 1 , . . . , λ p ).
Reducing dimensions - Finding the principal axes The figure below shows v 1 (and v 2 ) for this data. r r r ✓ ✼ v 1 r ✓ r ✓ r r r ✓ r r ❩❩❩❩ r r r r ⑦ v 2 r r r r r r
Reducing dimensions - Finding the principal axes Consider a change of variables y = V T x (with V T V = I p = VV T ). The data in the original coordinate system has x 1 x 1 x 2 x 2 x = = [ e 1 , e 2 , · · · , e p ] = Vy . . . . . . x p x p and in the new coordinate system becomes y 1 y 2 = VV T x [ v 1 , v 2 , · · · , v p ] . . . y p The values y 1 , . . . , y p are coordinates on new axes v 1 , . . . v p . The axes were chosen so that the coordinates are most spread out for variable y 1 , next for variable y 2 , . . . , least for y p . The axes v 1 , . . . , v p are called the principal axes and the variables y 1 , . . . , y p the principal components . (N.B. sometimes both axes and variables are called the principal components . . . sigh.) Note that the transformed variates (the principal components) yi and yj are now uncorrelated for i � = j (since the y points are now aligned along their principal axes).
Reducing dimensions - Finding the principal axes For our example, the transform y = V T x yields T x = c 1 x 1 + c 2 x 2 + · · · + c p x p y 1 = v 1 as a weighted linear combination of the original x variables (using entries of v 1 as weights). The following figure shows the points as they appear in the new co-ordinate system (e.g. y k = V T x k ). y 2 r r r r r rr r y 1 r r r r r r r r r y k r r r Note that the transformation rotates the points into position and (in this example) reflects them through one (or more) of the principal axes .
Reducing dimensions - Finding the principal axes For our example, consider only the Species = versicolor of the iris data and the first three variates. In three dimensions, the plot looks like library (loon) ## Loading required package: tcltk data <- l_scale3D (iris[iris $ Species == "versicolor", 1 : 4]) p3D <- l_plot3D (data[,1 : 3], showGuides = TRUE) plot (p3D) Sepal.Width Sepal.Length Which can now be rotated by hand to get to the principal components.
Reducing dimensions Note that n � T v j = v j T ( λ j v j ) = v j T ( X T X ) v j = ( Xv j ) T ( Xv j ) = z T z = z 2 λ j = λ j v j i i =1 for real values z j , and so λ j ≥ 0 ∀ j = 1 , . . . , p . T x i is the coordinate of the data point x i projected onto Note that each z i = v j the principal axis v j . So, if λ j = 0, then the projection of every point onto the direction v j is identically 0! That is, the data lie in a space orthogonal to v j . Suppose there is a value d < p such that λ 1 ≥ · · · ≥ λ d > 0, and λ j = 0 for j > d . Then the points x 1 , . . . , x n lie in a d -dimensional subspace of R p defined by the principal axes v 1 , . . . , v d . That is x i ∈ span { v 1 , . . . , v d } ⊂ R p for all i = 1 , . . . , n . Question : What if we only have λ j ≈ 0 for j > d ? Answer : The points x 1 , . . . , x n nearly lie in a d -dimensional subspace. Perhaps we can reduce consideration to d dimensions.
Reducing dimensions Let Y T = [ y 1 , . . . , y n ] be the p × n matrix of the points y 1 , . . . , y n in the coordinate system of the principal axes. Note that these coordinates in the new space are simply given by Y = X [ v 1 , . . . , v p ] = XV . The i th column of Y is the i th principal component. When we want to reduce the dimensionality to only those defined by the first d principal components, then we could right-multiply X by the p × d matrix [ v 1 , . . . , v d ] or equivalently simply select the first d columns of Y . There is a decomposition of a real rectangular n × p matrix X which is particularly handy called the singular value decomposition : X = UD σ V T where U is an n × p matrix with the property that U T U = I p , V is a p × p matrix with V T V = VV T = I p and D σ = diag ( σ 1 , . . . , σ p ) with σ 1 ≥ σ 2 ≥ · · · ≥ σ p ≥ 0. The scalars σ i are called the singular values of X , the columns of U the left singular vectors and the columns of V the right singular vectors .
Recommend
More recommend