Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 7: Dimensionality Reduction Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 1 /
Dimensionality Reduction The goal of dimensionality reduction is to find a lower dimensional representation of the data matrix D to avoid the curse of dimensionality. Given n × d data matrix, each point x i = ( x i 1 , x i 2 ,..., x id ) T is a vector in the ambient d -dimensional vector space spanned by the d standard basis vectors e 1 , e 2 ,..., e d . Given any other set of d orthonormal vectors u 1 , u 2 ,..., u d we can re-express each point x as x = a 1 u 1 + a 2 u 2 + ··· + a d u d where a = ( a 1 , a 2 ,..., a d ) T represents the coordinates of x in the new basis. More compactly: x = Ua where U is the d × d orthogonal matrix, whose i th column comprises the i th basis vector u i . Thus U − 1 = U T , and we have a = U T x Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 2 /
Optimal Basis: Projection in Lower Dimensional Space There are potentially infinite choices for the orthonormal basis vectors. Our goal is to choose an optimal basis that preserves essential information about D . We are interested in finding the optimal r -dimensional representation of D , with r ≪ d . Projection of x onto the first r basis vectors is given as r x ′ = a 1 u 1 + a 2 u 2 + ··· + a r u r = � a i u i = U r a r i = 1 where U r and a r comprises the r basis vectors and coordinates, respv. Also, restricting a = U T x to r terms, we have a r = U T r x The r -dimensional projection of x is thus given as: x ′ = U r U T r x = P r x where P r = U r U T r = � r i = 1 u i u T i is the orthogonal projection matrix for the subspace spanned by the first r basis vectors. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 3 /
Optimal Basis: Error Vector Given the projected vector x ′ = P r x , the corresponding error vector , is the projection onto the remaining d − r basis vectors d � ǫ = a i u i = x − x ′ i = r + 1 The error vector ǫ is orthogonal to x ′ . The goal of dimensionality reduction is to seek an r -dimensional basis that gives the best possible approximation x ′ i over all the points x i ∈ D . Alternatively, we seek to minimize the error ǫ i = x i − x ′ i over all the points. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 4 /
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC Iris Data: Optimal One-dimensional Basis X 3 X 3 X 1 X 1 X 2 X 2 bC bC bC bC u 1 Iris Data: 3D Optimal 1D Basis Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 5 /
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC Iris Data: Optimal 2D Basis X 3 X 3 X 1 X 2 X 1 X 2 bC bC u 2 bC bC bC bC bC bC u 1 Iris Data (3D) Optimal 2D Basis Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 6 /
Principal Component Analysis Principal Component Analysis (PCA) is a technique that seeks a r -dimensional basis that best captures the variance in the data. The direction with the largest projected variance is called the first principal component. The orthogonal direction that captures the second largest projected variance is called the second principal component, and so on. The direction that maximizes the variance is also the one that minimizes the mean squared error. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 7 /
Principal Component: Direction of Most Variance We seek to find the unit vector u that maximizes the projected variance of the points. Let D be centered, and let Σ be its covariance matrix. The projection of x i on u is given as � u T x i � x ′ u = ( u T x i ) u = a i u i = u T u Across all the points, the projected variance along u is � � n n n u = 1 ( a i − µ u ) 2 = 1 1 � � � σ 2 u T � x i x T � u = u T x i x T u = u T Σ u i i n n n i = 1 i = 1 i = 1 We have to find the optimal basis vector u that maximizes the projected variance σ 2 u = u T Σ u , subject to the constraint that u T u = 1. The maximization objective is given as J ( u ) = u T Σ u − α ( u T u − 1 ) max u Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 8 /
Principal Component: Direction of Most Variance Given the objective max u J ( u ) = u T Σ u − α ( u T u − 1 ) , we solve it by setting the derivative of J ( u ) with respect to u to the zero vector, to obtain ∂ u T Σ u − α ( u T u − 1 ) � � = 0 ∂ u that is, 2 Σ u − 2 α u = 0 which implies Σ u = α u Thus α is an eigenvalue of the covariance matrix Σ , with the associated eigenvector u . Taking the dot product with u on both sides, we have σ 2 u = u T Σ uu T α u = α u T u = α To maximize the projected variance σ 2 u , we thus choose the largest eigenvalue λ 1 of Σ , and the dominant eigenvector u 1 specifies the direction of most variance, also called the f irst principal component . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 9 /
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC Iris Data: First Principal Component X 3 X 1 X 2 bC bC u 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 10 /
Minimum Squared Error Approach The direction that maximizes the projected variance is also the one that minimizes the average squared error. The mean squared error ( MSE ) optimization condition is n n n � x i � 2 MSE ( u ) = 1 � ǫ i � 2 = 1 i � 2 = � � � � x i − x ′ − u T Σ u n n n i = 1 i = 1 i = 1 Since the first term is fixed for a dataset D , we see that the direction u 1 that maximizes the variance is also the one that minimizes the MSE. Further, n d � x i � 2 � � − u T Σ u = var ( D ) = tr (Σ) = σ 2 i n i = 1 i = 1 Thus, for the direction u 1 that minimizes MSE, we have MSE ( u 1 ) = var ( D ) − u T 1 Σ u 1 = var ( D ) − λ 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 11 /
Recommend
More recommend