Principal component analysis Course of Machine Learning Master Degree in Computer Science University of Rome “Tor Vergata” Giorgio Gambosi a.a. 2018-2019 1
Curse of dimensionality In general, many features: high-dimensional spaces. of dimensionality High dimensions lead to difficulties in machine learning algorithms (lower 2 • sparseness of data • increase in the number of coefficients, for example for dimension D and order 3 of the polynomial, D D D D D D ∑ ∑ ∑ ∑ ∑ ∑ y ( x , w ) = w 0 + w i x i + w ij x i x j + w ijk x i x j x k i =1 i =1 j =1 i =1 j =1 k =1 number of coefficients is O ( D M ) reliability or need of large number of coefficients) this is denoted as curse
Dimensionality reduction • for any given classifier, the training set size required to obtain a certain accuracy grows exponentially wrt the number of features ( curse of dimensionality ) • it is important to bound the number of features, identifying the less discriminant ones 3
Discriminant features • Discriminant feature: makes it possible to distinguish between two classes • Non discriminant feature: does not allow classes to be distinguished 4
Searching hyperplanes for the dataset • verifying whether training set elements lie on a hyperplane (a space of lower dimensionality), apart from a limited variability (which could be seen as noise) “faithful” representation of the original dataset • as “faithful” representation we mean that distances between elements and their projections are small, even minimal 5 • principal component analysis looks for a d ′ -dimensional subspace ( d ′ < d ) such that the projection of elements onto such suspace is a
6 • it is easy to show that is minimum PCA for d ′ = 0 • Objective: represent all d -dimensional vectors x 1 , . . . , x n by means of a unique vector x 0 , in the most faithful way, that is so that n ∑ || x 0 − x i || 2 J ( x 0 ) = i =1 n x 0 = m = 1 ∑ x i n i =1
7 • In fact, PCA for d ′ = 0 n ∑ || ( x 0 − m ) − ( x i − m ) || 2 J ( x 0 ) = i =1 n n n || x 0 − m || 2 − 2 ( x 0 − m ) T ( x i − m ) + ∑ ∑ ∑ || x i − m || 2 = i =1 i =1 i =1 n n n || x 0 − m || 2 − 2( x 0 − m ) T ∑ ∑ ∑ || x i − m || 2 = ( x i − m ) + i =1 i =1 i =1 n n || x 0 − m || 2 + ∑ ∑ || x i − m || 2 = i =1 i =1 • since n n ∑ ∑ ( x i − m ) = x i − n · m = n · m − n · m = 0 i =1 i =1 • the second term is independent from x 0 , while the first one is equal to zero for x 0 = m
• a single vector is too concise a representation of the dataset: anything related to data variability gets lost • a more interesting case is the one when vectors are projected onto a 8 PCA for d ′ = 1 line passing through m
quadratic error is then 9 PCA for d ′ = 1 • let u 1 be unit vector ( || u 1 || = 1 ) in the line direction: the line equation x = α u 1 + m where α is the distance of x from m along the line • let ˜ x i = α i u 1 + m be the projection of x i ( i = 1 , . . . , n ) onto the line: given x 1 , . . . , x n , we wish to find the set of projections minimizing the
10 The quadratic error is defined as PCA for d ′ = 1 n ∑ x i − x i || 2 || ˜ J ( α 1 , . . . , α n , u 1 ) = i =1 n ∑ || ( m + α i u 1 ) − x i || 2 = i =1 n ∑ || α i u 1 − ( x i − m ) || 2 = i =1 n n n i || u 1 || 2 + || x i − m || 2 − 2 ∑ + α 2 ∑ ∑ α i u T = 1 ( x i − m ) i =1 i =1 i =1 n n n || x i − m || 2 − 2 ∑ α 2 ∑ ∑ α i u T = i + 1 ( x i − m ) i =1 i =1 i =1
The second derivative turns out to be positive the line). showing that what we have found is indeed a minimum. 11 PCA for d ′ = 1 Its derivative wrt α k is ∂ ∂α k J ( α 1 , . . . , α n , u 1 ) = 2 α k − 2 u T 1 ( x k − m ) which is zero when α k = u T 1 ( x k − m ) (the orthogonal projection of x k onto ∂ J ( α 1 , . . . , α n , u 1 ) = 2 ∂α 2 k
12 of the dataset PCA for d ′ = 1 To derive the best direction u 1 of the line, we consider the covariance matrix n S = 1 ∑ ( x i − m )( x i − m ) T n i =1 By plugging the values computed for α i into the definition of J ( α 1 , . . . , α n , u 1 ) , we get n n n || x i − m || 2 − 2 ∑ α 2 ∑ ∑ α 2 J ( u 1 ) = i + i i =1 i =1 i =1 n n 1 ( x i − m )] 2 + ∑ ∑ || x i − m || 2 [ u T = − i =1 i =1 n n 1 ( x i − m )( x i − m ) T u 1 + ∑ u T ∑ || x i − m || 2 = − i =1 i =1 n = − n u T ∑ || x i − m || 2 1 Su 1 + i =1
• the product • the sum 13 PCA for d ′ = 1 • u T 1 ( x i − m ) is the projection of x i onto the line 1 ( x i − m )( x i − m ) T u 1 u T is then the variance of the projection of x i wrt the mean m n 1 ( x i − m )( x i − m ) T u 1 = n u T ∑ u T 1 Su 1 i =1 is the overall variance of the projections of vectors x i wrt the mean m
14 variance in the dataset to 0, obtaining By applying Lagrange multipliers this results equivalent to maximizing PCA for d ′ = 1 Minimizing J ( u 1 ) is equivalent to maximizing u T 1 Su 1 . That is, J ( u 1 ) is minimum if u 1 is the direction which keeps the maximum amount of Hence, we wish to maximize u T 1 Su 1 (wrt u 1 ), with the constraint || u 1 || = 1 . u = u T 1 Su 1 − λ 1 ( u T 1 u 1 − 1) This can be done by setting the first derivative wrt u 1 : ∂u ∂ u 1 = 2 Su 1 − 2 λ 1 u 1 Su 1 = λ 1 u 1
Note that: • the overall variance of the projections is then equal to the corresponding eigenvalue • the variance of the projections is then maximized (and the error 15 PCA for d ′ = 1 • u is maximized if u 1 is an eigenvector of S u T 1 Su 1 = u T 1 λ 1 u 1 = λ 1 u T 1 u 1 = λ 1 minimized) if u 1 is the eigenvector of S corresponding to the maximum eigenvalue λ 1
• The projections of vectors onto that hyperplane are distributed as a • The quadratic error is minimized by projecting vectors onto a of data variability 16 PCA for d ′ > 1 hyperplane defined by the directions associated to the d ′ eigenvectors corresponding to the d ′ largest eigenvalues of S • If we assume data are modeled by a d -dimensional gaussian distribution with mean µ and covariance matrix Σ , PCA returns a d ′ -dimensional subspace corresponding to the hyperplane defined by the eigenvectors associated to the d ′ largest eigenvalues of Σ d ′ -dimensional distribution which keeps the maximum possible amount
An example of PCA 17 • Digit recognition ( D = 28 × 28 = 784 )
Eigenvalue size distribution is usually characterized by a fast initial decrease followed by a small decrease This makes it possible to identify the number of eigenvalues to keep, and thus the dimensionality of the projections. 18 Choosing d ′
19 Eigenvalues measure the amount of distribution variance kept in the projection. by setting largest eigenvalues. Choosing d ′ Let us consider, for each k < d , the value ∑ k i =1 λ 2 i r k = ∑ n i =1 λ 2 i which provides a measure of the variance fraction associated to the k When r 1 < . . . < r d are known, a certain amount p of variance can be kept d ′ = argmin r i > p i ∈{ 1 ,...,d }
Singular value decomposition 20
Singular Value Decomposition = 21 there exist R n × m be a matrix of rank r ≤ min ( n, m ) , and let n > m . Then, Let W ∈ I R n × r orthonormal (that is, U T U = I r ) • U ∈ I R m × r orthonormal (that is, VV T = I r ) • V ∈ I R r × r diagonal • Σ ∈ I such that W = UΣV T Σ V T ( r × r ) ( r × m ) W U ( n × m ) ( n × r )
this derives from SVD in greater detail 22 Let us consider the matrix A = W T W ∈ I R m × m . Observe that • by definition, A has the same rank of W , that is r • A is symmetric: in fact, a ij = w T i w j by definition, where w k is the k -th column of W ; by the commutativity of vector product, a ij = w T i w j = w T j w i = a ji • A is semidefinite positive, that is x T Ax ≥ 0 for all non null x ∈ I R m : x T Ax = x T ( W T W ) x = ( Wx ) T ( Wx ) = || Wx || 2 ≥ 0
SVD in greater detail 23 All eigenvalues of A are real. In fact, • let λ ∈ C be an eigenvalue of A , and let v ∈ C n be a corresponding eigenvector: then, Av = λ v and v T Av = v T λ v = λ v T v • observe that, in general, it must also be that the complex conjugates λ and v are themselves an eigenvalue-eigenvector pair for A : then, Av = λ v . Since λ v T = ( λ v ) T = ( Av ) T = v T A T = v T A by the simmetry of A , it derives v T Av = λ v T v • as a consequence, λ v T v = λ v T v , that is λ || v || 2 = λ || v || 2 • since v ̸ = 0 (being an eigenvector), it must be λ = λ , hence λ ∈ I R
SVD in greater detail orthonormal base. 24 orthogonal The eigenvectors of A corresponding to different eigenvalues are • Let v 1 , v 2 ∈ C n be two eigenvectors, with corresponding distinct eigenvalues λ 1 , λ 2 1 v 2 ) = ( λ 1 v 1 ) T v 2 = ( Av 1 ) T v 2 = • then, by the simmetry of A , λ 1 ( v T 1 A T v 2 = v T v T 1 Av 2 = v T 1 λ 2 v 2 = λ 2 ( v T 1 v 2 ) • as a consequence, ( λ 1 − λ 2 ) v T 1 v 2 = 0 • since λ 1 ̸ = λ 2 , it must be v T 1 v 2 = 0 , that is v 1 , v 2 must be orthogonal If an eigenvalue λ ′ has multiplicity m > 1 , it is always possible to find a set of m orthonormal eigenvectors of λ ′ . As a result, there exists a set of eigenvectors of A which provides an
Recommend
More recommend