Principal Component Analysis and Autoencoders Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 25
Orthogonal Matrices 1 An orthogonal matrix is a square matrix whose columns and rows are orthogonal unit vectors, i.e. , orthonormal vectors. That is, if a matrix ◗ is an orthogonal matrix, we have ◗ T ◗ = ◗◗ T = ■ . 2 It leads to ◗ − 1 = ◗ T , which is a very useful property as it provides an easy way to compute the inverse. 3 For an orthogonal n × n matrix ◗ = [ q 1 , q 2 , . . . , q n ], where q i ∈ R n , i = 1 , 2 , . . . , n , it is easy to see that q T i q j = 0 when i � = j and q T i q i = 1. 4 Furthermore, suppose ◗ 1 = [ q 1 , q 2 , . . . , q i ] and ◗ 2 = [ q i +1 , q i +2 , . . . , q n ], we have ◗ T 1 ◗ 1 = ■ , ◗ T 2 ◗ 2 = ■ , but ◗ 1 ◗ T 1 � = ■ , ◗ 2 ◗ T 2 � = ■ . 2 / 25
Eigen-Decomposition 1 A square n × n matrix ❙ with n linearly independent eigenvectors can be factorized as ❙ = ◗ Λ ◗ − 1 , where ◗ is the square n × n matrix whose columns are eigenvectors of ❙ , and Λ is the diagonal matrix whose diagonal elements are the corresponding eigenvalues. 2 Note that only diagonalizable matrices can be factorized in this way. 3 If ❙ is a symmetric matrix, its eigenvectors are orthogonal. Thus ◗ is an orthogonal matrix and we have ❙ = ◗ Λ ◗ T . 3 / 25
Singular Value Decomposition The singular value decomposition (SVD) of an m × n real matrix (without loss of generality, we assume m ≥ n ) can be written as ❘ = ❯ ˜ Σ ❱ T , where ❯ is an orthogonal m × m matrix, ❱ is an orthogonal n × n matrix, and ˜ Σ is a diagonal m × n matrix with non-negative real values on diagonal. That is, ❯ T ❯ = ❯❯ T = ■ m × m , ❱ T ❱ = ❱ ❱ T = ■ n × n , 0 0 0 σ 1 . . . 0 σ 2 0 . . . 0 � Σ n × n � ˜ 0 0 0 σ 3 . . . Σ = , Σ n × n = (1) , 0 ... m × n 0 0 0 0 0 0 0 . . . σ n where σ 1 ≥ σ 2 ≥ · · · ≥ σ n ≥ 0 are known as singular values. If rank( ❘ ) = r ( r ≤ n ), we have σ 1 ≥ σ 2 ≥ · · · ≥ σ r > 0 and σ r +1 = σ r +2 = · · · = σ n = 0. 4 / 25
Relation to Eigen-Decomposition The columns of ❯ (left-singular vectors) are orthonormal eigenvectors of ❘❘ T , and the columns of ❱ (right-singular vectors) are orthonormal eigenvectors of ❘ T ❘ . In other words, we have ❘❘ T = ❯ Λ ❯ − 1 , ❘ T ❘ = ❱ Λ ❱ − 1 . It is easy to verify them as we have � Σ � � Σ � � � Σ � ❱ T = ❱ ( ) ❱ T = ❱ Σ 2 ❱ T , � ❘ T ❘ ❱ T ) T ❯ = ( ❯ Σ 0 0 0 0 � Σ � � Σ � � Σ � � � Σ 2 � 0 ❱ T ) T = ❯ ( ) ❯ T = ❯ � ❘❘ T ❱ T ( ❯ ❯ T = ❯ 0 Σ 0 0 0 0 0 and ❱ T = ❱ − 1 , ❯ T = ❯ − 1 . 5 / 25
SVD and eigen-decomposition 1 Under what conditions are SVD and eigen-decomposition the same? First, ❘ is a symmetric matrix, i.e. , ❘ = ❘ T . Second, ❘ is a positive semi-definite matrix, i.e. , ∀ ① ∈ R n , ① T ❘① ≥ 0. 2 The difference between Λ in eigen-decomposition and Σ in SVD is that, the diagonal entries of Λ can be negative, while the diagonal entries of Σ are non-negative. What are the fundamental reasons underlying this difference? Why the requirements on the singular values in SVD (non-negative and in sorted order) do not prevent the generality of SVD? 6 / 25
Compact SVD If rank( ❘ ) = r ( r ≤ n ), we have ❯ ˜ Σ ❱ T ❘ = ✈ T 0 σ 1 . . . 1 ... ✈ T . . ... . . 2 . . . . 0 . . . . σ r = [ ✉ 1 , ✉ 2 , . . . , ✉ r , . . . , ✉ m ] . ✈ T 0 0 . . . r ... . . ... . . . . . . . ✈ T 0 0 . . . n By removing zero components, we obtain ❯ r Σ r ❱ T ❘ = r ✈ T 1 0 σ 1 . . . ✈ T 2 . . ... = [ ✉ 1 , ✉ 2 , . . . , ✉ r ] . . . . . . . 0 . . . σ r ✈ T r ✈ T 1 ✈ T r 2 � σ i ✉ i ✈ T = [ σ 1 ✉ 1 , σ 2 ✉ 2 , . . . , σ r ✉ r ] . = i , . . i =1 ✈ T r where rank( σ i ✉ i ✈ T i ) = 1, i = 1 , 2 , . . . , r . 7 / 25
Truncated SVD and Best Low-Rank Approximation We can also approximate the matrix ❘ with the k largest singular values as k � ❘ k = ❯ k Σ k ❱ T σ i ✉ i ✈ T k = i . i =1 Apparently, ❘ � = ❘ k unless rank( ❘ ) = k . This approximation is the best in following sense: � � n � � σ 2 min || ❘ − ❇ || F = || ❘ − ❘ k || F = i , � ❇ : rank ( ❇ ) ≤ k i = k +1 min || ❘ − ❇ || 2 = || ❘ − ❘ k || 2 = σ k +1 , ❇ : rank ( ❇ ) ≤ k where || · || F denotes the Frobenius norm and || · || 2 denotes the spectral norm, defined as the largest singular value of the matrix. That is, ❘ k is the best rank- k approximation to ❘ in terms of both the Frobenius norm and spectral norm. Note the difference in terms of approximation errors when different matrix norms are used. 8 / 25
What is PCA? 1 Principal Component Analysis (PCA) is a statistical procedure that can be used to achieve feature (dimensionality) reduction. 2 Note, feature reduction is different from feature selection. After feature reduction, we still use all the features, while feature selection selects a subset of features to use. 3 The goal of PCA is to project the high-dimensional features to a lower-dimensional space with maximal variance and minimum reconstruction error simultaneously. 4 We derive PCA based on maximizing variance, and then we show the solution also minimizes reconstruction error. 5 In machine learning, PCA is an unsupervised learning technique, and therefore does not need labels. 9 / 25
PCA to 1D 1 To introduce PCA, we start from the simple case where PCA projects the features to a 1-dimensional space. 2 Formally, suppose we have n p -dimensional ( p > 1) features ① 1 , ① 2 , . . . , ① n ∈ R p . 3 Let ❛ ∈ R p represent a projection that ❛ T ① i = z i , i = 1 , 2 , . . . , n where z 1 , z 2 , . . . , z n ∈ R 1 . 4 PCA aims to solve n 1 � ❛ ∗ = arg max z ) 2 . ( z i − ¯ n || ❛ || =1 i =1 5 Note that the variance of the reduced data is n 1 � z ) 2 , ( z i − ¯ n i =1 which means that PCA tries to find the projection with the maximum variance in reduced data. 10 / 25
PCA to 1D Since n n n z = 1 z i = 1 ❛ T ① i = ❛ T ( 1 � � � ① i ) = ❛ T ¯ ¯ ① , n n n i =1 i =1 i =1 the problem can be written as n 1 � ❛ ∗ z ) 2 = arg max ( z i − ¯ n || ❛ || =1 i =1 n 1 � ( ❛ T ① i − ¯ z ) 2 = arg max n || ❛ || =1 i =1 n 1 � ( ❛ T ① i − ❛ T ¯ ① ) 2 = arg max n || ❛ || =1 i =1 n 1 � ❛ T ( ① i − ¯ ① ) T ❛ = arg max ① )( ① i − ¯ n || ❛ || =1 i =1 � � n 1 � ❛ T ① ) T = arg max ( ① i − ¯ ① )( ① i − ¯ ❛ n || ❛ || =1 i =1 � �� � p × p covariance matrix ❛ T ❈❛ , = arg max || ❛ || =1 � n where ❈ = 1 ① ) T , denotes the covariance matrix. i =1 ( ① i − ¯ ① )( ① i − ¯ n 11 / 25
PCA to k -dimensional space 1 What if we want to project the features to a k -dimensional space? Then the PCA problem becomes � � ❆ ∗ = ❆ T ❈❆ arg max trace , (2) ❆ ∈ R p × k : ❆ T ❆ = ■ k where ❆ = [ ❛ 1 , ❛ 2 , · · · , ❛ k ] ∈ R p × k . Note that when projecting onto k -dimensional space, PCA requires different projection vectors to be orthogonal. Also, the trace above is the sum of the variances after projecting the data to each of the k directions as k � � � � � ❆ T ❈❆ ❛ T trace = i ❈❛ i . i =1 12 / 25
Ky Fan Theorem 1 Solving the problem in Eqn. (2) requires the follow theorem. 2 Theorem. ( Ky Fan ) Let ❍ ∈ R n × n be a symmetric matrix with eigenvalues λ 1 ≥ λ 2 ≥ · · · ≥ λ n , and the corresponding eigenvectors ❯ = [ ✉ 1 , . . . , ✉ n ]. Then � � ❆ T ❍❆ λ 1 + · · · λ k = max trace . ❆ ∈ R n × k : ❆ T ❆ = ■ k And the optimal ❆ ∗ is given by ❆ ∗ = [ ✉ 1 , . . . , ✉ k ] ◗ with ◗ an arbitrary orthogonal matrix. � 13 / 25
Solutions to PCA 1 Note that in Eqn. (2), the covariance matrix ❈ is a symmetric matrix. Given the above theorem, we directly obtain � � ❆ T ❈❆ λ 1 + · · · λ k = arg max trace , ❆ ∈ R n × k : ❆ T ❆ = ■ k ❆ ∗ = [ ✉ 1 , . . . , ✉ k ] ◗ , where λ 1 , . . . , λ k are the k largest eigenvalues of the covariance matrix ❈ , and the solution ❆ ∗ is the matrix whose columns are corresponding eigenvectors. 2 It also follows from the above theorem that solutions to PCA are not unique, and they differ by an orthogonal matrix. We used the special case where ◗ = ■ , i.e. , ❆ ∗ = [ ✉ 1 , . . . , ✉ k ]. 14 / 25
Recommend
More recommend