Principal Component Analysis and Autoencoders Shuiwang Ji - PowerPoint PPT Presentation

Principal Component Analysis and Autoencoders Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 25

Orthogonal Matrices 1 An orthogonal matrix is a square matrix whose columns and rows are orthogonal unit vectors, i.e. , orthonormal vectors. That is, if a matrix ◗ is an orthogonal matrix, we have ◗ T ◗ = ◗◗ T = ■ . 2 It leads to ◗ − 1 = ◗ T , which is a very useful property as it provides an easy way to compute the inverse. 3 For an orthogonal n × n matrix ◗ = [ q 1 , q 2 , . . . , q n ], where q i ∈ R n , i = 1 , 2 , . . . , n , it is easy to see that q T i q j = 0 when i � = j and q T i q i = 1. 4 Furthermore, suppose ◗ 1 = [ q 1 , q 2 , . . . , q i ] and ◗ 2 = [ q i +1 , q i +2 , . . . , q n ], we have ◗ T 1 ◗ 1 = ■ , ◗ T 2 ◗ 2 = ■ , but ◗ 1 ◗ T 1 � = ■ , ◗ 2 ◗ T 2 � = ■ . 2 / 25

Eigen-Decomposition 1 A square n × n matrix ❙ with n linearly independent eigenvectors can be factorized as ❙ = ◗ Λ ◗ − 1 , where ◗ is the square n × n matrix whose columns are eigenvectors of ❙ , and Λ is the diagonal matrix whose diagonal elements are the corresponding eigenvalues. 2 Note that only diagonalizable matrices can be factorized in this way. 3 If ❙ is a symmetric matrix, its eigenvectors are orthogonal. Thus ◗ is an orthogonal matrix and we have ❙ = ◗ Λ ◗ T . 3 / 25

Singular Value Decomposition The singular value decomposition (SVD) of an m × n real matrix (without loss of generality, we assume m ≥ n ) can be written as ❘ = ❯ ˜ Σ ❱ T , where ❯ is an orthogonal m × m matrix, ❱ is an orthogonal n × n matrix, and ˜ Σ is a diagonal m × n matrix with non-negative real values on diagonal. That is, ❯ T ❯ = ❯❯ T = ■ m × m , ❱ T ❱ = ❱ ❱ T = ■ n × n ,   0 0 0 σ 1 . . . 0 σ 2 0 . . . 0   � Σ n × n �   ˜ 0 0 0 σ 3 . . .   Σ = , Σ n × n = (1) ,   0 ...   m × n 0 0 0 0   0 0 0 . . . σ n where σ 1 ≥ σ 2 ≥ · · · ≥ σ n ≥ 0 are known as singular values. If rank( ❘ ) = r ( r ≤ n ), we have σ 1 ≥ σ 2 ≥ · · · ≥ σ r > 0 and σ r +1 = σ r +2 = · · · = σ n = 0. 4 / 25

Relation to Eigen-Decomposition The columns of ❯ (left-singular vectors) are orthonormal eigenvectors of ❘❘ T , and the columns of ❱ (right-singular vectors) are orthonormal eigenvectors of ❘ T ❘ . In other words, we have ❘❘ T = ❯ Λ ❯ − 1 , ❘ T ❘ = ❱ Λ ❱ − 1 . It is easy to verify them as we have � Σ � � Σ � � � Σ � ❱ T = ❱ ( ) ❱ T = ❱ Σ 2 ❱ T , � ❘ T ❘ ❱ T ) T ❯ = ( ❯ Σ 0 0 0 0 � Σ � � Σ � � Σ � � � Σ 2 � 0 ❱ T ) T = ❯ ( ) ❯ T = ❯ � ❘❘ T ❱ T ( ❯ ❯ T = ❯ 0 Σ 0 0 0 0 0 and ❱ T = ❱ − 1 , ❯ T = ❯ − 1 . 5 / 25

SVD and eigen-decomposition 1 Under what conditions are SVD and eigen-decomposition the same? First, ❘ is a symmetric matrix, i.e. , ❘ = ❘ T . Second, ❘ is a positive semi-definite matrix, i.e. , ∀ ① ∈ R n , ① T ❘① ≥ 0. 2 The difference between Λ in eigen-decomposition and Σ in SVD is that, the diagonal entries of Λ can be negative, while the diagonal entries of Σ are non-negative. What are the fundamental reasons underlying this difference? Why the requirements on the singular values in SVD (non-negative and in sorted order) do not prevent the generality of SVD? 6 / 25

Compact SVD If rank( ❘ ) = r ( r ≤ n ), we have ❯ ˜ Σ ❱ T ❘ =     ✈ T 0 σ 1 . . . 1 ... ✈ T . . ...     . . 2     . . .     .     0 . . . . σ r     = [ ✉ 1 , ✉ 2 , . . . , ✉ r , . . . , ✉ m ] .    ✈ T  0 0 . . .     r ...     . . ... .  . .    . . .    .  ✈ T 0 0 . . . n By removing zero components, we obtain ❯ r Σ r ❱ T ❘ = r  ✈ T    1 0 σ 1 . . . ✈ T   2 . . ...     = [ ✉ 1 , ✉ 2 , . . . , ✉ r ] . . .     . . .  .  0 . . . σ r ✈ T r   ✈ T 1 ✈ T r   2 �   σ i ✉ i ✈ T = [ σ 1 ✉ 1 , σ 2 ✉ 2 , . . . , σ r ✉ r ] .  = i ,   .  . i =1 ✈ T r where rank( σ i ✉ i ✈ T i ) = 1, i = 1 , 2 , . . . , r . 7 / 25

Truncated SVD and Best Low-Rank Approximation We can also approximate the matrix ❘ with the k largest singular values as k � ❘ k = ❯ k Σ k ❱ T σ i ✉ i ✈ T k = i . i =1 Apparently, ❘ � = ❘ k unless rank( ❘ ) = k . This approximation is the best in following sense: � � n � � σ 2 min || ❘ − ❇ || F = || ❘ − ❘ k || F = i , � ❇ : rank ( ❇ ) ≤ k i = k +1 min || ❘ − ❇ || 2 = || ❘ − ❘ k || 2 = σ k +1 , ❇ : rank ( ❇ ) ≤ k where || · || F denotes the Frobenius norm and || · || 2 denotes the spectral norm, defined as the largest singular value of the matrix. That is, ❘ k is the best rank- k approximation to ❘ in terms of both the Frobenius norm and spectral norm. Note the difference in terms of approximation errors when different matrix norms are used. 8 / 25

What is PCA? 1 Principal Component Analysis (PCA) is a statistical procedure that can be used to achieve feature (dimensionality) reduction. 2 Note, feature reduction is different from feature selection. After feature reduction, we still use all the features, while feature selection selects a subset of features to use. 3 The goal of PCA is to project the high-dimensional features to a lower-dimensional space with maximal variance and minimum reconstruction error simultaneously. 4 We derive PCA based on maximizing variance, and then we show the solution also minimizes reconstruction error. 5 In machine learning, PCA is an unsupervised learning technique, and therefore does not need labels. 9 / 25

PCA to 1D 1 To introduce PCA, we start from the simple case where PCA projects the features to a 1-dimensional space. 2 Formally, suppose we have n p -dimensional ( p > 1) features ① 1 , ① 2 , . . . , ① n ∈ R p . 3 Let ❛ ∈ R p represent a projection that ❛ T ① i = z i , i = 1 , 2 , . . . , n where z 1 , z 2 , . . . , z n ∈ R 1 . 4 PCA aims to solve n 1 � ❛ ∗ = arg max z ) 2 . ( z i − ¯ n || ❛ || =1 i =1 5 Note that the variance of the reduced data is n 1 � z ) 2 , ( z i − ¯ n i =1 which means that PCA tries to find the projection with the maximum variance in reduced data. 10 / 25

PCA to 1D Since n n n z = 1 z i = 1 ❛ T ① i = ❛ T ( 1 � � � ① i ) = ❛ T ¯ ¯ ① , n n n i =1 i =1 i =1 the problem can be written as n 1 � ❛ ∗ z ) 2 = arg max ( z i − ¯ n || ❛ || =1 i =1 n 1 � ( ❛ T ① i − ¯ z ) 2 = arg max n || ❛ || =1 i =1 n 1 � ( ❛ T ① i − ❛ T ¯ ① ) 2 = arg max n || ❛ || =1 i =1 n 1 � ❛ T ( ① i − ¯ ① ) T ❛ = arg max ① )( ① i − ¯ n || ❛ || =1 i =1 � � n 1 � ❛ T ① ) T = arg max ( ① i − ¯ ① )( ① i − ¯ ❛ n || ❛ || =1 i =1 � �� p × p covariance matrix ❛ T ❈❛ , = arg max || ❛ || =1 � n where ❈ = 1 ① ) T , denotes the covariance matrix. i =1 ( ① i − ¯ ① )( ① i − ¯ n 11 / 25

PCA to k -dimensional space 1 What if we want to project the features to a k -dimensional space? Then the PCA problem becomes � � ❆ ∗ = ❆ T ❈❆ arg max trace , (2) ❆ ∈ R p × k : ❆ T ❆ = ■ k where ❆ = [ ❛ 1 , ❛ 2 , · · · , ❛ k ] ∈ R p × k . Note that when projecting onto k -dimensional space, PCA requires different projection vectors to be orthogonal. Also, the trace above is the sum of the variances after projecting the data to each of the k directions as k � � � � � ❆ T ❈❆ ❛ T trace = i ❈❛ i . i =1 12 / 25

Ky Fan Theorem 1 Solving the problem in Eqn. (2) requires the follow theorem. 2 Theorem. ( Ky Fan ) Let ❍ ∈ R n × n be a symmetric matrix with eigenvalues λ 1 ≥ λ 2 ≥ · · · ≥ λ n , and the corresponding eigenvectors ❯ = [ ✉ 1 , . . . , ✉ n ]. Then � � ❆ T ❍❆ λ 1 + · · · λ k = max trace . ❆ ∈ R n × k : ❆ T ❆ = ■ k And the optimal ❆ ∗ is given by ❆ ∗ = [ ✉ 1 , . . . , ✉ k ] ◗ with ◗ an arbitrary orthogonal matrix. � 13 / 25

Solutions to PCA 1 Note that in Eqn. (2), the covariance matrix ❈ is a symmetric matrix. Given the above theorem, we directly obtain � � ❆ T ❈❆ λ 1 + · · · λ k = arg max trace , ❆ ∈ R n × k : ❆ T ❆ = ■ k ❆ ∗ = [ ✉ 1 , . . . , ✉ k ] ◗ , where λ 1 , . . . , λ k are the k largest eigenvalues of the covariance matrix ❈ , and the solution ❆ ∗ is the matrix whose columns are corresponding eigenvectors. 2 It also follows from the above theorem that solutions to PCA are not unique, and they differ by an orthogonal matrix. We used the special case where ◗ = ■ , i.e. , ❆ ∗ = [ ✉ 1 , . . . , ✉ k ]. 14 / 25

Principal Component Analysis and Autoencoders Shuiwang Ji - PowerPoint PPT Presentation

Principal Component Analysis and Autoencoders Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 25 Orthogonal Matrices 1 An orthogonal matrix is a square matrix whose columns and rows are orthogonal unit

Continuous Latent Variables Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 12 Principal Component

Section 1 Principal Component Analysis 1 / 16 Principal Component Analysis ST 810-006

Functional Principal Component Analysis May 14, 2018 Empirical Principal Component FPC for the

Principal Component Analysis Powerpoint Presentation What is multivariate analysis? Summarizing

CSC321 Lecture 20: Autoencoders Roger Grosse Roger Grosse CSC321 Lecture 20: Autoencoders 1 /

Lecture 25: Autoencoders Kernel PCA Aykut Erdem January 2017 Hacettepe University Today

Principal component analysis Ingo Blechschmidt December 17th, 2014 Kleine Bayessche AG

Functional components Notification component Application received Refuse ? Notification

WIO IOSAP Project Budget Nairobi Convention WIO IOSAP Budget per Project Component COMPONENT

Principal Component Analysis http://setosa.io/ev/principal- Food consumption in the UK

CS475/CS675 Lecture 23: July 19, 2016 Principal Component Analysis, Eigenfaces CS475/CS675 (c)

Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis CMSC 678

Introduction to Principal Component Analysis and Indepedent Component Analysis Tristan A. Hearn

Chapter 5 Singular value decomposition and principal component analysis In A Practical Approach to

Hebbian Learning, Hebbian Learning Principal Component Analysis, and Independent Component

CSC421/2516 Lecture 17: Variational Autoencoders Roger Grosse and Jimmy Ba Roger Grosse and

Structured sparse methods for matrix factorization Francis Bach Sierra team, INRIA - Ecole

Symbolic PCA of compositional data. Sun Makosso Kallyth & Edwin Diday Universit e Paris

Benefits and Pitfalls Utility Extensions of the Exponential Mechanism References with

Principal Components Analysis Claire Le Barbenchon and Federico Ferrari Data Expeditions Welcome

Online Principal Component Analysis Edo Liberty . . . . . . . . . . . . . . . . .

Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction I Byron C Wallace Machine

Principal Component Analysis: Why do we use fourier transformation to analyze flow? Ziming Liu

Dimensionality Reduction Algorithms (and how to interpret their output) Dalya Baron (Tel Aviv

Principal Component Analysis and Autoencoders Shuiwang Ji - PowerPoint PPT Presentation

Principal Component Analysis and Autoencoders Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 25 Orthogonal Matrices 1 An orthogonal matrix is a square matrix whose columns and rows are orthogonal unit

Continuous Latent Variables Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 12 Principal Component

Section 1 Principal Component Analysis 1 / 16 Principal Component Analysis ST 810-006

Functional Principal Component Analysis May 14, 2018 Empirical Principal Component FPC for the

Principal Component Analysis Powerpoint Presentation What is multivariate analysis? Summarizing

CSC321 Lecture 20: Autoencoders Roger Grosse Roger Grosse CSC321 Lecture 20: Autoencoders 1 /

Lecture 25: Autoencoders Kernel PCA Aykut Erdem January 2017 Hacettepe University Today

Principal component analysis Ingo Blechschmidt December 17th, 2014 Kleine Bayessche AG

Functional components Notification component Application received Refuse ? Notification

WIO IOSAP Project Budget Nairobi Convention WIO IOSAP Budget per Project Component COMPONENT

Principal Component Analysis http://setosa.io/ev/principal- Food consumption in the UK

CS475/CS675 Lecture 23: July 19, 2016 Principal Component Analysis, Eigenfaces CS475/CS675 (c)

Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis CMSC 678

Introduction to Principal Component Analysis and Indepedent Component Analysis Tristan A. Hearn

Chapter 5 Singular value decomposition and principal component analysis In A Practical Approach to

Hebbian Learning, Hebbian Learning Principal Component Analysis, and Independent Component

CSC421/2516 Lecture 17: Variational Autoencoders Roger Grosse and Jimmy Ba Roger Grosse and

Structured sparse methods for matrix factorization Francis Bach Sierra team, INRIA - Ecole

Symbolic PCA of compositional data. Sun Makosso Kallyth &amp; Edwin Diday Universit e Paris

Benefits and Pitfalls Utility Extensions of the Exponential Mechanism References with

Principal Components Analysis Claire Le Barbenchon and Federico Ferrari Data Expeditions Welcome

Online Principal Component Analysis Edo Liberty . . . . . . . . . . . . . . . . .

Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction I Byron C Wallace Machine

Principal Component Analysis: Why do we use fourier transformation to analyze flow? Ziming Liu

Dimensionality Reduction Algorithms (and how to interpret their output) Dalya Baron (Tel Aviv

Symbolic PCA of compositional data. Sun Makosso Kallyth & Edwin Diday Universit e Paris