CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, - PowerPoint PPT Presentation

CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 12-PCA 1 / 23

Overview Today we’ll cover the first unsupervised learning algorithm for this course: principal component analysis (PCA) Dimensionality reduction: map the data to a lower dimensional space Save computation/memory Reduce overfitting Visualize in 2 dimensions PCA is a linear model, with a closed-form solution. It’s useful for understanding lots of other algorithms. Autoencoders Matrix factorizations (next lecture) Today’s lecture is very linear-algebra-heavy. Especially orthogonal matrices and eigendecompositions. Don’t worry if you don’t get it immediately — next few lectures won’t build on it Not on midterm (which only covers up through L9) UofT CSC 411: 12-PCA 2 / 23

Projection onto a Subspace z = U ⊤ ( x − µ ) Here, the columns of U form an orthonormal basis for a subspace S . The projection of a point x onto S is the point ˜ x ∈ S closest to x . In machine learning, ˜ x is also called the reconstruction of x . z is its representation, or code. UofT CSC 411: 12-PCA 3 / 23

Projection onto a Subspace If we have a K -dimensional subspace in a D -dimensional input space, then x ∈ R D and z ∈ R K . If the data points x all lie close to the subspace, then we can approximate distances, dot products, etc. in terms of these same operations on the code vectors z . If K ≪ D , then it’s much cheaper to work with z than x . A mapping to a space that’s easier to manipulate or visualize is called a representation, and learning such a mapping is representation learning. Mapping data to a low-dimensional space is called dimensionality reduction. UofT CSC 411: 12-PCA 4 / 23

Learning a Subspace How to choose a good subspace S ? Need to choose a vector µ and a D × K matrix U with orthonormal columns. Set µ to the mean of the data, µ = 1 � N i =1 x ( i ) N Two criteria: Minimize the reconstruction error N min 1 � x ( i ) − ˜ � x ( i ) � 2 N i =1 Maximize the variance of the code vectors Var( z j ) = 1 � � � ( z ( i ) z i ) 2 max − ¯ j N j j i = 1 � � z ( i ) − ¯ z � 2 N i = 1 � � z ( i ) � 2 Exercise: show ¯ z = 0 N i Note: here, ¯ z denotes the mean, not a derivative. UofT CSC 411: 12-PCA 5 / 23

Learning a Subspace These two criteria are equivalent! I.e., we’ll show N 1 x ( i ) � 2 = const − 1 � x ( i ) − ˜ � � � z ( i ) � 2 N N i =1 i Observation: by unitarity, x ( i ) − µ � = � Uz ( i ) � = � z ( i ) � � ˜ By the Pythagorean Theorem, N N 1 + 1 � � x ( i ) − µ � 2 � x ( i ) − ˜ x ( i ) � 2 � ˜ N N i =1 i =1 � �� projected variance reconstruction error N = 1 � � x ( i ) − µ � 2 N i =1 � �� constant UofT CSC 411: 12-PCA 6 / 23

Principal Component Analysis Choosing a subspace to maximize the projected variance, or minimize the reconstruction error, is called principal component analysis (PCA). Recall: Spectral Decomposition: a symmetric matrix A has a full set of eigenvectors, which can be chosen to be orthogonal. This gives a decomposition A = QΛQ ⊤ , where Q is orthogonal and Λ is diagonal. The columns of Q are eigenvectors, and the diagonal entries λ j of Λ are the corresponding eigenvalues. I.e., symmetric matrices are diagonal in some basis. A symmetric matrix A is positive semidefinite iff each λ j ≥ 0. UofT CSC 411: 12-PCA 7 / 23

Principal Component Analysis Consider the empirical covariance matrix: N Σ = 1 ( x ( i ) − µ )( x ( i ) − µ ) ⊤ � N i =1 Recall: Covariance matrices are symmetric and positive semidefinite. The optimal PCA subspace is spanned by the top K eigenvectors of Σ . More precisely, choose the first K of any orthonormal eigenbasis for Σ . The general case is tricky, but we’ll show this for K = 1. These eigenvectors are called principal components, analogous to the principal axes of an ellipse. UofT CSC 411: 12-PCA 8 / 23

Deriving PCA For K = 1, we are fitting a unit vector u , and the code is a scalar z = u ⊤ ( x − µ ). 1 [ z ( i ) ] 2 = 1 � � ( u ⊤ ( x ( i ) − µ )) 2 N N i i N = 1 � u ⊤ ( x ( i ) − µ )( x ( i ) − µ ) ⊤ u N i =1 � � N 1 � ( x ( i ) − µ )( x ( i ) − µ ) ⊤ = u ⊤ u N i =1 = u ⊤ Σu = u ⊤ QΛQ ⊤ u Spectral Decomposition = a ⊤ Λa for a = Q ⊤ u D � λ j a 2 = j j =1 UofT CSC 411: 12-PCA 9 / 23

Deriving PCA Maximize a ⊤ Λa = � D j =1 λ j a 2 j for a = Q ⊤ u . This is a change-of-basis to the eigenbasis of Σ . Assume the λ i are in sorted order. For simplicity, assume they are all distinct. Observation: since u is a unit vector, then by unitarity, a is also a unit j a 2 vector. I.e., � j = 1. By inspection, set a 1 = ± 1 and a j = 0 for j � = 1. Hence, u = Qa = q 1 (the top eigenvector). A similar argument shows that the k th principal component is the k th eigenvector of Σ . If you’re interested, look up the Courant-Fischer Theorem. UofT CSC 411: 12-PCA 10 / 23

Decorrelation Interesting fact: the dimensions of z are decorrelated. For now, let Cov denote the empirical covariance. Cov( z ) = Cov( U ⊤ ( x − µ )) = U ⊤ Cov( x ) U = U ⊤ ΣU = U ⊤ QΛQ ⊤ U � I � � � = I 0 Λ by orthogonality 0 = top left K × K block of Λ If the covariance matrix is diagonal, this means the features are uncorrelated. This is why PCA was originally invented (in 1901!). UofT CSC 411: 12-PCA 11 / 23

Recap Recap: Dimensionality reduction aims to find a low-dimensional representation of the data. PCA projects the data onto a subspace which maximizes the projected variance, or equivalently, minimizes the reconstruction error. The optimal subspace is given by the top eigenvectors of the empirical covariance matrix. PCA gives a set of decorrelated features. UofT CSC 411: 12-PCA 12 / 23

Applying PCA to faces Consider running PCA on 2429 19x19 grayscale images (CBCL data) Can get good reconstructions with only 3 components PCA for pre-processing: can apply classifier to latent representation For face recognition PCA with 3 components obtains 79% accuracy on face/non-face discrimination on test data vs. 76.8% for a Gaussian mixture model (GMM) with 84 states. (We’ll cover GMMs later in the course.) Can also be good for visualization UofT CSC 411: 12-PCA 13 / 23

Applying PCA to faces: Learned basis Principal components of face images (“eigenfaces”) UofT CSC 411: 12-PCA 14 / 23

Applying PCA to digits UofT CSC 411: 12-PCA 15 / 23

Next Next: two more interpretations of PCA, which have interesting generalizations. 1 Autoencoders 2 Matrix factorization (next lecture) UofT CSC 411: 12-PCA 16 / 23

Autoencoders An autoencoder is a feed-forward neural net whose job it is to take an input x and predict x . To make this non-trivial, we need to add a bottleneck layer whose dimension is much smaller than the input. UofT CSC 411: 12-PCA 17 / 23

Linear Autoencoders Why autoencoders? Map high-dimensional data to two dimensions for visualization Learn abstract features in an unsupervised way so you can apply them to a supervised task Unlabled data can be much more plentiful than labeled data UofT CSC 411: 12-PCA 18 / 23

Linear Autoencoders The simplest kind of autoencoder has one hidden layer, linear activations, and squared error loss. x � 2 L ( x , ˜ x ) = � x − ˜ This network computes ˜ x = W 2 W 1 x , which is a linear function. If K ≥ D , we can choose W 2 and W 1 such that W 2 W 1 is the identity matrix. This isn’t very interesting. But suppose K < D : W 1 maps x to a K -dimensional space, so it’s doing dimensionality reduction. UofT CSC 411: 12-PCA 19 / 23

Linear Autoencoders Observe that the output of the autoencoder must lie in a K -dimensional subspace spanned by the columns of W 2 . We saw that the best possible K -dimensional subspace in terms of reconstruction error is the PCA subspace. The autoencoder can achieve this by setting W 1 = U ⊤ and W 2 = U . Therefore, the optimal weights for a linear autoencoder are just the principal components! UofT CSC 411: 12-PCA 20 / 23

Nonlinear Autoencoders Deep nonlinear autoencoders learn to project the data, not onto a subspace, but onto a nonlinear manifold This manifold is the image of the decoder. This is a kind of nonlinear dimensionality reduction. UofT CSC 411: 12-PCA 21 / 23

Nonlinear Autoencoders Nonlinear autoencoders can learn more powerful codes for a given dimensionality, compared with linear autoencoders (PCA) UofT CSC 411: 12-PCA 22 / 23

Nonlinear Autoencoders Here’s a 2-dimensional autoencoder representation of newsgroup articles. They’re color-coded by topic, but the algorithm wasn’t given the labels. UofT CSC 411: 12-PCA 23 / 23

CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, - PowerPoint PPT Presentation

CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 12-PCA 1 / 23 Overview Today well cover the first unsupervised learning algorithm for this

CSC 411 Lecture 6: Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders Class based on Raquel

Continuous Latent Variables Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 12 Principal Component

CSC 411 Lecture 20: Gaussian Processes Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411: Lecture 02: Linear Regression Class based on Raquel Urtasun & Rich Zemels lectures

CSC 411 Lecture 3: Decision Trees Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411 Lecture 20: Closing Thoughts Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411 Lecture 19: Bayesian Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan

CSC 411: Lecture 19: Reinforcement Learning Class based on Raquel Urtasun & Rich Zemels

CSC 411: Lecture 08: Generative Models for Classification Class based on Raquel Urtasun &

CSC 411: Lecture 11: Neural Networks II Class based on Raquel Urtasun & Rich Zemels

CSC 411: Lecture 06: Decision Trees Class based on Raquel Urtasun & Rich Zemels lectures

CSC 411 Lecture 8: Linear Classification II Roger Grosse, Amir-massoud Farahmand, and Juan

CSC 411 Lecture 5: Ensembles II Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411: Lecture 12: Clustering Class based on Raquel Urtasun & Rich Zemels lectures Sanja

CSC 411 Lecture 9: SVMs and Boosting Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

1 Principal Components Analysis (PCA) Suppose someone hands you a stack of N vectors, { x 1 ,

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

Lecture 13 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett

APPLIED MACHINE LEARNING Methods for Reduction of Dimensionality through Linear Projection

Principal Components Analysis Sargur Srihari University at Buffalo 1 Topics Projection

Lecture: Face Recognition and Feature Reduction Juan Carlos Niebles and Ranjay Krishna Stanford

Z 1 = a 11 X 1 + a 12 X 2 + + a 1n X n Coefficients for linear model 2 + a 12 2 + + a 1n 2

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, - PowerPoint PPT Presentation

CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 12-PCA 1 / 23 Overview Today well cover the first unsupervised learning algorithm for this

CSC 411 Lecture 6: Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411: Lecture 14: Principal Components Analysis &amp; Autoencoders Class based on Raquel

Continuous Latent Variables Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 12 Principal Component

CSC 411 Lecture 20: Gaussian Processes Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411: Lecture 02: Linear Regression Class based on Raquel Urtasun &amp; Rich Zemels lectures

CSC 411 Lecture 3: Decision Trees Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411 Lecture 20: Closing Thoughts Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411 Lecture 19: Bayesian Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan

CSC 411: Lecture 19: Reinforcement Learning Class based on Raquel Urtasun &amp; Rich Zemels

CSC 411: Lecture 08: Generative Models for Classification Class based on Raquel Urtasun &amp;

CSC 411: Lecture 11: Neural Networks II Class based on Raquel Urtasun &amp; Rich Zemels

CSC 411: Lecture 06: Decision Trees Class based on Raquel Urtasun &amp; Rich Zemels lectures

CSC 411 Lecture 8: Linear Classification II Roger Grosse, Amir-massoud Farahmand, and Juan

CSC 411 Lecture 5: Ensembles II Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411: Lecture 12: Clustering Class based on Raquel Urtasun &amp; Rich Zemels lectures Sanja

CSC 411 Lecture 9: SVMs and Boosting Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

1 Principal Components Analysis (PCA) Suppose someone hands you a stack of N vectors, { x 1 ,

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

Lecture 13 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett

APPLIED MACHINE LEARNING Methods for Reduction of Dimensionality through Linear Projection

Principal Components Analysis Sargur Srihari University at Buffalo 1 Topics Projection

Lecture: Face Recognition and Feature Reduction Juan Carlos Niebles and Ranjay Krishna Stanford

Z 1 = a 11 X 1 + a 12 X 2 + + a 1n X n Coefficients for linear model 2 + a 12 2 + + a 1n 2

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders Class based on Raquel

CSC 411: Lecture 02: Linear Regression Class based on Raquel Urtasun & Rich Zemels lectures

CSC 411: Lecture 19: Reinforcement Learning Class based on Raquel Urtasun & Rich Zemels

CSC 411: Lecture 08: Generative Models for Classification Class based on Raquel Urtasun &

CSC 411: Lecture 11: Neural Networks II Class based on Raquel Urtasun & Rich Zemels

CSC 411: Lecture 06: Decision Trees Class based on Raquel Urtasun & Rich Zemels lectures

CSC 411: Lecture 12: Clustering Class based on Raquel Urtasun & Rich Zemels lectures Sanja