MACHINE LEARNING – 2013 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1
MACHINE LEARNING – 2013 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR C0 02! 2
MACHINE LEARNING – 2013 Why reducing the data dimensionality? Reducing the dimensionality of the dataset at hand so that computation afterwards is more tractable Idea: sole a few of the dimensions matter, the projections of the data along the residual dimensions do not contain informative structure of the data (already a form of generalization) 3
MACHINE LEARNING – 2013 Why reducing the data dimensionality? The curse of dimensionality refers to the exponential growth of volume covered by the parameters’ values to be tested as the dimensionality increases. In ML, analyzing high dimensional data is made particularly difficult because: Often one does not have enough observations to get good estimates (i.e. to sample enough all parameters). Adding more dimensions (hence more features) can increase the noise, and hence the error. 4
MACHINE LEARNING – 2013 Principal Component Analysis Principal Component Analysis is a method widely used in Engineering and Science. Its principle is based on statistical analysis of the correlations underpinning the dataset at hand Its popularity is due to the fact that: Its computation is simple and tractable with an analytical solution Its result can be easily visualized through usually a 2 or 3 dimensional graph. 5
MACHINE LEARNING – 2013 Co-Variance, Correlation The covariance and correlation are a measure of the dependency between two variables. Given two variables x and y (assuming that x and y are both zero mean) : cov( , ) , , x y E x y E x E y cov( , ) x y ( , ) . corr x y var var x y x and y are said to be uncorrelated , if their covariance is zero: , 0 and cov , 0. corr x y x y 6
MACHINE LEARNING – 2013 Co-Variance Matrix 1... i M i If is a multidimensional dataset containing M N-dimensional X x j j 1... N datapoints, the covariance matrix of is given by: C X C is diagonal when cov , .......................cov , X X X X 1 1 1 N t h e d a t a X i s .... decorrelated along T C E XX each dimension. cov , ........ ..............cov , X X X X 1 N N N The rows , 1... ,represent the coordinate of each datapoint with X j N j respect to the j-th basis vector. The column of contain the datapoints. X M T T ~ , since exp ectation is only a normalization factor. C E XX XX 7
MACHINE LEARNING – 2013 Purpose of PCA Goal: to find a better representation of the dataset at hand so as to simplify computation afterwards Assumes a linear transformation Assumes maximal variance is a criterion Projected onto two first Principal components Raw 2D dataset 8
MACHINE LEARNING – 2013 PCA: principle PRINCIPLE: Define a low dimensional manifold in the original space. Represent each data point X by its projection Y onto this manifold. FORMALISM: Consider a data set of -dimensional data points M N 1... i M i i N X= and x , 1,..., : x i M j 1,... j N PCA aims at finding a linear map ,s.t A N A q : , q A N 1 M i q , ,...., and each Y AX Y y y y 9
MACHINE LEARNING – 2013 PCA: principle There are three equivalent methods for performing PCA: 1. Maximize the variance of the projection (Hotelling 1933). In other words, this method tries to maximize the spread of the projected data. 2. Minimize the reconstruction error (Pearson 1901), i.e. to minimize the squared distance between the original data and its ”estimate” in a low dimensional manifold. 3. Mean Least Error of the parameter in a latent variable (Tipping and Bishop 1996) 10
MACHINE LEARNING – 2013 Standard PCA: Variance Maximization through Eigenvalue Decomposition Algorithm: 1. Determine the direction (vector) along which the variance of the data is maximal. 2. Determine an orthonormal basis composed of the direction obtained in 1. The projection of the data onto each axis are uncorrelated. 11
MACHINE LEARNING – 2013 Standard PCA: Variance Maximization through Eigenvalue Decomposition Algori thm: 1) Zero mean: ' - X X X E X T 2) Compute Covariance matrix: ' ' C E X X 3) Compute eigenvalues using 0, 1... C I i N i 4) Compute eigenvectors using Ce e i i i 5) Choose first eigenvectors: ,.... w i th ... q N e e q 1 1 2 q 1 1 ...... e e 1 N 6) Project data onto new basis: ' ', .. X Y A X A q q q q ...... e e 1 N 12
MACHINE LEARNING – 2013 Standard PCA: Example Sepatatiing line Two classes with 20 and 16 examples in each class Projection of the image datapoints on the first and 2 nd PC Demo PCA for Face Classification By projecting a set of images of two classes (two different persons) onto first two principal component allows to extract features particular to each class, which can then be use for classification. 13
MACHINE LEARNING – 2013 Principal Component Analysis LIMITATION OF STANDARD and MSQ PCA: The variance-covariance matrix needs to be calculated: Can be very computation-intensive for large datasets with a high # of dimensions Does not deal properly with missing data Incomplete data must either be discarded or imputed using ad-hoc methods Outliers can unduly affect the analysis Probabilistic PCA addresses some of the above limitations 14
MACHINE LEARNING – 2013 Probabilistic PCA 1... i M i The data are samples of the distribution of the random variable . X x x j 1... j N is generated by the latent variable following: x z T x W z The latent variable z corresponds to the unobserved variable. It is a lower dimensional representation of the data and their dependencies. In Probabilistic PCA, the latent variable model consists then of: – x: observed variables ( Dimension N ) – z: latent variables (Dimension q ) with q<N Less dimensions results in more parsimonious models. 15
MACHINE LEARNING – 2013 Probabilistic PCA 1... i M i The data are samples of the distribution of the random variable . X x x j 1... j N is generated by the latent variable following: x z T x W z Assumptions: - The latent variable z are centered and white, i.e. z 0, I - is a parameter (usually the mean of the data ) x 2 - the noise follows a zero mean Gaussian distribution 0, matrix. T - is a W N q Variance of the noise is diagonal conditional independence on the observables given the latent variables; z encapsulates all correlations across original dimensions. 16
MACHINE LEARNING – 2013 Probabilistic PCA p(z) z Assumptions: - The latent variable z are centered and white, i.e. z 0, I - is a parameter (usually the mean of the data ) x 2 - the noise follows a zero mean Gaussian distribution 0, matrix. T - is a W N q Variance of the noise is diagonal conditional independence on the observables given the latent variables; z encapsulates all correlations across original dimensions. 17
MACHINE LEARNING – 2013 Probabilistic PCA p(x|z 1 ) w x 2 p(z) 1 Z *|w| z 1 z x 1 2 Assuming further an isotropic Gaussian noise model N (0, I) conditional probability distribution of the observables given the latent variables ( | ) is given by: p x z 2 T ( | ) , p x z W z I 18
MACHINE LEARNING – 2013 Probabilistic PCA p(x|z 1 ) w x 2 p(z) 1 Axes of the ellipse Z *|w| correspond to the colums of W, i.e. to the eigenvectors of p(x) the covariance matrix: XX T z 1 z x 1 The marginal distribution can be computed by integrating out the latent variable and is then: 2 T , p x W W I Open parameters; can be learned through maximum likelihood 19
MACHINE LEARNING – 2013 Probabilistic PCA through Maximum Likelihood 2 T If we set , one can then compute the log-likelihood: B W W I M 1 ln , , ln 2 ln L B N B tr B C 2 where is the sample covariance matrix of the complete set of M C M datapoints X= ,..., . x x 1 The maximum likelihood estimate of is the mean of the dataset X. The parameters and are estimated through E-M. B See lecture notes for values of B and + exercises for derivation 20
Recommend
More recommend