Machine Learning Dimensionality Reduction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1/
Agenda Agenda Dimensionality Reduction Feature Extraction Feature Extraction Approaches Linear Methods Principal Component Analysis (PCA) Linear Discriminant Analysis (LDA) Multiple Discriminant Analysis (MDA) PCA vs LDA Linear Methods Drawbacks Nonlinear Dimensionality Reduction ISOMAP Local Linear Embedding (LLE) ISOMAP vs. LLE Sharif University of Technology, Computer Engineering Department, Machine Learning Course 2
Di Dimensionali mensionality R ty Reduc educti tion on Feature Selection (discussed previous time) Select the best subset from a given feature set Feature Extraction (will be discussed today) Create new features based on the original feature set Transforms are usually involved Sharif University of Technology, Computer Engineering Department, Machine Learning Course 3
Why D Why Dim imensionali ensionality ty Reducti Reduction? on? Most machine learning and data mining techniques may not be effective for high-dimensional data Curse of Dimensionality Query accuracy and efficiency degrade rapidly as the dimension increases The intrinsic dimension may be small. For example, the number of genes responsible for a certain type of disease may be small. Visualization: projection of high-dimensional data onto 2D or 3D. Data compression: efficient storage and retrieval. Noise removal: positive effect on query accuracy. Adopted from slides of Arizona State University Sharif University of Technology, Computer Engineering Department, Machine Learning Course 4
Feature Extr Feature Extracti action on Feature X i Y i Extractor T T X x ,x , ,x Y f(X ) y ,y , ,y i i1 i2 id i i i1 i2 im m d , usually For example: x x T 1 2 X x x x x Y 1 2 3 4 x x 3 4 Sharif University of Technology, Computer Engineering Department, Machine Learning Course 5
Feature Extr Feature Extracti action on Appr Approach oaches es The best f(x) is most likely a non-linear function, but linear functions are easier to find though Linear Approaches Principal Component Analysis (PCA) will be discussed or Karhunen-Loeve Expansion (KLE) Linear Discriminant Analysis (LDA) will be discussed Multiple Discriminant Analysis (MDA) will be discussed Independent Component Analysis (ICA) Project Pursuit Factor Analysis Multidimensional Scaling (MDS) Sharif University of Technology, Computer Engineering Department, Machine Learning Course 6
Feature Extr Feature Extracti action on Appr Approach oaches es Non-linear approach Kernel PCA ISOMAP Locally Linear Embedding (LLE) Neural Networks Feed-Forward Neural Networks High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Ref: Hinton, G. E. and Salakhutdinov, R. R. (2006 ) “ Reducing the dimensionality of data with neural networks.” Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006. Self-Organizing Map A Clustering Approach to Dimensionality Reduction Transform data to lower dimensional lattice Sharif University of Technology, Computer Engineering Department, Machine Learning Course 7
Feature Extr Feature Extracti action on Appr Approach oaches es Another view Unsupervised approaches PCA LLE Self organized map Supervised approaches LDA MDA Sharif University of Technology, Computer Engineering Department, Machine Learning Course 8
Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A) Main idea: seek most accurate data representation in a lower dimensional space Example in 2-D Project data to 1-D subspace (a line) which minimize the projection error Notice that the good line to use for projection lies is in the direction of largest variance small projection errors, large projection error, good line to project to bad line to project to Sharif University of Technology, Computer Engineering Department, Machine Learning Course 9
Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A) Preserves largest variances in the data What is the direction of largest variance in data? Hint: If x has multivariate Gaussian distribution N( μ , Σ ), direction of largest variance is given by eigenvector corresponding to the largest eigenvalue of Σ Sharif University of Technology, Computer Engineering Department, Machine Learning Course 10
Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A) We can derive following algorithm (will be discussed in next slides) PCA algorithm: X input nxd data matrix (each row a d-dimensional sample) X subtract mean of X , from each row of X The new data has zero mean (normalized data) Σ covariance matrix of X Find eigenvectors and eigenvalues of Σ C the M eigenvectors with largest eigenvalues, each in a column (a dxM matrix) - value of eigenvalues gives importance of each component Y (transformed data) transform X using C (Y = X * C) The number of new dimensional is M (M<<d) Q: How much is the data energy loss? Sharif University of Technology, Computer Engineering Department, Machine Learning Course Sharif University of Technology, Computer Engineering Department, Machine Learning Course 11 11
Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A) Illustration: First principal component * * * * * Second principal component * * * * * * * * Original axes * * * ** * * * * * * Sharif University of Technology, Computer Engineering Department, Machine Learning Course 12
Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A) Example: Sharif University of Technology, Computer Engineering Department, Machine Learning Course 13
Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A) Adopted from lectures of Duncan Fyfe Gillies Sharif University of Technology, Computer Engineering Department, Machine Learning Course 14
Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A) Drawbacks PCA was designed for accurate data representation, not for data classification Preserves as much variance in data as possible If directions of maximum variance is important for classification, will work (give an example?) However the direction of maximum variance may be useless for classification Sharif University of Technology, Computer Engineering Department, Machine Learning Course 15
PCA PCA Deri Derivati vation on Can be considered in many viewpoints: Minimum Error of Projection [least squares error] Maximum Information gain [maximum variance] Or by Neural Nets The result would be the same! least squares error == maximum variance: By using Pythagorean Theorem In the below figure Sharif University of Technology, Computer Engineering Department, Machine Learning Course 16
PCA PCA Deri Derivati vation on We want to find the most accurate representation of d-dimensional data D={x 1 ,x 2 ,…,x n } in some subspace W which has dimension k < d Let {e 1 ,e 2 ,…,e k } be the orthonormal basis for W. Any vector in W can be written as k e i i i 1 e i s are d-dimensional vectors in original space. Thus x 1 will be represented by some vectors in W: k x e 1 1 i i i 1 2 Error of this representation is k error x e 1 1 i i i 1 Then, the total error is: 2 n k J x e j ji i j 1 i 1 2 n n k n k t 2 x 2 x e j j ji i ji j 1 j 1 i 1 j 1 i 1 2 n n k n k t 2 x 2 x e j ji j i ji j 1 j 1 i 1 j 1 i 1 Sharif University of Technology, Computer Engineering Department, Machine Learning Course 17
PCA PCA Deri Derivati vation on To minimize J, need to take partial derivatives and also enforce constraint that {e 1 ,e 2 ,…,e k } are orthogonal. 2 n n k n k t 2 J e ( ,..., e , ,..., ) x 2 x e 1 k 11 nk j ji j i ji j 1 j 1 i 1 j 1 i 1 First take partial derivatives with respect to α ml t J e ( ,..., e , ,..., ) 2 x e 2 1 k 11 nk m l ml ml Thus the optimal value for α ml is t t 2 x e 2 0 x e m l ml ml m l Plug the optimal value for α ml back into J 2 2 n n k n k t t t ( ,..., ) 2 ( ) ( ) J e e x x e x e x e 1 k j j i j i j i 1 1 1 1 1 j j i j i 2 2 n n k t x ( x e ) j j i j 1 j 1 i 1 2 n k n t t x e ( x x ) e j i j j i j 1 i 1 j 1 2 n k n t t x e Se ; S x x j i i j j j 1 i 1 j 1 Sharif University of Technology, Computer Engineering Department, Machine Learning Course 18
Recommend
More recommend