machine learning
play

Machine Learning Dimensionality Reduction Hamid R. Rabiee Jafar - PowerPoint PPT Presentation

Machine Learning Dimensionality Reduction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1/ Agenda Agenda Dimensionality Reduction Feature Extraction Feature


  1. Machine Learning Dimensionality Reduction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1/

  2. Agenda Agenda  Dimensionality Reduction  Feature Extraction  Feature Extraction Approaches  Linear Methods  Principal Component Analysis (PCA)  Linear Discriminant Analysis (LDA)  Multiple Discriminant Analysis (MDA)  PCA vs LDA  Linear Methods Drawbacks  Nonlinear Dimensionality Reduction  ISOMAP  Local Linear Embedding (LLE)  ISOMAP vs. LLE Sharif University of Technology, Computer Engineering Department, Machine Learning Course 2

  3. Di Dimensionali mensionality R ty Reduc educti tion on  Feature Selection (discussed previous time)  Select the best subset from a given feature set  Feature Extraction (will be discussed today)  Create new features based on the original feature set  Transforms are usually involved Sharif University of Technology, Computer Engineering Department, Machine Learning Course 3

  4. Why D Why Dim imensionali ensionality ty Reducti Reduction? on?  Most machine learning and data mining techniques may not be effective for high-dimensional data  Curse of Dimensionality  Query accuracy and efficiency degrade rapidly as the dimension increases  The intrinsic dimension may be small.  For example, the number of genes responsible for a certain type of disease may be small.  Visualization: projection of high-dimensional data onto 2D or 3D.  Data compression: efficient storage and retrieval.  Noise removal: positive effect on query accuracy. Adopted from slides of Arizona State University Sharif University of Technology, Computer Engineering Department, Machine Learning Course 4

  5. Feature Extr Feature Extracti action on Feature X i Y i Extractor T T X x ,x , ,x Y f(X ) y ,y , ,y i i1 i2 id i i i1 i2 im m  d , usually  For example: x x T 1 2 X x x x x Y 1 2 3 4 x x 3 4 Sharif University of Technology, Computer Engineering Department, Machine Learning Course 5

  6. Feature Extr Feature Extracti action on Appr Approach oaches es  The best f(x) is most likely a non-linear function, but linear functions are easier to find though  Linear Approaches  Principal Component Analysis (PCA)  will be discussed  or Karhunen-Loeve Expansion (KLE)  Linear Discriminant Analysis (LDA)  will be discussed  Multiple Discriminant Analysis (MDA)  will be discussed  Independent Component Analysis (ICA)  Project Pursuit  Factor Analysis  Multidimensional Scaling (MDS) Sharif University of Technology, Computer Engineering Department, Machine Learning Course 6

  7. Feature Extr Feature Extracti action on Appr Approach oaches es  Non-linear approach  Kernel PCA  ISOMAP  Locally Linear Embedding (LLE)  Neural Networks  Feed-Forward Neural Networks  High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors.  Ref: Hinton, G. E. and Salakhutdinov, R. R. (2006 ) “ Reducing the dimensionality of data with neural networks.” Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006.  Self-Organizing Map  A Clustering Approach to Dimensionality Reduction  Transform data to lower dimensional lattice Sharif University of Technology, Computer Engineering Department, Machine Learning Course 7

  8. Feature Extr Feature Extracti action on Appr Approach oaches es  Another view  Unsupervised approaches  PCA  LLE  Self organized map  Supervised approaches  LDA  MDA Sharif University of Technology, Computer Engineering Department, Machine Learning Course 8

  9. Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A)  Main idea:  seek most accurate data representation in a lower dimensional space  Example in 2-D  Project data to 1-D subspace (a line) which minimize the projection error  Notice that the good line to use for projection lies is in the direction of largest variance small projection errors, large projection error, good line to project to bad line to project to Sharif University of Technology, Computer Engineering Department, Machine Learning Course 9

  10. Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A)  Preserves largest variances in the data  What is the direction of largest variance in data?  Hint: If x has multivariate Gaussian distribution N( μ , Σ ), direction of largest variance is given by eigenvector corresponding to the largest eigenvalue of Σ Sharif University of Technology, Computer Engineering Department, Machine Learning Course 10

  11. Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A)  We can derive following algorithm (will be discussed in next slides)  PCA algorithm:  X  input nxd data matrix (each row a d-dimensional sample)  X  subtract mean of X , from each row of X  The new data has zero mean (normalized data)  Σ  covariance matrix of X  Find eigenvectors and eigenvalues of Σ  C  the M eigenvectors with largest eigenvalues, each in a column (a dxM matrix) - value of eigenvalues gives importance of each component  Y (transformed data)  transform X using C (Y = X * C)  The number of new dimensional is M (M<<d)  Q: How much is the data energy loss? Sharif University of Technology, Computer Engineering Department, Machine Learning Course Sharif University of Technology, Computer Engineering Department, Machine Learning Course 11 11

  12. Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A)  Illustration: First principal component * * * * * Second principal component * * * * * * * * Original axes * * * ** * * * * * * Sharif University of Technology, Computer Engineering Department, Machine Learning Course 12

  13. Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A)  Example: Sharif University of Technology, Computer Engineering Department, Machine Learning Course 13

  14. Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A) Adopted from lectures of Duncan Fyfe Gillies Sharif University of Technology, Computer Engineering Department, Machine Learning Course 14

  15. Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A)  Drawbacks  PCA was designed for accurate data representation, not for data classification  Preserves as much variance in data as possible  If directions of maximum variance is important for classification, will work (give an example?)  However the direction of maximum variance may be useless for classification Sharif University of Technology, Computer Engineering Department, Machine Learning Course 15

  16. PCA PCA Deri Derivati vation on  Can be considered in many viewpoints:  Minimum Error of Projection [least squares error]  Maximum Information gain [maximum variance]  Or by Neural Nets  The result would be the same!  least squares error == maximum variance:  By using Pythagorean Theorem In the below figure Sharif University of Technology, Computer Engineering Department, Machine Learning Course 16

  17. PCA PCA Deri Derivati vation on  We want to find the most accurate representation of d-dimensional data D={x 1 ,x 2 ,…,x n } in some subspace W which has dimension k < d  Let {e 1 ,e 2 ,…,e k } be the orthonormal basis for W. Any vector in W can be written as k e i i i 1 e i s are d-dimensional vectors in original space.  Thus x 1 will be represented by some vectors in W: k x e 1 1 i i i 1 2  Error of this representation is k error x e 1 1 i i i 1  Then, the total error is: 2 n k J x e j ji i j 1 i 1 2 n n k n k t 2 x 2 x e j j ji i ji j 1 j 1 i 1 j 1 i 1 2 n n k n k t 2 x 2 x e j ji j i ji j 1 j 1 i 1 j 1 i 1 Sharif University of Technology, Computer Engineering Department, Machine Learning Course 17

  18. PCA PCA Deri Derivati vation on  To minimize J, need to take partial derivatives and also enforce constraint that {e 1 ,e 2 ,…,e k } are orthogonal. 2 n n k n k t 2 J e ( ,..., e , ,..., ) x 2 x e 1 k 11 nk j ji j i ji j 1 j 1 i 1 j 1 i 1  First take partial derivatives with respect to α ml t J e ( ,..., e , ,..., ) 2 x e 2 1 k 11 nk m l ml ml  Thus the optimal value for α ml is t t 2 x e 2 0 x e m l ml ml m l  Plug the optimal value for α ml back into J 2 2 n n k n k t t t ( ,..., ) 2 ( ) ( ) J e e x x e x e x e 1 k j j i j i j i 1 1 1 1 1 j j i j i 2 2 n n k t x ( x e ) j j i j 1 j 1 i 1 2 n k n t t x e ( x x ) e j i j j i j 1 i 1 j 1 2 n k n t t x e Se ; S x x j i i j j j 1 i 1 j 1 Sharif University of Technology, Computer Engineering Department, Machine Learning Course 18

Recommend


More recommend