dimension reduction
play

Dimension Reduction CSE 6242 / CX 4242 Thanks : Prof. Jaegul Choo - PowerPoint PPT Presentation

Dimension Reduction CSE 6242 / CX 4242 Thanks : Prof. Jaegul Choo , Dr. Ramakrishnan Kannan, Prof. Le Song What is Dimension Reduction? Data item index (n) low-dim Columns as data data items Dimension index (d) Dimension Reduction


  1. Dimension Reduction CSE 6242 / CX 4242 ž Thanks : Prof. Jaegul Choo , Dr. Ramakrishnan Kannan, Prof. Le Song

  2. What is Dimension Reduction? Data item index (n) low-dim Columns as data data items Dimension index (d) Dimension Reduction Why? Attribute=Feature= How big is Variable=Dimension this?

  3. Image Data 5 Serialized/rasterized pixel values 3 58 5 34 78 3 80 63 34 58 24 45 80 24 Raw images Pixel values 78 In a 4K (4096x2160) 45 image there are totally 8.8 63 million pixels Serialized pixels 3

  4. Video Data Serialized/rasterized pixel values 5 22 Pixel values 49 3 58 14 5 34 78 22 86 15 3 80 63 34 86 … 49 54 67 58 24 45 80 54 Raw images 14 78 36 ž Huge dimensions 24 78 — 4096x2160 image size → 8847360 dimensions 63 15 — 30 fps. — Means for 2 mins video, you generate a matrix of size 45 67 8847360 x3600 63 36 Serialized pixels 4

  5. Text Documents ž Bag-of-words vector — Document 1 = “Life of Pi won Oscar” — Document 2 = “Life of Pi is also a book.” Vocabulary Doc 1 Doc 2 Life 1 1 Pi 1 1 movies 0 0 … also 0 1 oscar 1 0 book 0 1 won 1 0

  6. Two Axes of Data Set ž Data items — How many data items? ž Dimensions — How many dimensions representing each item? Data item index (n) Columns as Dimension vs. Rows as data items data items index (d) We will use this during lecture

  7. Dimension Reduction No. of dimensions High-dim (k) data (n) low-dim data (n) Reduced Dimension Dimension dimension index (d) Reduction (k) Other Dim-reducing parameters transformation Additional info for new data about data : user-specified 7

  8. Benefits of Dimension Reduction Obviously, Compression Visualization Faster computation Computing distances: 100,000-dim vs. 10-dim vectors More importantly, Noise removal (improving data quality) Separates the data into General Pattern + Sparse + Noise Is Noise the important signal? Works as pre-processing for better performance e.g., microarray data analysis, information retrieval, face recognition, protein disorder prediction, network intrusion detection, document categorization, speech recognition

  9. Two Main Techniques 1. Feature selection Selects a subset of the original variables as reduced dimensions relevant for a particular task e.g., the number of genes responsible for a particular disease may be small 2. Feature extraction Each reduced dimension combines multiple original dimensions The original dataset will be transformed to some other numbers Feature = Variable = Dimension 9

  10. Feature Selection What are the optimal subset of m features to maximize a given criterion? Widely-used criteria Information gain, correlation, … Typically combinatorial optimization problems Therefore, greedy methods are popular Forward selection: Empty set → Add one variable at a time Backward elimination: Entire set → Remove one variable at a time 10

  11. Feature Extraction

  12. Aspects of Dimension Reduction Linear vs. Nonlinear Unsupervised vs. Supervised Global vs. Local Feature vectors vs. Similarity (as an input) 12

  13. Linear vs. Nonlinear Linear Represents each reduced dimension as a linear combination of original dimensions Of the form aX+b where a, x and b are vectors/matrices e.g., Y1 = 3*X1 – 4*X2 + 0.3*X3 – 1.5*X4 Y2 = 2*X1 + 3.2*X2 – X3 + 2*X4 Naturally capable of mapping new data to the same space D1 D2 D1 D2 X1 1 1 Dimension Y1 1.75 -0.27 Reduction X2 1 0 Y2 -0.21 0.58 X3 0 2 X4 1 1 13

  14. Linear vs. Nonlinear Linear Represents each reduced dimension as a linear combination of original dimensions e.g., Y1 = 3*X1 – 4*X2 + 0.3*X3 – 1.5*X4, Y2 = 2*X1 + 3.2*X2 – X3 + 2*X4 Naturally capable of mapping new data to the same space Nonlinear More complicated, but generally more powerful Recently popular topics 14

  15. Unsupervised vs. Supervised Unsupervised Uses only the input data No. of High-dim dimensions data low-dim data Dimension Reduction Additional info Other Dim-reducing about data parameters Transformer for a new data 15

  16. Unsupervised vs. Supervised Supervised Uses the input data + additional info No. of High-dim dimensions data low-dim data Dimension Reduction Additional info Other Dim-reducing about data parameters Transformer for a new data 16

  17. Unsupervised vs. Supervised Supervised Uses the input data + additional info e.g., grouping label No. of High-dim dimensions data low-dim data Dimension Reduction Additional info Other Dim-reducing about data parameters Transformer for a new data 17

  18. Global vs. Local Dimension reduction typically tries to preserve all the relationships/distances in data Information loss is unavoidable! Then, what should we emphasize more? Global Treats all pairwise distances equally important Focuses on preserving large distances Local Focuses on small distances, neighborhood relationships Active research area, e.g., manifold learning 18

  19. Feature vectors vs. Similarity (as an input) Typical setup (feature vectors as an input) No. of dimensions High-dim (k) data (n) low-dim data Reduced Dimension Dimension dimension index (d) Reduction (k) Additional info Other Dim-reducing about data parameters Transformer for a new data

  20. Feature vectors vs. Similarity (as an input) Typical setup (feature vectors as an input) Alternatively, takes similarity matrix instead ( i , j )-th component indicates similarity between i -th and j -th data Assuming distance is a metric, similarity matrix is symmetric No. of Similarity dimensions matrix low-dim data Dimension Reduction Additional info Other Dim-reducing about data parameters Transformer for a new data

  21. Feature vectors vs. Similarity (as an input) Typical setup (feature vectors as an input) Alternatively, takes similarity matrix instead Internally, converts feature vectors to similarity matrix before performing dimension reduction No. of Similarity dimensions matrix(nxn) Dimension Reduction low-dim High-dim data (dxn) data(kxn) low-dim Dimension data(kxn) Reduction Additional info Other Dim-reducing about data parameters Transformer for Graph Embedding a new data

  22. Feature vectors vs. Similarity (as an input) Why called graph embedding? Similarity matrix can be viewed as a graph where similarity represents edge weight Dimension Reduction High-dim Similarity data(dxn) matrix low-dim data Graph Embedding

  23. Methods Traditional Principal component analysis (PCA) Multidimensional scaling (MDS) Linear discriminant analysis (LDA) Advanced (nonlinear, kernelized, manifold learning) Isometric feature mapping (Isomap) * Matlab codes are available at http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html 23

  24. Principal Component Analysis Finds the axis showing the largest variation, and project all points into this axis Reduced dimensions are orthogonal Algorithm: Eigen-decomposition Pros: Fast Cons: Limited performances PC2 PC1 Linear Unsupervised Global Feature vectors Image source: http://en.wikipedia.org/wiki/Principal_component_analysis 24

  25. PCA – Some Questions Algorithm Subtract mean from the dataset (X-μ) Find the covariance matrix (X-μ)’ (X-μ) Perform Eigen decomposition on this covariance matrix Key Questions Why covariance matrix? SVD on the original matrix vs Eigen decomposition on covariance matrix 25

  26. Multidimensional Scaling (MDS) Main idea Tries to preserve given pairwise distances in low- dimensional space ideal distance Low-dim distance Nonlinear Unsupervised Metric MDS Global Preserves given distance values Similarity input Nonmetric MDS When you only know/care about ordering of distances Preserves only the orderings of distance values Algorithm: gradient-decent type c.f. classical MDS is the same as PCA 26

  27. Multidimensional Scaling Pros: widely-used (works well in general) Cons: slow ( n -body problem) Nonmetric MDS is even much slower than metric MDS Fast algorithm are available. Barnes-Hut algorithm GPU-based implementations 27

  28. Linear Discriminant Analysis What if clustering information is available? LDA tries to separate clusters by Putting different cluster as far as possible Putting each cluster as compact as possible (a) (b)

  29. Aspects of Dimension Reduction Unsupervised vs. Supervised Supervised Uses the input data + additional info e.g., grouping label No. of High-dim dimensions data low-dim data Dimension Reduction Additional info Other Dim-reducing about data parameters Transformer for a new data

  30. Linear Discriminant Analysis (LDA) vs. Principal Component Analysis 2D visualization of 7 Gaussian mixture of 1000 dimensions Linear discriminant analysis Principal component analysis (Supervised) (Unsupervised) 30 30

Recommend


More recommend