dimensionality reduction dimensionality reduction
play

DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION MATTHIEU BLOCH - PowerPoint PPT Presentation

DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION MATTHIEU BLOCH April 21, 2020 1 / 26 MULTIDIMENSIONAL SCALING MULTIDIMENSIONAL SCALING There are situations for which Euclidean distance is not appropriate Suppose we have access to a


  1. DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION MATTHIEU BLOCH April 21, 2020 1 / 26

  2. MULTIDIMENSIONAL SCALING MULTIDIMENSIONAL SCALING There are situations for which Euclidean distance is not appropriate Suppose we have access to a dissimilarity matrix and some distance function A dissimilarity matrix satisfies Triangle inequality not required Multidimensional scaling (MDS) Find dimension and such that In general, perfect embedding into the desired dimension will not exist Many variants of MDS based on choice of , whether is completely known, and Two types of MDS Metric MDS: try to ensure that Non Metric MDS: try to ensure that 2 / 26

  3. EUCLIDEAN EMBEDDINGS EUCLIDEAN EMBEDDINGS Assume is completely known (no missing entry) and Algorithm to create embedding Form where Compute eigen decomposition Return , where consists of first columns of , consists of first columns of Where is this coming from? Theorem (Eckart-Young Theorem) The above algorithm returns the best rank approximation in the sense that it minimizes and 3 / 26

  4. 4 / 26

  5. 5 / 26

  6. 6 / 26

  7. PCA VS MDS PCA VS MDS Suppose we have and such that Set PCA computes an eigendecomposition of Equivalent to computing the SVD of New representation computed as , where MDS computes an eigendecomposition of Return 7 / 26

  8. PCA VS MDS PCA VS MDS Subtle difference between PCS and MDS PCA gives us access to and : we can extract features and reconstruct approximations Need to recover How can we extract features in MDS and compute ? Important to add new point Lemma. Assume we have access to and want to add a new point to our embedding. Define Then , where consists of first columns of from the SVD 8 / 26

  9. EXTENSIONS OF MDS EXTENSIONS OF MDS Classical MDS minimizes the loss function Many other choices exist Common choice is stress function are fixed weights handles missing data, penalizes error on nearby points Nonlinear embeddings High-dimensional data sets can have nonlinear structure that not captured via linear methods Kernelize PCA and MDS with non linear Use PCA on or MDS on 9 / 26

  10. COMPUTING KERNEL PCA COMPUTING KERNEL PCA Dataset , kernel , dimension Kernel PCA 1. Form where is kernel matrix and is centering matrix 2. Compute eigen-decomposition 3. Set to first rows of Projection of transformed data computed with computed as with No computation in large dimensional Hilbert space! 10 / 26

  11. ISOMETRIC FEATURE MAPPING ISOMETRIC FEATURE MAPPING Can be viewed as an extension of MDS Assumes that the data lies in low-dimensional manifold (looks Euclidean in small neighborhoods) Given dataset , try to compute estimate of the geodesic distance along manifold Swiss roll manifold How do we estimate the geodesic distance? 11 / 26

  12. ESTIMATING GEODESIC DISTANCE ESTIMATING GEODESIC DISTANCE Compute shortest path using a proximity graph Form a matrix as follows 1. For every define local neighborhood (e.g., nearest neighbors, all s.t. ) 2. For each , set is a weighted adjacency matrix of the graph Compute by setting to length of shortest path from node to node in graph described by Can compute embedding similarly to MDS Challenge : isomap can become inaccurate for points far apart 12 / 26

  13. 13 / 26

  14. LOCALLY LINEAR EMBEDDING (LLE) LOCALLY LINEAR EMBEDDING (LLE) Idea: a data manifold that is globally nonlinear still appears linear in local pieces Don’t try to explicitly model global geodesic distances Try to preserve structure in data by patching together local pieces of the manifold LLE algorithm for dataset 1. For each , define local neighborhood 2. Solve 3. Fix and solve 14 / 26

  15. 15 / 26

  16. LOCALLY LINEAR EMBEDDING (LLE) LOCALLY LINEAR EMBEDDING (LLE) Eigenvalue problem in compact form: Same problem encountered in PCA! Use eigendecomposition of Can compute embedding of new points as with computed from same constrained least squares problem Demo notebook 16 / 26

  17. KERNEL DENSITY ESTIMATION KERNEL DENSITY ESTIMATION Density estimation problem: given samples from an unknown density , estimate Image Density estimation problem Applications: classification, clustering, anomaly detection, etc. 17 / 26

  18. KERNEL DENSITY ESTIMATION KERNEL DENSITY ESTIMATION General form of kernel density estimate is called a kernel Estimate is non parametric , also known as Parzen window method is the bandwidth Looks like a ridge regression kernel but with equal weights and need not be inner product kernel A kernel should satisfy the following for some Plenty of standard kernels: rectangular, Gaussian, etc. Demo kernel density estimation How do we choose ? 18 / 26

  19. KERNEL DENSITY ESTIMATION KERNEL DENSITY ESTIMATION Theorem. Let be a kernel density estimate based on kernel . Suppose we scale with as Then Seems like a very powerful result, kernel density estimation always works given enough points In practice, choose (Silverman’s rule of thumb) Can also use model selection techniques (split dataset for testing and training) Ugly truth: kernel density estimation works well with a lot of points in low dimension 19 / 26

  20. CLUSTERING CLUSTERING Clustering problem: given samples assign points to disjoint subsets called clusters , so that points in the same cluster are more similar to each other than points in different clusters Clustering is map with the number of clusters; how do we choose Definition. (within-cluster scatter) -means clustering: find minimizing Lemma. The number of possible clusters is given by the Stirling’s numbers of the second kind No known efficient search strategy for this space exact solution by complete enumeration has intractable complexity 20 / 26

  21. SUB-OPTIMAL K-MEANS CLUSTERING SUB-OPTIMAL K-MEANS CLUSTERING We want to find Lemma. where and Lemma. For a fixed clustering we have Solve enlarged optimization problem 21 / 26

  22. ALTERNATING OPTIMIZATION PROCEDURE ALTERNATING OPTIMIZATION PROCEDURE To find means solution 1. Given , choose to minimize 2. Given choose to minimize The solution to subproblem 1 is The solution to subproblem 2 is 22 / 26

  23. K MEANS REMARKS K MEANS REMARKS Algorithmic notes Algorithm typically initialized with as random point in dataset Several random initializations to avoid local minima Clusters boundaries are parts of hyperplanes Regions are intersections of halfplanes hence convex means fails if clusters are non convex Geometry changes if we change the norm 23 / 26

  24. GAUSSIAN MIXTURE MODELS GAUSSIAN MIXTURE MODELS Extend idea behind -means clustering to allow for more general cluster shapes clusters are elliptical cluster can be modeled using a multivariate Gaussian density with , , full data set is modeled using a Gaussian mixture model (GMM) where , , , cluster estimation by performing MLE on GMM Interpretation of GMM State variable such that every realization with a hidden realization of the state variable challenge is to perform clustering without observing hidden states 24 / 26

  25. MAXIMUM LIKELIHOOD ESTIMATION MAXIMUM LIKELIHOOD ESTIMATION Example (easy) MLE of single multivariate Gaussian Example (hard) MLE of mixture of multivariate Gaussian with incomplete data Example (ideal) MLE of mixture of multivariate Gaussian with complete data 25 / 26

  26. EXPECTATION-MAXIMIZATION (EM) ALGORITHM EXPECTATION-MAXIMIZATION (EM) ALGORITHM Efficient algorithm to address incomplete data Key idea is to work with MLE for complete data and average out unobserved hidden state EM Algorithm 1. Initialize 2. For E step: evaluate where Maximization step: set Lemma. The algorithm gets monotonically better       26 / 26

Recommend


More recommend