unsupervised learning latent space analysis and clustering
play

Unsupervised learning: latent space analysis and clustering Yifeng - PowerPoint PPT Presentation

Introduction to Machine Learning Unsupervised learning: latent space analysis and clustering Yifeng Tao School of Computer Science Carnegie Mellon University Slides adapted from Tom Mitchell, David Sontag, Ziv Bar-Joseph Yifeng Tao Carnegie


  1. Introduction to Machine Learning Unsupervised learning: latent space analysis and clustering Yifeng Tao School of Computer Science Carnegie Mellon University Slides adapted from Tom Mitchell, David Sontag, Ziv Bar-Joseph Yifeng Tao Carnegie Mellon University 1

  2. Outline o Dimension reduction/latent space analysis o PCA o ICA o t-SNE o Clustering o K-means o GMM o Hierarchical/agglomerative clustering Yifeng Tao Carnegie Mellon University 2

  3. Unsupervised mapping to lower dimension o Instead of choosing subset of features, create new features (dimensions) defined as functions over all features o Don’t consider class labels, just the data points Yifeng Tao Carnegie Mellon University 3

  4. Principle Components Analysis o Given data points in d -dimensional space, project into lower dimensional space while preserving as much information as possible o E.g., find best planar approximation to 3D data o E.g., find best planar approximation to 10 4 D data o In particular, choose projection that minimizes the squared error in reconstructing original data [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 4

  5. PCA: Find Projections to Minimize Reconstruction Error o Assume data is set of d-dimensional vectors, where n -th vector is o We can represent these in terms of any d orthogonal basis vectors [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 5

  6. PCA o Note we get zero error if M=d, so all error is due to missing components. [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 6

  7. PCA o More strict derivation in Bishop book. [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 7

  8. PCA Example [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 8

  9. PCA Example [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 9

  10. PCA Example [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 10

  11. [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 11

  12. Independent Components Analysis o PCA seeks directions < Y 1 ... Y M > in feature space X that minimize reconstruction error o ICA seeks directions < Y 1 ... Y M > that are most statistically independent . I.e., that minimize I(Y), the mutual information between the Y j : where H(Y) is the entropy of Y o Widely used in signal processing [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 12

  13. ICA example o Both PCA and ICA try to find a set of vectors, a basis, for the data. So you can write any point (vector) in your data as a linear combination of the basis. o In PCA the basis you want to find is the one that best explains the variability of your data. o In ICA the basis you want to find is the one in which each vector is an independent component of your data. [Slide from https://www.quora.com/What-is-the-difference-between-PCA-and-ICA ] Yifeng Tao Carnegie Mellon University 13

  14. t-Distributed Stochastic Neighbor Embedding (t-SNE) o Nonlinear dimensionality reduction technique o Manifold learning [Figure from https://scikit-learn.org/stable/auto_examples/manifold/plot_t_sne_perplexity.html#sphx-glr-auto-examples-manifold-plot-t-sne-perplexity-py ] Yifeng Tao Carnegie Mellon University 14

  15. t-SNE à o o Two stages: o First, t-SNE constructs a probability distribution over pairs of high- dimensional objects in such a way that similar objects have a high probability of being picked while dissimilar points have an extremely small probability of being picked. o Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback-Leibler divergence between the two distributions with respect to the locations of the points in the map. o Minimized using gradient descent [Slide from https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding ] Yifeng Tao Carnegie Mellon University 15

  16. t-SNE example o Visualizing MNIST [Figure from https://lvdmaaten.github.io/tsne/ ] Yifeng Tao Carnegie Mellon University 16

  17. Clustering o Unsupervised learning o Requires data, but no labels o Detect patterns e.g. in o Group emails or search results o Customer shopping patterns o Regions of images o Useful when don’t know what you’re looking for [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 17

  18. Clustering o Basic idea: group together similar instances o Example: 2D point patterns [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 18

  19. o The clustering result can be quite different based on different rules. [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 19

  20. Distance measure o What could “similar” mean? o One option: small Euclidean distance (squared) o Clustering results are crucially dependent on the measure of similarity (or distance) between “points” to be clustered o What properties should a distance measure have? o Symmetric o D(A,B)=D(B,A) o Otherwise, we can say A looks like B but B does not look like A o Positivity, and self-similarity o D(A, B) >= 0, and D(A, B)=0 iff A=B o Otherwise there will different objects that we can not tell apart o Triangle inequality o D(A, B) + D(B, C) >= D(A, C) o Otherwise one can say “A is like B, B is like C, but A is not like C at all” [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 20

  21. Clustering algorithms o Partition algorithms o K-means o Mixture of Gaussian o Spectral Clustering (in graph, not discussed in this lecture.) o Hierarchical algorithms o Bottom up - agglomerative o Top down – divisive (not discussed in this lecture.) [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 21

  22. Clustering examples o Image segmentation o Goal: Break up the image into meaningful or perceptually similar regions [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 22

  23. Clustering examples o Clustering gene expression data Yifeng Tao Carnegie Mellon University 23

  24. K-Means o An iterative clustering algorithm o Initialize: Pick K random points as cluster centers o Alternate: o Assign data points to closest cluster center o Change the cluster center to the average of its assigned points o Stop when no points’ assignments change [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 24

  25. K-Means o An iterative clustering algorithm o Initialize: Pick K random points as cluster centers o Alternate: o Assign data points to closest cluster center o Change the cluster center to the average of its assigned points o Stop when no points’ assignments change [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 25

  26. K-means clustering: Example o Pick K random points as cluster centers (means) o Shown here for K =2 [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 26

  27. K-means clustering: Example o Iterative Step 1 o Assign data points to closest cluster center [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 27

  28. K-means clustering: Example o Iterative Step 2 o Change the cluster center to the average of the assigned points [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 28

  29. K-means clustering: Example o Repeat until convergence [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 29

  30. K-means clustering: Example [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 30

  31. [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 31

  32. [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 32

  33. Properties of K-means algorithm o Guaranteed to converge in a finite number of iterations o Running time per iteration: o Assign data points to closest cluster center o O(KN) time o Change the cluster center to the average of its assigned points o O(N) [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 33

  34. K-means convergence [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 34

  35. Example: K-Means for Segmentation [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 35

  36. Example: K-Means for Segmentation [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 36

  37. Initialization o K-means algorithm is a heuristic o Requires initial means o It does matter what you pick! o What can go wrong? o Various schemes for preventing this kind of thing: variance-based split / merge, initialization heuristics o E.g., multiple initialization, k-means++ [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 37

  38. K-Means Getting Stuck o A local optimum: [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 38

  39. K-means not able to properly cluster o Spectral clustering will help in this case. [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 39

  40. Changing the features (distance function) can help [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 40

  41. Reconsidering “ hard assignments ”? o Clusters may overlap o Some clusters may be “wider” than others o Distances can be deceiving [Slide from Ziv Bar-Joseph ] Yifeng Tao Carnegie Mellon University 41

  42. Gaussian Mixture Models [Slide from Ziv Bar-Joseph ] Yifeng Tao Carnegie Mellon University 42

Recommend


More recommend