Introduction to Machine Learning Unsupervised learning: latent space analysis and clustering Yifeng Tao School of Computer Science Carnegie Mellon University Slides adapted from Tom Mitchell, David Sontag, Ziv Bar-Joseph Yifeng Tao Carnegie Mellon University 1
Outline o Dimension reduction/latent space analysis o PCA o ICA o t-SNE o Clustering o K-means o GMM o Hierarchical/agglomerative clustering Yifeng Tao Carnegie Mellon University 2
Unsupervised mapping to lower dimension o Instead of choosing subset of features, create new features (dimensions) defined as functions over all features o Don’t consider class labels, just the data points Yifeng Tao Carnegie Mellon University 3
Principle Components Analysis o Given data points in d -dimensional space, project into lower dimensional space while preserving as much information as possible o E.g., find best planar approximation to 3D data o E.g., find best planar approximation to 10 4 D data o In particular, choose projection that minimizes the squared error in reconstructing original data [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 4
PCA: Find Projections to Minimize Reconstruction Error o Assume data is set of d-dimensional vectors, where n -th vector is o We can represent these in terms of any d orthogonal basis vectors [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 5
PCA o Note we get zero error if M=d, so all error is due to missing components. [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 6
PCA o More strict derivation in Bishop book. [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 7
PCA Example [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 8
PCA Example [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 9
PCA Example [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 10
[Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 11
Independent Components Analysis o PCA seeks directions < Y 1 ... Y M > in feature space X that minimize reconstruction error o ICA seeks directions < Y 1 ... Y M > that are most statistically independent . I.e., that minimize I(Y), the mutual information between the Y j : where H(Y) is the entropy of Y o Widely used in signal processing [Slide from Tom Mitchell ] Yifeng Tao Carnegie Mellon University 12
ICA example o Both PCA and ICA try to find a set of vectors, a basis, for the data. So you can write any point (vector) in your data as a linear combination of the basis. o In PCA the basis you want to find is the one that best explains the variability of your data. o In ICA the basis you want to find is the one in which each vector is an independent component of your data. [Slide from https://www.quora.com/What-is-the-difference-between-PCA-and-ICA ] Yifeng Tao Carnegie Mellon University 13
t-Distributed Stochastic Neighbor Embedding (t-SNE) o Nonlinear dimensionality reduction technique o Manifold learning [Figure from https://scikit-learn.org/stable/auto_examples/manifold/plot_t_sne_perplexity.html#sphx-glr-auto-examples-manifold-plot-t-sne-perplexity-py ] Yifeng Tao Carnegie Mellon University 14
t-SNE à o o Two stages: o First, t-SNE constructs a probability distribution over pairs of high- dimensional objects in such a way that similar objects have a high probability of being picked while dissimilar points have an extremely small probability of being picked. o Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback-Leibler divergence between the two distributions with respect to the locations of the points in the map. o Minimized using gradient descent [Slide from https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding ] Yifeng Tao Carnegie Mellon University 15
t-SNE example o Visualizing MNIST [Figure from https://lvdmaaten.github.io/tsne/ ] Yifeng Tao Carnegie Mellon University 16
Clustering o Unsupervised learning o Requires data, but no labels o Detect patterns e.g. in o Group emails or search results o Customer shopping patterns o Regions of images o Useful when don’t know what you’re looking for [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 17
Clustering o Basic idea: group together similar instances o Example: 2D point patterns [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 18
o The clustering result can be quite different based on different rules. [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 19
Distance measure o What could “similar” mean? o One option: small Euclidean distance (squared) o Clustering results are crucially dependent on the measure of similarity (or distance) between “points” to be clustered o What properties should a distance measure have? o Symmetric o D(A,B)=D(B,A) o Otherwise, we can say A looks like B but B does not look like A o Positivity, and self-similarity o D(A, B) >= 0, and D(A, B)=0 iff A=B o Otherwise there will different objects that we can not tell apart o Triangle inequality o D(A, B) + D(B, C) >= D(A, C) o Otherwise one can say “A is like B, B is like C, but A is not like C at all” [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 20
Clustering algorithms o Partition algorithms o K-means o Mixture of Gaussian o Spectral Clustering (in graph, not discussed in this lecture.) o Hierarchical algorithms o Bottom up - agglomerative o Top down – divisive (not discussed in this lecture.) [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 21
Clustering examples o Image segmentation o Goal: Break up the image into meaningful or perceptually similar regions [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 22
Clustering examples o Clustering gene expression data Yifeng Tao Carnegie Mellon University 23
K-Means o An iterative clustering algorithm o Initialize: Pick K random points as cluster centers o Alternate: o Assign data points to closest cluster center o Change the cluster center to the average of its assigned points o Stop when no points’ assignments change [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 24
K-Means o An iterative clustering algorithm o Initialize: Pick K random points as cluster centers o Alternate: o Assign data points to closest cluster center o Change the cluster center to the average of its assigned points o Stop when no points’ assignments change [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 25
K-means clustering: Example o Pick K random points as cluster centers (means) o Shown here for K =2 [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 26
K-means clustering: Example o Iterative Step 1 o Assign data points to closest cluster center [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 27
K-means clustering: Example o Iterative Step 2 o Change the cluster center to the average of the assigned points [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 28
K-means clustering: Example o Repeat until convergence [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 29
K-means clustering: Example [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 30
[Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 31
[Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 32
Properties of K-means algorithm o Guaranteed to converge in a finite number of iterations o Running time per iteration: o Assign data points to closest cluster center o O(KN) time o Change the cluster center to the average of its assigned points o O(N) [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 33
K-means convergence [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 34
Example: K-Means for Segmentation [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 35
Example: K-Means for Segmentation [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 36
Initialization o K-means algorithm is a heuristic o Requires initial means o It does matter what you pick! o What can go wrong? o Various schemes for preventing this kind of thing: variance-based split / merge, initialization heuristics o E.g., multiple initialization, k-means++ [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 37
K-Means Getting Stuck o A local optimum: [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 38
K-means not able to properly cluster o Spectral clustering will help in this case. [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 39
Changing the features (distance function) can help [Slide from David Sontag ] Yifeng Tao Carnegie Mellon University 40
Reconsidering “ hard assignments ”? o Clusters may overlap o Some clusters may be “wider” than others o Distances can be deceiving [Slide from Ziv Bar-Joseph ] Yifeng Tao Carnegie Mellon University 41
Gaussian Mixture Models [Slide from Ziv Bar-Joseph ] Yifeng Tao Carnegie Mellon University 42
Recommend
More recommend