Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 18: Clustering Slides adapted from Jordan Boyd-Graber, Chris Ketelsen 1
Learning objectives • Learn about general clustering • Learn about the K-Means algorithm • Learn about Gaussian Mixture Models 2
Supervised learning Unsupervised learning Data: X Labels: Y Data: X Latent structure: Z 3
Clustering • One important unsupervised method is clustering • Goal: Organize data in classes 4
Clustering applications – Microarray Gene Expression data From: “Skin layer-specific transcriptional profiles in normal and recessive yellow (Mc1re/Mc1re) mice'' by April and Barsh in Pigment Cell Research (2006) 5
Clustering applications – Medical Imaging 6
Clustering applications – Community detection 7
News Media 8
Clustering • One important unsupervised method is clustering • Goal: Organize data in classes • Classes are hard to define • Different data representation may lead to different clusterings 9
Clustering • One important unsupervised method is clustering • Goal: Organize data in classes • Data have high in-class similarity • Data have low out-of-class similarity 10
Clustering - Similarity 11
Clustering - Similarity 12
K-Means • Simplest clustering method • Iterative in nature • Reasonably fast • Very popular in practice (though with more bells and whistles) • Requires real-valued data 13
K-Means 14
K-Means 15
16
17
18
19
20
21
22
More K-means • Animations: http://shabal.in/visuals/kmeans/4.html 23
K-Means in numbers 24
K-Means in numbers 25
K-Means in numbers 26
K-Means in numbers 27
K-Means in numbers 28
K-Means in numbers 29
K-Means in numbers 30
K-Means in numbers 31
K-Means in numbers 32
K-Means in numbers 33
K-Means in numbers 34
K-Means in numbers 35
K-Means 36
K-Means 37
K-Means • Weaknesses • Doesn't really work with categorical data • Usually only converges to local minimum • Have to determine number of clusters • Can be sensitive to outliers • Only generates convex clusters 38
K-means - Weaknesses • Doesn't really work with categorical data 39
K-means - Weaknesses • Doesn't really work with categorical data • Fix : Do K-Modes instead 40
K-means - Weaknesses • Usually only converges to local minimum 41
K-means - Weaknesses • Usually only converges to local minimum • Fix : Do several runs with random inits. and choose best 42
K-means - Weaknesses • Have to determine number of clusters 43
K-means - Weaknesses • Have to determine number of clusters • Fix: Use the elbow method Run K-Means for different values of k and look at loss function 44
45
46
47
48
49
50
Gaussian Mixture Models 51
Gaussian Mixture Models 52
Gaussian Mixture Models 53
Gaussian Mixture Models 54
Gaussian Mixture Models 55
Gaussian Mixture Models 56
Gaussian Mixture Models 57
Gaussian Mixture Models 58
Gaussian Mixture Models 59
Recap • K-means is the most commonly used clustering algorithm • We learned the Gaussian Mixture Model’s generative story • We will learn EM-algorithm next week 60
Recommend
More recommend