Hierarchical Clustering 36-350: Data Mining 25 September 2006
Last time... • Unsupervised learning problems; finding clusters • K means • divide into k clusters to minimize within- cluser variance*cluster size • local search, local minima
Limits of k-Means • Local search can get stuck • Random starts help • Sum-of-squares likes ball-shaped clusters • How to pick k? • No relations between clusters
Hierarchical Clustering • Basic idea: cluster the clusters • High-level clusters contain multiple low-level clusters • Clusters are now related • Don’t need to chose k • Assumes a hierarchy makes sense...
Ward’s Method 1. Start with every point in its own cluster 2. For each pair of clusters, calculate “merging cost” = increase in sum of squares 3. Merge least-costly pair 4. Stop when merging cost takes a big jump
Ward’s method applied ocean6 to the images from royalblue ocean1 royalblue ocean5 lecture 3: ocean, tigers, lightskyblue3 ocean4 azure3 flowers ocean2 darkslategray.2 ocean7 darkslategray.2 ocean3 midnightblue tiger2 Jump in merging cost gray10 tiger1 lightgoldenrod3 suggests 3 clusters - tiger4 darkseagreen4 tiger9 almost exactly right darkseagreen4 tiger8 antiquewhite2 tiger6 ones, too (but thinks gray10 darkseagreen4 flower5 flower5 is a tiger) tiger5 burlywood2 tiger3 gray36 3.0 tiger7 flower4 2.5 plum4 flower1 gray59.2 2.0 merging cost flower9 1.5 gray32 flower8 orchid3 flower7 1.0 orchid3 flower6 orchid3 0.5 flower3 darkmagenta 0.0 flower2 2 4 6 8 10 clusters
• Don’t have to chose k • Sum of squares is worse, generally, than k- means (for equal k) • more constrained search • prefers to merge small clusters, all else equal
Minimizing the mean distance from the center tends to make spheres, which can be silly k-Means Ward’s note how Ward’s is less balanced
Single-link clustering 1. Start with every point in its own cluster 2. Calculate gaps between every pair of clusters = distance between 2 closest points in each cluster 3. Merge clusters with smallest gap
k-Means Ward’s Single-link
Examples where single-link doesn’t work so well k-Means Ward’s Single-link
Recommend
More recommend