clustering
play

Clustering kMeans, Expectation Maximization, Self-Organizing Maps - PowerPoint PPT Presentation

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means clustering Hierarchical clustering Incremental clustering Probability-based clustering Self-Organising Maps Classification vs. Clustering


  1. Clustering kMeans, Expectation Maximization, Self-Organizing Maps

  2. Outline • K-means clustering • Hierarchical clustering • Incremental clustering • Probability-based clustering • Self-Organising Maps

  3. Classification vs. Clustering Classification: Supervised learning (labels given)

  4. Classification vs. Clustering labels unknown Clustering: Unsupervised learning No labels, find “natural” grouping of instances

  5. Many Applications! • Basically, everywhere labels are unknown/ uncertain/too expensive • Marketing: find groups of similar customers • Astronomy: find groups of similar stars, galaxies • Earth-quake studies: cluster earth quake epicenters along continent faults • Genomics: find groups of genes with similar expressions

  6. Clustering Methods: Terminology Non-overlapping Overlapping

  7. Clustering Methods: Terminology Bottom-up Top-down (agglomerative)

  8. Clustering Methods: Terminology Hierarchical

  9. Clustering Methods: Terminology Deterministic Probabilistic

  10. K-Means Clustering

  11. K-means clustering (k=3) Y X Pick k random points: initial cluster centers

  12. K-means clustering (k=3) k 1 Y k 2 k 3 X Pick k random points: initial cluster centers

  13. K-means clustering (k=3) k 1 Y k 2 k 3 X Assign each point to nearest cluster center

  14. K-means clustering (k=3) k 1 Y k 2 k 3 X Move cluster centers to mean of each cluster

  15. K-means clustering (k=3) k 1 Y k 2 k 3 X Move cluster centers to mean of each cluster

  16. K-means clustering (k=3) k 1 Y k 2 k 3 X Move cluster centers to mean of each cluster

  17. K-means clustering (k=3) k 1 Y k 3 k 2 X Move cluster centers to mean of each cluster

  18. K-means clustering (k=3) k 1 Y k 3 k 2 X Reassign points to nearest cluster center

  19. K-means clustering (k=3) k 1 Y k 3 k 2 X Reassign points to nearest cluster center

  20. K-means clustering (k=3) k 1 Y k 3 k 2 X Reassign points to nearest cluster center

  21. K-means clustering (k=3) k 1 Y k 3 k 2 X Reassign points to nearest cluster center

  22. K-means clustering (k=3) k 1 Y k 3 k 2 X Reassign points to nearest cluster center

  23. K-means clustering (k=3) k 1 Y k 3 k 2 X Repeat step 3-4 until cluster centers converge (don’t/hardly move)

  24. K-means clustering (k=3) k 1 Y k 2 k 3 X Repeat step 3-4 until cluster centers converge (don’t/hardly move)

  25. K-means clustering (k=3) k 1 Y k 2 k 3 X Repeat step 3-4 until cluster centers converge (don’t/hardly move)

  26. K-means clustering (k=3) k 1 Y k 2 k 3 X Repeat step 3-4 until cluster centers converge (don’t/hardly move)

  27. K-means Works with numeric data only Pick K random points: initial cluster centers 1) Assign every item to its nearest cluster center 2) (e.g. using Euclidean distance) Move each cluster center to the mean of its 3) assigned items Repeat steps 2,3 until convergence (change in 4) cluster assignments less than a threshold)

  28. K-means clustering: another example http://www.youtube.com/watch?v=zaKjh2N8jN4#!

  29. Discussion Result can vary significantly depending on initial • choice of centers Can get trapped in local minimum • Example: • initial cluster centers instances To increase chance of finding global optimum: restart • with different random seeds

  30. K-means clustering summary Advantages Disadvantages • Must pick number of • Simple, understandable clusters before hand • Items automatically • All items forced into a single assigned to clusters cluster • Sensitive to outliers

  31. K-means: variations • K-medoids – instead of mean, use medians of each cluster Mean of 1, 3, 5, 7, 1009 is • Median of 1, 3, 5, 7, 1009 is • • For large databases, use sampling

  32. K-means: variations • K-medoids – instead of mean, use medians of each cluster 205 Mean of 1, 3, 5, 7, 1009 is • Median of 1, 3, 5, 7, 1009 is • • For large databases, use sampling

  33. K-means: variations • K-medoids – instead of mean, use medians of each cluster 205 Mean of 1, 3, 5, 7, 1009 is • 5 Median of 1, 3, 5, 7, 1009 is • • For large databases, use sampling

  34. Hierarchical Clustering

  35. Bottom-up vs top-down clustering • Bottom up / Agglomerative • Start with single-instance clusters At each step, join two “closest” clusters • A B C D E F A DE BC DEF B D BCDEF F E C ABCDEF • Top down • Start with one universal cluster Split in two clusters • • Proceed recursively on each subset

  36. Hierarchical clustering • Hierarchical clustering represented in dendrogram • tree structure containing hierarchical clusters • clusters in leafs, union of child clusters in nodes

  37. Distance Between Clusters Centroid: distance between centroids • Sometimes hard to compute (e.g. mean of molecules?) • Single Link : smallest distance between points • Complete Link: largest distance between points • Average Link: average distance between points • single sinlge link complete link average link distance = 1 distance = 2 distance = 1.5 (d(A,C)+d(A,D) +d(B,C)+d(B,D))/4 D D D C C C B B B A A A

  38. Distance Between Clusters Centroid: distance between centroids • Sometimes hard to compute (e.g. mean of molecules?) • Single Link : smallest distance between points • Complete Link: largest distance between points • Average Link: average distance between points • Group-average : group two clusters into one, then take • average distance between all points (incl. d(A,B) & d(C,D))

  39. Incremental Clustering

  40. Clustering weather data ID Outlook Temp. Humidity Windy 1 A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False D Rainy Mild High False E Rainy Cool Normal False F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True

  41. Clustering weather data ID Outlook Temp. Humidity Windy 1 A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False D Rainy Mild High False start new clusters, 2 E Rainy Cool Normal False up to a point F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True

  42. Category Utility Category utility: overall quality of clustering • Quadratic loss function • • nominal: clusters C i , attributes a i , values v ij : • numeric: similar, assume Gaussian distribution Intuitively: • • good clusters allow to predict value of new data points: Pr[a i =v ij | C i ] > Pr[a i =v ij ] • 1/k factor: penalty for using many clusters (avoids overfitting)

  43. Clustering weather data 1 ID Outlook Temp. Humidity Windy A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False D Rainy Mild High False E Rainy Cool Normal False F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True

  44. Clustering weather data 1 ID Outlook Temp. Humidity Windy A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False Max. number D Rainy Mild High False 2 depends on k E Rainy Cool Normal False F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True

  45. Clustering weather data 1 ID Outlook Temp. Humidity Windy A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False Max. number D Rainy Mild High False 2 depends on k E Rainy Cool Normal False F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False join with most 3 similar leaf: J Rainy Mild Normal False new cluster K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True

  46. Clustering weather data ID Outlook Temp. Humidity Windy 4 A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False D Rainy Mild High False E Rainy Cool Normal False F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True

Recommend


More recommend