clustering
play

Clustering Lecture notes Clustering is Exploratory, unsupervised - PowerPoint PPT Presentation

Clustering Lecture notes Clustering is Exploratory, unsupervised method Data in cluster is similar to each other, and dissimilar to other cluster data Finding structure in data Hard vs. Soft/Fuzzy vs. Exclusive Understanding and/or summarise


  1. Clustering Lecture notes

  2. Clustering is Exploratory, unsupervised method Data in cluster is similar to each other, and dissimilar to other cluster data Finding structure in data Hard vs. Soft/Fuzzy vs. Exclusive Understanding and/or summarise 70% 15% Hard vs Fuzzy/Soft clustering 13% 2%

  3. Types of clusters Well separated Overlapping Contiguity Centre-based – based on distance to cluster centres. points are closer to others in their cluster than any other cluster Conceptual Density Sharing some general High-density areas attribute. Intersections separated by low-density belong to both areas

  4. Notion of cluster ambiguous

  5. Applications Biology – compare species in different environments Medicine – Group variants of illnesses or similar genes PET scans; differentiate tissues Marketing – Consumers with similar shopping habits, grouping shopping items Social sciences – Crime analysis (greatest incidences of different crimes), Poll data Computer science – Data compression, Anomaly detection Internet – Web searches (categories of results), social media analysis Other : Location of network towers for optimum signal cover Netflix (cluster similar view habits) Group skulls from archeological digs based on measurements

  6. “Distance” metric • Minkowski metric Others metrics • Mahalanobis distance – Absolute without redundancies for n features, points x,y • Pearson correlation (unit indep.) – covariance(x,y) / [std(x) std(y)] Bad for high dimension data • Binary data: – Russel, Dice, Yule index, …. Two common cases: Manhattan, q = 1 • Cosine (Document: keywords) (cityblock) • Gowder’s distance (mixed) Euclidean, q = 2 • Alternatives (squared distances): (“as the crow flies”) ✴ Squared Euclidean Magnitude and units affect ✴ Squared Pearson (e.g. body height vs. toe length) ………. -> standardise ! (mean=0, std=1) But may affect variability

  7. Linkage criteria Represent cluster location nearest(single) neighbour (noise,outliers) farthest(complete) neighbour (noise,outliers,favours global shape) average – calculate avg. amongst point dist. (middle of single/complete) centroid – virtual "average" point based on each feature medoid – real “median" point

  8. Example calculation 4 B 3 2 A 1 1 2 3 4 5 6

  9. Distance comparison Manhattan Euclidean Single 4 sqrt(8)=2.83 Complete 7 sqrt(29)=5.39 Average 48/9=5.33 4.00 Centroid 16/3 = 5.33 sqrt(153/9)=4.12 Medoid 6 sqrt(18)=4.24

  10. Hierarchical (Hierarchical) tree of nested clusters Levels are steps in clustering process agglomerative vs divisive “bottom-up” vs “top-down” Time complexity at least n^2 Divisive often more computationally demanding

  11. Dendrogram a 9 8 b 7 c 6 5 d 4 3 2 e Stepwise merge or split 1 1 2 3 4 5 6 7 8 9

  12. Dendrogram Root Branch Root – starting point for point all points Branch point – Threshold splitting/merging point Leaf – point is in its own cluster Threshold – selected limit for best # clusters Leaf

  13. Example calculation – Revisited 5 a b c 4 a b c d e a 0 1 4 5 5 3 b 1 0 2 6 6 d 2 c 4 2 0 7 7 e d 5 6 7 0 2 1 e 5 6 7 2 0 1 2 3 4 5 6 Manhattan metric with centroid linkage

  14. Example calculation – Revisited ab 5 c 4 ab c d e 3 ab 0 3.5 5.5 5.5 d c 3.5 0 7 7 2 d 5.5 7 0 2 e 1 e 5.5 7 2 0 1 2 3 4 5 6 Manhattan metric with centroid linkage

  15. Example calculation – Revisited ab 5 c 4 ab c de 3 ab 0 4.5 5 2 c 4.5 0 7 de 1 de 5 7 0 1 2 3 4 5 6 Manhattan metric with centroid linkage

  16. Example calculation – Revisited 5 abc c 4 3 2 de 1 1 2 3 4 5 6 Manhattan metric with centroid linkage

  17. More detailed example

  18. k-means Partitional clustering method Specify number of clusters k , find the most compact solution Ordinarily time complexity n dk+1 for d=features, k=clusters Faster algorithms exist (Lloyd's algorithm is linear!) But, k-means falls into local minima , several trials needed and only handles numeric data and redundancies are not excluded

  19. k-means – Initialise Initialise k cluster centroids randomly – possibly different clusters, do multiple runs first random or average, then rest most distant point for centroids – may select outliers) k means++ (first is fully random, the others random with probability proportional to D 2 – distance to nearest centroid) then …….

  20. k-means – Iterate Iterate (determine:) calculate centroids calculate distances to centroids move points to closest centroid Repeat until either: No points move clusters <X% moves Maximum iterations

  21. k-means – issues Empty clusters – replace centroid with farthest point to clusters or from cluster with highest Sum of Square Errors Outliers – Points with high contribution to cluster SSE, and more But for some fields, like financial, outliers are important

  22. Visualisation – k= 10, white crosses are centroids

  23. Silhouette What k to use? Validates performance based on intra and inter-cluster distances a(i) average dissimilarity with other data in cluster b(i) lowest dissimilarity to any non-member cluster Values [-1,1] Calculated for each point, so very time demanding!

  24. Calinski-Harabasz Index Faster , better for large data Performance based on average intra and inter-cluster SSE (Tr):

  25. Postprocessing We may still want to improve SSE of our results • Split cluster with largest SSE or standard deviation • Open a new cluster using point most distant to any cluster Decrease number of clusters with smallest SSE increase • Disperse a cluster, reassign points to cluster increasing SSE the least • Merge two clusters with the closest centroids

Recommend


More recommend