lecture 10
play

Lecture 10 Jan-Willem van de Meent Clustering Clustering - PowerPoint PPT Presentation

Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 10 Jan-Willem van de Meent Clustering Clustering Unsupervised learning (no labels for training) Group data into similar classes that Maximize


  1. Unsupervised Machine Learning 
 and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 10 Jan-Willem van de Meent

  2. Clustering

  3. Clustering • Unsupervised learning (no labels for training) • Group data into similar classes that • Maximize inter-cluster similarity • Minimize intra-cluster similarity

  4. Four Types of Clustering 1. Centroid-based (K-means, K-medoids) Notion of Clusters: Voronoi tesselation

  5. Four Types of Clustering 2. Connectivity-based (Hierarchical) Notion of Clusters: Cut off dendrogram at some depth

  6. Four Types of Clustering 3. Density-based (DBSCAN, OPTICS) Notion of Clusters: Connected regions of high density

  7. Four Types of Clustering 4. Distribution-based (Mixture Models) Notion of Clusters: Distributions on features

  8. 
 Review: K-means Clustering Objective: Sum of Squares μ 1 μ 2 One-hot assignment Center for cluster k µ k Alternate between two steps 
 μ 3 1. Minimize SSE w.r.t. z n 2. Minimize SSE w.r.t. μ k

  9. K-means Clustering 5 4 μ 1 3 μ 2 2 1 μ 3 0 0 1 2 3 4 5 Assign each point to closest centroid, then update centroids to average of points

  10. K-means Clustering 5 4 μ 1 3 2 μ 3 μ 2 1 0 0 1 2 3 4 5 Assign each point to closest centroid, then update centroids to average of points

  11. K-means Clustering 5 4 μ 1 3 2 μ 3 μ 2 1 0 0 1 2 3 4 5 Repeat until convergence 
 (no points reassigned, means unchanged)

  12. K-means Clustering 5 4 μ 1 3 2 μ 2 μ 3 1 0 0 1 2 3 4 5 Repeat until convergence 
 (no points reassigned, means unchanged)

  13. “Good” Initialization of Centroids Iteration 1 Iteration 2 Iteration 3 + 3 3 3 + + 2.5 2.5 2.5 + + 2 2 2 + 1.5 1.5 1.5 + y y y + + 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x Iteration 4 Iteration 5 Iteration 6 3 3 3 2.5 2.5 2.5 + + + 2 2 2 1.5 1.5 1.5 y y y 1 1 1 + + 0.5 0.5 0.5 + + + + 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x

  14. “Bad” Initialization of Centroids + Iteration 1 Iteration 2 3 3 + 2.5 2.5 + + 2 2 1.5 1.5 y y 1 1 0.5 0.5 + + 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x Iteration 3 Iteration 4 Iteration 5 3 3 3 + + + 2.5 2.5 2.5 + + + 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 + + + 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x

  15. Importance of Initial Centroids What is the chance of randomly selecting one point from each of K clusters? (assume each cluster has size n = N/K) Implication : We will almost always have 
 multiple initial centroids in same cluster.

  16. Example: 10 Clusters Iteration 1 Iteration 3 Iteration 2 Iteration 4 8 8 8 8 6 6 6 6 4 4 4 4 2 2 2 2 y y y y 0 0 0 0 -2 -2 -2 -2 -4 -4 -4 -4 -6 -6 -6 -6 0 0 0 0 5 5 5 5 10 10 10 10 15 15 15 15 20 20 20 20 x x x x 5 pairs of clusters, two initial points in each pair

  17. Example: 10 Clusters Iteration 1 Iteration 3 Iteration 2 Iteration 4 8 8 8 8 6 6 6 6 4 4 4 4 2 2 2 2 y y y y 0 0 0 0 -2 -2 -2 -2 -4 -4 -4 -4 -6 -6 -6 -6 0 0 0 0 5 5 5 5 10 10 10 10 15 15 15 15 20 20 20 20 x x x x 5 pairs of clusters, two initial points in each pair

  18. Importance of Initial Centroids Initialization tricks • Use multiple restarts • Initialize with hierarchical clustering • Select more than K points, 
 keep most widely separated points 


  19. Choosing K K=1, SSE=873 K=2, SSE=173 K=3, SSE=134 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

  20. equals 1 to 6… Choosing K “ ” or “knee finding”. 1.00E+03 9.00E+02 Cost Function 8.00E+02 7.00E+02 6.00E+02 5.00E+02 4.00E+02 3.00E+02 2.00E+02 1.00E+02 0.00E+00 K 1 2 3 4 5 6 “Elbow finding” (a.k.a. “knee finding”) 
 Set K to value just above “abrupt” increase (we’ll talk about better methods later in this course)

  21. K-means Limitations: Differing Sizes Original Points K-means (3 clusters)

  22. K-means Limitations: Different Densities Original Points K-means (3 clusters)

  23. K-means Limitations: Non-globular Shapes Original Points K-means (2 clusters)

  24. Overcoming K-means Limitations Intuition: “Combine” smaller clusters into larger clusters • One Solution: Hierarchical Clustering • Another Solution: Density-based Clustering

  25. Hierarchical Clustering

  26. Dendrogram ( a.k.a. a similarity tree ) Similarity of A and B is D(A,B) represented as height 
 of lowest shared 
 internal node (Bovine: 0.69395, (Spider Monkey: 0.390, (Gibbon:0.36079,(Orang: 0.33636, (Gorilla: 0.17147, 
 (Chimp: 0.19268, Human: 0.11927): 0.08386): 0.06124): 0.15057): 0.54939);

  27. Dendrogram ( a.k.a. a similarity tree ) D(A,B) Natural when measuring 
 genetic similarity, distance 
 to common ancestor (Bovine: 0.69395, (Spider Monkey: 0.390, (Gibbon:0.36079,(Orang: 0.33636, (Gorilla: 0.17147, 
 (Chimp: 0.19268, Human: 0.11927): 0.08386): 0.06124): 0.15057): 0.54939);

  28. Example: Iris data Iris Setosa Iris versicolor Iris virginica https://en.wikipedia.org/wiki/Iris_flower_data_set

  29. Hierarchical Clustering ( Euclidian Distance ) https://en.wikipedia.org/wiki/Iris_flower_data_set

  30. Edit Distance Distance Patty and Selma Change dress color, 1 point Change earring shape, 1 point Change hair part, 1 point D(Patty, Selma) = 3 Distance Marge and Selma Change dress color, 1 point Add earrings, 1 point Decrease height, 1 point Take up smoking, 1 point Lose weight, 1 point D(Marge,Selma) = 5 Can be defined for any set of discrete features

  31. Edit Distance for Strings • Transform string Q into string C , using only Similarity “Peter” and “Piotr”? Substitution , Insertion and Deletion . Substitution 1 Unit • Assume that each of these operators has a Insertion 1 Unit cost associated with it. Deletion 1 Unit • The similarity between two strings can be D ( Peter , Piotr ) is 3 defined as the cost of the cheapest transformation from Q to C. Peter Substitution (i for e) Piter Insertion (o) Pioter Deletion (e) Pedro Piotr Peter Piotr Piero Pyotr Petros Pietro Pierre

  32. Hierarchical Clustering ( Edit Distance ) Pedro (Portuguese) Petros (Greek), Peter (English), Piotr (Polish), Peadar (Irish), Pierre (French), Peder (Danish), Peka (Hawaiian), Pietro (Italian), Piero (Italian Alternative), Petr (Czech), Pyotr (Russian) Cristovao (Portuguese) Christoph (German), Christophe (French), Cristobal (Spanish), Cristoforo (Italian), Kristoffer (Scandinavian), Krystof (Czech), Christopher (English) Miguel (Portuguese) Michalis (Greek), Michael (English), Mick (Irish) Cristovao Pedro Miguel Christoph n Piotr r Petros o Pierre o Peter Peka r Michalis Michael Mick Christopher e Cristobal Cristoforo Kristoffer f r o t a e r r h a o t t e d d p e s y e i a e y P P o d i P e r P t s K s P i r i r C h C

  33. Meaningful Patterns Edit distance yields clustering according to geography Slide from Eamonn Keogh Pedro ( Portuguese/Spanish ) Petros ( Greek ), Peter ( English ), Piotr ( Polish ), Peadar (Irish), Pierre ( French ), Peder ( Danish ), Peka (Hawaiian), Pietro ( Italian ), Piero ( Italian Alternative ), Petr (Czech), Pyotr ( Russian )

  34. Spurious Patterns In general clusterings will only be as meaningful as your distance metric spurious; there is no connection between the two South Georgia & Serbia & St. Helena & U.K. AUSTRALIA ANGUILLA FRANCE NIGER INDIA IRELAND BRAZIL South Sandwich Montenegro Dependencies Islands (Yugoslavia)

  35. Spurious Patterns In general clusterings will only be as meaningful as your distance metric spurious; there is no connection between the two South Georgia & Serbia & St. Helena & U.K. AUSTRALIA ANGUILLA FRANCE NIGER INDIA IRELAND BRAZIL South Sandwich Montenegro Dependencies Islands (Yugoslavia) Former UK colonies No relation

  36. “Correct” Number of Clusters to determine the “correct”

  37. “Correct” Number of Clusters to determine the “correct” Determine number of clusters by looking at distance

  38. Detecting Outliers The single isolated branch is suggestive of a data point that is very different to all others Outlier

  39. Bottom-up vs Top-down Since we cannot test all possible The number of dendrograms with n trees we will have to heuristic leafs = (2 n -3)!/[(2 ( n -2) ) ( n -2)!] search of all possible trees. We could do this.. Number Number of Possible of Leafs Dendrograms 2 1 Bottom-Up (agglomerative): 3 3 4 15 Starting with each item in its own 5 105 cluster, find the best pair to merge ... … into a new cluster. Repeat until all 10 34,459,425 clusters are fused together. Top-Down (divisive): Starting with all the data in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively operate on both sides.

Recommend


More recommend