clustering with k means
play

Clustering with k-means Introduction to Machine Learning - PowerPoint PPT Presentation

INTRODUCTION TO MACHINE LEARNING Clustering with k-means Introduction to Machine Learning Clustering, what? Cluster : collection of objects Similar within cluster Dissimilar between clusters Clustering : grouping objects


  1. INTRODUCTION TO MACHINE LEARNING Clustering with 
 k-means

  2. Introduction to Machine Learning Clustering, what? ● Cluster : collection of objects ● Similar within cluster ● Dissimilar between clusters ● Clustering : grouping objects in clusters ● No labels: unsupervised classification ● Plenty possible clusterings

  3. Introduction to Machine Learning Clustering, why? ● Pa � ern Analysis ● Targeted Marketing Programs ● Visualise Data ● Student Segmentations ● pre-Processing Step ● Data Mining ● Outlier Detection ● … ● …

  4. Introduction to Machine Learning Clustering, how? ● Measure of Similarity: d( …, …) Numerical variables Metrics: Euclidean, Manhattan, … • Categorical variables Construct your own distance • • Clustering Methods k-means • Hierarchical Many variations • … •

  5. Introduction to Machine Learning Compactness and Separation ● Within Cluster Sums of Squares (WSS): Cluster Centroid Object Cluster #Clusters Measure of compactness Minimise WSS ● Between Cluster Sums of Squares (BSS): Cluster Centroid #Clusters #Objects in Cluster Sample Mean Measure of separation Maximise BSS

  6. Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets Let’s take k = 3 5 y 0 − 5 0 5 10 x

  7. Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets k = 3 1. Randomly assign k centroids 5 y 0 − 5 0 5 10 x

  8. Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets k = 3 1. Randomly assign k centroids 2. Assign data to closest centroid 5 y 0 − 5 0 5 10 x

  9. Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets k = 3 1. Randomly assign k centroids 2. Assign data to closest centroid 5 3. Moves centroids to average location y 0 − 5 0 5 10 x

  10. Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets k = 3 1. Randomly assign k centroids 2. Assign data to closest centroid 5 3. Moves centroids to average location y 0 4. Repeat step 2 and 3 − 5 0 5 10 x

  11. Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets k = 3 1. Randomly assign k centroids 2. Assign data to closest centroid 5 3. Moves centroids to average location y 0 4. Repeat step 2 and 3 − 5 0 5 10 x The algorithm has converged!

  12. Introduction to Machine Learning Choosing k ● Goal: Find k that minimizes WSS ● Problem : WSS keeps decreasing as k increases! } ● Solution : WSS starts decreasing slowly Fix k WSS / TSS < 0.2

  13. Introduction to Machine Learning Choosing k: Scree Plot Scree Plot: Visualizing the ratio WSS / TSS as function of k 1.0 Look for the elbow in the plot 0.8 0.6 WSS / TSS Choose k = 3 0.4 0.2 1 2 3 4 5 6 7 k

  14. Introduction to Machine Learning k-Means in R > my_km <- kmeans(data, centers, nstart) ● centers: Starting centroid or #clusters ● nstart: #times R restarts with di ff erent centroids Distance: Euclidean metric > my_km$tot.withinss WSS > my_km$betweenss BSS

  15. INTRODUCTION TO MACHINE LEARNING Let’s practice!

  16. INTRODUCTION TO MACHINE LEARNING Performance and Scaling

  17. Introduction to Machine Learning Cluster Evaluation Not trivial! There is no truth ● No true labels ● No true response Evaluation methods? Depends on the goal Goal: Compact and Separated Measurable!

  18. Introduction to Machine Learning Cluster Measures Good indication WSS and BSS: Underlying idea: Separation between clusters } ● Variance within clusters Compare ● Alternative: ● Diameter ● Intercluster Distance

  19. Introduction to Machine Learning Diameter 5 y : Objects 0 : Cluster : Distance (objects) − 5 0 5 10 15 x Measure of Compactness

  20. Introduction to Machine Learning Intercluster Distance 5 y : Objects 0 : Clusters : Distance (objects) − 5 0 5 10 15 x Measure of Separation

  21. Introduction to Machine Learning Dunn’s Index 5 y 0 − 5 0 5 10 15 x

  22. Introduction to Machine Learning Dunn’s Index Higher Dunn Be � er separated / more compact Notes: ● High computational cost ● Worst case indicator

  23. Introduction to Machine Learning Alternative measures ● Internal Validation: based on intrinsic knowledge ● BIC Index ● Silhoue � e’s Index ● External Validation: based on previous knowledge ● Hulbert’s Correlation ● Jaccard’s Coe ffi cient

  24. Introduction to Machine Learning Evaluating in R Libraries: cluster and clValid Dunn’s Index: > dunn(clusters = my_km, Data = ...) ● clusters : cluster partitioning vector ● Data : original dataset

  25. Introduction to Machine Learning Scale Issues Metrics are o � en scale dependent! Which pair is most similar ? ( Age, Income, IQ ) ● X1 = (28, 72000, 120) ● Intuition : (X1, X3) ● X2 = (56, 73000, 80) ● Euclidean : (X1, X2) ● X3 = (29, 74500, 118) Solution: Rescale income / 1000$

  26. Introduction to Machine Learning Standardizing Problem: Multiple variables on di ff erent scales Solution: Standardize your data 1. Subtract the mean 2. Divide by the standard deviation > scale(data) Note: Standardizing Di ff erent interpretation

  27. INTRODUCTION TO MACHINE LEARNING Let’s practice!

  28. INTRODUCTION TO MACHINE LEARNING Hierarchical Clustering

  29. Introduction to Machine Learning Hierarchical Clustering Hierarchy: ● Which objects cluster first? ● Which cluster pairs merge? When? Bo � om-up: ● Starts from the objects ● Builds a hierarchy of clusters

  30. Introduction to Machine Learning Bo � om-Up: Algorithm Pre: Calculate distances between objects Objects

  31. Introduction to Machine Learning Bo � om-Up: Algorithm Pre: Calculate distances between objects Objects Distance

  32. Introduction to Machine Learning Bo � om-Up: Algorithm 1. Put every object in its own cluster

  33. Introduction to Machine Learning Bo � om-Up: Algorithm 2. Find the closest pair of clusters Merge them

  34. Introduction to Machine Learning Bo � om-Up: Algorithm 3. Compute distances between new cluster and old ones

  35. Introduction to Machine Learning Bo � om-Up: Algorithm 4. Repeat steps two and three

  36. Introduction to Machine Learning Bo � om-Up: Algorithm 4. Repeat steps two and three

  37. Introduction to Machine Learning Bo � om-Up: Algorithm 4. Repeat steps two and three

  38. Introduction to Machine Learning Bo � om-Up: Algorithm 4. Repeat steps two and three One cluster

  39. Introduction to Machine Learning Linkage-Methods ● Simple-Linkage: minimal distance between clusters ● Complete-Linkage: maximal distance between clusters ● Average-Linkage: average distance between clusters Di ff erent Clusterings

  40. Introduction to Machine Learning Simple-Linkage Minimal distance between objects in each clusters

  41. Introduction to Machine Learning Complete-Linkage Maximal distance between objects in each cluster

  42. Introduction to Machine Learning Single-Linkage: Chaining ● O � en undesired

  43. Introduction to Machine Learning Single-Linkage: Chaining ● O � en undesired

  44. Introduction to Machine Learning Single-Linkage: Chaining ● O � en undesired

  45. Introduction to Machine Learning Single-Linkage: Chaining ● O � en undesired ● Can be great outlier detector

  46. Introduction to Machine Learning Dendrogram Merge Cut Height Merge Leaves / Objects

  47. Introduction to Machine Learning Hierarchical Clustering in R Library: stats > dist(x, method) ● ● x: dataset method: distance > hclust(d, method) ● ● d: distance matrix method: linkage

  48. Introduction to Machine Learning Hierarchical: Pro and Cons ● Pros ● In-depth analysis Di ff erent pa � ern ● Linkage-methods ● Cons ● High computational cost ● Can never undo merges

  49. Introduction to Machine Learning k-Means: Pro and Cons ● Pros ● Can undo merges ● Fast computations ● Cons ● Fixed #Clusters ● Dependent on starting centroids

  50. INTRODUCTION TO MACHINE LEARNING Let’s practice!

Recommend


More recommend