INTRODUCTION TO MACHINE LEARNING Clustering with k-means
Introduction to Machine Learning Clustering, what? ● Cluster : collection of objects ● Similar within cluster ● Dissimilar between clusters ● Clustering : grouping objects in clusters ● No labels: unsupervised classification ● Plenty possible clusterings
Introduction to Machine Learning Clustering, why? ● Pa � ern Analysis ● Targeted Marketing Programs ● Visualise Data ● Student Segmentations ● pre-Processing Step ● Data Mining ● Outlier Detection ● … ● …
Introduction to Machine Learning Clustering, how? ● Measure of Similarity: d( …, …) Numerical variables Metrics: Euclidean, Manhattan, … • Categorical variables Construct your own distance • • Clustering Methods k-means • Hierarchical Many variations • … •
Introduction to Machine Learning Compactness and Separation ● Within Cluster Sums of Squares (WSS): Cluster Centroid Object Cluster #Clusters Measure of compactness Minimise WSS ● Between Cluster Sums of Squares (BSS): Cluster Centroid #Clusters #Objects in Cluster Sample Mean Measure of separation Maximise BSS
Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets Let’s take k = 3 5 y 0 − 5 0 5 10 x
Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets k = 3 1. Randomly assign k centroids 5 y 0 − 5 0 5 10 x
Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets k = 3 1. Randomly assign k centroids 2. Assign data to closest centroid 5 y 0 − 5 0 5 10 x
Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets k = 3 1. Randomly assign k centroids 2. Assign data to closest centroid 5 3. Moves centroids to average location y 0 − 5 0 5 10 x
Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets k = 3 1. Randomly assign k centroids 2. Assign data to closest centroid 5 3. Moves centroids to average location y 0 4. Repeat step 2 and 3 − 5 0 5 10 x
Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets k = 3 1. Randomly assign k centroids 2. Assign data to closest centroid 5 3. Moves centroids to average location y 0 4. Repeat step 2 and 3 − 5 0 5 10 x The algorithm has converged!
Introduction to Machine Learning Choosing k ● Goal: Find k that minimizes WSS ● Problem : WSS keeps decreasing as k increases! } ● Solution : WSS starts decreasing slowly Fix k WSS / TSS < 0.2
Introduction to Machine Learning Choosing k: Scree Plot Scree Plot: Visualizing the ratio WSS / TSS as function of k 1.0 Look for the elbow in the plot 0.8 0.6 WSS / TSS Choose k = 3 0.4 0.2 1 2 3 4 5 6 7 k
Introduction to Machine Learning k-Means in R > my_km <- kmeans(data, centers, nstart) ● centers: Starting centroid or #clusters ● nstart: #times R restarts with di ff erent centroids Distance: Euclidean metric > my_km$tot.withinss WSS > my_km$betweenss BSS
INTRODUCTION TO MACHINE LEARNING Let’s practice!
INTRODUCTION TO MACHINE LEARNING Performance and Scaling
Introduction to Machine Learning Cluster Evaluation Not trivial! There is no truth ● No true labels ● No true response Evaluation methods? Depends on the goal Goal: Compact and Separated Measurable!
Introduction to Machine Learning Cluster Measures Good indication WSS and BSS: Underlying idea: Separation between clusters } ● Variance within clusters Compare ● Alternative: ● Diameter ● Intercluster Distance
Introduction to Machine Learning Diameter 5 y : Objects 0 : Cluster : Distance (objects) − 5 0 5 10 15 x Measure of Compactness
Introduction to Machine Learning Intercluster Distance 5 y : Objects 0 : Clusters : Distance (objects) − 5 0 5 10 15 x Measure of Separation
Introduction to Machine Learning Dunn’s Index 5 y 0 − 5 0 5 10 15 x
Introduction to Machine Learning Dunn’s Index Higher Dunn Be � er separated / more compact Notes: ● High computational cost ● Worst case indicator
Introduction to Machine Learning Alternative measures ● Internal Validation: based on intrinsic knowledge ● BIC Index ● Silhoue � e’s Index ● External Validation: based on previous knowledge ● Hulbert’s Correlation ● Jaccard’s Coe ffi cient
Introduction to Machine Learning Evaluating in R Libraries: cluster and clValid Dunn’s Index: > dunn(clusters = my_km, Data = ...) ● clusters : cluster partitioning vector ● Data : original dataset
Introduction to Machine Learning Scale Issues Metrics are o � en scale dependent! Which pair is most similar ? ( Age, Income, IQ ) ● X1 = (28, 72000, 120) ● Intuition : (X1, X3) ● X2 = (56, 73000, 80) ● Euclidean : (X1, X2) ● X3 = (29, 74500, 118) Solution: Rescale income / 1000$
Introduction to Machine Learning Standardizing Problem: Multiple variables on di ff erent scales Solution: Standardize your data 1. Subtract the mean 2. Divide by the standard deviation > scale(data) Note: Standardizing Di ff erent interpretation
INTRODUCTION TO MACHINE LEARNING Let’s practice!
INTRODUCTION TO MACHINE LEARNING Hierarchical Clustering
Introduction to Machine Learning Hierarchical Clustering Hierarchy: ● Which objects cluster first? ● Which cluster pairs merge? When? Bo � om-up: ● Starts from the objects ● Builds a hierarchy of clusters
Introduction to Machine Learning Bo � om-Up: Algorithm Pre: Calculate distances between objects Objects
Introduction to Machine Learning Bo � om-Up: Algorithm Pre: Calculate distances between objects Objects Distance
Introduction to Machine Learning Bo � om-Up: Algorithm 1. Put every object in its own cluster
Introduction to Machine Learning Bo � om-Up: Algorithm 2. Find the closest pair of clusters Merge them
Introduction to Machine Learning Bo � om-Up: Algorithm 3. Compute distances between new cluster and old ones
Introduction to Machine Learning Bo � om-Up: Algorithm 4. Repeat steps two and three
Introduction to Machine Learning Bo � om-Up: Algorithm 4. Repeat steps two and three
Introduction to Machine Learning Bo � om-Up: Algorithm 4. Repeat steps two and three
Introduction to Machine Learning Bo � om-Up: Algorithm 4. Repeat steps two and three One cluster
Introduction to Machine Learning Linkage-Methods ● Simple-Linkage: minimal distance between clusters ● Complete-Linkage: maximal distance between clusters ● Average-Linkage: average distance between clusters Di ff erent Clusterings
Introduction to Machine Learning Simple-Linkage Minimal distance between objects in each clusters
Introduction to Machine Learning Complete-Linkage Maximal distance between objects in each cluster
Introduction to Machine Learning Single-Linkage: Chaining ● O � en undesired
Introduction to Machine Learning Single-Linkage: Chaining ● O � en undesired
Introduction to Machine Learning Single-Linkage: Chaining ● O � en undesired
Introduction to Machine Learning Single-Linkage: Chaining ● O � en undesired ● Can be great outlier detector
Introduction to Machine Learning Dendrogram Merge Cut Height Merge Leaves / Objects
Introduction to Machine Learning Hierarchical Clustering in R Library: stats > dist(x, method) ● ● x: dataset method: distance > hclust(d, method) ● ● d: distance matrix method: linkage
Introduction to Machine Learning Hierarchical: Pro and Cons ● Pros ● In-depth analysis Di ff erent pa � ern ● Linkage-methods ● Cons ● High computational cost ● Can never undo merges
Introduction to Machine Learning k-Means: Pro and Cons ● Pros ● Can undo merges ● Fast computations ● Cons ● Fixed #Clusters ● Dependent on starting centroids
INTRODUCTION TO MACHINE LEARNING Let’s practice!
Recommend
More recommend