Clustering with k-means Introduction to Machine Learning - PowerPoint PPT Presentation

INTRODUCTION TO MACHINE LEARNING Clustering with   k-means

Introduction to Machine Learning Clustering, what? ● Cluster : collection of objects ● Similar within cluster ● Dissimilar between clusters ● Clustering : grouping objects in clusters ● No labels: unsupervised classification ● Plenty possible clusterings

Introduction to Machine Learning Clustering, why? ● Pa � ern Analysis ● Targeted Marketing Programs ● Visualise Data ● Student Segmentations ● pre-Processing Step ● Data Mining ● Outlier Detection ● … ● …

Introduction to Machine Learning Clustering, how? ● Measure of Similarity: d( …, …) Numerical variables Metrics: Euclidean, Manhattan, … • Categorical variables Construct your own distance • • Clustering Methods k-means • Hierarchical Many variations • … •

Introduction to Machine Learning Compactness and Separation ● Within Cluster Sums of Squares (WSS): Cluster Centroid Object Cluster #Clusters Measure of compactness Minimise WSS ● Between Cluster Sums of Squares (BSS): Cluster Centroid #Clusters #Objects in Cluster Sample Mean Measure of separation Maximise BSS

Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets Let’s take k = 3 5 y 0 − 5 0 5 10 x

Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets k = 3 1. Randomly assign k centroids 5 y 0 − 5 0 5 10 x

Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets k = 3 1. Randomly assign k centroids 2. Assign data to closest centroid 5 y 0 − 5 0 5 10 x

Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets k = 3 1. Randomly assign k centroids 2. Assign data to closest centroid 5 3. Moves centroids to average location y 0 − 5 0 5 10 x

Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets k = 3 1. Randomly assign k centroids 2. Assign data to closest centroid 5 3. Moves centroids to average location y 0 4. Repeat step 2 and 3 − 5 0 5 10 x

Introduction to Machine Learning k-Means Algorithm Goal: Partition data in k disjoint subsets k = 3 1. Randomly assign k centroids 2. Assign data to closest centroid 5 3. Moves centroids to average location y 0 4. Repeat step 2 and 3 − 5 0 5 10 x The algorithm has converged!

Introduction to Machine Learning Choosing k ● Goal: Find k that minimizes WSS ● Problem : WSS keeps decreasing as k increases! } ● Solution : WSS starts decreasing slowly Fix k WSS / TSS < 0.2

Introduction to Machine Learning Choosing k: Scree Plot Scree Plot: Visualizing the ratio WSS / TSS as function of k 1.0 Look for the elbow in the plot 0.8 0.6 WSS / TSS Choose k = 3 0.4 0.2 1 2 3 4 5 6 7 k

Introduction to Machine Learning k-Means in R > my_km <- kmeans(data, centers, nstart) ● centers: Starting centroid or #clusters ● nstart: #times R restarts with di ff erent centroids Distance: Euclidean metric > my_km$tot.withinss WSS > my_km$betweenss BSS

INTRODUCTION TO MACHINE LEARNING Let’s practice!

INTRODUCTION TO MACHINE LEARNING Performance and Scaling

Introduction to Machine Learning Cluster Evaluation Not trivial! There is no truth ● No true labels ● No true response Evaluation methods? Depends on the goal Goal: Compact and Separated Measurable!

Introduction to Machine Learning Cluster Measures Good indication WSS and BSS: Underlying idea: Separation between clusters } ● Variance within clusters Compare ● Alternative: ● Diameter ● Intercluster Distance

Introduction to Machine Learning Diameter 5 y : Objects 0 : Cluster : Distance (objects) − 5 0 5 10 15 x Measure of Compactness

Introduction to Machine Learning Intercluster Distance 5 y : Objects 0 : Clusters : Distance (objects) − 5 0 5 10 15 x Measure of Separation

Introduction to Machine Learning Dunn’s Index 5 y 0 − 5 0 5 10 15 x

Introduction to Machine Learning Dunn’s Index Higher Dunn Be � er separated / more compact Notes: ● High computational cost ● Worst case indicator

Introduction to Machine Learning Alternative measures ● Internal Validation: based on intrinsic knowledge ● BIC Index ● Silhoue � e’s Index ● External Validation: based on previous knowledge ● Hulbert’s Correlation ● Jaccard’s Coe ffi cient

Introduction to Machine Learning Evaluating in R Libraries: cluster and clValid Dunn’s Index: > dunn(clusters = my_km, Data = ...) ● clusters : cluster partitioning vector ● Data : original dataset

Introduction to Machine Learning Scale Issues Metrics are o � en scale dependent! Which pair is most similar ? ( Age, Income, IQ ) ● X1 = (28, 72000, 120) ● Intuition : (X1, X3) ● X2 = (56, 73000, 80) ● Euclidean : (X1, X2) ● X3 = (29, 74500, 118) Solution: Rescale income / 1000$

Introduction to Machine Learning Standardizing Problem: Multiple variables on di ff erent scales Solution: Standardize your data 1. Subtract the mean 2. Divide by the standard deviation > scale(data) Note: Standardizing Di ff erent interpretation

INTRODUCTION TO MACHINE LEARNING Hierarchical Clustering

Introduction to Machine Learning Hierarchical Clustering Hierarchy: ● Which objects cluster first? ● Which cluster pairs merge? When? Bo � om-up: ● Starts from the objects ● Builds a hierarchy of clusters

Introduction to Machine Learning Bo � om-Up: Algorithm Pre: Calculate distances between objects Objects

Introduction to Machine Learning Bo � om-Up: Algorithm Pre: Calculate distances between objects Objects Distance

Introduction to Machine Learning Bo � om-Up: Algorithm 1. Put every object in its own cluster

Introduction to Machine Learning Bo � om-Up: Algorithm 2. Find the closest pair of clusters Merge them

Introduction to Machine Learning Bo � om-Up: Algorithm 3. Compute distances between new cluster and old ones

Introduction to Machine Learning Bo � om-Up: Algorithm 4. Repeat steps two and three

Introduction to Machine Learning Bo � om-Up: Algorithm 4. Repeat steps two and three One cluster

Introduction to Machine Learning Linkage-Methods ● Simple-Linkage: minimal distance between clusters ● Complete-Linkage: maximal distance between clusters ● Average-Linkage: average distance between clusters Di ff erent Clusterings

Introduction to Machine Learning Simple-Linkage Minimal distance between objects in each clusters

Introduction to Machine Learning Complete-Linkage Maximal distance between objects in each cluster

Introduction to Machine Learning Single-Linkage: Chaining ● O � en undesired

Introduction to Machine Learning Single-Linkage: Chaining ● O � en undesired ● Can be great outlier detector

Introduction to Machine Learning Dendrogram Merge Cut Height Merge Leaves / Objects

Introduction to Machine Learning Hierarchical Clustering in R Library: stats > dist(x, method) ● ● x: dataset method: distance > hclust(d, method) ● ● d: distance matrix method: linkage

Introduction to Machine Learning Hierarchical: Pro and Cons ● Pros ● In-depth analysis Di ff erent pa � ern ● Linkage-methods ● Cons ● High computational cost ● Can never undo merges

Introduction to Machine Learning k-Means: Pro and Cons ● Pros ● Can undo merges ● Fast computations ● Cons ● Fixed #Clusters ● Dependent on starting centroids

Clustering with k-means Introduction to Machine Learning - PowerPoint PPT Presentation

INTRODUCTION TO MACHINE LEARNING Clustering with k-means Introduction to Machine Learning Clustering, what? Cluster : collection of objects Similar within cluster Dissimilar between clusters Clustering : grouping objects

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

K-MEANS++ OPTIMAL INITIALIZATION ALGORITHM An Improved K-means Clustering Method OVERVIEW

k -means clustering Method to automatically separate data sets into distinct groups. Clustering

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Multi-variable Optimization K-means clustering K-means clustering on points is finding K

Data Clustering: Data Clustering: 50 Years Beyond K means 50 Years Beyond K means 50 Years

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

1 K-means clustering The K-means clustering algorithm can be seen as applying the EM algorithm to

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Clustering Reference:http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/ Dr Ahmed

Summarizing Multiple Gene Trees Using Cluster Networks Regula Rupp, Daniel H. Huson MIEP, June

Multi-View Clustering with Constraint Propagation for Learning with an Incomplete Mapping Between

Clustered Samba in a Briefcase Kai Blin kai@samba.org, @kaiblin Team 2016-05-12 Outline

Correction for multiple comparisons in FreeSurfer 1 Problem of Multiple Comparisons p < 10 -7

Dijkstra using a Graph Look at neighbors Start of A and calculate B/1 A/0 their distances 1

Data Mining: Concepts and Techniques Cluster Analysis Li Xiong Slide credits: Jiawei Han and

A computational study of Gomory cut generators ejols 1 , Fran cois Margot 1 , Giacomo Nannicini 2

A Deeper Look at a Calculus I Activity Lance Burger* & Marat Markin CSU Fresno June 27, 2016

Clustering with k-means Introduction to Machine Learning - PowerPoint PPT Presentation

INTRODUCTION TO MACHINE LEARNING Clustering with k-means Introduction to Machine Learning Clustering, what? Cluster : collection of objects Similar within cluster Dissimilar between clusters Clustering : grouping objects

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

K-MEANS++ OPTIMAL INITIALIZATION ALGORITHM An Improved K-means Clustering Method OVERVIEW

k -means clustering Method to automatically separate data sets into distinct groups. Clustering

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Multi-variable Optimization K-means clustering K-means clustering on points is finding K

Data Clustering: Data Clustering: 50 Years Beyond K means 50 Years Beyond K means 50 Years

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

1 K-means clustering The K-means clustering algorithm can be seen as applying the EM algorithm to

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Clustering Reference:http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/ Dr Ahmed

Summarizing Multiple Gene Trees Using Cluster Networks Regula Rupp, Daniel H. Huson MIEP, June

Multi-View Clustering with Constraint Propagation for Learning with an Incomplete Mapping Between

Clustered Samba in a Briefcase Kai Blin kai@samba.org, @kaiblin Team 2016-05-12 Outline

Correction for multiple comparisons in FreeSurfer 1 Problem of Multiple Comparisons p &lt; 10 -7

Dijkstra using a Graph Look at neighbors Start of A and calculate B/1 A/0 their distances 1

Data Mining: Concepts and Techniques Cluster Analysis Li Xiong Slide credits: Jiawei Han and

A computational study of Gomory cut generators ejols 1 , Fran cois Margot 1 , Giacomo Nannicini 2

A Deeper Look at a Calculus I Activity Lance Burger* &amp; Marat Markin CSU Fresno June 27, 2016

Correction for multiple comparisons in FreeSurfer 1 Problem of Multiple Comparisons p < 10 -7

A Deeper Look at a Calculus I Activity Lance Burger* & Marat Markin CSU Fresno June 27, 2016