The clustering problem:
- Different representations
- homogeneity vs. separation
- Many possible distance metrics
- Many possible linkage approaches
- Method matters; metric matters;
definitions matter;
A quick review
A quick review The clustering problem: Different representations - - PowerPoint PPT Presentation
A quick review The clustering problem: Different representations homogeneity vs. separation Many possible distance metrics Many possible linkage approaches Method matters; metric matters; definitions matter; A quick review
The clustering problem:
definitions matter;
A quick review
A quick review
c1 c2 c3 c4
leaf nodes branch node root
0.00 4.00 6.00 3.50 1.00
4.00 0.00 6.00 2.00 4.50
6.00 6.00 0.00 5.50 6.50
3.50 2.00 5.50 0.00 4.00
1.00 4.50 6.50 4.00 0.00
Distance matrix
Hierarchical clustering result
Five clusters
separation:
used.
The “philosophy” of clustering - Summary
k-mean clustering
Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein
K-mean clustering: A different approach
contrast to hierarchical clustering)
hierarchical clustering)
hierarchical clustering)
learned so far)
What constitutes a good clustering solution?
(What exactly are we trying to find?)
Defining a good clustering solution
Expression in condition 1 Expression in condition 2
Expression in condition 1 Expression in condition 2
Defining a good clustering solution
Expression in condition 1 Expression in condition 2
Expression in condition 1 Expression in condition 2 Expression in condition 1 Expression in condition 2
Red cluster center Green cluster center
The K-mean approach Clustering of n observations/points into k clusters is ‘good’ if each observation is assigned to the cluster with the nearest mean/center
Defining a good clustering solution
Defining a good clustering solution
condition 2 condition 1
Defining a good clustering solution
condition 2 condition 1 condition 2 condition 1 condition 2 condition 1 condition 2 condition 1
into k clusters such that each observation belongs to the cluster with the nearest mean/center
K-mean clustering
Cluster 2 center (mean) Cluster 1 center (mean)
But how do we find a clustering solution with this property?
that each observation belongs to the cluster with the nearest mean/center
I do not know the means before I determine the partitioning I do not know the partitioning before I determine the means
K-mean clustering: Chicken and egg?
The K-mean clustering algorithm
An iterative approach
Start with some random locations of means/centers, partition into clusters according to these centers, then correct the centers according to the clusters, and repeat [similar to EM (expectation-maximization) algorithms]
assigned elements)
K-mean clustering algorithm
assigned elements)
termination conditions is reached:
i. The clusters are the same as in the previous iteration (stable solution) ii. The clusters are as in some previous iteration (cycle)
K-mean clustering algorithm
assigned elements)
termination conditions is reached:
i. The clusters are the same as in the previous iteration (stable solution) ii. The clusters are as in some previous iteration (cycle)
K-mean clustering algorithm
How can we do this efficiently?
Assigning elements to the closest center
B A
Assigning elements to the closest center
B A
B A
closer to A than to B closer to B than to A
Assigning elements to the closest center
B A C
closer to A than to B closer to B than to A closer to B than to C
Assigning elements to the closest center
B A C
closest to A closest to B closest to C
Assigning elements to the closest center
B A C
Assigning elements to the closest center
determined by distances to a specified discrete set of “centers” in the space
(each colored cell represents the collection of all points in this space that are closer to a specific center than to any other)
(e.g., the 1854 Broad Street cholera
and many others)
Voronoi diagram
assigned elements)
termination conditions is reached:
i. The clusters are the same as in the previous iteration (stable solution) ii. The clusters are as in some previous iteration (cycle)
K-mean clustering algorithm
K-mean clustering example
randomly generated
K-mean clustering example
randomly chosen as centers (stars)
K-mean clustering example
be assigned to the cluster with the closest center
K-mean clustering example
clusters
re-calculated
K-mean clustering example
K-mean clustering example
to partition the points
K-mean clustering example
clusters
K-mean clustering example
again
K-mean clustering example
partition the points
K-mean clustering example
into clusters
K-mean clustering example
centers remains stable
K-mean clustering: Summary
(sometimes 1 iteration results in a stable solution)
centers
K-mean clustering: Variations
maintains probabilistic assignments to clusters, instead of deterministic assignments, and multivariate Gaussian distributions instead of means.
swapping points between clusters
An important take-home message
D’haeseleer, 2005
Hierarchical clustering K-mean clustering
What else are we missing?
What else are we missing?
Defining a good clustering solution The K-mean approach
condition 1 condition 2
Defining a good clustering solution The K-mean approach
condition 1 condition 2 condition 1 condition 2 condition 1 condition 2 condition 1 condition 2