A quick review The clustering problem: Different representations - - PowerPoint PPT Presentation

a quick review
SMART_READER_LITE
LIVE PREVIEW

A quick review The clustering problem: Different representations - - PowerPoint PPT Presentation

A quick review The clustering problem: Different representations homogeneity vs. separation Many possible distance metrics Many possible linkage approaches Method matters; metric matters; definitions matter; A quick review


slide-1
SLIDE 1

The clustering problem:

  • Different representations
  • homogeneity vs. separation
  • Many possible distance metrics
  • Many possible linkage approaches
  • Method matters; metric matters;

definitions matter;

A quick review

slide-2
SLIDE 2

A quick review

  • Hierarchical clustering:
  • Takes as input a distance matrix
  • Progressively regroups the closest objects/groups
  • The result is a tree - intermediate nodes represent clusters
  • Branch lengths represent distances between clusters
  • bject 2
  • bject 4
  • bject 1
  • bject 3
  • bject 5

c1 c2 c3 c4

leaf nodes branch node root

  • bject 1
  • bject 2
  • bject 3
  • bject 4
  • bject 5
  • bject 1

0.00 4.00 6.00 3.50 1.00

  • bject 2

4.00 0.00 6.00 2.00 4.50

  • bject 3

6.00 6.00 0.00 5.50 6.50

  • bject 4

3.50 2.00 5.50 0.00 4.00

  • bject 5

1.00 4.50 6.50 4.00 0.00

Distance matrix

slide-3
SLIDE 3

Hierarchical clustering result

Five clusters

slide-4
SLIDE 4
  • “Unsupervised learning” problem
  • No single solution is necessarily the true/correct!
  • There is usually a tradeoff between homogeneity and

separation:

  • More clusters  increased homogeneity but decreased separation
  • Less clusters  Increased separation but reduced homogeneity
  • Method matters; metric matters; definitions matter;
  • In most cases, heuristic methods or approximations are

used.

The “philosophy” of clustering - Summary

slide-5
SLIDE 5

Clustering

k-mean clustering

Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

slide-6
SLIDE 6

K-mean clustering: A different approach

  • Clear definition of a ‘good’ clustering solution (in

contrast to hierarchical clustering)

  • Divisive rather than agglomerative (in contrast to

hierarchical clustering)

  • Obtained solution is non-hierarchical (in contrast to

hierarchical clustering)

  • A new algorithmic approach (unlike any algorithm we

learned so far)

slide-7
SLIDE 7

What constitutes a good clustering solution?

(What exactly are we trying to find?)

slide-8
SLIDE 8

Defining a good clustering solution

Expression in condition 1 Expression in condition 2

slide-9
SLIDE 9

Expression in condition 1 Expression in condition 2

Defining a good clustering solution

Expression in condition 1 Expression in condition 2

slide-10
SLIDE 10

Expression in condition 1 Expression in condition 2 Expression in condition 1 Expression in condition 2

Red cluster center Green cluster center

The K-mean approach Clustering of n observations/points into k clusters is ‘good’ if each observation is assigned to the cluster with the nearest mean/center

Defining a good clustering solution

slide-11
SLIDE 11

Defining a good clustering solution

condition 2 condition 1

slide-12
SLIDE 12

Defining a good clustering solution

  X

condition 2 condition 1 condition 2 condition 1 condition 2 condition 1 condition 2 condition 1

slide-13
SLIDE 13
  • An algorithm for partitioning n observations/points

into k clusters such that each observation belongs to the cluster with the nearest mean/center

K-mean clustering

Cluster 2 center (mean) Cluster 1 center (mean)

slide-14
SLIDE 14

But how do we find a clustering solution with this property?

slide-15
SLIDE 15
  • An algorithm for partitioning n
  • bservations/points into k clusters such

that each observation belongs to the cluster with the nearest mean/center

  • Note the two components of this definition:
  • Partitioning of n points into clusters
  • Clusters’ means
  • A chicken and egg problem:

I do not know the means before I determine the partitioning I do not know the partitioning before I determine the means

K-mean clustering: Chicken and egg?

slide-16
SLIDE 16

The K-mean clustering algorithm

An iterative approach

  • Key principle - cluster around mobile centers:

Start with some random locations of means/centers, partition into clusters according to these centers, then correct the centers according to the clusters, and repeat [similar to EM (expectation-maximization) algorithms]

slide-17
SLIDE 17
  • The number of centers, k, has to be specified a-priori
  • Algorithm:
  • 1. Arbitrarily select k initial centers
  • 2. Assign each element to the closest center
  • 3. Re-calculate centers (mean position of the

assigned elements)

  • 4. Repeat 2 and 3 until …

K-mean clustering algorithm

slide-18
SLIDE 18
  • The number of centers, k, has to be specified a-priori
  • Algorithm:
  • 1. Arbitrarily select k initial centers
  • 2. Assign each element to the closest center
  • 3. Re-calculate centers (mean position of the

assigned elements)

  • 4. Repeat 2 and 3 until one of the following

termination conditions is reached:

i. The clusters are the same as in the previous iteration (stable solution) ii. The clusters are as in some previous iteration (cycle)

  • iii. The difference between two iterations is small??
  • iv. The maximum number of iterations has been reached

K-mean clustering algorithm

slide-19
SLIDE 19
  • The number of centers, k, has to be specified a-priori
  • Algorithm:
  • 1. Arbitrarily select k initial centers
  • 2. Assign each element to the closest center
  • 3. Re-calculate centers (mean position of the

assigned elements)

  • 4. Repeat 2 and 3 until one of the following

termination conditions is reached:

i. The clusters are the same as in the previous iteration (stable solution) ii. The clusters are as in some previous iteration (cycle)

  • iii. The difference between two iterations is small??
  • iv. The maximum number of iterations has been reached

K-mean clustering algorithm

How can we do this efficiently?

slide-20
SLIDE 20
  • Could be computationally intensive ….

Assigning elements to the closest center

B A

slide-21
SLIDE 21
  • Could be computationally intensive ….
  • Preprocessing (by partitioning the space) can help

Assigning elements to the closest center

B A

slide-22
SLIDE 22
  • Could be computationally intensive ….
  • Preprocessing (by partitioning the space) can help

B A

closer to A than to B closer to B than to A

Assigning elements to the closest center

slide-23
SLIDE 23

B A C

closer to A than to B closer to B than to A closer to B than to C

Assigning elements to the closest center

  • Could be computationally intensive ….
  • Preprocessing (by partitioning the space) can help
slide-24
SLIDE 24

B A C

closest to A closest to B closest to C

Assigning elements to the closest center

  • Could be computationally intensive ….
  • Preprocessing (by partitioning the space) can help
slide-25
SLIDE 25

B A C

Assigning elements to the closest center

  • Could be computationally intensive ….
  • Preprocessing (by partitioning the space) can help
slide-26
SLIDE 26
  • Decomposition of a metric space

determined by distances to a specified discrete set of “centers” in the space

(each colored cell represents the collection of all points in this space that are closer to a specific center than to any other)

  • Several algorithms exist to find the Voronoi diagram
  • Numerous applications

(e.g., the 1854 Broad Street cholera

  • utbreak in Soho England, Aviation,

and many others)

Voronoi diagram

slide-27
SLIDE 27
  • The number of centers, k, has to be specified a-priori
  • Algorithm:
  • 1. Arbitrarily select k initial centers
  • 2. Assign each element to the closest center (Voronoi)
  • 3. Re-calculate centers (mean position of the

assigned elements)

  • 4. Repeat 2 and 3 until one of the following

termination conditions is reached:

i. The clusters are the same as in the previous iteration (stable solution) ii. The clusters are as in some previous iteration (cycle)

  • iii. The difference between two iterations is small??
  • iv. The maximum number of iterations has been reached

K-mean clustering algorithm

slide-28
SLIDE 28

K-mean clustering example

  • Two sets of points

randomly generated

  • 200 centered on (0,0)
  • 50 centered on (1,1)
slide-29
SLIDE 29

K-mean clustering example

  • Two points are

randomly chosen as centers (stars)

slide-30
SLIDE 30

K-mean clustering example

  • Each dot can now

be assigned to the cluster with the closest center

slide-31
SLIDE 31

K-mean clustering example

  • First partition into

clusters

slide-32
SLIDE 32
  • Centers are

re-calculated

K-mean clustering example

slide-33
SLIDE 33

K-mean clustering example

  • And are again used

to partition the points

slide-34
SLIDE 34

K-mean clustering example

  • Second partition into

clusters

slide-35
SLIDE 35

K-mean clustering example

  • Re-calculating centers

again

slide-36
SLIDE 36

K-mean clustering example

  • And we can again

partition the points

slide-37
SLIDE 37

K-mean clustering example

  • Third partition

into clusters

slide-38
SLIDE 38

K-mean clustering example

  • After 6 iterations:
  • The calculated

centers remains stable

slide-39
SLIDE 39

K-mean clustering: Summary

  • The convergence of k-mean is usually quite fast

(sometimes 1 iteration results in a stable solution)

  • K-means is time- and memory-efficient
  • Strengths:
  • Simple to use
  • Fast
  • Can be used with very large data sets
  • Weaknesses:
  • The number of clusters has to be predetermined
  • The results may vary depending on the initial choice of

centers

slide-40
SLIDE 40

K-mean clustering: Variations

  • Expectation-maximization (EM):

maintains probabilistic assignments to clusters, instead of deterministic assignments, and multivariate Gaussian distributions instead of means.

  • k-means++: attempts to choose better starting points.
  • Some variations attempt to escape local optima by

swapping points between clusters

slide-41
SLIDE 41

An important take-home message

D’haeseleer, 2005

Hierarchical clustering K-mean clustering

?

slide-42
SLIDE 42

What else are we missing?

slide-43
SLIDE 43
  • What if the clusters are not “linearly separable”?

What else are we missing?

slide-44
SLIDE 44
slide-45
SLIDE 45

Defining a good clustering solution The K-mean approach

condition 1 condition 2

slide-46
SLIDE 46

Defining a good clustering solution The K-mean approach

condition 1 condition 2 condition 1 condition 2 condition 1 condition 2 condition 1 condition 2