Clustering Hierarchical clustering and k-mean clustering Genome - - PowerPoint PPT Presentation

clustering
SMART_READER_LITE
LIVE PREVIEW

Clustering Hierarchical clustering and k-mean clustering Genome - - PowerPoint PPT Presentation

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics Elhanan Borenstein A quick review The clustering problem: partition genes into distinct sets with high homogeneity and high separation Many


slide-1
SLIDE 1

Clustering

Hierarchical clustering and k-mean clustering

Genome 373 Genomic Informatics Elhanan Borenstein

slide-2
SLIDE 2
  • The clustering problem:
  • partition genes into distinct sets with

high homogeneity and high separation

  • Many different representations
  • Many possible distance metrics
  • Metric matters
  • Homogeneity vs separation

A quick review

slide-3
SLIDE 3
  • A good clustering solution should have two features:

1. High homogeneity: homogeneity measures the similarity between genes assigned to the same cluster. 2. High separation: separation measures the distance/dis- similarity between clusters. (If two clusters have similar expression patterns, then they should probably be merged into one cluster).

The clustering problem

slide-4
SLIDE 4
  • “Unsupervised learning” problem
  • No single solution is necessarily the true/correct!
  • There is usually a tradeoff between homogeneity and

separation:

  • More clusters  increased homogeneity but decreased separation
  • Less clusters  Increased separation but reduced homogeneity
  • Method matters; metric matters; definitions matter;
  • There are many formulations of the clustering problem;

most of them are NP-hard (why?).

  • In most cases, heuristic methods or approximations are

used.

The “philosophy” of clustering

slide-5
SLIDE 5
  • Many algorithms:
  • Hierarchical clustering
  • k-means
  • self-organizing maps (SOM)
  • Knn
  • PCC
  • CAST
  • CLICK
  • The results (i.e., obtained clusters) can vary drastically

depending on:

  • Clustering method
  • Parameters specific to each clustering method (e.g. number
  • f centers for the k-mean method, agglomeration rule for

hierarchical clustering, etc.)

One problem, numerous solutions

slide-6
SLIDE 6

Hierarchical clustering

slide-7
SLIDE 7
  • An agglomerative clustering method
  • Takes as input a distance matrix
  • Progressively regroups the closest objects/groups
  • The result is a tree - intermediate nodes represent clusters
  • Branch lengths represent distances between clusters

Hierarchical clustering

  • bject 2
  • bject 4
  • bject 1
  • bject 3
  • bject 5

c1 c2 c3 c4

leaf nodes branch node root

Tree representation

  • bject 1
  • bject 2
  • bject 3
  • bject 4
  • bject 5
  • bject 1

0.00 4.00 6.00 3.50 1.00

  • bject 2

4.00 0.00 6.00 2.00 4.50

  • bject 3

6.00 6.00 0.00 5.50 6.50

  • bject 4

3.50 2.00 5.50 0.00 4.00

  • bject 5

1.00 4.50 6.50 4.00 0.00

Distance matrix

slide-8
SLIDE 8

mmm… Déjà vu anyone?

slide-9
SLIDE 9
  • 1. Assign each object to a separate cluster.
  • 2. Find the pair of clusters with the shortest distance,

and regroup them into a single cluster.

  • 3. Repeat 2 until there is a single cluster.

Hierarchical clustering algorithm

slide-10
SLIDE 10
  • 1. Assign each object to a separate cluster.
  • 2. Find the pair of clusters with the shortest distance,

and regroup them into a single cluster.

  • 3. Repeat 2 until there is a single cluster.

Hierarchical clustering algorithm

slide-11
SLIDE 11

Hierarchical clustering

  • One needs to define a (dis)similarity metric between

two groups. There are several possibilities

  • Average linkage: the average distance between objects from

groups A and B

  • Single linkage: the distance between the closest objects

from groups A and B

  • Complete linkage: the distance between the most distant
  • bjects from groups A and B
  • 1. Assign each object to a separate cluster.
  • 2. Find the pair of clusters with the shortest distance,

and regroup them into a single cluster.

  • 3. Repeat 2 until there is a single cluster.
slide-12
SLIDE 12

Impact of the agglomeration rule

 These four trees were built from the same distance matrix,

using 4 different agglomeration rules.

Note: these trees were computed from a matrix

  • f random numbers.

The impression of structure is thus a complete artifact.

Single-linkage typically creates nesting clusters Complete linkage create more balanced trees.

slide-13
SLIDE 13

Hierarchical clustering result

13

Five clusters

slide-14
SLIDE 14

K-mean clustering

Divisive Non-hierarchical

slide-15
SLIDE 15
  • An algorithm for partitioning n observations/points

into k clusters such that each observation belongs to the cluster with the nearest mean/center

  • Note that this is a somewhat strange definition:
  • Assignment of a point to a cluster is based on the proximity
  • f the point to the cluster mean
  • But the cluster mean is calculated based on all the points

assigned to the cluster.

K-mean clustering

cluster_2 mean cluster_1 mean

slide-16
SLIDE 16
  • An algorithm for partitioning n
  • bservations/points into k clusters such

that each observation belongs to the cluster with the nearest mean/center

  • The chicken and egg problem:

I do not know the means before I determine the partitioning into clusters I do not know the partitioning into clusters before I determine the means

  • Key principle - cluster around mobile centers:
  • Start with some random locations of means/centers,

partition into clusters according to these centers, and then correct the centers according to the clusters (somewhat similar to expectation-maximization algorithm)

K-mean clustering: Chicken and egg

slide-17
SLIDE 17
  • The number of centers, k, has to be specified a-priori
  • Algorithm:
  • 1. Arbitrarily select k initial centers
  • 2. Assign each element to the closest center
  • 3. Re-calculate centers (mean position of the

assigned elements)

  • 4. Repeat 2 and 3 until ….

K-mean clustering algorithm

slide-18
SLIDE 18
  • The number of centers, k, has to be specified a-priori
  • Algorithm:
  • 1. Arbitrarily select k initial centers
  • 2. Assign each element to the closest center
  • 3. Re-calculate centers (mean position of the

assigned elements)

  • 4. Repeat 2 and 3 until one of the following

termination conditions is reached:

i. The clusters are the same as in the previous iteration ii. The difference between two iterations is smaller than a specified threshold iii. The maximum number of iterations has been reached

K-mean clustering algorithm

How can we do this efficiently?

slide-19
SLIDE 19
  • Assigning elements to the closest center

Partitioning the space

B A

slide-20
SLIDE 20
  • Assigning elements to the closest center

Partitioning the space

B A

closer to A than to B closer to B than to A

slide-21
SLIDE 21
  • Assigning elements to the closest center

Partitioning the space

B A C

closer to A than to B closer to B than to A closer to B than to C

slide-22
SLIDE 22
  • Assigning elements to the closest center

Partitioning the space

B A C

closest to A closest to B closest to C

slide-23
SLIDE 23
  • Assigning elements to the closest center

Partitioning the space

B A C

slide-24
SLIDE 24
  • Decomposition of a metric space determined by

distances to a specified discrete set of “centers” in the space

  • Each colored cell represents the collection of all points

in this space that are closer to a specific center s than to any other center

  • Several algorithms exist to find

the Voronoi diagram.

Voronoi diagram

slide-25
SLIDE 25
  • The number of centers, k, has to be specified a priori
  • Algorithm:
  • 1. Arbitrarily select k initial centers
  • 2. Assign each element to the closest center (Voronoi)
  • 3. Re-calculate centers (mean position of the

assigned elements)

  • 4. Repeat 2 and 3 until one of the following

termination conditions is reached:

i. The clusters are the same as in the previous iteration ii. The difference between two iterations is smaller than a specified threshold iii. The maximum number of iterations has been reached

K-mean clustering algorithm

slide-26
SLIDE 26

K-mean clustering example

  • Two sets of points

randomly generated

  • 200 centered on (0,0)
  • 50 centered on (1,1)
slide-27
SLIDE 27

K-mean clustering example

  • Two points are

randomly chosen as centers (stars)

slide-28
SLIDE 28

K-mean clustering example

  • Each dot can now

be assigned to the cluster with the closest center

slide-29
SLIDE 29

K-mean clustering example

  • First partition into

clusters

slide-30
SLIDE 30
  • Centers are

re-calculated

K-mean clustering example

slide-31
SLIDE 31

K-mean clustering example

  • And are again used

to partition the points

slide-32
SLIDE 32

K-mean clustering example

  • Second partition into

clusters

slide-33
SLIDE 33

K-mean clustering example

  • Re-calculating centers

again

slide-34
SLIDE 34

K-mean clustering example

  • And we can again

partition the points

slide-35
SLIDE 35

K-mean clustering example

  • Third partition

into clusters

slide-36
SLIDE 36

K-mean clustering example

  • After 6 iterations:
  • The calculated

centers remains stable

slide-37
SLIDE 37

K-mean clustering: Summary

  • The convergence of k-mean is usually quite fast

(sometimes 1 iteration results in a stable solution)

  • K-means is time- and memory-efficient
  • Strengths:
  • Simple to use
  • Fast
  • Can be used with very large data sets
  • Weaknesses:
  • The number of clusters has to be predetermined
  • The results may vary depending on the initial choice of

centers

slide-38
SLIDE 38

K-mean clustering: Variations

  • Expectation-maximization (EM):

maintains probabilistic assignments to clusters, instead of deterministic assignments, and multivariate Gaussian distributions instead of means.

  • k-means++: attempts to choose better starting points.
  • Some variations attempt to escape local optima by

swapping points between clusters

slide-39
SLIDE 39

The take-home message

D’haeseleer, 2005

Hierarchical clustering K-mean clustering

?

slide-40
SLIDE 40
slide-41
SLIDE 41

What else are we missing?

slide-42
SLIDE 42
  • What if the clusters are not “linearly separable”?

What else are we missing?

slide-43
SLIDE 43

Clustering in both dimensions

  • We can cluster genes, conditions (samples), or both.