Clustering Hierarchical clustering and k-mean clustering Genome - PowerPoint PPT Presentation

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics Elhanan Borenstein

A quick review  The clustering problem:  partition genes into distinct sets with high homogeneity and high separation  Many different representations  Many possible distance metrics  Metric matters  Homogeneity vs separation

The clustering problem  A good clustering solution should have two features: 1. High homogeneity : homogeneity measures the similarity between genes assigned to the same cluster. 2. High separation : separation measures the distance/dis- similarity between clusters. (If two clusters have similar expression patterns, then they should probably be merged into one cluster).

The “philosophy” of clustering  “ Unsupervised learning ” problem  No single solution is necessarily the true/correct!  There is usually a tradeoff between homogeneity and separation:  More clusters  increased homogeneity but decreased separation  Less clusters  Increased separation but reduced homogeneity  Method matters; metric matters; definitions matter;  There are many formulations of the clustering problem; most of them are NP-hard (why?) .  In most cases, heuristic methods or approximations are used.

One problem, numerous solutions  Many algorithms:  Hierarchical clustering  k-means  self-organizing maps (SOM)  Knn  PCC  CAST  CLICK  The results (i.e., obtained clusters) can vary drastically depending on:  Clustering method  Parameters specific to each clustering method (e.g. number of centers for the k-mean method, agglomeration rule for hierarchical clustering, etc.)

Hierarchical clustering

Hierarchical clustering  A n agglomerative clustering method  Takes as input a distance matrix  Progressively regroups the closest objects/groups  The result is a tree - intermediate nodes represent clusters  Branch lengths represent distances between clusters Tree representation branch Distance matrix object 1 c1 node object 1 object 2 object 3 object 4 object 5 object 5 c3 object 4 c4 c2 object 1 0.00 4.00 6.00 3.50 1.00 object 2 object 2 4.00 0.00 6.00 2.00 4.50 object 3 object 3 6.00 6.00 0.00 5.50 6.50 root object 4 3.50 2.00 5.50 0.00 4.00 object 5 1.00 4.50 6.50 4.00 0.00 leaf nodes

mmm… Déjà vu anyone?

Hierarchical clustering algorithm 1. Assign each object to a separate cluster. 2. Find the pair of clusters with the shortest distance, and regroup them into a single cluster. 3. Repeat 2 until there is a single cluster.

Hierarchical clustering 1. Assign each object to a separate cluster. 2. Find the pair of clusters with the shortest distance, and regroup them into a single cluster. 3. Repeat 2 until there is a single cluster.  One needs to define a (dis)similarity metric between two groups . There are several possibilities  Average linkage: the average distance between objects from groups A and B  Single linkage: the distance between the closest objects from groups A and B  Complete linkage: the distance between the most distant objects from groups A and B

Impact of the agglomeration rule  These four trees were built from the same distance matrix, using 4 different agglomeration rules. Single-linkage typically creates nesting clusters Complete linkage create more balanced trees. Note: these trees were computed from a matrix of random numbers. The impression of structure is thus a complete artifact.

Hierarchical clustering result Five clusters 13

K-mean clustering Divisive Non-hierarchical

K-mean clustering  An algorithm for partitioning n observations/points into k clusters such that each observation belongs to the cluster with the nearest mean/center cluster_2 mean cluster_1 mean  Note that this is a somewhat strange definition:  Assignment of a point to a cluster is based on the proximity of the point to the cluster mean  But the cluster mean is calculated based on all the points assigned to the cluster.

K-mean clustering: Chicken and egg  An algorithm for partitioning n observations/points into k clusters such that each observation belongs to the cluster with the nearest mean/center  The chicken and egg problem: I do not know the means before I determine the partitioning into clusters I do not know the partitioning into clusters before I determine the means  Key principle - cluster around mobile centers:  Start with some random locations of means/centers, partition into clusters according to these centers, and then correct the centers according to the clusters (somewhat similar to expectation-maximization algorithm)

K-mean clustering algorithm  The number of centers, k , has to be specified a-priori  Algorithm: 1. Arbitrarily select k initial centers 2. Assign each element to the closest center 3. Re-calculate centers (mean position of the assigned elements) 4. Repeat 2 and 3 until ….

K-mean clustering algorithm  The number of centers, k , has to be specified a-priori  Algorithm: How can we do this efficiently? 1. Arbitrarily select k initial centers 2. Assign each element to the closest center 3. Re-calculate centers (mean position of the assigned elements) 4. Repeat 2 and 3 until one of the following termination conditions is reached: i. The clusters are the same as in the previous iteration ii. The difference between two iterations is smaller than a specified threshold iii. The maximum number of iterations has been reached

Partitioning the space  Assigning elements to the closest center B A

Partitioning the space  Assigning elements to the closest center closer to B than to A B closer to A than to B A

Partitioning the space  Assigning elements to the closest center closer to B than to A B closer to A closer to B than to B than to C A C

Partitioning the space  Assigning elements to the closest center closest to B B closest to A A C closest to C

Partitioning the space  Assigning elements to the closest center B A C

Voronoi diagram  Decomposition of a metric space determined by distances to a specified discrete set of “centers” in the space  Each colored cell represents the collection of all points in this space that are closer to a specific center s than to any other center  Several algorithms exist to find the Voronoi diagram.

K-mean clustering algorithm  The number of centers, k , has to be specified a priori  Algorithm: 1. Arbitrarily select k initial centers 2. Assign each element to the closest center (Voronoi) 3. Re-calculate centers (mean position of the assigned elements) 4. Repeat 2 and 3 until one of the following termination conditions is reached: i. The clusters are the same as in the previous iteration ii. The difference between two iterations is smaller than a specified threshold iii. The maximum number of iterations has been reached

K-mean clustering example  Two sets of points randomly generated  200 centered on (0,0)  50 centered on (1,1)

K-mean clustering example  Two points are randomly chosen as centers (stars)

K-mean clustering example  Each dot can now be assigned to the cluster with the closest center

K-mean clustering example  First partition into clusters

K-mean clustering example  Centers are re-calculated

K-mean clustering example  And are again used to partition the points

K-mean clustering example  Second partition into clusters

K-mean clustering example  Re-calculating centers again

K-mean clustering example  And we can again partition the points

K-mean clustering example  Third partition into clusters

K-mean clustering example  After 6 iterations:  The calculated centers remains stable

K-mean clustering: Summary  The convergence of k-mean is usually quite fast (sometimes 1 iteration results in a stable solution)  K-means is time- and memory-efficient  Strengths:  Simple to use  Fast  Can be used with very large data sets  Weaknesses:  The number of clusters has to be predetermined  The results may vary depending on the initial choice of centers

K-mean clustering: Variations  Expectation-maximization ( EM ): maintains probabilistic assignments to clusters, instead of deterministic assignments, and multivariate Gaussian distributions instead of means.  k-means++: attempts to choose better starting points.  Some variations attempt to escape local optima by swapping points between clusters

The take-home message Hierarchical K-mean clustering clustering ? D’haeseleer , 2005

What else are we missing?

What else are we missing?  What if the clusters are not “linearly separable”?

Clustering in both dimensions  We can cluster genes, conditions (samples), or both.

Clustering Hierarchical clustering and k-mean clustering Genome - PowerPoint PPT Presentation

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics Elhanan Borenstein A quick review The clustering problem: partition genes into distinct sets with high homogeneity and high separation Many

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Graph Clustering Why graph clustering is useful? Distance matrices are graphs as useful as

CHAPTER VIII VIII CHAPTER Data Clustering and Data Clustering and Self- -Organizing Feature

Verbal mismatch in French Right-Node Raising: Speeded grammaticality judgments but no EEGs

Interactive graphics in the browser using Stata2D3 & Stata SVG graphs Robert Grant Tim P .

Lambda calculus (cont) Deian Stefan (adopted from my & Edward Yangs CSE242 slides)

Stop thinking about bottoms when writing programs . . . Thorsten Altenkirch University of

Redefine Optical Devices I ntegration and Manufacturing through Nano-engineering Jian Jim

Navigating the Legal, Economic and Business Challenges of the Coronavirus Pandemic W EBINAR 3

ARTIFICIAL INTELLIGENCE Russell & Norvig Chapter 9. Inference in First-Order Logic

[T HREADS ] Shrideep Pallickara Computer Science Colorado State University CS370: Operating

Clustering Hierarchical clustering and k-mean clustering Genome - PowerPoint PPT Presentation

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics Elhanan Borenstein A quick review The clustering problem: partition genes into distinct sets with high homogeneity and high separation Many

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Graph Clustering Why graph clustering is useful? Distance matrices are graphs as useful as

CHAPTER VIII VIII CHAPTER Data Clustering and Data Clustering and Self- -Organizing Feature

Verbal mismatch in French Right-Node Raising: Speeded grammaticality judgments but no EEGs

Interactive graphics in the browser using Stata2D3 &amp; Stata SVG graphs Robert Grant Tim P .

Lambda calculus (cont) Deian Stefan (adopted from my &amp; Edward Yangs CSE242 slides)

Stop thinking about bottoms when writing programs . . . Thorsten Altenkirch University of

Redefine Optical Devices I ntegration and Manufacturing through Nano-engineering Jian Jim

Navigating the Legal, Economic and Business Challenges of the Coronavirus Pandemic W EBINAR 3

ARTIFICIAL INTELLIGENCE Russell &amp; Norvig Chapter 9. Inference in First-Order Logic

[T HREADS ] Shrideep Pallickara Computer Science Colorado State University CS370: Operating

Interactive graphics in the browser using Stata2D3 & Stata SVG graphs Robert Grant Tim P .

Lambda calculus (cont) Deian Stefan (adopted from my & Edward Yangs CSE242 slides)

ARTIFICIAL INTELLIGENCE Russell & Norvig Chapter 9. Inference in First-Order Logic