Clustering Expression Data www.cs.washington.edu/527 • Why cluster gene expression data? – Tissue classification – Find biologically related genes – First step in inferring regulatory networks – Look for common promoter elements – Hypothesis generation – One of the tools of choice for expression analysis Subscribe, if you Didn’t get msg last night Clustering Expression Data Clustering Algorithms • Partitional • What has been done? – CAST (Ben-Dor et al. 1999) – Hierarchical average-link [Eisen et al. 98] – Self Organizing Maps (SOM) [Tamayo et al. 99] – k-means , variously initialized (Hartigan 1975) – CAST [Ben-Dor et al. 99] • Hierarchical – Support Vector Machines (SVM) [Grundy et al. 00] – etc., etc., etc. – single-link • Why so many methods? – average-link – Clustering is NP-hard, even with simple objectives, data – complete-link – Hard problem: high dimensionality, noise, … – ∴ many heuristic, local search, & approximation algorithms • Random (as a control) – No clear winner – Randomly assign genes to clusters • Others 1
The following slides largely from Overview http://staff.washington.edu/kayee/research.html Errors are mine. • What is clustering? • Similarity/distance metrics • Hierarchical clustering algorithms Clustering 101 – Made popular by Stanford, ie. [Eisen et al . 1998] • K-means – Made popular by many groups, eg. [Tavazoie et al. 1999] • Self-organizing map (SOM) Ka Yee Yeung – Made popular by Whitehead, ie. [Tamayo et al . 1999] Center for Expression Arrays University of Washington How to define similarity? What is clustering? Experiment X genes n 1 p 1 s X • Group similar objects together genes genes • Objects in the same cluster (group) are more Y similar to each other than objects in different Y n n clusters Raw matrix Similarity matrix • Data exploratory tool • Similarity metric: – A measure of pairwise similarity or dissimilarity – Examples: • Correlation coefficient • Euclidean distance 2
Example Similarity metrics 4 X 1 0 -1 0 • Euclidean distance 3 Y 3 2 1 2 2 Z -1 0 1 0 X 1 p Y W 2 0 -2 0 2 � ( X [ j ] Y [ j ] ) Z � 0 W 1 2 3 4 j 1 -1 = -2 • Correlation coefficient -3 Correlation (X,Y) = 1 Distance (X,Y) = 4 p p ( X [ j ] X )( Y [ j ] Y ) X [ j ] � � � � j 1 j 1 = , where X = = Correlation (X,Z) = -1 Distance (X,Z) = 2.83 p p p ( X [ j ] X ) 2 ( Y [ j ] Y ) 2 � � � � Correlation (X,W) = 1 Distance (X,W) = 1.41 j = 1 j = 1 Clustering algorithms Lessons from the example • Inputs: • Correlation – direction only – Raw data matrix or similarity matrix • Euclidean distance – magnitude & direction – Number of clusters or some other parameters • Min # attributes (experiments) to compute pairwise • Many different classifications of clustering similarity algorithms: – >= 2 attributes for Euclidean distance – Hierarchical vs partitional – >= 3 attributes for correlation • Array data is noisy need many experiments to robustly – Heuristic-based vs model-based estimate pairwise similarity – Soft vs hard 3
Hierarchical Clustering [Hartigan Hierarchical: Single Link 1975] • Agglomerative ( bottom-up) • cluster similarity = similarity of two most similar • Algorithm: members – Initialize: each item a cluster - Potentially – Iterate: long and skinny • select two most similar clusters clusters dendrogram • merge them + Fast – Halt: when required number of clusters is reached Example: single link Example: single link 1 2 3 4 5 1 2 3 4 5 ( 1 , 2 ) 3 4 5 ( 1 , 2 ) 3 4 5 ( 1 , 2 , 3 ) 4 5 1 0 1 0 � � � � ( 1 , 2 ) 0 ( 1 , 2 ) 0 � � � � ( 1 , 2 , 3 ) 0 � � � � � � 2 2 0 2 2 0 � � � � 3 3 0 3 3 0 � � � � � � 4 7 0 � � � � 3 � 6 3 0 � 3 � 6 3 0 � � � 4 9 7 0 4 9 7 0 � � � � � � � � 5 5 4 0 � � 4 10 9 7 0 4 10 9 7 0 � � � � � � � � 5 8 5 4 0 � � 5 8 5 4 0 � � � � 5 � 9 8 5 4 0 � 5 � 9 8 5 4 0 � � � � � 5 5 d min{ d , d } min{ 6 , 3 } 3 = = = d min{ d , d } min{ 9 , 7 } 7 = = = ( 1 , 2 ), 3 1 , 3 2 , 3 ( 1 , 2 , 3 ), 4 ( 1 , 2 ), 4 3 , 4 4 4 d min{ d , d } min{ 10 , 9 } 9 d min{ d , d } min{ 8 , 5 } 5 = = = = = = ( 1 , 2 , 3 ), 5 ( 1 , 2 ), 5 3 , 5 ( 1 , 2 ), 4 1 , 4 2 , 4 3 3 d min{ d , d } min{ 9 , 8 } 8 = = = ( 1 , 2 ), 5 1 , 5 2 , 5 2 2 1 1 4
Hierarchical: Complete Link Example: single link • cluster similarity = similarity of two least similar 1 2 3 4 5 members ( 1 , 2 ) 3 4 5 ( 1 , 2 , 3 ) 4 5 1 0 � � ( 1 , 2 ) 0 � � ( 1 , 2 , 3 ) 0 � � � � 2 2 0 � � � � 3 3 0 � � 4 7 0 � � 3 6 3 0 � � � � 4 9 7 0 � � � � 5 5 4 0 � � 4 10 9 7 0 � � � � � � + tight clusters 5 8 5 4 0 � � 5 � 9 8 5 4 0 � � � 5 - slow d min{ d , d } 5 = = 4 ( 1 , 2 , 3 ), ( 4 , 5 ) ( 1 , 2 , 3 ), 4 ( 1 , 2 , 3 ), 5 3 2 1 Sometimes drawn to a scale Example: complete link Example: complete link 1 2 3 4 5 1 2 3 4 5 ( 1 , 2 ) 3 4 5 ( 1 , 2 ) 3 4 5 ( 1 , 2 ) 3 ( 4 , 5 ) 1 0 1 0 � � � � ( 1 , 2 ) 0 ( 1 , 2 ) 0 � � � � ( 1 , 2 ) 0 � � � � � � 2 2 0 2 2 0 � � � � 3 6 0 3 6 0 � � � � � � 3 6 0 � � � � 3 � 6 3 0 � 3 � 6 3 0 � � � 4 10 7 0 4 10 7 0 � � � � � � � � ( 4 , 5 ) 10 7 0 � � 4 10 9 7 0 4 10 9 7 0 � � � � � � � � 5 9 5 4 0 � � 5 9 5 4 0 � � � � 5 � 9 8 5 4 0 � 5 � 9 8 5 4 0 � � � � � 5 5 d max{ d , d } max{ 6 , 3 } 6 = = = d max{ d , d } max{ 10 , 9 } 10 = = = ( 1 , 2 ), 3 1 , 3 2 , 3 ( 1 , 2 ), ( 4 , 5 ) ( 1 , 2 ), 4 ( 1 , 2 ), 5 4 4 d max{ d , d } max{ 10 , 9 } 10 = = = d max{ d , d } max{ 7 , 5 } 7 = = = ( 1 , 2 ), 4 1 , 4 2 , 4 3 , ( 4 , 5 ) 3 , 4 3 , 5 3 3 d max{ d , d } max{ 9 , 8 } 9 = = = ( 1 , 2 ), 5 1 , 5 2 , 5 2 2 1 1 5
Hierarchical: Average Link Example: complete link • cluster similarity = average similarity of all pairs 1 2 3 4 5 ( 1 , 2 ) 3 4 5 ( 1 , 2 ) 3 ( 4 , 5 ) 1 0 � � ( 1 , 2 ) 0 � � ( 1 , 2 ) 0 � � � � 2 2 0 � � � � 3 6 0 � � 3 6 0 � � 3 6 3 0 � � � � 4 10 7 0 � � � � ( 4 , 5 ) 10 7 0 � � 4 10 9 7 0 � � � � � � + tight clusters 5 9 5 4 0 � � 5 � 9 8 5 4 0 � � � 5 - slow d max{ d , d } 10 = = 4 ( 1 , 2 , 3 ), ( 4 , 5 ) ( 1 , 2 ), ( 4 , 5 ) 3 , ( 4 , 5 ) 3 2 1 Example: average link Example: average link 1 2 3 4 5 1 2 3 4 5 ( 1 , 2 ) 3 4 5 ( 1 , 2 ) 3 4 5 ( 1 , 2 ) 3 ( 4 , 5 ) 1 0 1 0 � � � � ( 1 , 2 ) 0 ( 1 , 2 ) 0 � � � � ( 1 , 2 ) 0 � � � � � � 2 2 0 2 2 0 � � � � 3 4 . 5 0 3 4 . 5 0 � � � � � � 3 4 . 5 0 � � � � 3 � 6 3 0 � 3 � 6 3 0 � � � 4 9 . 5 7 0 4 9 . 5 7 0 � � � � � � � � ( 4 , 5 ) 9 6 0 � � 4 10 9 7 0 4 10 9 7 0 � � � � � � � � 5 8 . 5 5 4 0 � � 5 8 . 5 5 4 0 � � � � 5 � 9 8 5 4 0 � 5 � 9 8 5 4 0 � � � � � 5 1 6 3 + 5 d ( d d ) 4 . 5 = + = = ( 1 , 2 ), 3 1 , 3 2 , 3 1 2 2 d ( d d d d ) 9 4 = + + + = ( 1 , 2 ), ( 4 , 5 ) 1 , 4 1 , 5 2 , 4 2 , 5 4 4 1 10 9 + d ( d d ) 9 . 5 = + = = 1 ( 1 , 2 ), 4 1 , 4 2 , 4 3 2 2 3 d ( d d ) 6 = + = 3 , ( 4 , 5 ) 3 , 4 3 , 5 2 1 9 8 2 + 2 d ( d d ) 8 . 5 = + = = ( 1 , 2 ), 5 1 , 5 2 , 5 2 2 1 1 6
Recommend
More recommend