Comparative Clustering Analysis of Gene Expression Profiles Qun Shan Genetics Division, MCB Department University of California at Berkeley Charless Fowlkes Serge Belongie (now at UCSD) Jitendra Malik Computer Science Department University of California at Berkeley
Take Home Messages ! Part I: The GeneCut Program ! a global, hierarchical clustering program ! based on normalized cut ! Part II: Application to Rosetta’s dataset ! Both GeneCut and hierarchical clustering program used in the original paper capture essentially the same obvious clusters ! Comparative clustering analysis provides an avenue for generating testable hypotheses
Challenges & Motivations ! Unknown ground truth and sparse data in multiple dimensional space ! None of the numerous clustering programs is perfect ! Our solutions ! provide the state of art clustering program ! comparative clustering analysis ! user knowledge of how these different clustering programs work
Related Work ! Hierarchical Clustering Program (Eisen et al , 1998) is the most popular ! simple, fast, and free ! nice data visualization interface ! Heuristic algorithm: greedy and pairwise ! SOM (Tamayo et al , 1999) ! CLICK (Sharan & Shamir, 2000) ! Support Vector Machine (Brown, 2000)
Why Should We Care About GeneCut ! GeneCut uses the state of art clustering algorithm (Kannan et al 2000) ! GeneCut offers a global clustering method ! Comparative clustering analysis
. GeneCut does Pairwise Clustering ! Pairwise clustering methods are based directly on distance between all pairs of feature vectors in the data set ! In contrast, Central clustering, used by k-mean based clustering programs, requires a small number of prototypical feature vectors
Central Clustering Does Not Handle Transitivity Well
GeneCut does Global Clustering
Some Terminology for Graph Partitioning ! How do we bipartition a graph: ∑ ∈ ∑ = = ( A, A' ) W ( , ) ( A, B) W( , ), assoc u v cut u v ∈ ∈ A, A' u v ∈ A, B u v A and A' not necessaril y disjoint ∩ = ∅ with A B
Normalized Cut, A measure of dissimilarity ! Normalized Cut, Ncut : ! Minimum cut is not appropriate since it favors cutting small pieces. ( A, B) ( A, B) cut cut = + ( A, B) Ncut ( A, V) ( , V) assoc assoc B
Solving the Normalized Cut problem ! Exact discrete solution to Ncut is NP- complete even on regular grid, ! [Papadimitriou’97] ! Drawing on spectral graph theory, good approximation can be obtained by solving a generalized eigenvalue problem.
Approximating Using Random Samples ! Solving big eigenvalue problem is computationally expensive ! Approximate solution is possible using small subset of random samples → Nyström approximation ! Originally developed in 1928 for solution of eigenfunction problems
Summary ! GeneCut is based on normalized cut ! global, hierarchical clustering ! Recursive K-way partitioning ! Stable clustering results ! Nystrom approximation
GeneCut: Web Interface
Rosetta Data Set
GeneCut: hierarchical trees
Ergosterol Cluster erg2 erg3 ERG11 (tet promoter) HMG2 (tet promoter) yer0440c (haploid) Itraconazole Lovastatin Terbinafine hmg1 (haploid)
Cell Wall Cluster Yar014C= BUD14, unknown function, Swi4p-Swi6p activates genes FKS1 is a plasma membrane protein however it interacts with cell wall related involved in cell wall biosynthesis proteins GLC7p and YOL154p yar014c spf1 fks1 (haploid) anp1 2-deoxy-D-glucose glucosamine swi4 swi5 gas1 Tunicamycin yer083c
Mitochondria Cluster
GeneCut: A Close Look
Genes in The Ergosterol Cluster
A Close Look at YER044C
YER044C May Share Functions with Genes in this Clusters:
Where Does YER044C Fit in the Ergosterol Pathway?
Proteolysis Model for Regulation of Ergosterol Biosynthesis A. Vik (2001)
Concluding Remarks ! GeneCut is a global clustering analysis program for gene expression data ! GeneCut is based on normalized cut algorithm, and incorporates features such as k -way clustering and Nyström approximation ! Exploration of gene expression profile through comparative clustering analysis
Recommend
More recommend