Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Clustering in bioinformatics Microarrays Clustering is a widely used tool in microarray analysis Class discovery is an important problem in microarray studies for two reasons: either the classes are completely unknown before- hand or it is unknown whether a known class contains interesting subclasses Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

Clustering in bioinformatics Examples Classes unknown: Does a disease affect gene expression in a particular tissue? Does gene expression differ between two groups in a particular condition? Subclasses unknown: Are there subtypes of a disease? Is there even a hierarchy of subclasses within one disease? Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

Clustering in bioinformatics Popularity Clustering tools are available in the large microarray database NCBI Gene Expression Omnibus (GEO) http://www.ncbi.nlm.nih.gov/geo/ 3002 pubmed hits for ’microarray clustering’ Recent editorial of OUP Bioinformatics Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

Distance metrics Euclidean distance Euclidean distance of gene x and y of n samples or sample x and y of n genes: � n � � � ( x i − y i ) 2 d xy = (1) � i =1 Pearson’s Correlation Pearson Correlation of gene x and y of n samples or sample x and y of n genes, where ¯ x is the mean of x and is ¯ y the mean of y: � n i =1 ( x i − ¯ x )( y i − ¯ y ) r xy = (2) �� n x ) 2 �� n y ) 2 i =1 ( x i − ¯ i =1 ( y i − ¯ Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

Distance metrics Un-centered correlation coefficient Un-centered correlation coefficient of gene x and y of n samples or sample x and y of n genes: � n i =1 x i y i r u xy = (3) �� n �� n i =1 x 2 i =1 y 2 i i Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

Clustering algorithms Hierarchical Clustering Single linkage: The linking distance is the minimum distance between two clusters. Complete linkage: The linking distance is the maximum distance between two clusters. Average linkage/UPGMA (The linking distance is the average of all pair-wise distances between members of the two clusters. Since all genes and samples carry equal weight, the linkage is an Unweighted Pair Group Method with Arithmetic Means (UPGMA)) ‘Flat’ Clustering k-means (k from 2 to 15, 3 runs) k-median (k-medoid) Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

The two-sample problem Interpretation of clusters Clustering introduces ‘structure’ into microarray datasets But is there a statistical or biomedical meaning of these classes? Biomedical meaning has to be established in experiments ‘Statistical meaning’ can be measured using statistical tests, by a so-called two-sample test A two-sample tests decides whether two samples were drawn from the same probability distribution or not Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

The two-sample problem Data diversity Molecular biology produces a wealth of information The problem is that these data are generated on different platforms and by different protocols under different levels of noise Hence data from different labs show different scales different ranges different distributions Main problem: Joint data analysis may detect differences in distributions, not biological phenomena! Karsten Borgwardt: Data Mining in Bioinformatics, Page 9

The two-sample problem The two-sample problem Given two samples X and Y . Were they generated by the same distribution? Previous approaches two-sample tests exist for univariate and multivariate data Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

The two-sample problem t-test A test of the null hypothesis that the means of two nor- mally distributed populations are equal unpaired/independent (versus paired) For equal sample sizes and equal variances, the t statistic to test whether the means are different can be calcu- lated as follows: x − ¯ ¯ y t = (4) � 2 σ xy · n � σ 2 x + σ 2 where σ xy = . y 2 The degrees of freedom for this test is 2 n − 2 where n is the size of each sample. Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

The two-sample problem New challenges in bioinformatics high-dimensional structured (strings and graphs) low sample size Novel distribution test: Maximum Mean Discrepancy (MMD) Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

MMD key idea Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

MMD key idea Key Idea Avoid density estimator, use means in feature spaces Maximum Mean Discrepancy (Fortet and Mourier, 1953) D ( p, q, F ) := sup E p [ f ( x )] − E q [ f ( y )] f ∈ F Theorem D ( p, q, F ) = 0 iff p = q , when F = C 0 ( X ) . Follows directly, e.g. from Dudley, 1984. Theorem D ( p, q, F ) = 0 iff p = q , when F = { f | � f � H ≤ 1 } provided that H is a universal RKHS. (follows via Steinwart, 2001, Smola et al., 2006). Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

MMD statistic Goal: Estimate D ( p, q, F ) E p,p k ( x, x ′ ) − 2 E p,q k ( x, y ) + E q,q k ( y, y ′ ) U-Statistic: Empirical estimate D ( X, Y, F ) � 1 k ( x i , x j ) − k ( x i , y j ) − k ( y i , x j ) + k ( y i , y j ) m ( m − 1) i � = j Theorem D ( X, Y, F ) is an unbiased estimator of D ( p, q, F ) . Test Estimate σ 2 from data. Reject null hypothesis that p = q if D ( X, Y, F ) exceeds acceptance threshold. Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

Attractive for bioinformatics MMD two-sample test in terms of kernels Computationally attractive search infinite space of functions by evaluating one expression no optimization problem has to be solved All thanks to kernels! Karsten Borgwardt: Data Mining in Bioinformatics, Page 16

Attractive for bioinformatics Wide applicability for one- and higher-dimensional vectorial data, but also for structured data ! two-sample problems can now be tackled on strings: protein and DNA sequences graphs: molecules, protein interaction networks time series: time series of microarray data and sets, trees, . . . Karsten Borgwardt: Data Mining in Bioinformatics, Page 17

Cross-platform comparability Data microarray data from two breast cancer studies one on cDNA platform (Gruvberger et al., 2001) other on oligonucleotide microarray platform (West et al., 2001) Task Can MMD help to find out if two sets of observations were generated by the same study (both from Gruvberger or both from West)? different studies (one Gruvberger, one West)? Karsten Borgwardt: Data Mining in Bioinformatics, Page 18

Cross-platform comparability Experiment sample size each: 25 dimension of each datapoint 2,116 significance level: α = 0 . 05 100 times: 1 sample from Gruvberger, 1 from West 100 times: both from Gruvberger or both from West report percentage of correct decisions compare to t-test, Friedman-Rafsky Wald-Wolfowitz and Smirnov Karsten Borgwardt: Data Mining in Bioinformatics, Page 19

Cross-platform comparability Karsten Borgwardt: Data Mining in Bioinformatics, Page 20

Kernel-based statistical test novel statistical test for two-sample problem: easy to implement non-parametric first for structured data best on high-dimensional data quadratic runtime w.r.t. the number of data points impressive accuracy in our experiments kernel method for two-sample problem: all kernels recently defined in molecular biology can be re-used for data integration applicable to vectors, strings, sets, trees, graphs and time series Karsten Borgwardt: Data Mining in Bioinformatics, Page 21

Biclustering Clustering in two dimensions alternative names: co-clustering, two-mode clustering A bicluster is a subset of genes that show similar activ- ity patterns under a subset of conditions. Clustering in 2 dimensions Cluster patients and conditions Earliest work by Hartigan, 1972: Divide a matrix into submatrices with minimum variance. Most interesting cases are NP-complete. Many extensions in bioinformatics (e.g. Cheng and Church, 2002) Karsten Borgwardt: Data Mining in Bioinformatics, Page 22

References and further reading References [1] Gretton, Borgwardt, Rasch, Schölkopf, Smola: A kernel method for the two-sample problem. NIPS 2006 Karsten Borgwardt: Data Mining in Bioinformatics, Page 23

The end See you tomorrow! Next topic: Feature Selection in Bioinformatics Karsten Borgwardt: Data Mining in Bioinformatics, Page 24

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tbingen Karsten Borgwardt: Data Mining in Bioinformatics, Page

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Clustering Gene Expression Data

Data Mining in Bioinformatics Day 8: Clustering in Bioinformatics Clustering Gene Expression Data

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 2: Clustering Karsten Borgwardt February 21 to March 4, 2011

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Data Mining in Bioinformatics Day 10: Graph Mining in Bioinformatics Karsten Borgwardt February

Clustering and information visualization Samuel Kaski University of Helsinki Department of

LIFE SCIENCES IN PARIS REGION PARIS AREA : FIRST EUROPEAN REGION IN THE FIELD OF LIFE SCIENCE AND

Introduction to K- means Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

Administrative notes October 26, 2017 Well do some In the News Groupwork today

Co-manifold learning with missing data Gal Mishne, Eric C. Chi and Ronald R. Coifman Department

Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler Unsupervised

flowMatch Meta-clustering based popula3on matching Ariful Azad,

Curve Clustering and Functional Mixed Models. Modeling, variable selection and application to

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tbingen Karsten Borgwardt: Data Mining in Bioinformatics, Page

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Clustering Gene Expression Data

Data Mining in Bioinformatics Day 8: Clustering in Bioinformatics Clustering Gene Expression Data

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 2: Clustering Karsten Borgwardt February 21 to March 4, 2011

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Data Mining in Bioinformatics Day 10: Graph Mining in Bioinformatics Karsten Borgwardt February

Clustering and information visualization Samuel Kaski University of Helsinki Department of

LIFE SCIENCES IN PARIS REGION PARIS AREA : FIRST EUROPEAN REGION IN THE FIELD OF LIFE SCIENCE AND

Introduction to K- means Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

Administrative notes October 26, 2017 Well do some In the News Groupwork today

Co-manifold learning with missing data Gal Mishne, Eric C. Chi and Ronald R. Coifman Department

Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler Unsupervised

flowMatch Meta-clustering based popula3on matching Ariful Azad,

Curve Clustering and Functional Mixed Models. Modeling, variable selection and application to

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt