Model-based clustering and data transformations of gene expression - PowerPoint PPT Presentation

Model-based clustering and data transformations of gene expression data Walter L. Ruzzo University of Washington UW CSE Computational Biology Group

Overview • Motivation • Model-based clustering • Validation • Summary and Conclusions 2

Toy 2-d Clustering Example ? 3

K-Means 4

Hierarchical Average Link 5

Model-Based (If You Want) 6

Model-based clustering • Gaussian mixture model: – Assume each cluster is generated by a multivariate normal distribution – Cluster k has parameters : • Mean vector: µ k • Covariance matrix: Σ k 8

Model-based clustering • Gaussian mixture model: – Assume each cluster is generated by a multivariate normal distribution – Cluster k has parameters : • Mean vector: µ k • Covariance matrix: Σ k µ 1 µ 2 σ 1 σ 2 9

Variance & Covariance • Variance var( x ) = E (( x � x ) 2 ) = � x 2 • Covariance cov( x , y ) = E (( x � x )( y � y )) • Correlation cor( x , y ) = cov( x , y ) � x � y 10

Gaussian Distributions 1 • Univariate 1 ( x � x ) 2 / � 2 2 �� 2 e � 2 • Multivariate 1 1 ( x � x ) T ( � � 1 )( x � x ) e � 2 (2 � ) n | � | where Σ is the variance/covariance matrix: � i , j = E (( x i � x i )( x j � x j )) 11

Variance/Covariance 12

Covariance models Σ k = λ k D k A k D k T (Banfield & Raftery 1993) orientation shape volume • Equal volume spherical model (EI): ~ kmeans Σ k = λ I 13

Covariance models Σ k = λ k D k A k D k T (Banfield & Raftery 1993) orientation shape volume • Equal volume spherical model (EI): ~ kmeans Σ k = λ I • Unequal volume spherical (VI): Σ k = λ k I 14

Covariance models Σ k = λ k D k A k D k T (Banfield & Raftery 1993) orientation shape volume • Equal volume spherical model (EI): ~ kmeans Σ k = λ I But more parameters • Unequal volume spherical (VI): Σ k = λ k I • Diagonal model: More flexible Σ k = λ k B k , where B k is diagonal, |B k |=1 • EEE elliptical model: Σ k = λ DAD T • Unconstrained model (VVV): Σ k = λ k D k A k D k T 15

EM algorithm • General approach to maximum likelihood • Iterate between E and M steps: – E step: compute the probability of each observation belonging to each cluster using the current parameter estimates – M-step: estimate model parameters using the current group membership probabilities 16

Advantages of model-based clustering • Higher quality clusters • Flexible models • Model selection – A principled way to choose right model and right # of clusters – Bayesian Information Criterion (BIC): • Approximate Bayes factor: posterior odds for one model against another model • Roughly: data likelihood, penalized for number of parameters – A large BIC score indicates strong evidence for the corresponding model. 17

Definition of the BIC score ˆ 2 log p ( D | M ) 2 log p ( D | , M ) log( n ) BIC � � � � = k k k k k • The integrated likelihood p(D|M k ) is hard to evaluate, where D is the data, M k is the model. • BIC is an approximation to log p(D|M k ) • υ k : number of parameters to be estimated in model M k 18

Overview • Motivation • Model-based clustering • Validation – Methodology – Data Sets – Results • Summary and Conclusions 19

Validation Methodology • Compare on data sets with external criteria (BIC scores do not require the external criteria) • To compare clusters with external criterion: – Adjusted Rand index (Hubert and Arabie 1985) – Adjusted Rand index = 1  perfect agreement – 2 random partitions have an expected index of 0 • Compare quality of clusters to those from: – a leading heuristic-based algorithm: CAST (Ben-Dor & Yakhini 1999) – k-Means (EI). 20

Gene expression data sets • Ovarian cancer data set (Michel Schummer, Institute of Systems Biology) – Subset of data: 235 clones 24 experiments (cancer/normal tissue samples) – 235 clones correspond to 4 genes • Yeast cell cycle data (Cho et al 1998) – 17 time points – Subset of 384 genes associated with 5 phases of cell cycle 21

Synthetic data sets Both based on ovary data • Randomly resampled ovary data – For each class, randomly sample the expression levels in each experiment, independently – Near diagonal covariance matrix • Gaussian mixture – Generate multivariate normal distributions with the sample covariance matrix and mean vector of each class in the ovary data 22

Results: 0.9 0.8 randomly Adjusted Rand 0.7 EI VI 0.6 resampled VVV diagonal 0.5 CAST EEE ovary data 0.4 0.3 0 2 4 6 8 10 12 14 16 number of clusters • Diagonal model achieves max BIC -10500 0 2 4 6 8 10 12 14 16 -11000 score (~expected) -11500 • max BIC at 4 clusters BIC EI -12000 VI (~expected) diagonal -12500 EEE • max adjusted Rand -13000 -13500 • beats CAST number of clusters 23

Results: square root ovary data 0.8 • Adjusted Rand: 0.7 Adjusted Rand max at EEE 4 0.6 EI VI 0.5 VVV clusters (> CAST) diagonal 0.4 CAST 0.3 EEE • BIC analysis: 0.2 0 2 4 6 8 10 12 14 16 – EEE and diagonal number of clusters models  local 0 max at 4 clusters 0 2 4 6 8 10 12 14 16 -500 – Global max  VI -1000 EI at 8 clusters BIC -1500 VI diagonal (8 ≈ split of 4). -2000 EEE -2500 -3000 number of clusters 24

Results: standardized yeast cell cycle data 0.55 0.50 0.45 EI Adjusted Rand 0.40 VI • Adjusted Rand: VVV 0.35 diagonal 0.30 EI slightly > CAST 0.25 EEE CAST at 5 0.20 0.15 clusters. 0 2 4 6 8 10 12 14 16 number of clusters • BIC: selects -1000 0 2 4 6 8 10 12 14 16 EEE at 5 -3000 -5000 clusters. -7000 EI BIC -9000 VI diagonal -11000 EEE -13000 -15000 -17000 number of clusters 25

Overview • Motivation • Model-based clustering • Validation • Importance of Data Transformation • Summary and Conclusions 28

log yeast cell cycle data 29

Standardized yeast cell cycle data 30

Summary and Conclusions • Synthetic data sets: – With the correct model, model-based clustering better than a leading heuristic clustering algorithm – BIC selects the right model & right number of clusters • Real expression data sets: – Comparable adjusted Rand indices to CAST – BIC gives a good hint as to the number of clusters • Appropriate data transformations increase normality & cluster quality (See paper & web.) 32

Acknowledgements • Ka Yee Yeung 1 , Chris Fraley 2,4 , Alejandro Murua 4 , Adrian E. Raftery 2 • Michèl Schummer 5 – the ovary data • Jeremy Tantrum 2 – help with MBC software (diagonal model) • Chris Saunders 3 – CRE & noise model 4 Insightful Corporation 1 Computer Science & Engineering 2 Statistics 5 Institute of Systems Biology 3 Genome Sciences More Info http://www.cs.washington.edu/homes/ruzzo 33 UW CSE Computational Biology Group

Adjusted Rand Example c#1(4) c#2(5) c#3(7) c#4(4) class#1(2) 2 0 0 0 class#2(3) 0 0 0 3 class#3(5) 1 4 0 0 class#4(10) 1 1 7 1 2 3 4 7 � � � � � � � � a d + a 31 = � � + � � + � � + � � = Rand , R 0 . 789 � � � � � � � � = = 2 2 2 2 � � � � � � � � a d c d + + + 4 5 7 4 � � � � � � � � R E ( R ) � b a 43 31 12 = � � + � � + � � + � � � = � = Adjusted Rand 0 . 469 = = � � � � � � � � 2 2 2 2 1 E ( R ) � � � � � � � � � 2 3 5 10 � � � � � � � � c a 59 31 28 � � � � � � � � = + + + � = � = � � � � � � � � 2 2 2 2 � � � � � � � � 20 � � d a b c 119 � � = � � � = � � 2 � � 44

Model-based clustering and data transformations of gene expression - PowerPoint PPT Presentation

Model-based clustering and data transformations of gene expression data Walter L. Ruzzo University of Washington UW CSE Computational Biology Group Overview Motivation Model-based clustering Validation Summary and

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Detecting gene-gene interactions in high-throughput genotype data through a Bayesian clustering

A Bayesian clustering approach for detecting gene-gene interactions in high-dimensional genotype

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Gene Expression Data Introduction to gene expression data Expression data storage concept An

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

lecture 3 view transformations model transformations GL_MODELVIEW transformation view

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Gene Finding Strategies to find gene structures on the web Swiss Institute of Bioinformatics

Staphylococcus aureus Pathogenesis - Gene exchanges - Gene regulation - Gene products - Gene

CHAPTER VIII VIII CHAPTER Data Clustering and Data Clustering and Self- -Organizing Feature

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Transformations and Matrices Transformations I Transformations are functions Matrices

Sick leaves: Understanding disparities between French Departm ents 2 nd IRDES WORKSHOP on Applied

Investor Presentation August 2020 Forward-Looking Statements & Other Important Disclosures

European rail systems Chris Nash Institute for Transport Studies University of Leeds

Deliverable 2.3 Identification of user requirements concerning the definition of variables to be

Some bias and a pinch of variance Sara van de Geer November 2, 2016 Joint work with: Andreas

Simulation of the Spatial Covariance Matrix 802.11 TGn Channel Model Special Committee November

Variations to registered medicines Further work to develop a better risk-based approach Jenny

1 Key Item 1: Harmonisation Initial Marketing Authorisation Current situation with national

Model-based clustering and data transformations of gene expression - PowerPoint PPT Presentation

Model-based clustering and data transformations of gene expression data Walter L. Ruzzo University of Washington UW CSE Computational Biology Group Overview Motivation Model-based clustering Validation Summary and

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Detecting gene-gene interactions in high-throughput genotype data through a Bayesian clustering

A Bayesian clustering approach for detecting gene-gene interactions in high-dimensional genotype

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Gene Expression Data Introduction to gene expression data Expression data storage concept An

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

lecture 3 view transformations model transformations GL_MODELVIEW transformation view

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Gene Finding Strategies to find gene structures on the web Swiss Institute of Bioinformatics

Staphylococcus aureus Pathogenesis - Gene exchanges - Gene regulation - Gene products - Gene

CHAPTER VIII VIII CHAPTER Data Clustering and Data Clustering and Self- -Organizing Feature

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Transformations and Matrices Transformations I Transformations are functions Matrices

Sick leaves: Understanding disparities between French Departm ents 2 nd IRDES WORKSHOP on Applied

Investor Presentation August 2020 Forward-Looking Statements &amp; Other Important Disclosures

European rail systems Chris Nash Institute for Transport Studies University of Leeds

Deliverable 2.3 Identification of user requirements concerning the definition of variables to be

Some bias and a pinch of variance Sara van de Geer November 2, 2016 Joint work with: Andreas

Simulation of the Spatial Covariance Matrix 802.11 TGn Channel Model Special Committee November

Variations to registered medicines Further work to develop a better risk-based approach Jenny

1 Key Item 1: Harmonisation Initial Marketing Authorisation Current situation with national

Investor Presentation August 2020 Forward-Looking Statements & Other Important Disclosures