Classification and Clustering of RNAseq data Verena Zuber IMISE, University of Leipzig 5th June 2012 Verena Zuber, Classification and Clustering, 5th June 2012 1
The presented publication Verena Zuber, Classification and Clustering, 5th June 2012 2
Author of the publication: Daniela Witten Verena Zuber, Classification and Clustering, 5th June 2012 3
Table of contents 1 Introduction 2 Statistical Framework 3 Supervised Learning: Classification 4 Unsupervised Learning: Clustering 5 Results 6 Conclusion Verena Zuber, Classification and Clustering, 5th June 2012 4
Biological Background: Transcriptomics I Technologies to “measure” the transcriptome: • Microarrays • Next or second generation RNA sequencing (RNAseq) Limitations of microarrays: • High levels of background noise due to cross-hybridization • Only transcripts for which a probe is present on the array can be measured. Therefore, it is not possible to discover novel mRNAs in a typical microarray experiment. Verena Zuber, Classification and Clustering, 5th June 2012 5
Biological Background: Transcriptomics II Promises of RNAseq • Less noisy than microarray data, since the technology does not suffer from cross-hybridization. • Detection of novel transcripts and coding regions • “It seems certain that RNA sequencing is on track to replace the microarray as the technology of choice for the characterization of gene expression.” Challenges in the analysis: • Normalization • Count data, integer valued and non-negative Verena Zuber, Classification and Clustering, 5th June 2012 6
Statistical Framework Verena Zuber, Classification and Clustering, 5th June 2012 7
Data structure X n × p matrix of sequencing data • i ∈ 1 , ..., n samples • j ∈ 1 , ..., p features or regions of interest ✛ ✲ p ✻ n X ❄ • s i sample-specific constant • g j gene-specific constant Verena Zuber, Classification and Clustering, 5th June 2012 8
Distributions • Poisson distribution X ij ∼ Poisson( N ij ) , N ij = s i g j • expectation: E( X ij ) = N ij • variance: Var( X ij ) = N ij • Negative binomial distribution X ij ∼ NB( N ij , φ j ) , N ij = s i g j • φ j is a gene-specific over-dispersion • expectation: E( X ij ) = N ij • variance: Var( X ij ) = N ij + N 2 ij φ j Verena Zuber, Classification and Clustering, 5th June 2012 9
Distributions dependent on class k • y i = k ∈ 1 , ..., K : factor indicating the membership of sample i to class k • Poisson distribution X ij | y i = k ∼ Poisson( N ij d kj ) , N ij = s i g j • Negative binomial distribution X ij | y i = k ∼ NB( N ij d kj , φ j ) , N ij = s i g j • d kj : gene-specific, class-specific factor • d kj > 1 indicates that the j th feature is over-expressed in class k relative to the baseline • d kj < 1 indicates that the j th feature is under-expressed in class k relative to the baseline • C k comprises all samples belonging to class k Verena Zuber, Classification and Clustering, 5th June 2012 10
Poisson Log Linear Model Assumptions: • Poisson distribution • Independence of features Poisson log linear model X ij | y i = k ∼ Poisson(ˆ N ij ˆ ˆ d kj ) , N ij = ˆ s i ˆ g j Estimation of the gene-specific constant g j : g j = � n • ˆ i =1 X ij Verena Zuber, Classification and Clustering, 5th June 2012 11
Poisson Log Linear Model Estimation of the sample-specific constant s i (under identifiability constraint � n i =1 ˆ s i = 1): • Total count (ML-estimate): s i = � p � p j =1 X ij / � n ˆ j =1 X ij i =1 • Median ratio (Anders and Huber (2010)): s i = m i / � n ˆ i =1 m i X ij � � m i = median j (Π n i ′ X i ′ j ) 1 / n • Quantile (Bullard et al. (2010)): s i = q i / � n ˆ i =1 q i , where q i is the 75th percentile of the counts for each sample Verena Zuber, Classification and Clustering, 5th June 2012 12
Poisson Log Linear Model Estimation of the (gene and) class-specific factor d kj : • Maximum likelihood estimate ˆ � ˆ d kj = X C k j / N ij i ∈ C k • If X C k j = 0, then ˆ d kj = 0. “This can pose a problem for downstream analyses, since this estimate completely precludes the possibility of a nonzero count for feature j arising from an observation in class k .” • Bayesian estimate: Gamma( β, β ) prior on d kj results in the following posterior mean X C k j + β ˆ d kj = i ∈ C k ˆ � N ij + β Verena Zuber, Classification and Clustering, 5th June 2012 13
Transformation for overdispersed data • Biological replicates of sequencing data tend to be overdispersed relative to the Poisson model (variance is larger than the expectation) ij ← X α • Power transformation X ′ ij where α ∈ (0 , 1] is chosen so that n p ij − χ ′ ) 2 ( X ′ � � ≈ ( n − 1)( p − 1) χ ′ i =1 j =1 � � p � n j =1 X ′ i =1 X ′ with χ ′ = � ij ij (Goodness of fit test!) � n � p j =1 X ′ i =1 ij • “Though the resulting transformed data are not integer-valued, we nonetheless model them using the Poisson distribution.” Verena Zuber, Classification and Clustering, 5th June 2012 14
Supervised Learning: Classification Verena Zuber, Classification and Clustering, 5th June 2012 15
Poisson linear discriminant analysis • Rather diagonal discriminant analysis (DDA) due to the independence assumption • Bayes’ rule to define the probability of belonging to class k depending on the test data x ⋆ π k f ( x ⋆ | k ) prob( k | x ⋆ ) = f ( x ⋆ ) π k f ( x ⋆ | k ) ∝ • where f ( x ⋆ | k ) is given by X ij | y i = k ∼ Poisson( N ij d kj ) , N ij = s i g j • π k represents the a priori mixing probability for class k Verena Zuber, Classification and Clustering, 5th June 2012 16
Discriminant scores • Poisson discriminant analysis π k f ( x ⋆ | k ) log { prob( k | x ⋆ ) = f ( x ⋆ ) p p � X ⋆ j log ˆ s ⋆ � g j ˆ ∝ d kj − ˆ ˆ d kj + log ˆ π k j =1 j =1 • For comparison: Fisher’s DDA (Gaussian Distribution) π k f ( x ⋆ | k ) log { prob( k | x ⋆ ) } = f ( x ⋆ ) k V − 1 x ⋆ − 1 µ T 2 µ T k V − 1 µ k + log( π k ) ∝ where µ k is the expectation in group k , and V is the diagonal variance matrix equal in all K groups Verena Zuber, Classification and Clustering, 5th June 2012 17
The sparse PLDA classifier • Standard estimates ˆ d kj are unequal 1 for all p features • But for high-dimensional transcriptomics data classifiers build on a smaller subset of features are desirable • Soft-threshold estimate (similar to PAM) √ ˆ d kj = 1 + S ( a / b − 1 , ρ/ b ) • Soft-threshold operator with penalization-parameter ρ √ √ S ( a / b − 1 , ρ/ b ) = sign( a / b − 1)( | a / b − 1 | − ρ/ b ) + (numerator of the Bayesian estimate ˆ • a = X C k j + β d kj ) i ∈ C k ˆ N ij + β (denominator of the Bayesian estimate ˆ • b = � d kj ) √ • Shrinks ˆ d kj towards 1 if | a / b − 1 | < ρ/ b , and thus excludes feature j from the classification rule Verena Zuber, Classification and Clustering, 5th June 2012 18
Unsupervised Learning: Clustering Verena Zuber, Classification and Clustering, 5th June 2012 19
Poisson dissimilarity • Aim: Clustering based on a n × n dissimilarity matrix between observations • Connection of Euclidean distance and log likelihood ratio statistic under a Gaussian model X ij ∼ N( µ ij , σ 2 ) X i ′ j ∼ N( µ i ′ j , σ 2 ) Testing H 0 : µ ij = µ i ′ j against H 1 : “ µ ij and µ i ′ j are unrestricted” results in the following log likelihood ratio statistic p p p X ij − X ij + X i ′ j X i ′ j − X ij + X i ′ j � � � ( X ij − X i ′ j ) 2 � � � � + = 2 2 j =1 j =1 j =1 || x i − x j || 2 ∝ Verena Zuber, Classification and Clustering, 5th June 2012 20
Poisson dissimilarity • Poisson distribution “restricted to x i and x i ′ ” X ij ∼ Poisson(ˆ N ij ˆ X i ′ j ∼ Poisson(ˆ N i ′ j ˆ d ij ) d i ′ j ) • Testing H 0 : d ij = d i ′ j = 1 against H 1 : “ d ij and d i ′ j are unrestricted results” results in the following log likelihood ratio statistic p � (ˆ N ij + ˆ N i ′ j ) − (ˆ N ij ˆ d ij + ˆ N i ′ j ˆ d i ′ j ) + ( X ij logˆ d ij + X i ′ j logˆ � � d i ′ j ) j =1 • Can be used as dissimilarity of x i and x i ′ ; is nonnegative and equals zero if x i = x i ′ Verena Zuber, Classification and Clustering, 5th June 2012 21
Results Verena Zuber, Classification and Clustering, 5th June 2012 22
Simulation set up Data is generated by the negative binomial distribution X ij | y i = k ∼ NB( s i g j d kj , φ ) • Overdispersion • φ = 0 . 01: very slight overdispersion • φ = 0 . 1: substential overdispersion • φ = 1: very high overdispersion • s i ∼ Unif(0 . 2 , 2 . 2) • g j ∼ Exp(1 / 25) • K = 3 classes • p = 10 , 000 features and 30% are differentially expressed • d 1 j = d 2 j = d 3 j = 1: feature j is not differentially expressed • otherwise log ( d kj ) ∼ N (0 , σ 2 ) Verena Zuber, Classification and Clustering, 5th June 2012 23
Recommend
More recommend