classification and clustering of rnaseq data
play

Classification and Clustering of RNAseq data Verena Zuber IMISE, - PowerPoint PPT Presentation

Classification and Clustering of RNAseq data Verena Zuber IMISE, University of Leipzig 5th June 2012 Verena Zuber, Classification and Clustering, 5th June 2012 1 The presented publication Verena Zuber, Classification and Clustering, 5th June


  1. Classification and Clustering of RNAseq data Verena Zuber IMISE, University of Leipzig 5th June 2012 Verena Zuber, Classification and Clustering, 5th June 2012 1

  2. The presented publication Verena Zuber, Classification and Clustering, 5th June 2012 2

  3. Author of the publication: Daniela Witten Verena Zuber, Classification and Clustering, 5th June 2012 3

  4. Table of contents 1 Introduction 2 Statistical Framework 3 Supervised Learning: Classification 4 Unsupervised Learning: Clustering 5 Results 6 Conclusion Verena Zuber, Classification and Clustering, 5th June 2012 4

  5. Biological Background: Transcriptomics I Technologies to “measure” the transcriptome: • Microarrays • Next or second generation RNA sequencing (RNAseq) Limitations of microarrays: • High levels of background noise due to cross-hybridization • Only transcripts for which a probe is present on the array can be measured. Therefore, it is not possible to discover novel mRNAs in a typical microarray experiment. Verena Zuber, Classification and Clustering, 5th June 2012 5

  6. Biological Background: Transcriptomics II Promises of RNAseq • Less noisy than microarray data, since the technology does not suffer from cross-hybridization. • Detection of novel transcripts and coding regions • “It seems certain that RNA sequencing is on track to replace the microarray as the technology of choice for the characterization of gene expression.” Challenges in the analysis: • Normalization • Count data, integer valued and non-negative Verena Zuber, Classification and Clustering, 5th June 2012 6

  7. Statistical Framework Verena Zuber, Classification and Clustering, 5th June 2012 7

  8. Data structure X n × p matrix of sequencing data • i ∈ 1 , ..., n samples • j ∈ 1 , ..., p features or regions of interest ✛ ✲ p ✻ n X ❄ • s i sample-specific constant • g j gene-specific constant Verena Zuber, Classification and Clustering, 5th June 2012 8

  9. Distributions • Poisson distribution X ij ∼ Poisson( N ij ) , N ij = s i g j • expectation: E( X ij ) = N ij • variance: Var( X ij ) = N ij • Negative binomial distribution X ij ∼ NB( N ij , φ j ) , N ij = s i g j • φ j is a gene-specific over-dispersion • expectation: E( X ij ) = N ij • variance: Var( X ij ) = N ij + N 2 ij φ j Verena Zuber, Classification and Clustering, 5th June 2012 9

  10. Distributions dependent on class k • y i = k ∈ 1 , ..., K : factor indicating the membership of sample i to class k • Poisson distribution X ij | y i = k ∼ Poisson( N ij d kj ) , N ij = s i g j • Negative binomial distribution X ij | y i = k ∼ NB( N ij d kj , φ j ) , N ij = s i g j • d kj : gene-specific, class-specific factor • d kj > 1 indicates that the j th feature is over-expressed in class k relative to the baseline • d kj < 1 indicates that the j th feature is under-expressed in class k relative to the baseline • C k comprises all samples belonging to class k Verena Zuber, Classification and Clustering, 5th June 2012 10

  11. Poisson Log Linear Model Assumptions: • Poisson distribution • Independence of features Poisson log linear model X ij | y i = k ∼ Poisson(ˆ N ij ˆ ˆ d kj ) , N ij = ˆ s i ˆ g j Estimation of the gene-specific constant g j : g j = � n • ˆ i =1 X ij Verena Zuber, Classification and Clustering, 5th June 2012 11

  12. Poisson Log Linear Model Estimation of the sample-specific constant s i (under identifiability constraint � n i =1 ˆ s i = 1): • Total count (ML-estimate): s i = � p � p j =1 X ij / � n ˆ j =1 X ij i =1 • Median ratio (Anders and Huber (2010)): s i = m i / � n ˆ i =1 m i X ij � � m i = median j (Π n i ′ X i ′ j ) 1 / n • Quantile (Bullard et al. (2010)): s i = q i / � n ˆ i =1 q i , where q i is the 75th percentile of the counts for each sample Verena Zuber, Classification and Clustering, 5th June 2012 12

  13. Poisson Log Linear Model Estimation of the (gene and) class-specific factor d kj : • Maximum likelihood estimate ˆ � ˆ d kj = X C k j / N ij i ∈ C k • If X C k j = 0, then ˆ d kj = 0. “This can pose a problem for downstream analyses, since this estimate completely precludes the possibility of a nonzero count for feature j arising from an observation in class k .” • Bayesian estimate: Gamma( β, β ) prior on d kj results in the following posterior mean X C k j + β ˆ d kj = i ∈ C k ˆ � N ij + β Verena Zuber, Classification and Clustering, 5th June 2012 13

  14. Transformation for overdispersed data • Biological replicates of sequencing data tend to be overdispersed relative to the Poisson model (variance is larger than the expectation) ij ← X α • Power transformation X ′ ij where α ∈ (0 , 1] is chosen so that n p ij − χ ′ ) 2 ( X ′ � � ≈ ( n − 1)( p − 1) χ ′ i =1 j =1 � � p � n j =1 X ′ i =1 X ′ with χ ′ = � ij ij (Goodness of fit test!) � n � p j =1 X ′ i =1 ij • “Though the resulting transformed data are not integer-valued, we nonetheless model them using the Poisson distribution.” Verena Zuber, Classification and Clustering, 5th June 2012 14

  15. Supervised Learning: Classification Verena Zuber, Classification and Clustering, 5th June 2012 15

  16. Poisson linear discriminant analysis • Rather diagonal discriminant analysis (DDA) due to the independence assumption • Bayes’ rule to define the probability of belonging to class k depending on the test data x ⋆ π k f ( x ⋆ | k ) prob( k | x ⋆ ) = f ( x ⋆ ) π k f ( x ⋆ | k ) ∝ • where f ( x ⋆ | k ) is given by X ij | y i = k ∼ Poisson( N ij d kj ) , N ij = s i g j • π k represents the a priori mixing probability for class k Verena Zuber, Classification and Clustering, 5th June 2012 16

  17. Discriminant scores • Poisson discriminant analysis π k f ( x ⋆ | k ) log { prob( k | x ⋆ ) = f ( x ⋆ ) p p � X ⋆ j log ˆ s ⋆ � g j ˆ ∝ d kj − ˆ ˆ d kj + log ˆ π k j =1 j =1 • For comparison: Fisher’s DDA (Gaussian Distribution) π k f ( x ⋆ | k ) log { prob( k | x ⋆ ) } = f ( x ⋆ ) k V − 1 x ⋆ − 1 µ T 2 µ T k V − 1 µ k + log( π k ) ∝ where µ k is the expectation in group k , and V is the diagonal variance matrix equal in all K groups Verena Zuber, Classification and Clustering, 5th June 2012 17

  18. The sparse PLDA classifier • Standard estimates ˆ d kj are unequal 1 for all p features • But for high-dimensional transcriptomics data classifiers build on a smaller subset of features are desirable • Soft-threshold estimate (similar to PAM) √ ˆ d kj = 1 + S ( a / b − 1 , ρ/ b ) • Soft-threshold operator with penalization-parameter ρ √ √ S ( a / b − 1 , ρ/ b ) = sign( a / b − 1)( | a / b − 1 | − ρ/ b ) + (numerator of the Bayesian estimate ˆ • a = X C k j + β d kj ) i ∈ C k ˆ N ij + β (denominator of the Bayesian estimate ˆ • b = � d kj ) √ • Shrinks ˆ d kj towards 1 if | a / b − 1 | < ρ/ b , and thus excludes feature j from the classification rule Verena Zuber, Classification and Clustering, 5th June 2012 18

  19. Unsupervised Learning: Clustering Verena Zuber, Classification and Clustering, 5th June 2012 19

  20. Poisson dissimilarity • Aim: Clustering based on a n × n dissimilarity matrix between observations • Connection of Euclidean distance and log likelihood ratio statistic under a Gaussian model X ij ∼ N( µ ij , σ 2 ) X i ′ j ∼ N( µ i ′ j , σ 2 ) Testing H 0 : µ ij = µ i ′ j against H 1 : “ µ ij and µ i ′ j are unrestricted” results in the following log likelihood ratio statistic p p p X ij − X ij + X i ′ j X i ′ j − X ij + X i ′ j � � � ( X ij − X i ′ j ) 2 � � � � + = 2 2 j =1 j =1 j =1 || x i − x j || 2 ∝ Verena Zuber, Classification and Clustering, 5th June 2012 20

  21. Poisson dissimilarity • Poisson distribution “restricted to x i and x i ′ ” X ij ∼ Poisson(ˆ N ij ˆ X i ′ j ∼ Poisson(ˆ N i ′ j ˆ d ij ) d i ′ j ) • Testing H 0 : d ij = d i ′ j = 1 against H 1 : “ d ij and d i ′ j are unrestricted results” results in the following log likelihood ratio statistic p � (ˆ N ij + ˆ N i ′ j ) − (ˆ N ij ˆ d ij + ˆ N i ′ j ˆ d i ′ j ) + ( X ij logˆ d ij + X i ′ j logˆ � � d i ′ j ) j =1 • Can be used as dissimilarity of x i and x i ′ ; is nonnegative and equals zero if x i = x i ′ Verena Zuber, Classification and Clustering, 5th June 2012 21

  22. Results Verena Zuber, Classification and Clustering, 5th June 2012 22

  23. Simulation set up Data is generated by the negative binomial distribution X ij | y i = k ∼ NB( s i g j d kj , φ ) • Overdispersion • φ = 0 . 01: very slight overdispersion • φ = 0 . 1: substential overdispersion • φ = 1: very high overdispersion • s i ∼ Unif(0 . 2 , 2 . 2) • g j ∼ Exp(1 / 25) • K = 3 classes • p = 10 , 000 features and 30% are differentially expressed • d 1 j = d 2 j = d 3 j = 1: feature j is not differentially expressed • otherwise log ( d kj ) ∼ N (0 , σ 2 ) Verena Zuber, Classification and Clustering, 5th June 2012 23

Recommend


More recommend