Statistical analysis of RNASeq Data Introduction to RNA-seq data analysis dominique-laurent.couturier@cruk.cam.ac.uk [Bioinformatics core] (Source: O. Rueda, CRUK-CI; G. Marot, INRIA)
Introduction 2
Grand Picture of Statistics Statistical Hypotheses Sample H0: µ B = µ L H1: µ B � = µ L Idea: Data: RNASeq counts EGF is differentially expressed (DE) ( x B, 1 ; x B, 2 ; ... ; x B,nB ) in luminal (L) and basal (B) cells ( x L, 1 ; x L, 2 ; ... ; x L,nL ) Inference Point estimation µ B − � � µ L � µ B − � µ L � T obs = ∼ St nT + nC − 2 1 1 s p nB + nL 3
Outline ◮ 1/ Analysis of gene expression measured with Microarrays ⊲ 1a/ Normal distribution ⊲ 1b/ Test of equality of means for two samples: T-test ⊲ 1c/ Test of equality of means for > 2 samples: ANOVA ⊲ 1d/ Test of equality of means for 2 categorical predictors: ANOVA ⊲ 1e/ Test of equality of means for > 2 predictors: Linear model ⊲ 1f/ Confounding ◮ 2/ Analysis of gene expression measured by RNAseq ⊲ Generalisation of the linear model: Negative Binomial regression ◮ 2a/ Negative Binomial distribution ◮ 2b/ Nuisance parameter estimation: Shrinkage estimator ◮ 2c/ Controlling for Library size: Offset ◮ 3/ Controlling for multiple testing ⊲ 3a/ Family-wise error rate ⊲ 3b/ False discovery rate 4
Analysis of gene expression measured with Microarrays Part I dominique-laurent.couturier@cruk.cam.ac.uk [Bioinformatics core] (Source: O. Rueda, CRUK-CI; G. Marot, INRIA)
1a/ Normal distribution 2 πσ 2 e − ( y − µ )2 1 X ∼ N ( µ, σ 2 ) , √ f Y ( y ) = 2 σ 2 Var [ Y ] = σ 2 , E [ Y ] = µ, Probability density function, f Y ( y | µ, σ ) 0.4 0.3 0.2 0.1 0.0 µ − 3 σ µ − 2 σ µ − σ µ µ + σ µ + 2 σ µ + 3 σ 68.27% 95.45% 99.73% 6
1a/ Normal distribution 2 πσ 2 e − ( y − µ )2 1 X ∼ N ( µ, σ 2 ) , √ f Y ( y ) = 2 σ 2 Var [ Y ] = σ 2 , E [ Y ] = µ, ◮ Suitable modelling for a lot of variables 0.5 0.4 0.3 0.2 0.1 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 (Gene expression values of gene ‘X’ of basal cells of 33 mice) 6
1a/ Normal distribution 2 πσ 2 e − ( y − µ )2 1 X ∼ N ( µ, σ 2 ) , √ f Y ( y ) = 2 σ 2 Var [ Y ] = σ 2 , E [ Y ] = µ, ◮ Suitable modelling for a lot of variables 0.5 0.4 0.3 0.2 0.1 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 (Gene expression values of gene ‘X’ of basal cells of 33 mice) 6
1b/ Test of equality of means for two samples Intensity expression of gene 'X' Basal ● n=33 Luminal ● n=43 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 We test H0 : µ B − µ L = 0 against H1 : µ B − µ L � = 0 . We know: 0.4 T ∼ St 100 T ∼ St 50 � µ B − � µ L ◮ Student’s t-test [assume σ 2 B = σ 2 T ∼ St 10 L ]: � ∼ t n B + n L − 2 , T ∼ St 2 0.3 1 1 s p nB + nL � 0.2 s 2 B ( n B − 1)+ s 2 L ( n L − 1) Density ◮ s p = . n B + N L − 2 0.1 0.0 -4.303 95% 4.303 -2.228 95% 2.228 -2.009 95% 2.009 -1.984 95% 1.984 -5 -4 -3 -2 -1 0 1 2 3 4 5 7
1b/ Test of equality of means for two samples Intensity expression of gene 'X' Basal ● n=33 Luminal ● n=43 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 We test H0 : µ B − µ L = 0 against H1 : µ B − µ L � = 0 . We know: 0.4 T ∼ St 100 T ∼ St 50 � µ B − � µ L ◮ Student’s t-test [assume σ 2 B = σ 2 T ∼ St 10 L ]: � ∼ t n B + n L − 2 , T ∼ St 2 0.3 1 1 s p nB + nL � 0.2 s 2 B ( n B − 1)+ s 2 L ( n L − 1) Density ◮ s p = . n B + N L − 2 0.1 Two Sample t-test 0.0 -4.303 95% 4.303 -2.228 95% 2.228 data: Basal and Luminal -2.009 95% 2.009 -1.984 95% 1.984 t = 6.6751, df = 74, p-value = 3.941e-09 alternative hypothesis: true difference in means is not equal to 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 95 percent confidence interval: 1.048457 1.940748 sample estimates: mean of x mean of y 2.923908 1.429305 7
1b/ Test of equality of means for two samples ◮ Modelling 1: Y i ( B ) = µ B + ǫ i Y i ( L ) = µ L + ǫ i Intensity expression of gene 'X' Basal ● n=33 Luminal ● n=43 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 8
1b/ Test of equality of means for two samples ◮ Modelling 1: Y i ( B ) = µ B + ǫ i Y i ( L ) = µ L + ǫ i Intensity expression of gene 'X' Basal ◮ Modelling 2: ● n=33 Luminal ● Y i = µ B + δ L I ( i ∈ L ) + ǫ i n=43 = β 0 + β 1 X 1 + ǫ i −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 where i = 1 , ..., n ; ǫ i ∼ N (0 , σ 2 ) . 8
1b/ Test of equality of means for two samples Intensity expression of gene 'X' Basal ● n=33 ◮ Modelling 1: Luminal ● n=43 Y i = µ B I ( i ∈ B ) + µ L I ( i ∈ L ) + ǫ i −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 Y = X β + ǫ where i = 1 , ..., n ; ǫ i ∼ N (0 , σ 2 ) . 9
1b/ Test of equality of means for two samples Intensity expression of gene 'X' Basal ● n=33 ◮ Modelling 1: Luminal ● n=43 Y i = µ B I ( i ∈ B ) + µ L I ( i ∈ L ) + ǫ i −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 Y = X β + ǫ where i = 1 , ..., n ; ǫ i ∼ N (0 , σ 2 ) . Call: lm(formula = expression ~ celltype - 1, data = microarrays) Residuals: Min 1Q Median 3Q Max -2.64401 -0.58586 0.01473 0.65051 2.47771 Coefficients: Estimate Std. Error t value Pr(>|t|) celltypeBasal 2.9239 0.1684 17.361 < 2e-16 *** celltypeLuminal 1.4293 0.1475 9.687 8.47e-15 *** --- 0 ,¨ o***,¨ o 0.001 ,¨ o**,¨ o 0.01 ,¨ o*,¨ o 0.05 ,¨ o.,¨ o 0.1 ,¨ o ,¨ Signif. codes: A` A^ A` A^ A` A^ A` A^ A` A^ o 1 Residual standard error: 0.9675 on 74 degrees of freedom Multiple R-squared: 0.8423,Adjusted R-squared: 0.838 F-statistic: 197.6 on 2 and 74 DF, p-value: < 2.2e-16 9
1b/ Test of equality of means for two samples Intensity expression of gene 'X' Basal ● n=33 ◮ Modelling 2: Luminal ● n=43 Y i = µ B + δ L I ( i ∈ L ) ǫ i −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 = β 0 + β 1 X 1 + ǫ i Y = X β + ǫ where i = 1 , ..., n ; ǫ i ∼ N (0 , σ 2 ) . 10
1b/ Test of equality of means for two samples Intensity expression of gene 'X' Basal ● n=33 ◮ Modelling 2: Luminal ● n=43 Y i = µ B + δ L I ( i ∈ L ) ǫ i −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 = β 0 + β 1 X 1 + ǫ i Y = X β + ǫ where i = 1 , ..., n ; ǫ i ∼ N (0 , σ 2 ) . Call: lm(formula = expression ~ celltype, data = microarrays) Residuals: Min 1Q Median 3Q Max -2.64401 -0.58586 0.01473 0.65051 2.47771 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.9239 0.1684 17.361 < 2e-16 *** celltypeLuminal -1.4946 0.2239 -6.675 3.94e-09 *** --- 0 ,¨ o***,¨ o 0.001 ,¨ o**,¨ o 0.01 ,¨ o*,¨ o 0.05 ,¨ o.,¨ o 0.1 ,¨ o ,¨ Signif. codes: A` A^ A` A^ A` A^ A` A^ A` A^ o 1 Residual standard error: 0.9675 on 74 degrees of freedom Multiple R-squared: 0.3758,Adjusted R-squared: 0.3674 F-statistic: 44.56 on 1 and 74 DF, p-value: 3.941e-09 10
1c/ Test of equality of means for > 2 samples ◮ One-way ANOVA hypotheses ⊲ H0: µ L = µ P = µ V , ⊲ H1: µ k � = µ l for at least one pair ( k, l ) . Intensity expression of gene 'X' Virgin Pregnant ● Lactating −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 11
1c/ Test of equality of means for > 2 samples ◮ One-way ANOVA hypotheses ⊲ H0: µ L = µ P = µ V , ⊲ H1: µ k � = µ l for at least one pair ( k, l ) . Intensity expression of gene 'X' ◮ Modelling 1: Virgin Y i ( L ) = µ L + ǫ i Pregnant ● Y i ( P ) = µ P + ǫ i Lactating Y i ( V ) = µ V + ǫ i −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 Y i = µ L I ( i ∈ L ) + µ P I ( i ∈ P ) + µ V I ( i ∈ V ) + ǫ i Y = X β + ǫ 11
Recommend
More recommend