tcga gene expression data
play

TCGA gene expression data Nianxiang Zhang BCB, MDACC 7/14/10 - PowerPoint PPT Presentation

Assessment of batch effects in TCGA gene expression data Nianxiang Zhang BCB, MDACC 7/14/10 Outline TCGA data used Batch effects in TCGA data Identification of batch effects Algorithms PCA and Hierarchical


  1. Assessment of batch effects in TCGA gene expression data Nianxiang Zhang BCB, MDACC 7/14/10

  2. Outline • TCGA data used • Batch effects in TCGA data • Identification of batch effects – Algorithms • PCA and Hierarchical clustering • Correlation of correlations (CR) – Batch effects in TCGA gene expression data on GBM and ovarian cancer • Adjustment for batch effects – Methods of batch effects adjustment – Adjustment for batch effects in TCGA gene expression data on GBM and ovarian cancer – Comparison of adjustment methods • Implications

  3. Data Level 3 gene expression data on GBM and Ovarian cancer • 3 platforms – Affymetrix U133a – Agilent – Affymetrix Exon array • GBM – 11 batches, 372 tumor samples • OV – 13 batches, 511 samples-30 samples excluded

  4. Batch Effects in TCGA • TCGA data are collected in multiple batches • TCGA data come from multiple platforms, analyses, and institutions • Batch effects can be very important for biological and clinical predictions

  5. OV Data Distribution By Batch

  6. Ovarian Cancer Data

  7. GBM Data

  8. Identification of Batch Effects • Standard techniques – Principal component analysis (PCA) – Clustering analysis (1-Pearson metric, Ward linkage) • Correlation of correlations (CR) – A scalar index of the similarity of batches in terms of gene-gene interactions • CR=1 if batches are identical • CR=0 if batches are uncorrelated

  9. Calculation of Correlation of Correlations (CR) U ij denotes the correlation of genes i and j in batch 1 V ij denotes the correlation of genes i and j in batch 2 (Scherf, …. Weinstein, Nature Genetics 2000; 24:236)  Permutation test of CR provides the statistical significance of batch effects

  10. Visualization of the Correlation of Correlations Calculation (for 4 genes and batches consisting of 4 and 3 samples) Batch 1 Batch 2 Gene 1 Gene 1 R 12 =Corr (1,2) R’ 12 =Corr (1,2) Gene 2 Gene 2 Gene 3 Gene 3 Gene 4 Gene 4 CR=Corr [(R 12 , R 13 , R 14 , R 23 , R 24 , R 34 ), *(R’ 12 , R’ 13 , R’ 14 , R’ 23 , R’ 24 , R’ 34 )] Then calculate a scalar quantity, the correlation between the vector of 6 correlation coefficients for Batch 1 and the vector of 6 correlation coefficients for Batch 2

  11. Permutation test of CRs We scramble the batch labels of samples in two batches, calculate CR between two permutated batches to obtain distribution of CR under H 0 . Actual CR Between two Batches (two-sided p)

  12. PCA GBM data Batch 16, 20 Batch 16

  13. GBM:Affy

  14. GBM:Agilent

  15. GBM:Exon

  16. Tests for batch effects using CR:GBM

  17. Q-values for testing batch effects in GBM data

  18. PCA-Ovarian data Batch 9, 11

  19. Ovarian-Affymetrix

  20. Ovarian-Agilent

  21. Ovarian-Exon Batch 9, 11

  22. Tests for batch effects using CR:OV

  23. Q-values for batch effects in OV data

  24. Batch effects in unified OV gene expression data Unadjusted Affy U133a Data Unified Gene Expression Data

  25. Adjustment of Batch Effects • Empirical Bayes (ComBat) – Parametric prior (EBP) – Nonparametric prior (EBNP) • Median Polish – Overall (MP) – Within each batch (MPB) • ANOVA – Naïve ANOVA (AN) – With variance shrinkage (WAN)

  26. Batch effect adjustment

  27. GBM:Affy

  28. GBM Agilent

  29. GBM:Exon data

  30. Effect of batch effects adjustment on gene expression

  31. Assessment of batch effects with adjustments CDF of p-values from the permutation tests of batch effect in TCGA ovarian cancer Affymetrix gene expression data batch 9 and 11-15 p-value = 0.05 1.0 Unadjusted data Cumulative Distribution Function 0.8 Cumulative probability 0.6 0.4 U EBP EBNP 0.2 MP MPB AN 0.0 WAN 0.0 0.2 0.4 0.6 0.8 1.0 P-values p-values

  32. Assessment of batch effects with adjustments CDF of p-values from the permutation tests of batch effect in TCGA ovarian cancer Affymetrix gene expression data p-value = 0.05 1.0 Unadjusted data Cumulative Distribution Function 0.8 Cumulative probability 0.6 MPB adjusted data 0.4 U EBP EBNP 0.2 MP MPB AN 0.0 WAN 0.0 0.2 0.4 0.6 0.8 1.0 P-values p-values

  33. Assessment of batch effects with adjustments:OV CDF of p-values from the permutation tests of batch effect in TCGA ovarian cancer Affymetrix gene expression data batch 9 and 11 to 15 p-value = 0.05 1.0 Unadjusted data Cumulative Distribution Function 0.8 Cumulative probability 0.6 MPB adjusted data 0.4 U EBP EBNP 0.2 MP MPB AN 0.0 WAN 0.0 0.2 0.4 0.6 0.8 1.0 P-values p-values

  34. Association of Clinical Outcomes with Batches Overall survival by batch (TCGA ovarian cancer data) P<0.001 P=0.018 Batch 9 Batch 9

  35. Implications • Assessments based on Correlation of Correlations parameter can be used to identify batch effects in TCGA data. This is complemented by principal component analysis and hierarchical clustering. • Batch effects exist in TCGA GBM and ovarian cancer data • Be cautious when we do batch effects adjustment. – The batch differences may be technical or biological • We do not want to correct biological difference • We do want to correct technical difference (bias) – Some methods may over massage the data • The impact of batch effects on clinical predictions from the data remains to be determined.

  36. Thank you!

Recommend


More recommend