Assessment of batch effects in TCGA gene expression data Nianxiang Zhang BCB, MDACC 7/14/10
Outline • TCGA data used • Batch effects in TCGA data • Identification of batch effects – Algorithms • PCA and Hierarchical clustering • Correlation of correlations (CR) – Batch effects in TCGA gene expression data on GBM and ovarian cancer • Adjustment for batch effects – Methods of batch effects adjustment – Adjustment for batch effects in TCGA gene expression data on GBM and ovarian cancer – Comparison of adjustment methods • Implications
Data Level 3 gene expression data on GBM and Ovarian cancer • 3 platforms – Affymetrix U133a – Agilent – Affymetrix Exon array • GBM – 11 batches, 372 tumor samples • OV – 13 batches, 511 samples-30 samples excluded
Batch Effects in TCGA • TCGA data are collected in multiple batches • TCGA data come from multiple platforms, analyses, and institutions • Batch effects can be very important for biological and clinical predictions
OV Data Distribution By Batch
Ovarian Cancer Data
GBM Data
Identification of Batch Effects • Standard techniques – Principal component analysis (PCA) – Clustering analysis (1-Pearson metric, Ward linkage) • Correlation of correlations (CR) – A scalar index of the similarity of batches in terms of gene-gene interactions • CR=1 if batches are identical • CR=0 if batches are uncorrelated
Calculation of Correlation of Correlations (CR) U ij denotes the correlation of genes i and j in batch 1 V ij denotes the correlation of genes i and j in batch 2 (Scherf, …. Weinstein, Nature Genetics 2000; 24:236) Permutation test of CR provides the statistical significance of batch effects
Visualization of the Correlation of Correlations Calculation (for 4 genes and batches consisting of 4 and 3 samples) Batch 1 Batch 2 Gene 1 Gene 1 R 12 =Corr (1,2) R’ 12 =Corr (1,2) Gene 2 Gene 2 Gene 3 Gene 3 Gene 4 Gene 4 CR=Corr [(R 12 , R 13 , R 14 , R 23 , R 24 , R 34 ), *(R’ 12 , R’ 13 , R’ 14 , R’ 23 , R’ 24 , R’ 34 )] Then calculate a scalar quantity, the correlation between the vector of 6 correlation coefficients for Batch 1 and the vector of 6 correlation coefficients for Batch 2
Permutation test of CRs We scramble the batch labels of samples in two batches, calculate CR between two permutated batches to obtain distribution of CR under H 0 . Actual CR Between two Batches (two-sided p)
PCA GBM data Batch 16, 20 Batch 16
GBM:Affy
GBM:Agilent
GBM:Exon
Tests for batch effects using CR:GBM
Q-values for testing batch effects in GBM data
PCA-Ovarian data Batch 9, 11
Ovarian-Affymetrix
Ovarian-Agilent
Ovarian-Exon Batch 9, 11
Tests for batch effects using CR:OV
Q-values for batch effects in OV data
Batch effects in unified OV gene expression data Unadjusted Affy U133a Data Unified Gene Expression Data
Adjustment of Batch Effects • Empirical Bayes (ComBat) – Parametric prior (EBP) – Nonparametric prior (EBNP) • Median Polish – Overall (MP) – Within each batch (MPB) • ANOVA – Naïve ANOVA (AN) – With variance shrinkage (WAN)
Batch effect adjustment
GBM:Affy
GBM Agilent
GBM:Exon data
Effect of batch effects adjustment on gene expression
Assessment of batch effects with adjustments CDF of p-values from the permutation tests of batch effect in TCGA ovarian cancer Affymetrix gene expression data batch 9 and 11-15 p-value = 0.05 1.0 Unadjusted data Cumulative Distribution Function 0.8 Cumulative probability 0.6 0.4 U EBP EBNP 0.2 MP MPB AN 0.0 WAN 0.0 0.2 0.4 0.6 0.8 1.0 P-values p-values
Assessment of batch effects with adjustments CDF of p-values from the permutation tests of batch effect in TCGA ovarian cancer Affymetrix gene expression data p-value = 0.05 1.0 Unadjusted data Cumulative Distribution Function 0.8 Cumulative probability 0.6 MPB adjusted data 0.4 U EBP EBNP 0.2 MP MPB AN 0.0 WAN 0.0 0.2 0.4 0.6 0.8 1.0 P-values p-values
Assessment of batch effects with adjustments:OV CDF of p-values from the permutation tests of batch effect in TCGA ovarian cancer Affymetrix gene expression data batch 9 and 11 to 15 p-value = 0.05 1.0 Unadjusted data Cumulative Distribution Function 0.8 Cumulative probability 0.6 MPB adjusted data 0.4 U EBP EBNP 0.2 MP MPB AN 0.0 WAN 0.0 0.2 0.4 0.6 0.8 1.0 P-values p-values
Association of Clinical Outcomes with Batches Overall survival by batch (TCGA ovarian cancer data) P<0.001 P=0.018 Batch 9 Batch 9
Implications • Assessments based on Correlation of Correlations parameter can be used to identify batch effects in TCGA data. This is complemented by principal component analysis and hierarchical clustering. • Batch effects exist in TCGA GBM and ovarian cancer data • Be cautious when we do batch effects adjustment. – The batch differences may be technical or biological • We do not want to correct biological difference • We do want to correct technical difference (bias) – Some methods may over massage the data • The impact of batch effects on clinical predictions from the data remains to be determined.
Thank you!
Recommend
More recommend