Application of Survival and Multivariate Methods to Gene Expression Data from Two Affymetrix datasets Linda Warnock, Statistical Sciences UK Richard Stephens, Transcriptome Analysis UK Jo Ann Coleman, Statistical Sciences US
Outline • Description of the datasets • Pre-processing and exploration of the data • Methods • Results • Conclusions • Recommendations • References 2
The data • Harvard � U95 affymetrix chip type � 124 lung adenocarcinoma samples with clinical data � 71 females, 53 males � 76 with stage I tumour, 48 with stage II - IV tumour � 12,625 probe sets • Michigan � HuGeneFL affymetrix chip type � 86 lung adenocarcinoma samples with clinical data � 51 females, 35 males � 67 stage I tumour, 19 stage III tumour � 7,129 probe sets 3
Pre-processing • Normalisation � .CEL files were loaded into DChip and normalised with the available algorithms using the Perfect Match (PM) data • Quality Control � Poor quality chips were excluded from the analysis if: • the percentage of spot outliers > 3 (DChip) • 3 prime to 5 prime ratio > 3 (MAS5) • PCA of several chip metrics (eg. background, overall brightness etc) identify technical bias in data production • Final number of samples/ chips in analysis � Harvard: 114 (10 excluded) � Michigan: 70 (16 excluded) 4
Aim of analysis and Methods Used • Aim � Combine data across chip types and identify genes associated with prolonged survival • Methods � Quality Control Metrics � Affymetrix array comparison spreadsheets to match probe_set ids from Harvard with Michigan (resulted in 6013 genes in common) � survival plots � PCA plots � COX PH regression modelling � Meta analysis: Fisher’s Chi-squared method � Volcano plots 5
Harvard chip metric data PCA (190 chips) Coloured by IVT batch IVT Batch No. 3 Background, GAPDH, B-actin, RawQ, Number Present, Average Signal, St.Dev. Backgrnd 6
Harvard chip metric data PCA (190 chips) Coloured by IVT batch Background, GAPDH, B-actin, RawQ, Number Present, Average Signal, St.Dev. Backgrnd 7
Harvard Expression Data PCA 190 chips COID Ad NL SCLC Adenocarcinoma Samples SQ Coloured by IVT batch PCA 132 chips Batch 3 8
Survival Plots with strata: sex and tumour stage Michigan Harvard F, I F, I F, III M, I M, I M, III M, II+ F, II+ ** Stage is a predictor for survival ** Stage is a predictor for survival 9
COX PH regression and Meta Analysis of the Clinical Data Variable Harv Harv Mich Mich Meta Meta Chi-Sq P-value Chi-Sq P-value Critical P-value point Stage 15.98 <0.0001 25.15 <0.0001 48.2 <0.0001 Sex 1.63 0.2015 1.96 0.1623 6.84 0.1446 Age 0.76 0.382 2.87 0.0904 6.73 0.1508 g e A 10
Exploration of the gene expression data (averaged over samples) Mean Mich = 3.1 SD = 0.38 Mean Harv = 2.4 SD = 0.50 11
PCA on raw expression data PCA on all the common genes Class 1 Michigan Harvard Class 2 20 10 t[2] - 1% of variation Loadings plot for all common genes 0 0.080 -10 0.060 0.040 -20 0.020 0.000 p[2] -100 0 100 -0.020 t[1] - 89% of variation -0.040 -0.060 -0.080 -0.0140 -0.0120 -0.0100 -0.0080 -0.0060 -0.0040 -0.0020 0.0000 0.0020 0.0040 0.0060 0.0080 0.0100 0.0120 0.0140 p[1] 12
Meta Analysis - combining p-values • Inverse Chi-Square method (Fisher, 1932) • Under the null hypothesis P-values have a uniform distribution • … so -2log(p) has a chi-square distribution with 2 degrees of freedom • … and -2log(p 1 p 2 ) has a chi-square distribution with 4 degrees of freedom • A new p-value is created for every gene which is a combination of the p-value from Harvard and from Michigan 13
Cox Proportional Hazards Model • Model the log hazard function against the covariates � log h(t;x) = b T x * h 0 (t) � where b is the vector of covariate parameter estimates, h 0 (t) is the baseline hazard and x represents the data • The exponential of the parameter estimate for gene expression represents the increase in hazard for every unit increase in log expression or for every 10 fold increase in expression 14
Interpretation of the hazard parameter estimate Increase in hazard for every 10-fold increase in gene expression 1200 Parameter estimate = 5 1000 exp(5) = 148.4 hazard increases by 148 for each 10 800 Parameter estimate = 2 fold increase in gene expression exp(2) = 7.4 hazard 600 hazard increases by 7 for each 10 fold increase in gene expression 400 200 0 10 100 1000 10000 1E5 1E6 1E7 15 gene expression
Volcano plots 241 genes selected Cut-off P<0.05 16
Agreement between hazard estimates 17
Genes selected from Cox Analysis • Interpretation � These genes show a significant association with survival after taking the factors of stage, sex and age into account � These are genes which will increase or decrease the chances of survival regardless of the stage of the tumour 18
PCA Score plot of genes selected from survival analysis Scatter Plot survivor Non-survivor 3 short survival 2 long survival 1 0 -1 -2 -3 -4 -5 -8 -6 -4 -2 0 2 4 6 M2.t[1] 19
List of genes and association with Survival False Dicovery meta Rate p- survival Hazard Hazard probe_set p-value value association Harv Mich gene name 34777_at <0.0001 0.0015 Neg 1.00 5.67 adrenomedullin solute carrier family 2 (facilitated glucose 40507_at <0.0001 0.0077 Neg 8.07 10.34 transporter), member 1 1649_at 0.0001 0.0656 Neg 5.68 22.78 chromosome 20 open reading frame 16 32300_s_at 0.0002 0.1582 Neg 5.23 13.99 tyrosine hydroxylase 38544_at 0.0003 0.2286 Neg 2.10 4.99 inhibin, alpha phosphoinositide-3-kinase, regulatory subunit, 1269_at 0.0005 0.3053 Pos -2.71 -7.04 polypeptide 1 (p85 alpha) 35693_at 0.0006 0.3195 Neg 2.48 16.87 hippocalcin-like 1 36133_at 0.0007 0.3428 Neg 0.46 8.33 desmoplakin (DPI, DPII) 32593_at 0.0009 0.3680 Pos -0.17 -10.60 KIAA0084 protein 1904_at 0.0012 0.4366 Neg 5.36 12.27 c-myc binding protein glutathione transferase zeta 1 1212_at 0.0013 0.4673 Neg 7.37 14.03 (maleylacetoacetate isomerase) Phosphoglycerate kinase {alternatively spliced} [human, phosphoglycerate kinase d e ficient patient with episodes of muscl, mRNA 31488_s_at 0.0015 0.4690 Neg 2.08 6.13 Partial Mutant, 307 nt] macrophage stimulating 1 receptor (c-met- 1317_at 0.0016 0.4690 Neg 0.13 10.42 related tyrosine kinase) 41096_at 0.0020 0.4690 Neg 0.36 1.66 S100 calcium binding protein A8 (calgranulin A) 37026_at 0.0022 0.4690 Neg 0.38 5.86 core promoter element binding protein 40657_r_at 0.0026 0.4690 Pos -3.83 -8.19 a d ipose most abundant gene transcript 1 20
Papers that the genes have appeared in Adrenomedullin Microsc Res Tech. 2002 Apr 15;57(2):110-9. Related Articles, Links Adrenomedullin functions as an important tumor survival factor in human carcinogenesis. Solute carrier family 2member 1 Cancer. 2002 Feb 15;94(4):1078-82. Related Articles, Links Immunohistochemical staining of GLUT1 in benign, borderline, and malignant ovarian epithelia. Kalir T, Wang BY, Goldfischer M, Haber RS, Reder I, Demopoulos R, Cohen CJ, Burstein DE. Tyrosine hydroxylase : Lambooy LH, Gidding CE, van den Heuvel LP, Hulsbergen-van de Kaa CA, Ligtenberg M, Bokkerink JP, De Abreu RA. Related Articles, Links Real-time analysis of tyrosine hydroxylase gene expression: a sensitive and semiquantitative marker for minimal residual disease detection of neuroblastoma. Clin Cancer Res. 2003 Feb;9(2):812-9. PMID: 12576454 [PubMed - indexed for MEDLINE] Inhibin J Clin Endocrinol Metab. 1998 Mar;83(3):969-75. Related Articles, Links Loss of the expression and localization of inhibin alpha-subunit in high grade prostate cancer. PIK3R1 : J Biol Chem. 2003 Jun 27;278(26):23630-8. Epub 2003 Apr 24. Related Articles, Links Evidence that phosphatidylinositol 3-kinase- and mitogen-activated protein kinase kinase- 4/c-Jun NH2-terminal kinase-dependent Pathways cooperate to maintain lung cancer cell survival. Desmoplakin : Lung Cancer. 2002 May;36(2):133-41. Related Articles, Links Differential expression and biodistribution of cytokeratin 18 and desmoplakins in non-small cell lung carcinoma subtypes. Young GD, Winokur TS, Cerfolio RJ, Van Tine BA, Chow LT, Okoh V, Garver RI Jr. 21
Alternative analysis approach • Stage has a large effect on survival with stage I having better survival prospects • If gene expression can be correlated with tumour stage then the genes identified can be used as targets for new medications • Perform ANOVA with gene expression as the response and stage, age and sex as covariates 22
Volcano plots for ANOVA analysis 43 genes selected 1.5 fold change P < 0.05 23
Genes selected from ANOVA • Interpretation � These genes show a significant difference between Stage I and future stages and hence are indicators for survival � If gene expression is higher on stage I then the gene is positively associated with survival � CAVEAT: the analysis is detecting small fold- changes as being significant so it is questionable whether any genes are truly interesting to an oncologist 24
Recommend
More recommend