Use of Microarray Data via Model- Based Classification in the Study and Prediction of Survival from Lung Cancer Liat Jones * , Angus Ng * , Chris Ambroise ** , Katrina Monico* and Geoff McLachlan * * Institute for Molecular Bioscience ** Laboratoire Heudiasyc University of Queensland
AIM : To link gene-expression data with survival from lung cancer A CLUSTER ANALYSIS We apply a model-based clustering approach to classify tumor tissues on the basis of microarray gene expression. B SURVIVAL ANALYSIS The association between the clusters so formed and patient survival (recurrence) times is established. C DISCRIMINANT ANALYSIS We demonstrate the potential of the clustering-based prognosis as a predictor of the outcome of disease.
STANFORD and ONTARIO DATASETS : cDNA microarrays were used to obtain gene expression profiles for the tissue (tumor) samples . STANFORD: 918 genes ONTARIO : 2880 genes The Stanford Dataset contains relatively more adenocarcinoma (AC) samples, and the Ontario Dataset contains only non-small cell lung carcinomas (NSCLC).
Tumor Types in Stanford and Ontario Datasets Tumor Type Number of Samples Stanford Ontario Adenocarcinoma 41 19 Squamous cell 16 14 Large cell 5 4 Adenosquamous 0 1 Carcinoid 0 1 Small Cell 5 0 TOTAL 67 39
Heat Map for 2880 Ontario Genes (39 Tissues) Genes Tissues
CLUSTERING OF ONTARIO TUMORS Using EMMIX-GENE Steps used in the application of EMMIX-GENE: • Select the most relevant genes from this filtered set of 2,880 genes. The set of retained genes is thus reduced to 766. • Cluster these 766 genes into twenty groups. The majority of gene groups produced were reasonably cohesive and distinct. • Using these twenty group means, cluster the tissue samples into two groups using a mixture of normal components/factor analyzers.
Heat Maps for the 20 Ontario Gene-Groups (39 Tissues) Genes Tissues Tissues are ordered as: Recurrence (1-24) and Censored (25-39)
Expression Profiles for Useful Metagenes (Ontario 39 Tissues) Gene Group 1 Gene Group 2 Our Tissue Cluster 1 Our Tissue Cluster 2 Log Expression Value Recurrence (1-24) Censored (25-39) Gene Group 19 Gene Group 20 Tissues
Expression Profiles of some Genes Identified in Ontario Cluster A Clusters B and C (down Rec, up Censored) (up Rec, down Censored) Log Expression Value PNUTL1 ATM Recurrence (1-24) Censored (25-39) Recurrence (1-24) Censored (25-39) FUS HIF1A Wee1 RABIF Tissues
Only ZNF136 is retained by us and also identified in Ontario Log Expression Value Tissues Recurrence (1-24) Censored (25-39) It is found in our Group 19 (up-regulated in recurrence).
Tissue Clusters Tumors 1-24 belong to RECURRENCE group Tumors 25-39 are CENSORED CLUSTER ANALYSIS via EMMIX-GENE of 20 METAGENES yields TWO CLUSTERS: CLUSTER 1: 1-14, 16-24 (recurrence) plus 25-29, 33, 36, 38 (censored) CLUSTER 2: 15 (recurrence) plus 30-32, 34, 35, 37, 39 (censored)
SURVIVAL ANALYSIS: LONG-TERM SURVIVOR (LTS) MODEL = > S ( t ) . { T t } prob = π + π S ( t ) 1 1 2 where T is time to recurrence and π 1 = 1- π 2 is the prior prob. of recurrence. Adopt Weibull model for the survival function for recurrence S 1 (t).
Fitted LTS Model vs. Kaplan-Meier
PCA of Tissues Based on Metagenes Second PC First PC
PCA of Tissues Based on Metagenes Second PC First PC
PCA of Tissues Based on All Genes (via SVD) Second PC First PC
PCA of Tissues Based on All Genes (via SVD) Second PC First PC
Cluster-Specific Kaplan-Meier Plots
Survival Analysis for Ontario Dataset • Nonparametric analysis: Mean time to Failure ( ± SE) Cluster No. of Tissues No. of Censored 665 ± 85.9 1 29 8 1388 ± 155.7 2 8 7 A significant difference between Kaplan-Meier estimates for the two clusters ( P = 0.027). • Cox’s proportional hazards analysis: Variable Hazard ratio (95% CI) P -value Cluster 1 vs. Cluster 2 6.78 (0.9 – 51.5) 0.06 Tumor stage (I vs. II&III) 1.07 (0.57 – 2.0) 0.83
Discriminant Analysis (Supervised Classification) A prognosis classifier was developed to predict the class of origin of a tumor tissue with a small error rate after correction for the selection bias. A support vector machine (SVM) was adopted to identify important genes that play a key role on predicting the clinical outcome, using all the genes, and the metagenes. A cross-validation (CV) procedure was used to calculate the prediction error, after correction for the selection bias.
ONTARIO DATA (39 tissues): Support Vector Machine (SVM) with Recursive Feature Elimination (RFE) 0.12 0.1 Error Rate (CV10E) 0.08 0.06 0.04 0.02 0 0 2 4 6 8 10 12 log2 (number of genes) Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector Machine (SVM). applied to g=2 clusters (G1: 1-14, 16- 29,33,36,38; G2: 15,30-32,34,35,37,39)
STANFORD DATA 918 genes based on 73 tissue samples from 67 patients. Row and column normalized, retained 451 genes after select-genes step. Used 20 metagenes to cluster tissues. Retrieved histological groups.
Heat Maps for the 20 Stanford Gene-Groups (73 Tissues) Genes Tissues Tissues are ordered by their histological classification: Adenocarcinoma (1-41), Fetal Lung (42), Large cell (43-47), Normal (48-52), Squamous cell (53-68), Small cell (69-73)
Reduced dataset of 35 Adenocarcinoma (AC) Tissues Full dataset had 41 AC tissues. According to our cluster analysis: AC tissues 5, 16, 26 are put with LCLC 7, 29 are put with SCLC 40 is put with SCC. Also, we did not add tissues 43 (LCLC) nor 68 (SCC) (as done in the Stanford study), as they were both assigned to the LCLC cluster. This left 35 AC tissues with 918 genes, reduced to 219 genes, which were clustered into 15 groups (metagenes).
STANFORD CLASSIFICATION: Cluster 1: 1-19 (good prognosis) Cluster 2: 20-26 (long-term survivors) Cluster 3: 27-35 (poor prognosis)
Heat Maps for the 15 Stanford Gene-Groups (35 Tissues) Genes Tissues Tissues are ordered by the Stanford classification into AC groups: AC group 1 (1-19), AC group 2 (20-26), AC group 3 (27-35)
Expression Profiles for Top Metagenes (Stanford 35 AC Tissues) Gene Group 1 Gene Group 2 Stanford AC group 1 Log Expression Value Stanford AC group 2 Stanford AC group 3 Misallocated Gene Group 4 Gene Group 3 Tissues
Which Genes make up the top 4 Metagenes ? Group 1 ( 22 genes ) includes: Group 2 ( 12 genes ) includes: ESTs Hs.11607 ornithine decarboxylase ataxia-telangiectasia group D-associated protein carbonyl reductase ( metabolic enzyme ) solute carrier family 7, member 5 (CD98) vascular endothelial growth factor C Marker Genes For Group 3 (Supervised) Marker Genes for Group 2 (Supervised) High in group 3, low in 1 and 2 (4/10 genes) High in group 2, low in 3 (1/8 genes) Group 4 ( 14 genes ) includes: Group 3 ( 16 genes ) includes: cartilage paired-class homeoprotein aldo-keto reductase family 1 tumor suppressor deleted in oral cancer-related 1 glutathione peroxidase thioredoxin reductase Metabolic Enzymes (Unsupervised) Marker Genes for Group 2 (Supervised) High in group 3, also SCC (3/6 genes) High in group 2, low in 3 (2/8 genes)
Some other interesting Metagenes Gene Group 7 Gene Group 9 Log Expression Value Tissues Group 7 ( 19 genes ) includes: Group 9 ( 22 genes ) includes: citron ICAM-1 (CD54) surfactant A1 collagen, type IX hepsin thyroid transcription factor Marker Genes For Group 1 (Supervised) Marker Genes For Group 1 (Supervised) High in group 1, low in 2 (1/ 9 genes) High in group 1, low in 2 (4/ 9 genes) Surfactant Proteins (Unsupervised) High in groups 1 and 2, low in 3
Cluster-Specific Kaplan-Meier Plots
Cluster-Specific Kaplan-Meier Plots
STANFORD DATA: TWO-COMPONENT WEIBULL MIXTURE MODEL = π + π S ( t ) S ( t ) S ( t ), 1 1 2 2 where β = − α = S ( t ) exp ( t ) ( i 1 , 2 ). i i i
Plot of 1- and 2-component Weibull Mixture vs. Kaplan-Meier
Survival Analysis for Stanford Dataset • Kaplan-Meier estimation: Mean time to Failure ( ± SE) Cluster No. of Tissues No. of Censored 37.5 ± 5.0 1 17 10 5.2 ± 2.3 2 5 0 A significant difference in survival between clusters ( P < 0.001) • Cox’s proportional hazards analysis: Variable Hazard ratio (95% CI) P -value Cluster 3 vs. Clusters 1&2 13.2 (2.1 – 81.1) 0.005 Grade 3 vs. grades 1 or 2 1.94 (0.5 – 8.5) 0.38 Tumor size 0.96 (0.3 – 2.8) 0.93 No. of tumors in lymph nodes 1.65 (0.7 – 3.9) 0.25 Presence of metastases 4.41 (1.0 – 19.8) 0.05
Survival Analysis for Stanford Dataset • Univariate Cox’s proportional hazards analysis (metagenes): Metagene Coefficient (SE) P -value 1 1.37 (0.44) 0.002 2 -0.24 (0.31) 0.44 3 0.14 (0.34) 0.68 4 -1.01 (0.56) 0.07 5 0.66 (0.65) 0.31 6 -0.63 (0.50) 0.20 7 -0.68 (0.57) 0.24 8 0.75 (0.46) 0.10 9 -1.13 (0.50) 0.02 10 0.73 (0.39) 0.06 11 0.35 (0.50) 0.48 12 -0.55 (0.41) 0.18 13 -0.61 (0.48) 0.20 14 0.22 (0.36) 0.53 15 1.70 (0.92) 0.06
Recommend
More recommend