Neural Network Classifiers and Gene Selection Methods for Microarray Data on Human Lung Adenocarcinoma Gaolin Zheng School of Computer Science Florida International University, Miami, FL E.O. George Department of Mathematical Sciences, University of Memphis G. Narasimhan School of Computer Science, Florida International University
Our Work � Building classifiers to predict tumor stage based on gene expression data. � Comparative study of neural network classifiers. � Comparative study of gene selection methods. � Explore data integration. Zheng et al. 2
Data Set 1 (Michigan) Normal Stage 1 Tumor Stage 3 Tumor Lung Female Male Female Male Non-smoking 9 1 2 7 10 Smoking 33 24 0 10 � 86 patients with Adenocarcinoma divided into: � Stage 1 and Stage 3 tumors, � male and female � Smoking and non-smoking � Data is severely unbalanced. � 10 non-neoplastic (normal) lung samples and their gene expression, but with no additional information (e.g. gender, smoking etc.) � 7129 probe sets. Zheng et al. 3
Data Set 2 (Boston) Normal Stage 1 Tumor Stage 2 Tumor Stage 3 Tumor Lung Female Male Female Male Female Male Non-smoking 5 2 3 0 2 0 13 Smoking 39 30 12 9 4 7 � 113 patients with Adenocarcinoma divided into: � Stage 1, Stage 2, and Stage 3 tumors � male and female � Smoking and non-smoking � 13 normal lung samples without any additional information. � Over 12600 probe sets Looked at 490 overlapping probe sets. Zheng et al. 4
Gene Selection Goal: Find genes that discriminate on the basis of tumor stage information. Methods: � ANOVA � SAM (http://www-stat.stanford.edu/~tibs/SAM/) � GS � GS-Robust � PCA � Select principal components contributing to >80% variation. Zheng et al. 5
ANOVA Based Gene Selection � For individual data set: � Single-factor (stage) � Multifactor (stage, gender, smoking) � For integrated data set � Single factor, multiple factor � Mixed-effect model (stage as fixed factor, lab as random factor) for the 490 overlapping probe sets. � Use P-value for stage to rank significance of genes Zheng et al. 6
Gene Selection Method: GS A measure of ratio of inter-group to intra-group variation. k ∑ − − 2 ( g g ) /( k 1 ) ij. i .. = = j 1 GS i n k ij ∑∑ − − 2 ( ) /( 1 ) g g n ijl ij ij . = = j 1 l 1 = g mean ( g ) ij. ij = = g mean { mean ( g ), j 1 ,..., k } i .. ij Zheng et al. 7
GS-Robust A robust measure of ratio of inter-group to intra-group variation. MAD [ median ( g ),..., median ( g )] = i 1 ik GSRobust i k ∑ MAD ( g ) ij = j 1 GSRobust : the GSRobust value for the ith gene. i MAD : median absolute deviation g : the vector of gene expression corresponding ij to ith gene and jth class. Zheng et al. 8
Classifiers � Feed forward neural network ( nnet() from R) Yet another machine learning classifier. Yawn! � FNN with Bayesian learning of network weights. � Neural Network Ensembles � Bagging (Breiman, 1994) � Boosting (Freund and Schapire, 1996) Zheng et al. 9
Bayesian Neural Networks: Bayesian Learning of the Weights Choose initial values of hyperparameters α and β W ~ N (0, 1/ α ) Total Error Error term Classifier Regularization term S W = β E D + α E W Eigenvalue of the λ W ≡ ∑ γ i Hessian matrix λ + α = i 1 i α = γ / 2 E new W β = − γ ( N ) / 2 E new D Zheng et al. 10
Bagging Classifiers ... ... Ensembled Classifier Zheng et al. 11
Boosting Classifiers Zheng et al. 12
Benchmarking the classifiers Iris BreastCancer .18 .10 .16 .09 5-fold Cross-validation Error .14 5-fold Cross-validation Error .08 .12 .07 .10 .08 .06 10 .06 .05 .04 2 1 .04 4 .02 8 0.00 .03 N = 10 10 10 10 10 10 N = 10 10 10 10 10 10 NNET NBAG NBOOST BNN BBAG BBOOST NNET NBAG NBOOST BNN BBAG BBOOST Zheng et al. 13
How different are these gene selection methods? SAM GS ANOVA GS-Robust SAM 200 GS 167 200 Michigan ANOVA 179 164 200 GS-Robust 23 28 20 200 SAM GS ANOVA GS-Robust SAM 200 Boston GS 43 200 ANOVA 68 35 200 GS-Robust 8 13 6 200 Zheng et al. 14
Common Significant Genes Gene Name Unigene Comment GAPD glyceraldehyde-3-phosphate dehydrogenase MGP matrix Gla protein RTVP1 GLI pathogenesis-related 1 (glioma) DDXBP1 Not found FGR Gardner-Rasheed feline sarcoma viral (v-fgr) oncogene homolog FGFR2 fibroblast growth factor receptor 2 (bacteria- expressed kinase, keratinocyte growth factor receptor, craniofacial dysostosis 1, Crouzon syndrome, Pfeiffer syndrome, Jackson-Weiss syndrome) TNNC1 troponin C, slow KIAA0140 KIAA0140 gene product Zheng et al. 15
Neural Network Topology Michigan Boston 20 Input Layer 20 4 Hidden Layer 4 3 Output Layer 4 Zheng et al. 16
Practical Issues � Underrepresented classes � Contradictions in mapping � Unbalanced testing data Zheng et al. 17
K-fold Cross-Validation Training Testing Training Testing Training Testing Error1 Error2 Error3 3-fold Cross-Validation Error Zheng et al. 18
Validation across Data Sets Data Set 1 Data Set 2 Data Set 2 Data Set 1 Training Testing Training Testing Zheng et al. 19
Results – Data Set 1 Gene Selection Method NN Type GS-ANOVA GS-SAM GS GS-Robust GS-PCA 0.289 ± 0.025 0.290 ± 0.022 0.296 ± 0.031 0.277 ± 0.024 0.288 ± 0.021 nnet 0.279 ± 0.004 0.277 ± 0.008 0.267 ± 0.018 0.273 ± 0.006 0.278 ± 0.000 nnet.bag 0.292 ± 0.012 0.290 ± 0.017 0.262 ± 0.016 0.272 ± 0.012 0.282 ± 0.013 nnet.boost 0.335 ± 0.048 0.311 ± 0.046 0.315 ± 0.036 0.269 ± 0.030 0.299 ± 0.034 Bayesian 0.282 ± 0.008 0.273 ± 0.014 0.264 ± 0.021 0.236 ± 0.017 0.280 ± 0.009 bayes.bag 0.282 ± 0.012 0.280 ± 0.015 0.257 ± 0.019 0.246 ± 0.015 0.277 ± 0.013 bayes.boost Zheng et al. 20
Results – Data Set 2 Gene Selection Method NN Type GS-ANOVA GS-SAM GS GS-Robust GS-PCA 0.153 ± 0.010 0.150 ± 0.005 0.148 ± 0.006 0.149 ± 0.002 0.150 ± 0.007 nnet 0.148 ± 0.000 0.148 ± 0.000 0.148 ± 0.000 0.148 ± 0.000 0.148 ± 0.000 nnet.bag 0.148 ± 0.000 0.149 ± 0.002 0.148 ± 0.000 0.148 ± 0.000 0.148 ± 0.000 nnet.boost 0.157 ± 0.016 0.152 ± 0.005 0.145 ± 0.006 0.154 ± 0.014 0.148 ± 0.000 Bayesian 0.148 ± 0.000 0.149 ± 0.003 0.147 ± 0.002 0.148 ± 0.000 0.148 ± 0.000 bayes.bag 0.147 ± 0.003 0.149 ± 0.005 0.142 ± 0.006 0.149 ± 0.002 0.148 ± 0.000 bayes.boost Zheng et al. 21
Validation Across Different Data Sets Michigan/Boston & Boston/Michigan Gene Selection Method Training/ NN Type Testing GS-ANOVA GS-SAM GS GS-Robust GS-PCA 0.090 ± 0.122 0.055 ± 0.054 0.122 ± 0.257 0.033 ± 0.000 0.142 ± 0.272 nnet 0.033 ± 0.000 0.033 ± 0.000 0.034 ± 0.003 0.033 ± 0.000 0.035 ± 0.005 nnet.bag 0.036 ± 0.008 0.055 ± 0.068 0.049 ± 0.037 0.033 ± 0.000 0.054 ± 0.050 nnet.boost Michigan 0.172 ± 0.309 0.269 ± 0.358 0.405 ± 0.466 0.099 ± 0.126 0.171 ± 0.294 /Boston Bayesian 0.034 ± 0.003 0.035 ± 0.003 0.057 ± 0.059 0.033 ± 0.003 0.105 ± 0.155 bayes.bag 0.037 ± 0.007 0.060 ± 0.038 0.138 ± 0.188 0.033 ± 0.000 0.061 ± 0.086 bayes.boost 0.391 ± 0.226 0.250 ± 0.077 0.299 ± 0.154 0.293 ± 0.178 0.221 ± 0.000 nnet 0.221 ± 0.000 0.221 ± 0.000 0.221 ± 0.000 0.221 ± 0.000 0.221 ± 0.000 nnet.bag 0.219 ± 0.004 0.222 ± 0.004 0.221 ± 0.000 0.221 ± 0.000 0.221 ± 0.000 nnet.boost Boston/ 0.434 ± 0.245 0.343 ± 0.249 0.380 ± 0.201 0.510 ± 0.336 0.276 ± 0.131 Michigan Bayesian 0.226 ± 0.015 0.222 ± 0.004 0.307 ± 0.149 0.280 ± 0.167 0.221 ± 0.000 bayes.bag 0.241 ± 0.042 0.271 ± 0.101 0.399 ± 0.286 0.337 ± 0.206 0.221 ± 0.000 bayes.boost Zheng et al. 22
Questions from Anomalous Results � Could it be due to different compositions of the data sets? � Could the assignment of tumor stage by TNM system be non-uniform? Does “Stage 1” mean the same for both data sets? � Could there be differences in preprocessing (normalization)? � Tumor heterogeneity? � Differences in treatment? � How can these questions be approached? Zheng et al. 23
Conclusions � Bagging exhibited consistently better performance. � Boosting improved classification, but was erratic. � Univariate Bayesian learning did not usually improve performance. � Bagging is a faster and simpler ensemble technique than boosting. � GS-Robust selected many unique genes and had excellent ability to select features for our classifiers. Zheng et al. 24
Acknowledgements Members of the Bioinformatics Research Group (BioRG), School of Computer Science, FIU: � Patricia Buendia � Daniel Cazalis � Tom Milledge � Xintao Wei � Chengyong Yang � Erliang Zeng http://www.cs.fiu.edu/~giri/BNN/ Zheng et al. 25
Zheng et al. 26
Recommend
More recommend