Analysis and evaluation of classification models for disease detection using human gut metagenomic data Analysis and evaluation of classification models for disease detection using human gut metagenomic data Elena Kochkina, Fredrik Karlsson Chalmers University of Technology September 10, 2015
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Introduction Colorectal cancer ◮ Colorectal cancer - is the development of malignant tumor in the colon or rectum. ◮ 75-95 % of colon cancer occurs in people with low genetic risk. ◮ Standard way of testing for CRC - the analysis of the stool for hidden blood is of limited practical importance for diagnosis and there’s a need for developing better alternatives for population screening.
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Introduction The gut microbiota ◮ The gut microbiota - an ecological community of the microorganisms populating our intestine. ◮ The gut microbiota is an important modulator of the immune system and an important metabolic organ. ◮ In several diseases, the taxonomic and functional composition of the microbiota is altered compared to a normal healthy microbiota.
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Data Data I. Zeller et. al. 2014
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Data Data II Colorectal cancer Adenoma Early stage Late stage Group Healthy ( < 1 ) ( > 1 ) 0 I II III IV Country F (N=156) 61 27 15 0 15 7 10 21 France G (N=38) 0 0 0 25 13 Germany H (N=297) 297 0 0 0 0 Denmark, Spain, Germany Datasets include fecal metagenomes, collective genetic materials of the microbiota, information about functional and taxonomic features of the bacteria populating the human gut. Taxonomic features represent relative abundance of 1753 different bacteria. Functional features - represent gene functions and are divided to KEGG modules and CAZY families.
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology Classification Training set Machine learning (known labels) algorithm Test set Classification model Predicted label (unknown labels) ◮ LASSO ◮ Elastic Net ◮ Support Vector Machines ◮ Random Forests
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology LASSO - Logistic regression with L1 norm regularisation The binary logistic model predicts a binary response (class) based on predictors or features by estimating probabilities of an instance belnging to ’positive’ class. The probabilities are modeled using a logistic function: 1 σ ( q ) = P ( y i = 1 | x ) = 1 + e − q Given a set of input measurements x 1 , x 2 ... x p and an outcome measurement y = ± 1, q can be a linear function of x : q = β 0 + β 1 ∗ x 1 + β 2 ∗ x 2 + ...β p ∗ x p . The LASSO constraint is defined by: � p j =1 β j ≤ t We maximise log-likelihood with added penalty: p N β lasso = argmax { ˆ � � [ y i q − log (1 + e q )] − λ β j } i =1 j =1
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology Elastic Net - Logistic regression with regularisation by combination of L1 and L2 norms The difference from LASSO is in the Elastic Net penalty: p � ( αβ 2 j + (1 − α ) | β j | ) , λ j =1 where α is a compromise between Ridge and LASSO. Therefore, Elasic Net criterion has the following form: N p β lasso = argmax { ˆ � � ( αβ 2 [ y i q − log (1 + e q )] − λ j + (1 − α ) | β j | ) } i =1 j =1
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology Support Vector Machines Given training data ( x i , y i ) for i = 1 ... N , with x i ∈ R d and y i ∈ {− 1 , 1 } , learn a classifier f ( x ) such that � ≥ 0 , y i = +1 f ( x i ) = < 0 , y i = − 1 i.e. y i f ( x i ) > 0 for a correct wx+b=0 classification. o w 2 A linear classifier has the form o o ||𝑥|| o o o o o o x x o x x x f ( x ) = w · x + b x x x x 2 The margin is given by || w || 2 max || w || subject to y i ( w · x i − b ) ≥ 1 . w
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology Random Forest Random Forest algorithm works by constructing an ensemble of decision trees All the trees are constructed x independently, using Gini Tree 1 …. Tree n Tree 2 impurity criterion to choose partition attributes. Classification of objects carried + by a majority voting scheme: y every tree classifies objects to one of the classes, and wins the class for which the highest number of trees vote.
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology Pipeline Preprocessing: filtering, log-transform, normalisation GH Partition F set to test and training sets for 10-fold cross-validation Selection of the optimal hyperparameter(s) with nested 10-fold cross-validation 10 х GH and fitting the model Application of the fitted model to the test set of each fold and GH set Model interpretation and important feature extraction GH GH Performance evaluation
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology Performance estimation The result of classification is a set of predicted probabilities that a certain element belongs to positive class (CRC). After choosing the decision boundary we can construct confusion matrix: Actual class Positive Negative Positive TP FP Predicted class Negative FN TN TP - True Positive; TN - True Negative; FP - False Positive; FN - False Negative.
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology Performance estimation II Performance metrics that can be calculated based on confusion matrix with fixed decision boundary. TP + TN Accuracy = TP + FN + FP + TN TP Precision = TP + FP TP Recall = TP + FN TN Specificity = FP + TN
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology Performance estimation III For comparison of the classification models we use ROC-curve (Receiver Operator Characteristic) and Area Under a Curve(AUC). ROC-curve reflects relation 1.0 between Sensitivity (True 0.8 Positive Rate), and 0.6 Sensitivity 1 − Specificity (False Positive 0.4 Rate) while varying decision 0.2 boundary. 0.0 1.0 0.8 0.6 0.4 0.2 0.0 Specificity
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Results Results I AUC, F set, AUC, F set, AUC, F set, taxonomic and Classifier taxonomic functional functional features features features LASSO 0.84 0.80 0.87 Elastic net 0.83 0.79 0.87 Support Vector Machines (SVM) 0.82 0.76 0.85 no feature selection Random forest 0.87 0.79 0.85 Table : Performance of different classification models on training set F using taxonomic and functional features
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Results Results II Classifier AUC (GH set) LASSO 0.85 Elastic net 0.85 Support Vector Machines (SVM) with feature selection using 0.89 linear correlation criterion Random forest 0.87 Table : Performance of different classification models on the test set GH
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Results Important features All classifiers and filters highlight the importance of the following bacteria: Fusbacterium nucleatum vincentii , Fusbacterium nucleatum animalis and Peptostreptococcus stomatis These bacteria are oral pathogens. Other studies (Warren et al (2013), Feng et al (2014)) also point out these species as CRC related bacteria. It is still unclear whether they are the cause or a consequence of tumor growth.
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Results Confounder assessment Wilcoxon test p-value = 0.0027 Wilcoxon test p-value = 0.76 Fisher test p-value = 0.86 A B C Gender proportions Female Female Age BMI Male Male Controls Cases Controls Cases Controls Cases Figure : Boxplots. (A) Comparison of gender proportions between CRC patients and controls of study population F. (B) Comparison of patient age as a potential confounder. (C) Comparison of body mass index (BMI) as a potential confounder.
Recommend
More recommend