Analysis and evaluation of classification models for disease - PowerPoint PPT Presentation

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Analysis and evaluation of classification models for disease detection using human gut metagenomic data Elena Kochkina, Fredrik Karlsson Chalmers University of Technology September 10, 2015

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Introduction Colorectal cancer ◮ Colorectal cancer - is the development of malignant tumor in the colon or rectum. ◮ 75-95 % of colon cancer occurs in people with low genetic risk. ◮ Standard way of testing for CRC - the analysis of the stool for hidden blood is of limited practical importance for diagnosis and there’s a need for developing better alternatives for population screening.

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Introduction The gut microbiota ◮ The gut microbiota - an ecological community of the microorganisms populating our intestine. ◮ The gut microbiota is an important modulator of the immune system and an important metabolic organ. ◮ In several diseases, the taxonomic and functional composition of the microbiota is altered compared to a normal healthy microbiota.

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Data Data I. Zeller et. al. 2014

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Data Data II Colorectal cancer Adenoma Early stage Late stage Group Healthy ( < 1 ) ( > 1 ) 0 I II III IV Country F (N=156) 61 27 15 0 15 7 10 21 France G (N=38) 0 0 0 25 13 Germany H (N=297) 297 0 0 0 0 Denmark, Spain, Germany Datasets include fecal metagenomes, collective genetic materials of the microbiota, information about functional and taxonomic features of the bacteria populating the human gut. Taxonomic features represent relative abundance of 1753 different bacteria. Functional features - represent gene functions and are divided to KEGG modules and CAZY families.

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology Classification Training set Machine learning (known labels) algorithm Test set Classification model Predicted label (unknown labels) ◮ LASSO ◮ Elastic Net ◮ Support Vector Machines ◮ Random Forests

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology LASSO - Logistic regression with L1 norm regularisation The binary logistic model predicts a binary response (class) based on predictors or features by estimating probabilities of an instance belnging to ’positive’ class. The probabilities are modeled using a logistic function: 1 σ ( q ) = P ( y i = 1 | x ) = 1 + e − q Given a set of input measurements x 1 , x 2 ... x p and an outcome measurement y = ± 1, q can be a linear function of x : q = β 0 + β 1 ∗ x 1 + β 2 ∗ x 2 + ...β p ∗ x p . The LASSO constraint is defined by: � p j =1 β j ≤ t We maximise log-likelihood with added penalty: p N β lasso = argmax { ˆ � � [ y i q − log (1 + e q )] − λ β j } i =1 j =1

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology Elastic Net - Logistic regression with regularisation by combination of L1 and L2 norms The difference from LASSO is in the Elastic Net penalty: p � ( αβ 2 j + (1 − α ) | β j | ) , λ j =1 where α is a compromise between Ridge and LASSO. Therefore, Elasic Net criterion has the following form: N p β lasso = argmax { ˆ � � ( αβ 2 [ y i q − log (1 + e q )] − λ j + (1 − α ) | β j | ) } i =1 j =1

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology Support Vector Machines Given training data ( x i , y i ) for i = 1 ... N , with x i ∈ R d and y i ∈ {− 1 , 1 } , learn a classifier f ( x ) such that � ≥ 0 , y i = +1 f ( x i ) = < 0 , y i = − 1 i.e. y i f ( x i ) > 0 for a correct wx+b=0 classification. o w 2 A linear classifier has the form o o ||𝑥|| o o o o o o x x o x x x f ( x ) = w · x + b x x x x 2 The margin is given by || w || 2 max || w || subject to y i ( w · x i − b ) ≥ 1 . w

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology Random Forest Random Forest algorithm works by constructing an ensemble of decision trees All the trees are constructed x independently, using Gini Tree 1 …. Tree n Tree 2 impurity criterion to choose partition attributes. Classification of objects carried + by a majority voting scheme: y every tree classifies objects to one of the classes, and wins the class for which the highest number of trees vote.

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology Pipeline Preprocessing: filtering, log-transform, normalisation GH Partition F set to test and training sets for 10-fold cross-validation Selection of the optimal hyperparameter(s) with nested 10-fold cross-validation 10 х GH and fitting the model Application of the fitted model to the test set of each fold and GH set Model interpretation and important feature extraction GH GH Performance evaluation

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology Performance estimation The result of classification is a set of predicted probabilities that a certain element belongs to positive class (CRC). After choosing the decision boundary we can construct confusion matrix: Actual class Positive Negative Positive TP FP Predicted class Negative FN TN TP - True Positive; TN - True Negative; FP - False Positive; FN - False Negative.

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology Performance estimation II Performance metrics that can be calculated based on confusion matrix with fixed decision boundary. TP + TN Accuracy = TP + FN + FP + TN TP Precision = TP + FP TP Recall = TP + FN TN Specificity = FP + TN

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology Performance estimation III For comparison of the classification models we use ROC-curve (Receiver Operator Characteristic) and Area Under a Curve(AUC). ROC-curve reflects relation 1.0 between Sensitivity (True 0.8 Positive Rate), and 0.6 Sensitivity 1 − Specificity (False Positive 0.4 Rate) while varying decision 0.2 boundary. 0.0 1.0 0.8 0.6 0.4 0.2 0.0 Specificity

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Results Results I AUC, F set, AUC, F set, AUC, F set, taxonomic and Classifier taxonomic functional functional features features features LASSO 0.84 0.80 0.87 Elastic net 0.83 0.79 0.87 Support Vector Machines (SVM) 0.82 0.76 0.85 no feature selection Random forest 0.87 0.79 0.85 Table : Performance of different classification models on training set F using taxonomic and functional features

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Results Results II Classifier AUC (GH set) LASSO 0.85 Elastic net 0.85 Support Vector Machines (SVM) with feature selection using 0.89 linear correlation criterion Random forest 0.87 Table : Performance of different classification models on the test set GH

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Results Important features All classifiers and filters highlight the importance of the following bacteria: Fusbacterium nucleatum vincentii , Fusbacterium nucleatum animalis and Peptostreptococcus stomatis These bacteria are oral pathogens. Other studies (Warren et al (2013), Feng et al (2014)) also point out these species as CRC related bacteria. It is still unclear whether they are the cause or a consequence of tumor growth.

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Results Confounder assessment Wilcoxon test p-value = 0.0027 Wilcoxon test p-value = 0.76 Fisher test p-value = 0.86 A B C Gender proportions Female Female Age BMI Male Male Controls Cases Controls Cases Controls Cases Figure : Boxplots. (A) Comparison of gender proportions between CRC patients and controls of study population F. (B) Comparison of patient age as a potential confounder. (C) Comparison of body mass index (BMI) as a potential confounder.

Analysis and evaluation of classification models for disease - PowerPoint PPT Presentation

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Analysis and evaluation of classification models for disease detection using human gut metagenomic data Elena Kochkina, Fredrik Karlsson

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Bag-of-features models for category classification for category classification Cordelia Schmid

Management of Classification Lookup Files The basics of classification The basics of

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

Library of Congress Classification: Module 1.3 1 Library of Congress Classification: Module 1.3

Classification K-nearest neighbor classification D istance functions Choice of k Choice of k

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Evaluation Update Laura Forsythe, PhD, MPH Associate Director, Evaluation & Analysis Lori

Multiclass Classification CS 6956: Deep Learning for NLP 1 So far: Binary Classification We

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Discriminant

Linear Models for Classification Greg Mori - CMPT 419/726 Bishop PRML Ch. 4 Discriminant

Mind the Gaps: Reducing the Inequalities & Research Implementation Gaps in the English

The simulation trainer for gastro- enterology and colonoscopy 43 Version 201601 ENDO-X Expert

Cancer Coalition Symposium Randi K. Rycroft, MSPH, CTR Director, Colorado Central Cancer Registry

Why Im Short Exact Sciences (EXAS) Whitney Tilson October 21, 2014 Kase Capital Management

17 th June, 2014 Introduction Increasing demand for colonoscopy in the NHS BCSP. Need

Endoscopic Therapy of Colorectal Polyps Douglas K Rex Indiana University Medical Center

How can patients and caregivers contribute to generate data on behavioural changes Elisa

Ovarian Cancer Prof. Christina Fotopoulou Department of Surgery & Cancer Imperial College

Analysis and evaluation of classification models for disease - PowerPoint PPT Presentation

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Analysis and evaluation of classification models for disease detection using human gut metagenomic data Elena Kochkina, Fredrik Karlsson

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Bag-of-features models for category classification for category classification Cordelia Schmid

Management of Classification Lookup Files The basics of classification The basics of

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

Library of Congress Classification: Module 1.3 1 Library of Congress Classification: Module 1.3

Classification K-nearest neighbor classification D istance functions Choice of k Choice of k

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Evaluation Update Laura Forsythe, PhD, MPH Associate Director, Evaluation &amp; Analysis Lori

Multiclass Classification CS 6956: Deep Learning for NLP 1 So far: Binary Classification We

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Discriminant

Linear Models for Classification Greg Mori - CMPT 419/726 Bishop PRML Ch. 4 Discriminant

Mind the Gaps: Reducing the Inequalities &amp; Research Implementation Gaps in the English

The simulation trainer for gastro- enterology and colonoscopy 43 Version 201601 ENDO-X Expert

Cancer Coalition Symposium Randi K. Rycroft, MSPH, CTR Director, Colorado Central Cancer Registry

Why Im Short Exact Sciences (EXAS) Whitney Tilson October 21, 2014 Kase Capital Management

17 th June, 2014 Introduction Increasing demand for colonoscopy in the NHS BCSP. Need

Endoscopic Therapy of Colorectal Polyps Douglas K Rex Indiana University Medical Center

How can patients and caregivers contribute to generate data on behavioural changes Elisa

Ovarian Cancer Prof. Christina Fotopoulou Department of Surgery &amp; Cancer Imperial College

Evaluation Update Laura Forsythe, PhD, MPH Associate Director, Evaluation & Analysis Lori

Mind the Gaps: Reducing the Inequalities & Research Implementation Gaps in the English

Ovarian Cancer Prof. Christina Fotopoulou Department of Surgery & Cancer Imperial College