improving disease prediction using machine learning
play

Improving Disease Prediction Using Machine Learning Shelda Sajeev, - PowerPoint PPT Presentation

Improving Disease Prediction Using Machine Learning Shelda Sajeev, Stephanie Champion, Anthony Maeder Flinders Digital Health Research Centre, College of Nursing and Health Sciences, Flinders University, Adelaide, Australia Cardiovascular


  1. Improving Disease Prediction Using Machine Learning Shelda Sajeev, Stephanie Champion, Anthony Maeder Flinders Digital Health Research Centre, College of Nursing and Health Sciences, Flinders University, Adelaide, Australia

  2. • Cardiovascular disease (CVD) is one of the leading causes of death worldwide (~30%) and is regarded as highly preventable (~90%) [1]. • Primary prevention is thus a high priority and requires screening for risk factors and In Introduction providing suitable interventions. • Clinicians need accurate and reliable disease prediction tools to identify people who are at increased risk of a cardiovascular event.

  3. • Numerous CVD risk prediction models to estimate an individual’s likelihood of a CVD event are available [2]. • Conventional predictive models typically use simple In Introduction regression fitting over relatively few risk factors. • Regression approaches are simple, but do not assume any non-linearity in the model for contributions of chosen factors. (cont.) .) • In practice, many factors are correlated and have underlying non-linear relationships to the predicted outcome. • Regression models are generalised from broad based population datasets which can miss some subtle associations.

  4. Machine Lea earning Learning Scale Complexity Adaptivity Overcome limitations of Cater for a larger number Address multivariate Support an adaptive the conventional models. of variables in the model. interactions and non-linear approach for risk predictor relationships. revisions.

  5. Purpose The aim of the work reported here was to investigate plausibility of using a machine learning approach, by demonstrating its ability to derive prediction models for heart disease risk. This study discusses variations that can arise in the performance of some typical linear and more sophisticated non-linear machine learning prediction methods. The effects of different underlying populations on predictive performance, and the impact of combining cohorts to mimic a more general population, are considered.

  6. • We used two datasets from the widely known University of California, Irvine (UCI) machine learning repository [3]. • The two datasets were the Statlog heart dataset (270 participants) and Cleveland heart disease dataset (303 participants). Methods  To provide a larger sample size, the two datasets were also combined over the 13 common risk factors (with no duplicates).  The machine learning study was conducted on the two datasets individually and on the combined dataset.

  7. Study Population Characteristics • Average age - 54 years. Substantially fewer women than men (32% women, 68% men). • 14% had diabetes and 52% had high cholesterol (above 240). • 51% exhibited an abnormality in ECG results and 31% exhibited major vessel calcification in fluoroscopy. • 33% experienced exercise-induced angina. • 257 (45%) cases of heart disease, from 567 participants. Statlog cohort had 120/270 (44%) and Cleveland had 137/297 (46%).

  8. Approach  Before applying machine learning algorithms, the data was normalized to zero mean and unit variance, to ensure each variable would have the same influence on the cost function in designing the classifier.  Data was then separated into Training (80%) and Testing (20%) subsets by random selection, independently and repeatedly for successive runs Overview of the machine learning approach .

  9. • Four popular machine learning models were used: • Logistic regression (LR) [4] • Linear discriminant analysis (LDA) [5], • Support vector machine (SVM) with RBF kernel [6], • Random forest (RF) [7] Experimental • LR and LDA are simple linear classifiers; SVM and RF Setup are more advanced machine learning models that support non-linear classification. • All the machine learning algorithms were implemented in Python using the Scikit-learn library.

  10. • A confusion matrix was used to review the performance of the classification algorithm, reporting four outcomes: True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN). • The performance measures extracted were Results sensitivity, specificity, precision and accuracy. • Sensitivity = TP / (TP + FN ) • Specificity = TN / (TN + FP) • Precision = TP / (TP + FP) • Accuracy = (TP+TN) / (TP + TN + FN + FP )

  11. Comparison of the performance of the four machine learning models using 13 risk factors predicting heart disease incidence for individual datasets (Statlog and Cleveland). The reported values are the average of 50 iterations. Pred edic ictio ion Algorithms Sensitivity Specificity Precision Accuracy AUC Accuracy – Statlog Heart Dataset Ind Indiv ivid idual l Logistic Regression 0.807 0.859 0.821 0.836 0.910 Datasets ts Linear Discriminant Analysis 0.798 0.870 0.830 0.838 0.909 Support Vector Machine – RBF 0.807 0.849 0.849 0.830 0.907 Random Forest 0.788 0.879 0.838 0.836 0.913 Cleveland Heart Dataset 0.794 0.869 0.841 0.834 0.903 Logistic Regression 0.789 0.886 0.858 0.840 0.904 Linear Discriminant Analysis 0.773 0.867 0.867 0.828 0.900 Support Vector Machine – RBF 0.778 0.883 0.853 0.832 0.912 Random Forest

  12. Comparison of the performance of the four machine learning models using 13 risk factors predicting heart disease incidence for combined dataset (Statlog and Pred edic ictio ion Cleveland). The reported values are the average of 50 iterations. Accuracy – Com ombin ined Algorithms Sensitivity Specificity Precision Accuracy AUC Dataset Logistic Regression 0.817 0.873 0.844 0.848 0.913 Linear Discriminant Analysis 0.800 0.888 0.857 0.848 0.911 Support Vector Machine – RBF 0.866 0.906 0.885 0.888 0.943 Random Forest 0.890 0.955 0.943 0.933 0.963

  13. Results – Area Under Curve (ROC) . ROC curves for Logistic Regression (LR) , Random Forest (RF), Linear Discriminant Analysis ( LDA) and Support Vector Machine (SVM) models for UCI study participants (Statlog and Cleveland cohorts combined). ROC is drawn for one of the 50 iterations.

  14.  The results for the two individual dataset cohorts from different sources show that even for a small dataset, machine learning models can produce good results.  Variations in these two comparable Discussion cohorts do not affect this adversely.  When the cohorts are combined, the overall non- linear model’s performance increases substantially, while the results from linear models remain similar.

  15. Conclusions  This work demonstrates there is value in considering machine learning methods for disease prediction modelling, and offers the potential for the modelling performance to improve as dataset size increases.  This suggests that the machine learning approach may be more effective for maintaining prediction accuracy for datasets which change over time, as well as for specialized cohorts within the overall population, for which prediction may be less accurate due to deviation from a standard model.

  16. References 1. WHO: Prevention of cardiovascular disease : guidelines for assessment and management of total cardiovascular risk. World Health Organization (2007) 2. Sajeev, S., Maeder, A.: Cardiovascular risk prediction models: A scoping review. In: Proceedings of the Australasian Computer Science Week Multiconference, p. 21, ACM (2019) 3. Bache, K., Lichman, M.: UCI Machine Learning Repository Irvine, CA: University of California, School of Information and Computer Science (2013), Available at http://archive.ics.uci.edu/ml 4. Hosmer Jr, D.W., Lemeshow, S., Sturdivant, R.X.: Applied logistic regression, vol.398, John Wiley & Sons (2013) 5. Mika, S., Ratsch, G., Weston, J., Scholkopf, B., Mullers, K.R.: Fisher discriminant analysis with kernels. In: Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop, pp. 41-48, IEEE (1999) 6. Van Gestel, T., Suykens, J.A., Baesens, B., Viaene, S., Vanthienen, J., Dedene, G.,De Moor, B., Vandewalle, J.: Benchmarking least squares support vector machine classfiers. Machine Learning 54(1), pp. 5-32 (2004) 7. Breiman, L.: Random forests. Machine Learning 45(1), pp. 5-32 (2001)

Recommend


More recommend