seminars in software and services for the information
play

Seminars in Software and Services for the Information Society - PowerPoint PPT Presentation

D IPARTIMENTO DI I NGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE A NTONIO R UBERTI Master of Science in Engineering in Computer Science (MSE-CS) (MSE-CS) Seminars in Software and Services for the Information Society Umberto Nanni Lara


  1. D IPARTIMENTO DI I NGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE A NTONIO R UBERTI Master of Science in Engineering in Computer Science (MSE-CS) (MSE-CS) Seminars in Software and Services for the Information Society Umberto Nanni Lara Malfatti (MD-Thesis, March 2013) Lara Malfatti (MD-Thesis, March 2013) Data Mining for evaluating the risk of chemotherapy-associated thrombosis Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 1

  2. Outline • Problem and contextualization • Problem and contextualization • Data Mining methodologies • Dataset preprocessing • Attributes’ selection • Classification • Classification • Costs’ evaluation • Conclusion Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 2

  3. Venous Thrombo-Embolism (VTE) • It increases from 0,1% in general population to 3% in general population to 3% in cancer patients • It is the second cause of mortality in cancer patients • Its treatment represents a big cost for National Health cost for National Health Service (about 8.000 € per patient) Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 3

  4. Data set description Dataset contains 565 instances (526 negative + 39 positive). Each entry contains 35 variables which can be grouped in: 1. Patient risk factors: as age, sex, laboratory analysis and comorbid condition (i.e. obesity) 2. Cancer risk factors: as site and 2. Cancer risk factors: as site and stage of tumor 3. Treatment risk factors: as assumption of chemotherapy or targeted therapy agents Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 4

  5. State of the art Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 5

  6. Terminology • Classification process: takes in input an instance and tries to forecast if it will be positive or negative • Medical evaluation metrics are derived from the related confusion matrix: Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 6

  7. Statistical approach: Khorana’s score This model uses 5 biological variables as predictors and classifies patients into three risk categories: low, intermediate and high risk Pros: LOW INTERME HIGH • Simple and clear model DIATE Num. of 280 252 33 • Low cost of predictive variables patients Cons: Cons: Metrics Metrics Values Values • Too many patients classified as Accuracy 53% “intermediate risk” PPV 10% • Poor performances NPV 96% Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 7

  8. Challenge: • Is it possible to find better variable combinations able to predict thrombosis combinations able to predict thrombosis through data mining? • What is the the best predictive combination in terms of cost/benefit among all the possible ones? • Are the screening cost of these combinations sustainable by the National Health Service? Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 8

  9. Outline • Problem and contextualization • Problem and contextualization • Data Mining methodologies • Dataset preprocessing • Attributes’ selection • Classification • Classification • Costs’ evaluation • Conclusion Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 9

  10. Knowledge Discovery in Health Care Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 10

  11. WEKA WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining • It is a free tool for data mining applications, written in JAVA • It implements all the steps of KDD workflow from data preprocessing to the visualization of discovered patterns patterns • Attention is focused on data preprocessing, attribute selection and learning phase Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 11

  12. WEKA: learning phase Learning phase : training and testing data sets must be disjoint Unbalanced data set causes: Unbalanced data set causes: • Excessive influence of majority class on classification model • High global performance without forecasting a single instance of the minority class minority class The creation of balanced training and testing datasets is manually conducted during the preprocessing phase Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 12

  13. Outline • Problem and contextualization • Problem and contextualization • Data Mining methodologies • Dataset preprocessing • Attributes’ selection • Classification • Classification • Costs’ evaluation • Conclusion Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 13

  14. Data set pre-processing: cleaning (1/3) Create three balanced folders and combine the partial results partial results • All the instances are All the instances are classified exactly once • All the training sets have the same number of positive and negative instances Training and testing Training and testing datasets are disjoint Extra cost: each experiment needs three run execution Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 14

  15. Data set pre-processing: cleaning (2/3) The objective is to remove noisy instances • VTE normally falls within 6 months from within 6 months from the beginning of chemotherapy • Time interval is enlarged to 12 months to cover also to cover also asymptomatic events Outliers are given by: • Intrinsic probability of having a thrombotic event • Changes in anticancer treatments Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 15

  16. Data set preprocessing: improvements (3/3) Unstructured numerical data are aggregated, to not badly influence the classification model (see the classification model (see figure) Instances with missing values are discarded because: • Artificial values cannot • Artificial values cannot correspond to real cases • They can create problems both in training and testing data set Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 16

  17. Outline • Problem and contextualization • Problem and contextualization • Data Mining methodologies • Dataset preprocessing • Attributes’ selection • Classification • Classification • Costs’ evaluation • Conclusion Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 17

  18. Attribute selection (1/2) Feature selection returns meaningful subsets of the original attributes ignoring the ones which provide no information Filter methods: Filter methods: • they are independent from any learning algorithms and rely only on data properties • they can be seen as the they can be seen as the combination of search techniques combination of search techniques for proposing new subsets and evaluation metrics to rank them WEKA provides lots of possibilities Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 18

  19. Attribute selection (2/2) GreedyStepwise : performs a greedy search through the space of attribute subsets in both directions (backward and forward) starting from the empty set forward) starting from the empty set CorrelationFeautureSubSetEval : prefers subsets with attributes highly correlated with the class but having low inter-correlation Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 19

  20. Outline • Problem and contextualization • Problem and contextualization • Data Mining methodologies • Dataset preprocessing • Attributes’ selection • Classification • Classification • Costs’ evaluation • Conclusion Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 20

  21. Classification Guidelines: • For each subset found in previous step some experiments are conducted using different learning algorithms are conducted using different learning algorithms • PPV, NPV and Accuracy are compared, Khorana’s results are used as benchmarks • A constraint is fixed, no NPV values lower than 96% are allowed WEKA provides a variety of learning algorithms, the ones WEKA provides a variety of learning algorithms, the ones used in experiments are: • Bayes algorithms, Decision trees, Cover rules, Logistic regression functions and Lazy algorithms Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 21

  22. Classification: Accuracy All the predictive groups have better accuracy than Pure-KS Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 22

Recommend


More recommend