microarray data integration and machine learning
play

Microarray Data Integration and Machine Learning Techniques For - PowerPoint PPT Presentation

Microarray Data Integration and Machine Learning Techniques For Lung Cancer Survival Prediction Daniel Berrar , Brian Sturgeon, Ian Bradbury, C. Stephen Downes, Werner Dubitzky November 14, 2003 Outline Summary of Results (1 slide)


  1. Microarray Data Integration and Machine Learning Techniques For Lung Cancer Survival Prediction Daniel Berrar , Brian Sturgeon, Ian Bradbury, C. Stephen Downes, Werner Dubitzky November 14, 2003

  2. Outline • Summary of Results (1 slide) • Overview of Tasks (1 slide) • Data Integration (4 slides) • Methods (6 slides) • Results and Biological Interpretation (6 slides) • Conclusions (1 slide)

  3. Summary of Results • With respect to tasks: – Classification task : Prediction of 5-year survival is most accurate when we build a model using only patient data (age, tumor stage,…); – Regression task : Prediction of survival in months is more accurate for the model relying on expression data than on patient data, and best when the model relies on both patient and expression data; • With respect to methods: – “Best” model : Decision tree

  4. Tasks • Task #1: Data integration – Integration of lung cancer microarray data of Harvard and Michigan data set and data pre-processing; • Task #2: Classification – (a) Prediction of 5-year survival of patients based on (1) patient data, (2) expression data, and (3) both patient and expression data; – (b) Comparison of 6 machine learning methods; • Task #3: Regression – Prediction of survival time using (1) patient data, (2) expression data, and (3) both patient and expression data; • Task #4: Interpretation – Biological interpretation of identified genes.

  5. Task #1: Data Integration [1/4]

  6. Task #1: Data Integration [2/4] Target variables Patient data Expression data 211 patients 3,588 genes

  7. Task #1: Data Integration [3/4] • Data pre-processing for classification task: – Group patients into 2 classes: • LOW RISK: Survival ≥ 5 years • HIGH RISK: Survival < 5 years – Discard patients that are censored before 60 months – Remaining number of patients: 136 • Data pre-processing for regression task: – Include all 211 patients. • Data pre-processing for both tasks: – Generate learning set and test set by randomly splitting the entire data set (~70% : ~30%).

  8. Task #1: Data Integration [4/4] Learning Set Test Set Learning Set Test Set 1 1 1 1 Patient data ... ... ... ... Task #2: Classification 96 40 148 63 Task #3: Regression Model CART Learning Set Test Set Learning Set Test Set 1 1 1 1 Expression data ... ... ... ... 96 40 148 63 Model CART Learning Set Test Set Learning Set Test Set 1 1 1 1 Patient + Expression data ... ... ... ... 96 40 148 63 Model CART

  9. Tasks • Task #1: Data pre-processing – Integration of lung cancer microarray data of Harvard and Michigan data set; • Task #2: Classification – (a) Prediction of 5-year survival of patients based on (1) patient data, (2) expression data, and (3) both patient and expression data; – (b) Comparison of 6 machine learning methods; • Task #3: Regression – Prediction of survival time using (1) patient data, (2) expression data, and (3) both patient and expression data; • Task #4: Interpretation – Biological interpretation of identified genes.

  10. Methods – Overview • Methods used to address Classification-Task (1) k -nearest neighbour ( k -NN) (2) Decision Tree C5.0 (3) Boosted Decision Trees (4) Support Vector Machines (SVMs) (5) Artificial Neural Networks (Multilayer Perceptrons, MLPs) (6) Probabilistic Neural Networks (PNNs) • Methods used to address Regression-Task (1) Classification and Regression Tree (CART)

  11. Methods – Comparison of Principles • Consider the following 2-class problem

  12. Methods – Decision Tree • Recursively split the data set into decision regions and generate a rule set • Classify the test case using the rule set Root Node y ≤ Split # 1 y > Split # 1 Class • Split again Boosted decision trees: • Aggregating decision trees (committee) by weighted voting and resampling of the data set.

  13. Methods – Support Vector Machine • Find optimal separating hyperplane by maximizing the margin between 2 classes • Classify the test case using hyperplane Class � Class •

  14. Methods – Strengths and Weaknesses Most (if not all) models ultimately rely on a definition of distance between objects. *Lee Y., Lee C.K.: Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioionformatics 19 (9), pp. 1132-1139, (2003). This definition is not trivial in high-dimensional space. Distance metric: tuning parameter? � fractal distance [Aggarwal et al., ICDT , 2001]

  15. Results of Task#1: Classification

  16. Tasks • Task #1: Data pre-processing – Integration of lung cancer microarray data of Harvard and Michigan data set; • Task #2: Classification – (a) Prediction of 5-year survival of patients based on (1) patient data, (2) expression data, and (3) both patient and expression data; – (b) Comparison of 6 machine learning methods; • Task #3: Regression – Prediction of survival time using (1) patient data, (2) expression data, and (3) both patient and expression data; • Task #4: Interpretation – Biological interpretation of identified genes.

  17. Methods – Classification and Regression Tree • Algorithm is similar to the decision tree C5.0 • Heuristic is based on recursive partitioning of data set • Differences:

  18. Results of Task #2: Regression [1/3] • Evaluation criteria: – How many death events are correctly identified as death � accuracy events, and how many are not? – For the correctly identified death events, what is the deviance of the residuals between the real and the predicted survival time?

  19. Results of Task #2: Regression [2/3]

  20. Results of Task #2: Regression [3/3] 10.9

  21. Tasks • Task #1: Data pre-processing – Integration of lung cancer microarray data of Harvard and Michigan data set; • Task #2: Classification – (a) Prediction of 5-year survival of patients based on (1) patient data, (2) expression data, and (3) both patient and expression data; – (b) Comparison of 6 machine learning methods; • Task #3: Regression – Prediction of survival time using (1) patient data, (2) expression data, and (3) both patient and expression data; • Task #4: Interpretation – Biological interpretation of identified genes.

  22. Task #4: Biological Interpretation [1/2] • How to interpret the results? � Using literature, OMIM, PubMed,… • # of features relevant for classification task: 8, e.g. ZNF174 (zinc finger protein) • Proteins of this family probably have an impact on repression of growth factor gene expression [OMIM, 603900] • Example: Wilms tumour suppressor WT1 encodes a zinc finger protein that downregulates the expression of various growth factor genes [OMIM, 603900] • Decision tree: overexpression of ZNF174 is associated with LOW RISK, underexpression with HIGH RISK • ZNF174: important marker in Burkitt’s lymphoma cells [Li et al., PNAS, May 2003]

  23. Task #4: Biological Interpretation [2/2] • # of features relevant for regression task: 5, e.g. NifU • function not fully understood yet • is likely to be involved in the mobilization of iron and sulfur for nitrogenase-specific iron-sulfur cluster formation • important for breast cancer classification [Hedenfalk et al., N Engl J Med, Feb. 2001]; • Decision tree: overexpression of NifU is associated with good clinical outcome for patients with early tumour stage.

  24. Conclusions • Integrating clinical and transcriptional data might improve survival outcome prediction; • “Best” model in this study: decision tree, but… • Method of choice is not available; • No Free Lunch Theorem: “ No classifier is inherently superior to any other. The type of the problem determines which classifier is most appropriate .” • George Box: “ Statisticians, like artists, have the bad habit of falling in love with their models. ”

  25. Acknowledgements • Brian Sturgeon • Ian Bradbury • C. Stephen Downes • Werner Dubitzky Supplementary information will be available at http://research.bioinformatics.ulster.ac.uk/~dberrar/camda03.html.

  26. Methods – Comparison of Principles • Consider the following 2-class problem of cases ( x,y ) • t = {0.1, 0.2,…10}, • Class A : f • ( x ) = t cos( t ), f • ( y ) = t sin( t ) • Class B : f � ( x ) = t sin( t ), f � ( y ) = t cos( t )

  27. Methods – k -Nearest Neighbour • Retrieve the nearest neighbours of the test case • Classify test case based on the class membership of the nearest neighbours

  28. Methods – k -Nearest Neighbour Learning: • For each case in the learning set, determine all neighbours and rank them with respect to similarity = 1 − distance • Determine global optimal number of nearest neighbours k opt (e.g., in LOOCV) Test: • Use k opt for classifying the test cases • Interpret normalized similarities as measure of confidence 1 − distance) Case # Normed Class Similarity (= • 27 0.0921 0.35795 Suppose that k opt = 3 29 0.0833 0.32375 and the following • 34 0.0819 0.31831 nearest neighbours: Confidence for class • : 0.35795 + 0.67626 0.31831 = 0.32375 Confidence for class :

  29. Methods – Support Vector Machine • Goal: Finding the optimal decision boundery between 2 classes • SVM-heuristic for separable problems (non-overlapping classes): – Construct optimal separating hyperplane by maximizing the margin

Recommend


More recommend