COMP 204 Intro to machine learning with scikit-learn (part two) - PowerPoint PPT Presentation

COMP 204 Intro to machine learning with scikit-learn (part two) Mathieu Blanchette, based on material from Christopher J.F. Cameron and Carlos G. Oliver 1 / 17

Return to our prostate cancer prediction problem Suppose you want to learn to predict if a person has a prostate cancer based on two easily-measured variables obtained from blood sample: Complete Blood Count (CBC) and Prostate-specific antigen (PSA). We have collected data from patients known to have or not have prostate cancer: CBC PSA Status 142 67 Normal 132 58 Normal 178 69 Cancer 188 46 Normal 183 68 Cancer ... Goal: Train classifier to predict the class of new patients, from their CBC and PSA. 2 / 17

A perfect classifier 3 / 17

More realistic data Here, it is impossible to cleanly separate positive and negative examples with a straight line. → We will be bound to make classification errors. 4 / 17

True/false positives and negatives True positive (TP) Positive example that is predicted to be positive ◮ A person who is predicted to have cancer and actually has cancer False positive (FP) Negative example that is predicted to be positive ◮ A person who is predicted to have cancer and but doesn’t have cancer True negative (TN) Negative example that is predicted to be negative ◮ A person who is predicted to not have cancer and actually doesn’t have cancer False negative (FN) Positive example that is predicted to be negative ◮ A person who is predicted to not have cancer and but actually has cancer 5 / 17

More realistic data Here: TP = 10, TN = 12, FP = 2, FN = 3. 6 / 17

Confusion matrices Confusion matrix: A table describing the counts of TPs, FPs, TNs, and FNs Predicted positive Predicted negative Actual positive TP = 10 FN = 3 Actual negative FP = 2 TN = 12 In scikit-learn, we can get the confusion matrix for the SVC by: 1 from s k l e a r n . m e t r i c s import c o n f u s i o n m a t r i x 2 c l f = svm . SVC() 3 c l f . f i t ( X train , y t r a i n ) 4 5 preds = c l f . p r e d i c t ( X t e s t ) 6 tn , fp , fn , tp = c o n f u s i o n m a t r i x ( y t e s t , preds ) . r a v e l ( ) 7 / 17

True/false positive rates Sensitivity : Pproportion of positive examples that are predicted to be positive ◮ Fraction of cancer patients who are predicted to have cancer TP 10 Sensitivity = TP + FN = 10 + 3 = 77% Specificity : Proportion of negative examples that are predicted to be negative ◮ Fraction of healthy patients who are predicted to be healthy 12 TN Specificity = FP + TN = 2 + 12 = 86% False-positive rate (FPR) : Proportion of negative examples that are predicted to be positive ◮ Fraction of healthy patients who are predicted to have cancer FP 2 FPR = FP + TN = 1 − specificity = 2 + 12 = 14% 8 / 17

Accuracy on training vs testing sets To get an unbiased estimation of the accuracy of a predictor, we need to evaluate it against our test data (not used for the training). Predicted positive Predicted negative Actual positive TP = 9 FN = 4 Actual negative FP = 3 TN = 15 TP 9 FP 3 Sens = TP + FN = 9+4 = 69%, FPR = FP + TN = 3+15 = 17% 9 / 17

Decision tree Linear classifiers are limited in how well they can match the training data. Another type of classifier is called a decision tree. http://scikit-learn.org/stable/modules/tree.html Family ¡history? ¡ Yes ¡ No ¡ European ¡ancestry? ¡ Low ¡risk ¡ No ¡ Mixed ¡ Yes ¡ AR_GCC ¡repeat ¡ CYP3A4 ¡ AR_GCC ¡repeat ¡ ¡ copy ¡number? ¡ copy ¡number? ¡ haplotype? ¡ <16 ¡ AA ¡ >=16 ¡ GA ¡or ¡AG ¡or ¡GG ¡ <16 ¡ >=16 ¡ High ¡risk ¡ Medium ¡risk ¡ Low ¡risk ¡ High ¡risk ¡ CYP3A4 ¡ CYP3A4 ¡ haplotype? ¡ haplotype? ¡ AA ¡ GA ¡or ¡AG ¡or ¡GG ¡ AA ¡ GA ¡or ¡AG ¡or ¡GG ¡ Medium ¡risk ¡ High ¡risk ¡ Low ¡risk ¡ High ¡risk ¡ 10 / 17

Decision tree in Python Note: Requires installing graphviz by running ”pip install graphviz” import g r a p h v i z 1 from s k l e a r n import m o d e l s e l e c t i o n 2 from s k l e a r n . m e t r i c s import c o n f u s i o n m a t r i x 3 from s k l e a r n import m o d e l s e l e c t i o n , t r e e 4 5 depth = 3 6 c l f = t r e e . D e c i s i o n T r e e C l a s s i f i e r ( max depth=depth ) 7 c l f . f i t ( X train , y t r a i n ) 8 p t r a i n = c l f . p r e d i c t ( X t r a i n ) 9 p t e s t = c l f . p r e d i c t ( X t e s t ) 10 11 #p l o t t r e e 12 dot data = t r e e . e x p o r t g r a p h v i z ( c l f , o u t f i l e=None ) 13 graph = g r a p h v i z . Source ( dot data ) 14 graph . r e n d e r ( ” p r o s t a t e t r e e d e p t h ”+s t r ( depth ) ) 15 16 # c a l c u l a t e t r a i n i n g and t e s t i n g e r r o r 17 tn , fp , fn , tp = c o n f u s i o n m a t r i x ( y t r a i n , p t r a i n ) . r a v e l ( ) 18 p r i n t ( ” T r a i n i n g data : ” , tn , fp , fn , tp ) 19 tn , fp , fn , tp = c o n f u s i o n m a t r i x ( y t e s t , p t e s t ) . r a v e l () 20 p r i n t ( ” Test data : ” , tn , fp , fn , tp ) 21 11 / 17

Decision tree TP 12 FP 0 Sens = TP + FN = 12+1 = 92%, FPR = FP + TN = 0+17 = 0% Great accuracy on training set! 12 / 17

Decision tree TP 9 FP 1 Sens = TP + FN = 9+8 = 53%, FPR = FP + TN = 1+11 = 8% Not so good on the test set... 13 / 17

A harder example 14 / 17

Decision tree (max depth = 3) X[1] <= 103.074 gini = 0.5 samples = 95 value = [47, 48] True False X[1] <= 72.255 gini = 0.0 gini = 0.483 samples = 14 samples = 81 value = [14, 0] value = [33, 48] X[0] <= 154.321 X[0] <= 70.221 gini = 0.375 gini = 0.231 samples = 36 samples = 45 value = [27, 9] value = [6, 39] gini = 0.133 gini = 0.219 gini = 0.0 gini = 0.355 samples = 28 samples = 8 samples = 19 samples = 26 value = [26, 2] value = [1, 7] value = [0, 19] value = [6, 20] TP 41 sens ( train ) = TP + FN = 41+6 = 87%, 9 FP FPR ( train ) = FP + TN = 9+39 = 19% 36 TP sens ( test ) = TP + FN = 36+7 = 84%, FP 8 FPR ( test ) = FP + TN = 8+44 = 15% 15 / 17

Deeper trees - max depth = 4 X[1] <= 103.074 gini = 0.5 samples = 95 value = [47, 48] True False X[1] <= 72.255 gini = 0.0 gini = 0.483 samples = 14 samples = 81 value = [14, 0] value = [33, 48] X[0] <= 154.321 X[0] <= 70.221 gini = 0.375 gini = 0.231 samples = 36 samples = 45 value = [27, 9] value = [6, 39] X[0] <= 52.888 X[1] <= 63.281 X[0] <= 97.128 gini = 0.0 gini = 0.133 gini = 0.219 gini = 0.355 samples = 19 samples = 28 samples = 8 samples = 26 value = [0, 19] value = [26, 2] value = [1, 7] value = [6, 20] gini = 0.0 gini = 0.071 gini = 0.375 gini = 0.0 gini = 0.0 gini = 0.091 samples = 1 samples = 27 samples = 4 samples = 4 samples = 5 samples = 21 value = [0, 1] value = [26, 1] value = [1, 3] value = [0, 4] value = [5, 0] value = [1, 20] TP 45 sens ( train ) = TP + FN = 45+2 = 96%, FP 1 FPR ( train ) = FP + TN = 1+47 = 2% 37 TP sens ( test ) = TP + FN = 37+6 = 86%, FP 11 FPR ( test ) = FP + TN = 11+41 = 21% Accuracy on training data is much higher than on testing data: overfitting! We’ve gone too far! 16 / 17

ML - closing comments Very powerful algorithms exist and are available in scikit-learn: ◮ Decision trees and decision forests ◮ Support vector machines ◮ Neural networks ◮ etc. etc. These algorithms can be used for classification / regression based on all kinds of data: ◮ Arrays of numerical values ◮ Images, video, sound ◮ Text ◮ etc. etc. Applications in life sciences ◮ Medical diagnostic ◮ Interpretation of genetic data ◮ Drug design, optimization of medical devices ◮ Modeling of ecosystems ◮ etc. etc. Experiment with different approaches/problems! 17 / 17

COMP 204 Intro to machine learning with scikit-learn (part two) - PowerPoint PPT Presentation

COMP 204 Intro to machine learning with scikit-learn (part two) Mathieu Blanchette, based on material from Christopher J.F. Cameron and Carlos G. Oliver 1 / 17 Return to our prostate cancer prediction problem Suppose you want to learn to

COMP 204 A world of possibilities... and iPython Notebooks Mathieu Blanchette 1 / 12 Preparing

Agenda Why Engage 204? Engage 204 in Review Recommendations from Engage 204

Doug Houghton Box 1120 Beausejour MB R0E 0C0 (204)268-1027 (204)268-5406 (cell) Oct. 05, 2015

Immigration Heather Ayre y h.f.ayre@talk21.com 204 776 2195 204 776 2195 My Family y y

Basic Elec. Engr Basic Elec. Engr. Lab . Lab ECS 204 ECS 204 Asst. Prof. Dr. Prapun Suksompong

Basic Elec. Engr Basic Elec. Engr. Lab . Lab ECS 204 ECS 204 Asst. Prof. Dr. Prapun Suksompong

Basic Elec. Engr Basic Elec. Engr. Lab . Lab ECS 204 ECS 204 Asst. Prof. Dr. Prapun Suksompong

Basic Elec. Engr Basic Elec. Engr. Lab . Lab ECS 204 ECS 204 Asst. Prof. Dr. Prapun Suksompong

Basic Elec. Engr Basic Elec. Engr. Lab . Lab ECS 204 ECS 204 Asst. Prof. Dr. Prapun Suksompong

Basic Elec. Engr Basic Elec. Engr. Lab . Lab ECS 204 ECS 204 Asst. Prof. Dr. Prapun Suksompong

Basic Elec. Engr Basic Elec. Engr. Lab . Lab ECS 204 ECS 204 Asst. Prof. Dr. Prapun Suksompong

Basic Elec. Engr Basic Elec. Engr. Lab . Lab ECS 204 ECS 204 Asst. Prof. Dr. Prapun Suksompong

Basic Elec. Engr Basic Elec. Engr. Lab . Lab ECS 204 ECS 204 Asst. Prof. Dr. Prapun Suksompong

18.204 Term Paper Workshop 18.204 SPRING 2020 MALCAH EFFRON AND LASZLO MIKLOS LOVASZ Paper

Welcome to Comp/Phys/Mtsc 715 1/11/2011 Introduction Comp/Phys/Mtsc 715 Taylor 1 1/11/2011

COMP 204 Functions II Mathieu Blanchette based on material from Yue Li and Carlos Oliver

Biovigilance Component Hemovigilance Incident Reporting 1 Welcome to the National Healthcare

9/27/2012 1 2 1 9/27/2012 Christopher R. Flowers, MD, MS Christopher R. Flowers, MD, MS

Hard Crypto Design Alex Biryukov University of Luxembourg 18-May-2019 slides from Whibox19

Automated microscopy system for detection and genetic characterization of fetal nucleated red

SELF CARE FOR THE CAREGIVER Linda Clark, LCSW Bereavement Counselor, Hospice and

Coping with Caregiving Skill Building Week Three 1 Week 3 Review home practice Defining

Behavior and Mental Health: The Missing Piece in the Wellness Puzzle for Adults with Down

6/8/2020 Beyond Covid-19: Supporting Children, Families, & Staff to Reintegrate to the New

COMP 204 Intro to machine learning with scikit-learn (part two) - PowerPoint PPT Presentation

COMP 204 Intro to machine learning with scikit-learn (part two) Mathieu Blanchette, based on material from Christopher J.F. Cameron and Carlos G. Oliver 1 / 17 Return to our prostate cancer prediction problem Suppose you want to learn to

COMP 204 A world of possibilities... and iPython Notebooks Mathieu Blanchette 1 / 12 Preparing

Agenda Why Engage 204? Engage 204 in Review Recommendations from Engage 204

Doug Houghton Box 1120 Beausejour MB R0E 0C0 (204)268-1027 (204)268-5406 (cell) Oct. 05, 2015

Immigration Heather Ayre y h.f.ayre@talk21.com 204 776 2195 204 776 2195 My Family y y

Basic Elec. Engr Basic Elec. Engr. Lab . Lab ECS 204 ECS 204 Asst. Prof. Dr. Prapun Suksompong

Basic Elec. Engr Basic Elec. Engr. Lab . Lab ECS 204 ECS 204 Asst. Prof. Dr. Prapun Suksompong

Basic Elec. Engr Basic Elec. Engr. Lab . Lab ECS 204 ECS 204 Asst. Prof. Dr. Prapun Suksompong

Basic Elec. Engr Basic Elec. Engr. Lab . Lab ECS 204 ECS 204 Asst. Prof. Dr. Prapun Suksompong

Basic Elec. Engr Basic Elec. Engr. Lab . Lab ECS 204 ECS 204 Asst. Prof. Dr. Prapun Suksompong

Basic Elec. Engr Basic Elec. Engr. Lab . Lab ECS 204 ECS 204 Asst. Prof. Dr. Prapun Suksompong

Basic Elec. Engr Basic Elec. Engr. Lab . Lab ECS 204 ECS 204 Asst. Prof. Dr. Prapun Suksompong

Basic Elec. Engr Basic Elec. Engr. Lab . Lab ECS 204 ECS 204 Asst. Prof. Dr. Prapun Suksompong

Basic Elec. Engr Basic Elec. Engr. Lab . Lab ECS 204 ECS 204 Asst. Prof. Dr. Prapun Suksompong

18.204 Term Paper Workshop 18.204 SPRING 2020 MALCAH EFFRON AND LASZLO MIKLOS LOVASZ Paper

Welcome to Comp/Phys/Mtsc 715 1/11/2011 Introduction Comp/Phys/Mtsc 715 Taylor 1 1/11/2011

COMP 204 Functions II Mathieu Blanchette based on material from Yue Li and Carlos Oliver

Biovigilance Component Hemovigilance Incident Reporting 1 Welcome to the National Healthcare

9/27/2012 1 2 1 9/27/2012 Christopher R. Flowers, MD, MS Christopher R. Flowers, MD, MS

Hard Crypto Design Alex Biryukov University of Luxembourg 18-May-2019 slides from Whibox19

Automated microscopy system for detection and genetic characterization of fetal nucleated red

SELF CARE FOR THE CAREGIVER Linda Clark, LCSW Bereavement Counselor, Hospice and

Coping with Caregiving Skill Building Week Three 1 Week 3 Review home practice Defining

Behavior and Mental Health: The Missing Piece in the Wellness Puzzle for Adults with Down

6/8/2020 Beyond Covid-19: Supporting Children, Families, &amp; Staff to Reintegrate to the New

6/8/2020 Beyond Covid-19: Supporting Children, Families, & Staff to Reintegrate to the New