COMP 364: Computer Tools for Life Sciences Intro to machine learning - PowerPoint PPT Presentation

COMP 364: Computer Tools for Life Sciences Intro to machine learning with scikit-learn (part three) Christopher J.F. Cameron and Carlos G. Oliver 1 / 23

Key course information TA office hours ◮ For this week only: ◮ Pouriya - Friday 1:30-3:00 pm TR 3104 ◮ if room is locked, check TR 3090 Course evaluations ◮ available now at the following link: ◮ https://horizon.mcgill.ca/pban1/twbkwbis.P_ WWWLogin?ret_code=f 2 / 23

Recap - Titanic survival problem In the last COMP 364 lecture ◮ implemented a support vector machine classifier (SVC) ◮ created a learned SVC model from training data ◮ calculated train and test mean squared error (MSE) ◮ MSE isn’t typically used for classification Let’s implement a better accuracy metric for our classifier ◮ receiver operating characteristic (ROC) 3 / 23

ROC curves A plot that represents the predictive capability of a classifier ◮ across various discrimination thresholds ROC curves are created by plotting the true positive rate (TPR) against false positive rate (FPR) ◮ wait....what are those? ◮ FPR is the proportion of true positives (TP) ◮ stop...what is a TP? Let’s start with an example ROC plot 4 / 23

Example ROC plot A , B , C , and C ′ ◮ are different methods ◮ i.e., different ML models Top left of the plot ◮ perfect classification ◮ no incorrect predictions Dashed red line ◮ result of random chance ◮ e.g., flipping a coin 5 / 23

True/false positives and negatives True positive (TP) Positive example is predicted to be positive ◮ surviving Titanic passenger predicted to survive False positive (FP) Negative example is predicted to be positive ◮ dead Titanic passenger predicted to survive True negative (TN) Negative example is predicted to be negative False negative (FN) Positive example is predicted to be negative 6 / 23

Confusion matrices A table describing the counts of TPs, FPs, TNs, and FNs In scikit-learn, we can get the confusion matrix for the SVC by: from sklearn.metrics import confusion_matrix 1 2 clf = svm.SVC() 3 clf.fit(X_train, y_train) 4 preds = clf.predict(X_test) 5 tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel() 6 print(tn, fp, fn, tp) 7 # prints: 86 33 41 49 8 These counts are for the current threshold used by the SVC 7 / 23

TPR vs FPR True positive rate (TPR) The proportion of correctly classified true examples ◮ i.e., surviving passengers predicted to survive TP 49 TPR = TP + FN = 49 + 41 = 0 . 49 False positive rate (FPR) The proportion of incorrectly labeled negative examples ◮ i.e., dead passengers predicted to survive 33 FP TPR = FP + TN = 33 + 86 = 0 . 33 8 / 23

Creating a ROC curve To create a ROC curve we need to: 1. extract the score assigned by the SVC for test examples 2. calculate TPRs and FPRs at various thresholds 3. calculate area under the curve (AUC) for the ROC ◮ greater the area, better the classifier 4. plot TPR vs. FPR 9 / 23

Extracting SVC scores To extract the score assigned by the SVC for each test example from sklearn import svm 1 2 clf = svm.SVC() 3 clf.fit(X_train, y_train) 4 scores = clf.decision_function(X_test) 5 print(scores[:5]) 6 # prints: [-0.26781241 0.1145858 -0.40117029 7 # 0.35895218 -1.07689094] 8 preds = clf.predict(X_test) 9 print(preds[:5]) 10 #prints: [0 1 0 1 0] 11 What threshold is being used to convert scores to labels? 10 / 23

Calculating FPR, TPR, and AUC To calculate FPRs, TPRs, and AUC for the SVC’s scores: from sklearn.metrics import roc_curve, auc 1 2 scores = clf.decision_function(X_test) 3 fpr, tpr, thresholds = roc_curve(y_test, scores) 4 roc_auc = auc(fpr, tpr) 5 print(sorted(thresholds)) 6 # prints: 7 # [-1.1008432176342331, -1.0168691423153751, 8 # -1.0002881313288357, -0.98665888089289866, 9 # ... 0.032799334871084884, 10 # 0.16940940621752093, 0.32341186208816985, 11 # 0.61088122361422137, 2.2800666503978029] 12 11 / 23

Plotting ROC curves import matplotlib.pyplot as plt 1 2 fpr, tpr, _ = roc_curve(y_test, scores) 3 roc_auc = auc(fpr, tpr) 4 plt.plot(fpr,tpr,"b-",lw=2, 5 label="ROC curve (area = %0.2f)"%roc_auc) 6 plt.plot([0,1],[0,1],"k--",lw=2) 7 plt.xlim([0.0, 1.0]) 8 plt.ylim([0.0, 1.05]) 9 plt.xlabel('False Positive Rate') 10 plt.ylabel('True Positive Rate') 11 plt.title('Receiver operating characteristic for SVC') 12 plt.legend(loc="lower right") 13 plt.savefig("./roc_curve.png") 14 plt.close() 15 12 / 23

13 / 23

Can we improve upon our predictor? Let’s try applying a different ML algorithm to the data ◮ perhaps a decision tree ? ◮ http://scikit-learn.org/stable/modules/tree.html ◮ remember to choose the classifier from sklearn import tree 1 2 clf = tree.DecisionTreeClassifier() 3 clf.fit(X_train, y_train) 4 # similar to .decision_function() 5 dt_scores = clf.predict_proba(X_test)[:, 1] 6 14 / 23

Decision trees only provide class labels 15 / 23

Decision trees (DT) Okay, why did we use a DT if it isn’t useful to plot a ROC curve? ◮ DTs do not transform model input as heavily ◮ input examples of the SVC are transformed ◮ using a radial basis function ◮ prevents interpretation of feature importance In addition to making accurate predictions ◮ we would like to know which features contribute most to a model’s predictions ◮ i.e., feature importance for the ML model 16 / 23

Extracting feature importance To implement the DT in sckit-learn ◮ we are working with a classifier object ◮ which has particular attributes: http://scikit-learn.org/stable/modules/generated/ sklearn.tree.DecisionTreeClassifier.html#sklearn. tree.DecisionTreeClassifier clf = tree.DecisionTreeClassifier() 1 clf.fit(X_train, y_train) 2 print(clf.feature_importances_) 3 # prints: 4 #[ 0.10237889 0.30678435 0.25202615 5 # 0.04502743 0.00862753 0.25493346 6 # 0.0302222 ] 7 17 / 23

18 / 23

ML - closing comments Now that we know the important features for the DT ◮ we could try improving our predictions by: ◮ filter out low-weighted features ◮ try using a different transformation function in the SVC ◮ manipulate optional arguments of the DT and SVC ◮ choose other ML algorithms to apply ◮ build our own ML algorithm? ◮ transform input and output values ◮ so on and so on But... we need to move onto our next topic in COMP 364 ◮ digital image analysis/processing 19 / 23

What is digital image analysis? Digital image analysis (DIA) : the extraction of useful information from images DIA is considered to consist of the following: ◮ pre-processing ◮ image enhancement ◮ classification (hmmm...sounds familiar, eh?) ◮ unsupervised ◮ supervised ◮ object-based ◮ change detection ◮ data merging 20 / 23

Think objects, not pixels Object-based Image Analysis (OBIA) involves: ◮ segmentation of images into objects 21 / 23

OBIA Then classifying objects resulting from segmentation ◮ to identify components of the image ◮ e.g., roads, grass, trees, etc. 22 / 23

Next time in COMP 364 Exploring the scikit-image module scikit-image API http://scikit-image.org/docs/dev/api/api.html scikit-image tutorials http: //scikit-image.org/docs/dev/user_guide/tutorials.html 23 / 23

COMP 364: Computer Tools for Life Sciences Intro to machine learning - PowerPoint PPT Presentation

COMP 364: Computer Tools for Life Sciences Intro to machine learning with scikit-learn (part three) Christopher J.F. Cameron and Carlos G. Oliver 1 / 23 Key course information TA office hours For this week only: Pouriya - Friday

748($.4+9 -#1234(4+-#%(.')%(+5#364. ! -#1234(4+-#%(.')%(+5#364.

Set s and Funct ions Set s and Funct ions Reading f or COMP 364 and CSI T571 Reading f or COMP

COMP 364: Computer Tools for Life Sciences Python libraries; How to read and use an API

COMP 364: Computer Tools for Life Sciences Python programming: Control flow: for loops, while

COMP 364: Computer Tools for Life Sciences Intro to machine learning with scikit-learn

COMP 364: Computer Tools for Life Sciences Regular expressions Christopher J.F. Cameron and

COMP 364: Computer Tools for Life Sciences Introduction to image analysis with scikit-image (part

COMP 364: Computer Tools for Life Sciences Using libraries: NumPy & Data visualization with

COMP 364: Computer Tools for Life Sciences Python programming: File IO Christopher J.F. Cameron

COMP 364: Computer Tools for Life Sciences Notions of machine learning Christopher J.F. Cameron

Life Sciences Building Life Sciences Building Life Sciences Building Life Sciences Building

EasyStart 364: The Airstreamers Answer to RV A/C Starting on Small Generators (Micro-Air)

The 6 th Annual Project Excellence Awards 2010 Project 2010 Project Item No. 10-364.00 Item No.

Methods Updating Variables Console Programs int life = 42; life life = 42 life; 21 life =

COMP 364: Conditional Statements Control Flow Carlos G. Oliver, Christopher Cameron September

COMP 364: Intro to Programming/Python Carlos G. Oliver, Christopher Cameron September 11, 2017

PATTERN RECOGNITION AND MACHINE LEARNING Slide Set 5: Neural Networks and Deep Learning November

Introduction in ML with scikit- learn Professor Patrick McDaniel Jonathan Price Fall 2015

Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector

Using JavaScript with Twine Cool effects to polish your interactive story! The Code Liberation

Simulating the Sky, Lecture2 Creating, Testing, and Using Simulations of the Galaxy Population

Backstepping From simple designs to take-off Ola Hrkegrd Control & Communication

Plotting Dr. Mihail September 25, 2018 (Dr. Mihail) Plots September 25, 2018 1 / 24 Plots

Behavioral Types and Logical Frameworks An Introduction Carsten Sch urmann IT University of

Sambuz

Useful Links

Newsletter

Mail Us

COMP 364: Computer Tools for Life Sciences Intro to machine learning - PowerPoint PPT Presentation

COMP 364: Computer Tools for Life Sciences Intro to machine learning with scikit-learn (part three) Christopher J.F. Cameron and Carlos G. Oliver 1 / 23 Key course information TA office hours For this week only: Pouriya - Friday

748($.4+9 -#1234(4+-#%*(.')%(+5#364.* ! -#1234(4+-#%*(.')%(+5#364.*

Set s and Funct ions Set s and Funct ions Reading f or COMP 364 and CSI T571 Reading f or COMP

COMP 364: Computer Tools for Life Sciences Python libraries; How to read and use an API

COMP 364: Computer Tools for Life Sciences Python programming: Control flow: for loops, while

COMP 364: Computer Tools for Life Sciences Intro to machine learning with scikit-learn

COMP 364: Computer Tools for Life Sciences Regular expressions Christopher J.F. Cameron and

COMP 364: Computer Tools for Life Sciences Introduction to image analysis with scikit-image (part

COMP 364: Computer Tools for Life Sciences Using libraries: NumPy &amp; Data visualization with

COMP 364: Computer Tools for Life Sciences Python programming: File IO Christopher J.F. Cameron

COMP 364: Computer Tools for Life Sciences Notions of machine learning Christopher J.F. Cameron

Life Sciences Building Life Sciences Building Life Sciences Building Life Sciences Building

EasyStart 364: The Airstreamers Answer to RV A/C Starting on Small Generators (Micro-Air)

The 6 th Annual Project Excellence Awards 2010 Project 2010 Project Item No. 10-364.00 Item No.

Methods Updating Variables Console Programs int life = 42; life life = 42 life; 21 life =

COMP 364: Conditional Statements Control Flow Carlos G. Oliver, Christopher Cameron September

COMP 364: Intro to Programming/Python Carlos G. Oliver, Christopher Cameron September 11, 2017

PATTERN RECOGNITION AND MACHINE LEARNING Slide Set 5: Neural Networks and Deep Learning November

Introduction in ML with scikit- learn Professor Patrick McDaniel Jonathan Price Fall 2015

Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector

Using JavaScript with Twine Cool effects to polish your interactive story! The Code Liberation

Simulating the Sky, Lecture2 Creating, Testing, and Using Simulations of the Galaxy Population

Backstepping From simple designs to take-off Ola Hrkegrd Control &amp; Communication

Plotting Dr. Mihail September 25, 2018 (Dr. Mihail) Plots September 25, 2018 1 / 24 Plots

Behavioral Types and Logical Frameworks An Introduction Carsten Sch urmann IT University of

Sambuz

Useful Links

Newsletter

Mail Us

748($.4+9 -#1234(4+-#%(.')%(+5#364. ! -#1234(4+-#%(.')%(+5#364.

COMP 364: Computer Tools for Life Sciences Using libraries: NumPy & Data visualization with

Backstepping From simple designs to take-off Ola Hrkegrd Control & Communication