COMP 364: Computer Tools for Life Sciences Intro to machine learning with scikit-learn (part three) Christopher J.F. Cameron and Carlos G. Oliver 1 / 23
Key course information TA office hours ◮ For this week only: ◮ Pouriya - Friday 1:30-3:00 pm TR 3104 ◮ if room is locked, check TR 3090 Course evaluations ◮ available now at the following link: ◮ https://horizon.mcgill.ca/pban1/twbkwbis.P_ WWWLogin?ret_code=f 2 / 23
Recap - Titanic survival problem In the last COMP 364 lecture ◮ implemented a support vector machine classifier (SVC) ◮ created a learned SVC model from training data ◮ calculated train and test mean squared error (MSE) ◮ MSE isn’t typically used for classification Let’s implement a better accuracy metric for our classifier ◮ receiver operating characteristic (ROC) 3 / 23
ROC curves A plot that represents the predictive capability of a classifier ◮ across various discrimination thresholds ROC curves are created by plotting the true positive rate (TPR) against false positive rate (FPR) ◮ wait....what are those? ◮ FPR is the proportion of true positives (TP) ◮ stop...what is a TP? Let’s start with an example ROC plot 4 / 23
Example ROC plot A , B , C , and C ′ ◮ are different methods ◮ i.e., different ML models Top left of the plot ◮ perfect classification ◮ no incorrect predictions Dashed red line ◮ result of random chance ◮ e.g., flipping a coin 5 / 23
True/false positives and negatives True positive (TP) Positive example is predicted to be positive ◮ surviving Titanic passenger predicted to survive False positive (FP) Negative example is predicted to be positive ◮ dead Titanic passenger predicted to survive True negative (TN) Negative example is predicted to be negative False negative (FN) Positive example is predicted to be negative 6 / 23
Confusion matrices A table describing the counts of TPs, FPs, TNs, and FNs In scikit-learn, we can get the confusion matrix for the SVC by: from sklearn.metrics import confusion_matrix 1 2 clf = svm.SVC() 3 clf.fit(X_train, y_train) 4 preds = clf.predict(X_test) 5 tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel() 6 print(tn, fp, fn, tp) 7 # prints: 86 33 41 49 8 These counts are for the current threshold used by the SVC 7 / 23
TPR vs FPR True positive rate (TPR) The proportion of correctly classified true examples ◮ i.e., surviving passengers predicted to survive TP 49 TPR = TP + FN = 49 + 41 = 0 . 49 False positive rate (FPR) The proportion of incorrectly labeled negative examples ◮ i.e., dead passengers predicted to survive 33 FP TPR = FP + TN = 33 + 86 = 0 . 33 8 / 23
Creating a ROC curve To create a ROC curve we need to: 1. extract the score assigned by the SVC for test examples 2. calculate TPRs and FPRs at various thresholds 3. calculate area under the curve (AUC) for the ROC ◮ greater the area, better the classifier 4. plot TPR vs. FPR 9 / 23
Extracting SVC scores To extract the score assigned by the SVC for each test example from sklearn import svm 1 2 clf = svm.SVC() 3 clf.fit(X_train, y_train) 4 scores = clf.decision_function(X_test) 5 print(scores[:5]) 6 # prints: [-0.26781241 0.1145858 -0.40117029 7 # 0.35895218 -1.07689094] 8 preds = clf.predict(X_test) 9 print(preds[:5]) 10 #prints: [0 1 0 1 0] 11 What threshold is being used to convert scores to labels? 10 / 23
Calculating FPR, TPR, and AUC To calculate FPRs, TPRs, and AUC for the SVC’s scores: from sklearn.metrics import roc_curve, auc 1 2 scores = clf.decision_function(X_test) 3 fpr, tpr, thresholds = roc_curve(y_test, scores) 4 roc_auc = auc(fpr, tpr) 5 print(sorted(thresholds)) 6 # prints: 7 # [-1.1008432176342331, -1.0168691423153751, 8 # -1.0002881313288357, -0.98665888089289866, 9 # ... 0.032799334871084884, 10 # 0.16940940621752093, 0.32341186208816985, 11 # 0.61088122361422137, 2.2800666503978029] 12 11 / 23
Plotting ROC curves import matplotlib.pyplot as plt 1 2 fpr, tpr, _ = roc_curve(y_test, scores) 3 roc_auc = auc(fpr, tpr) 4 plt.plot(fpr,tpr,"b-",lw=2, 5 label="ROC curve (area = %0.2f)"%roc_auc) 6 plt.plot([0,1],[0,1],"k--",lw=2) 7 plt.xlim([0.0, 1.0]) 8 plt.ylim([0.0, 1.05]) 9 plt.xlabel('False Positive Rate') 10 plt.ylabel('True Positive Rate') 11 plt.title('Receiver operating characteristic for SVC') 12 plt.legend(loc="lower right") 13 plt.savefig("./roc_curve.png") 14 plt.close() 15 12 / 23
13 / 23
Can we improve upon our predictor? Let’s try applying a different ML algorithm to the data ◮ perhaps a decision tree ? ◮ http://scikit-learn.org/stable/modules/tree.html ◮ remember to choose the classifier from sklearn import tree 1 2 clf = tree.DecisionTreeClassifier() 3 clf.fit(X_train, y_train) 4 # similar to .decision_function() 5 dt_scores = clf.predict_proba(X_test)[:, 1] 6 14 / 23
Decision trees only provide class labels 15 / 23
Decision trees (DT) Okay, why did we use a DT if it isn’t useful to plot a ROC curve? ◮ DTs do not transform model input as heavily ◮ input examples of the SVC are transformed ◮ using a radial basis function ◮ prevents interpretation of feature importance In addition to making accurate predictions ◮ we would like to know which features contribute most to a model’s predictions ◮ i.e., feature importance for the ML model 16 / 23
Extracting feature importance To implement the DT in sckit-learn ◮ we are working with a classifier object ◮ which has particular attributes: http://scikit-learn.org/stable/modules/generated/ sklearn.tree.DecisionTreeClassifier.html#sklearn. tree.DecisionTreeClassifier clf = tree.DecisionTreeClassifier() 1 clf.fit(X_train, y_train) 2 print(clf.feature_importances_) 3 # prints: 4 #[ 0.10237889 0.30678435 0.25202615 5 # 0.04502743 0.00862753 0.25493346 6 # 0.0302222 ] 7 17 / 23
18 / 23
ML - closing comments Now that we know the important features for the DT ◮ we could try improving our predictions by: ◮ filter out low-weighted features ◮ try using a different transformation function in the SVC ◮ manipulate optional arguments of the DT and SVC ◮ choose other ML algorithms to apply ◮ build our own ML algorithm? ◮ transform input and output values ◮ so on and so on But... we need to move onto our next topic in COMP 364 ◮ digital image analysis/processing 19 / 23
What is digital image analysis? Digital image analysis (DIA) : the extraction of useful information from images DIA is considered to consist of the following: ◮ pre-processing ◮ image enhancement ◮ classification (hmmm...sounds familiar, eh?) ◮ unsupervised ◮ supervised ◮ object-based ◮ change detection ◮ data merging 20 / 23
Think objects, not pixels Object-based Image Analysis (OBIA) involves: ◮ segmentation of images into objects 21 / 23
OBIA Then classifying objects resulting from segmentation ◮ to identify components of the image ◮ e.g., roads, grass, trees, etc. 22 / 23
Next time in COMP 364 Exploring the scikit-image module scikit-image API http://scikit-image.org/docs/dev/api/api.html scikit-image tutorials http: //scikit-image.org/docs/dev/user_guide/tutorials.html 23 / 23
Recommend
More recommend