evaluating binary classifiers
play

Evaluating Binary Classifiers TPR FPR Many slides attributable - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Evaluating Binary Classifiers TPR FPR Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten,


  1. Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Evaluating Binary Classifiers TPR FPR Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books) 8

  2. Today’s objectives (day 08) Evaluating Binary Classifiers 1) Evaluate binary decisions at specific threshold accuracy, TPR, TNR, PPV, NPV, … 2) Evaluate across range of thresholds ROC curve, Precision-Recall curve 3) Evaluate probabilities / scores directly cross entropy loss (aka log loss) Mike Hughes - Tufts COMP 135 - Fall 2020 9

  3. What will we learn? Evaluation Supervised Training Learning Data, Label Pairs Performance { x n , y n } N measure Task n =1 Unsupervised Learning data label x y Reinforcement Learning Prediction Mike Hughes - Tufts COMP 135 - Fall 2020 10

  4. Task: Binary Classification y is a binary variable Supervised (red or blue) Learning binary classification x 2 Unsupervised Learning Reinforcement Learning x 1 Mike Hughes - Tufts COMP 135 - Fall 2020 11

  5. Example: Hotdog or Not https://www.theverge.com/tldr/2017/5/14/15639784/hbo- silicon-valley-not-hotdog-app-download Mike Hughes - Tufts COMP 135 - Fall 2020 12

  6. <latexit sha1_base64="+i3LJAjdB5LiyhIGKCnqfu7uJxk=">AB/XicbVDLSsNAFJ3UV62v+Ni5GSxCBSlJFXQjFN24rGAf0IYwmU6aoZMHMzdiDcVfceNCEbf+hzv/xmbhVYPXDicy/3uMlgiuwrC+jsLC4tLxSXC2trW9sbpnbOy0Vp5KyJo1FLDseUzwiDWBg2CdRDISeoK1veHVxG/fMal4HN3CKGFOSAYR9zkloCX3FMuxc4qNy7/Bj3IGBAjlyzbFWtKfBfYuekjHI0XPOz149pGrIqCBKdW0rAScjEjgVbFzqpYolhA7JgHU1jUjIlJNrx/jQ630sR9LXRHgqfpzIiOhUqPQ050hgUDNexPxP6+bgn/uZDxKUmARnS3yU4EhxpMocJ9LRkGMNCFUcn0rpgGRhIOrKRDsOdf/ktatap9Uq3dnJbrl3kcRbSPDlAF2egM1dE1aqAmougBPaEX9Go8Gs/Gm/E+ay0Y+cwu+gXj4xtlNpPm</latexit> From Features to Predictions Goal: Predict label (0 or 1) given features x x i , [ x i 1 , x i 2 , . . . x if . . . x iF ] Input features s i = h ( x i , θ ) Score (a real number) Chosen threshold Binary label y i ∈ { 0 , 1 } (0 or 1) Mike Hughes - Tufts COMP 135 - Fall 2020 13

  7. <latexit sha1_base64="EYNrH7IL4mNqkIlAC+i56/+hS7I=">AB+nicbVBNS8NAEJ34WetXqkcvi0Wol5JUQS9C0YvHCvYD2hA2027dLMJuxulxP4ULx4U8eov8ea/cdvmoK0PBh7vzTAzL0g4U9pxvq2V1bX1jc3CVnF7Z3dv3y4dtFScSkKbJOax7ARYUc4EbWqmOe0kuIo4LQdjG6mfvuBSsVica/HCfUiPBAsZARrI/l2KfEZukI9xQYRrifnfp2ak6M6Bl4uakDkav3V68ckjajQhGOluq6TaC/DUjPC6aTYSxVNMBnhAe0aKnBElZfNTp+gE6P0URhLU0Kjmfp7IsORUuMoMJ0R1kO16E3F/7xuqsNL2MiSTUVZL4oTDnSMZrmgPpMUqL52BMJDO3IjLEhNt0iqaENzFl5dJq1Z1z6q1u/Ny/TqPowBHcAwVcOEC6nALDWgCgUd4hld4s56sF+vd+pi3rlj5zCH8gfX5A8Khkwc=</latexit> <latexit sha1_base64="+i3LJAjdB5LiyhIGKCnqfu7uJxk=">AB/XicbVDLSsNAFJ3UV62v+Ni5GSxCBSlJFXQjFN24rGAf0IYwmU6aoZMHMzdiDcVfceNCEbf+hzv/xmbhVYPXDicy/3uMlgiuwrC+jsLC4tLxSXC2trW9sbpnbOy0Vp5KyJo1FLDseUzwiDWBg2CdRDISeoK1veHVxG/fMal4HN3CKGFOSAYR9zkloCX3FMuxc4qNy7/Bj3IGBAjlyzbFWtKfBfYuekjHI0XPOz149pGrIqCBKdW0rAScjEjgVbFzqpYolhA7JgHU1jUjIlJNrx/jQ630sR9LXRHgqfpzIiOhUqPQ050hgUDNexPxP6+bgn/uZDxKUmARnS3yU4EhxpMocJ9LRkGMNCFUcn0rpgGRhIOrKRDsOdf/ktatap9Uq3dnJbrl3kcRbSPDlAF2egM1dE1aqAmougBPaEX9Go8Gs/Gm/E+ay0Y+cwu+gXj4xtlNpPm</latexit> From Features to Predictions via Probabilities Goal: Predict label (0 or 1) given features x x i , [ x i 1 , x i 2 , . . . x if . . . x iF ] Input features s i = h ( x i , θ ) Score (a real number) 1 sigmoid( z ) = 1 + e − z p i = σ ( s i ) Probability of positive class (between 0.0 and 1.0) Chosen threshold between 0.0 and 1.0 Binary label y i ∈ { 0 , 1 } (0 or 1) Mike Hughes - Tufts COMP 135 - Fall 2020 14

  8. Classifier: Evaluation Step Goal: Assess quality of predictions Many ways in practice: 1) Evaluate binary decisions at specific threshold accuracy, TPR, TNR, PPV, NPV, … 2) Evaluate across range of thresholds ROC curve, Precision-Recall curve 3) Evaluate probabilities / scores directly cross entropy loss (aka log loss), hinge loss, … Mike Hughes - Tufts COMP 135 - Fall 2020 15

  9. Types of binary predictions TN : true negative FP : false positive FN : false negative TP : true positive 16

  10. Example: Which outcome is this? TN : true negative FP : false positive FN : false negative TP : true positive 17

  11. Example: Which outcome is this? Answer: True Positive TN : true negative FP : false positive FN : false negative TP : true positive 18

  12. Example: Which outcome is this? TN : true negative FP : false positive FN : false negative TP : true positive 19

  13. Example: Which outcome is this? Answer: True Negative (TN) TN : true negative FP : false positive FN : false negative TP : true positive 20

  14. Example: Which outcome is this? TN : true negative FP : false positive FN : false negative TP : true positive 21

  15. Example: Which outcome is this? Answer: False Negative (FN) TN : true negative FP : false positive FN : false negative TP : true positive 22

  16. Example: Which outcome is this? TN : true negative FP : false positive FN : false negative TP : true positive 23

  17. Example: Which outcome is this? Answer: False Positive (FP) TN : true negative FP : false positive FN : false negative TP : true positive 24

  18. Metric: Confusion Matrix Counting mistakes in binary predictions #TN : num. true negative #TP : num. true positive #FN : num. false negative #FP : num. false positive #TN #FP #FN #TP 25

  19. Metric: Accuracy accuracy = fraction of correct predictions TP + TN = TP + TN + FN + FP Potential problem: Suppose your dataset has 1 positive example and 99 negative examples What is the accuracy of the classifier that always predicts ”negative”? Mike Hughes - Tufts COMP 135 - Fall 2020 26

  20. Metric: Accuracy accuracy = fraction of correct predictions TP + TN = TP + TN + FN + FP Potential problem: Suppose your dataset has 1 positive example and 99 negative examples What is the accuracy of the classifier that always predicts ”negative”? 99%! Mike Hughes - Tufts COMP 135 - Fall 2020 27

  21. Metrics for Binary Decisions “sensitivity”, “recall” “specificity”, 1 - FPR “precision” In practice, you need to emphasize the metrics appropriate for your application. 28

  22. Goal: Classifier to find relevant tweets to list on Tufts website - If in top 10 by predicted probability, put on website - If not, discard that tweet Which metric might be most important? Could we just use accuracy? Mike Hughes - Tufts COMP 135 - Fall 2020 29

  23. Goal: Detector for cancer based on medical image - If called positive, patient gets further screening - If called negative, no further attention until 5+ years later Which metric might be most important? Could we just use accuracy? Mike Hughes - Tufts COMP 135 - Fall 2020 30

  24. Classifier: Evaluation Step Goal: Assess quality of predictions Many ways in practice: 1) Evaluate binary decisions at specific threshold accuracy, TPR, TNR, PPV, NPV, … 2) Evaluate across range of thresholds ROC curve, Precision-Recall curve 3) Evaluate probabilities / scores directly cross entropy loss (aka log loss), hinge loss, … Mike Hughes - Tufts COMP 135 - Fall 2020 Mike Hughes - Tufts COMP 135 - Fall 2020 31

  25. ROC curve perfect Specific thresh TPR random guess FPR (1 – TNR) Each point represents TPR and FPR of one specific threshold Connecting all points (all thresholds) produces the curve 32

  26. Area under ROC curve (aka AUROC or AUC or “C statistic”) Area varies from 0.0 – 1.0. 0.5 is random guess. 1.0 is perfect. Graphical view: TPR FPR Probabilistic interpretation: AUROC , Pr(ˆ y ( x i ) > ˆ y ( x j ) | y i = 1 , y j = 0) For random pair of examples, one positive and one negative, What is probability classifier will rank positive one higher? 33

  27. Precision-Recall Curve PPV precision recall (aka TPR) Mike Hughes - Tufts COMP 135 - Fall 2020 34

  28. AUROC not always best choice TPR TPR FPR AUROC says red is better Blue much better for avoiding false alarms 35

  29. Classifier: Evaluation Step Goal: Assess quality of predictions Many ways in practice: 1) Evaluate binary decisions at specific threshold accuracy, TPR, TNR, PPV, NPV, … 2) Evaluate across range of thresholds ROC curve, Precision-Recall curve 3) Evaluate probabilities / scores directly cross entropy loss (aka log loss) Not covered yet: hinge loss, many others Mike Hughes - Tufts COMP 135 - Fall 2020 Mike Hughes - Tufts COMP 135 - Fall 2020 36

  30. Measuring quality of predicted probabilities Use the log loss (aka “binary cross entropy”) from sklearn.metrics import log_loss log loss( y, ˆ p ) = − y log ˆ p − (1 − y ) log(1 − ˆ p ) Advantages: • smooth • easy to take derivatives! Mike Hughes - Tufts COMP 135 - Fall 2020 37

  31. Why minimize log loss? The upper bound justification Log loss (if implemented in correct base) is a smooth upper bound of the error rate. Why smooth matters: easy to do gradient descent Why upper bound matters: achieving a log loss of 0.1 (averaged over dataset) guarantees us that error rate is no worse than 0.1 (10%) Mike Hughes - Tufts COMP 135 - Spring 2019 38

  32. Log loss upper bounds 0-1 error ( 1 if y 6 = ˆ y error( y, ˆ y ) = 0 if y = ˆ y log loss( y, ˆ p ) = − y log ˆ p − (1 − y ) log(1 − ˆ p ) Plot assumes: - True label is 1 - Threshold is 0.5 - Log base 2 Mike Hughes - Tufts COMP 135 - Spring 2019 39

  33. Why minimize log loss? An information-theory justification Mike Hughes - Tufts COMP 135 - Spring 2019 40

Recommend


More recommend