Evaluating Binary Classifiers TPR FPR Many slides attributable - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Evaluating Binary Classifiers TPR FPR Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books) 8

Today’s objectives (day 08) Evaluating Binary Classifiers 1) Evaluate binary decisions at specific threshold accuracy, TPR, TNR, PPV, NPV, … 2) Evaluate across range of thresholds ROC curve, Precision-Recall curve 3) Evaluate probabilities / scores directly cross entropy loss (aka log loss) Mike Hughes - Tufts COMP 135 - Fall 2020 9

What will we learn? Evaluation Supervised Training Learning Data, Label Pairs Performance { x n , y n } N measure Task n =1 Unsupervised Learning data label x y Reinforcement Learning Prediction Mike Hughes - Tufts COMP 135 - Fall 2020 10

Task: Binary Classification y is a binary variable Supervised (red or blue) Learning binary classification x 2 Unsupervised Learning Reinforcement Learning x 1 Mike Hughes - Tufts COMP 135 - Fall 2020 11

Example: Hotdog or Not https://www.theverge.com/tldr/2017/5/14/15639784/hbo- silicon-valley-not-hotdog-app-download Mike Hughes - Tufts COMP 135 - Fall 2020 12

<latexit sha1_base64="+i3LJAjdB5LiyhIGKCnqfu7uJxk=">AB/XicbVDLSsNAFJ3UV62v+Ni5GSxCBSlJFXQjFN24rGAf0IYwmU6aoZMHMzdiDcVfceNCEbf+hzv/xmbhVYPXDicy/3uMlgiuwrC+jsLC4tLxSXC2trW9sbpnbOy0Vp5KyJo1FLDseUzwiDWBg2CdRDISeoK1veHVxG/fMal4HN3CKGFOSAYR9zkloCX3FMuxc4qNy7/Bj3IGBAjlyzbFWtKfBfYuekjHI0XPOz149pGrIqCBKdW0rAScjEjgVbFzqpYolhA7JgHU1jUjIlJNrx/jQ630sR9LXRHgqfpzIiOhUqPQ050hgUDNexPxP6+bgn/uZDxKUmARnS3yU4EhxpMocJ9LRkGMNCFUcn0rpgGRhIOrKRDsOdf/ktatap9Uq3dnJbrl3kcRbSPDlAF2egM1dE1aqAmougBPaEX9Go8Gs/Gm/E+ay0Y+cwu+gXj4xtlNpPm</latexit> From Features to Predictions Goal: Predict label (0 or 1) given features x x i , [ x i 1 , x i 2 , . . . x if . . . x iF ] Input features s i = h ( x i , θ ) Score (a real number) Chosen threshold Binary label y i ∈ { 0 , 1 } (0 or 1) Mike Hughes - Tufts COMP 135 - Fall 2020 13

<latexit sha1_base64="EYNrH7IL4mNqkIlAC+i56/+hS7I=">AB+nicbVBNS8NAEJ34WetXqkcvi0Wol5JUQS9C0YvHCvYD2hA2027dLMJuxulxP4ULx4U8eov8ea/cdvmoK0PBh7vzTAzL0g4U9pxvq2V1bX1jc3CVnF7Z3dv3y4dtFScSkKbJOax7ARYUc4EbWqmOe0kuIo4LQdjG6mfvuBSsVica/HCfUiPBAsZARrI/l2KfEZukI9xQYRrifnfp2ak6M6Bl4uakDkav3V68ckjajQhGOluq6TaC/DUjPC6aTYSxVNMBnhAe0aKnBElZfNTp+gE6P0URhLU0Kjmfp7IsORUuMoMJ0R1kO16E3F/7xuqsNL2MiSTUVZL4oTDnSMZrmgPpMUqL52BMJDO3IjLEhNt0iqaENzFl5dJq1Z1z6q1u/Ny/TqPowBHcAwVcOEC6nALDWgCgUd4hld4s56sF+vd+pi3rlj5zCH8gfX5A8Khkwc=</latexit> <latexit sha1_base64="+i3LJAjdB5LiyhIGKCnqfu7uJxk=">AB/XicbVDLSsNAFJ3UV62v+Ni5GSxCBSlJFXQjFN24rGAf0IYwmU6aoZMHMzdiDcVfceNCEbf+hzv/xmbhVYPXDicy/3uMlgiuwrC+jsLC4tLxSXC2trW9sbpnbOy0Vp5KyJo1FLDseUzwiDWBg2CdRDISeoK1veHVxG/fMal4HN3CKGFOSAYR9zkloCX3FMuxc4qNy7/Bj3IGBAjlyzbFWtKfBfYuekjHI0XPOz149pGrIqCBKdW0rAScjEjgVbFzqpYolhA7JgHU1jUjIlJNrx/jQ630sR9LXRHgqfpzIiOhUqPQ050hgUDNexPxP6+bgn/uZDxKUmARnS3yU4EhxpMocJ9LRkGMNCFUcn0rpgGRhIOrKRDsOdf/ktatap9Uq3dnJbrl3kcRbSPDlAF2egM1dE1aqAmougBPaEX9Go8Gs/Gm/E+ay0Y+cwu+gXj4xtlNpPm</latexit> From Features to Predictions via Probabilities Goal: Predict label (0 or 1) given features x x i , [ x i 1 , x i 2 , . . . x if . . . x iF ] Input features s i = h ( x i , θ ) Score (a real number) 1 sigmoid( z ) = 1 + e − z p i = σ ( s i ) Probability of positive class (between 0.0 and 1.0) Chosen threshold between 0.0 and 1.0 Binary label y i ∈ { 0 , 1 } (0 or 1) Mike Hughes - Tufts COMP 135 - Fall 2020 14

Classifier: Evaluation Step Goal: Assess quality of predictions Many ways in practice: 1) Evaluate binary decisions at specific threshold accuracy, TPR, TNR, PPV, NPV, … 2) Evaluate across range of thresholds ROC curve, Precision-Recall curve 3) Evaluate probabilities / scores directly cross entropy loss (aka log loss), hinge loss, … Mike Hughes - Tufts COMP 135 - Fall 2020 15

Types of binary predictions TN : true negative FP : false positive FN : false negative TP : true positive 16

Example: Which outcome is this? TN : true negative FP : false positive FN : false negative TP : true positive 17

Example: Which outcome is this? Answer: True Positive TN : true negative FP : false positive FN : false negative TP : true positive 18

Example: Which outcome is this? Answer: True Negative (TN) TN : true negative FP : false positive FN : false negative TP : true positive 20

Example: Which outcome is this? Answer: False Negative (FN) TN : true negative FP : false positive FN : false negative TP : true positive 22

Example: Which outcome is this? Answer: False Positive (FP) TN : true negative FP : false positive FN : false negative TP : true positive 24

Metric: Confusion Matrix Counting mistakes in binary predictions #TN : num. true negative #TP : num. true positive #FN : num. false negative #FP : num. false positive #TN #FP #FN #TP 25

Metric: Accuracy accuracy = fraction of correct predictions TP + TN = TP + TN + FN + FP Potential problem: Suppose your dataset has 1 positive example and 99 negative examples What is the accuracy of the classifier that always predicts ”negative”? Mike Hughes - Tufts COMP 135 - Fall 2020 26

Metric: Accuracy accuracy = fraction of correct predictions TP + TN = TP + TN + FN + FP Potential problem: Suppose your dataset has 1 positive example and 99 negative examples What is the accuracy of the classifier that always predicts ”negative”? 99%! Mike Hughes - Tufts COMP 135 - Fall 2020 27

Metrics for Binary Decisions “sensitivity”, “recall” “specificity”, 1 - FPR “precision” In practice, you need to emphasize the metrics appropriate for your application. 28

Goal: Classifier to find relevant tweets to list on Tufts website - If in top 10 by predicted probability, put on website - If not, discard that tweet Which metric might be most important? Could we just use accuracy? Mike Hughes - Tufts COMP 135 - Fall 2020 29

Goal: Detector for cancer based on medical image - If called positive, patient gets further screening - If called negative, no further attention until 5+ years later Which metric might be most important? Could we just use accuracy? Mike Hughes - Tufts COMP 135 - Fall 2020 30

Classifier: Evaluation Step Goal: Assess quality of predictions Many ways in practice: 1) Evaluate binary decisions at specific threshold accuracy, TPR, TNR, PPV, NPV, … 2) Evaluate across range of thresholds ROC curve, Precision-Recall curve 3) Evaluate probabilities / scores directly cross entropy loss (aka log loss), hinge loss, … Mike Hughes - Tufts COMP 135 - Fall 2020 Mike Hughes - Tufts COMP 135 - Fall 2020 31

ROC curve perfect Specific thresh TPR random guess FPR (1 – TNR) Each point represents TPR and FPR of one specific threshold Connecting all points (all thresholds) produces the curve 32

Area under ROC curve (aka AUROC or AUC or “C statistic”) Area varies from 0.0 – 1.0. 0.5 is random guess. 1.0 is perfect. Graphical view: TPR FPR Probabilistic interpretation: AUROC , Pr(ˆ y ( x i ) > ˆ y ( x j ) | y i = 1 , y j = 0) For random pair of examples, one positive and one negative, What is probability classifier will rank positive one higher? 33

Precision-Recall Curve PPV precision recall (aka TPR) Mike Hughes - Tufts COMP 135 - Fall 2020 34

AUROC not always best choice TPR TPR FPR AUROC says red is better Blue much better for avoiding false alarms 35

Classifier: Evaluation Step Goal: Assess quality of predictions Many ways in practice: 1) Evaluate binary decisions at specific threshold accuracy, TPR, TNR, PPV, NPV, … 2) Evaluate across range of thresholds ROC curve, Precision-Recall curve 3) Evaluate probabilities / scores directly cross entropy loss (aka log loss) Not covered yet: hinge loss, many others Mike Hughes - Tufts COMP 135 - Fall 2020 Mike Hughes - Tufts COMP 135 - Fall 2020 36

Measuring quality of predicted probabilities Use the log loss (aka “binary cross entropy”) from sklearn.metrics import log_loss log loss( y, ˆ p ) = − y log ˆ p − (1 − y ) log(1 − ˆ p ) Advantages: • smooth • easy to take derivatives! Mike Hughes - Tufts COMP 135 - Fall 2020 37

Why minimize log loss? The upper bound justification Log loss (if implemented in correct base) is a smooth upper bound of the error rate. Why smooth matters: easy to do gradient descent Why upper bound matters: achieving a log loss of 0.1 (averaged over dataset) guarantees us that error rate is no worse than 0.1 (10%) Mike Hughes - Tufts COMP 135 - Spring 2019 38

Log loss upper bounds 0-1 error ( 1 if y 6 = ˆ y error( y, ˆ y ) = 0 if y = ˆ y log loss( y, ˆ p ) = − y log ˆ p − (1 − y ) log(1 − ˆ p ) Plot assumes: - True label is 1 - Threshold is 0.5 - Log base 2 Mike Hughes - Tufts COMP 135 - Spring 2019 39

Why minimize log loss? An information-theory justification Mike Hughes - Tufts COMP 135 - Spring 2019 40

Evaluating Binary Classifiers TPR FPR Many slides attributable - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Evaluating Binary Classifiers TPR FPR Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten,

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Binary Numbers Binary numbers look like this Binary Numbers or Binary Code Binary numbers or

A Quick Review Decimal to binary Binary to decimal Binary to hexadecimal

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Binary Trees, Heaps Binary Trees, Heaps Binary trees Binary trees A binary tree (

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Binary Numbers 723 Binary Numbers 723 = 7x100 + 2x10 + 3x1 Binary Numbers 723 = 7x100 + 2x10 +

First look at structures CS 6355: Structured Prediction 1 So far Binary classifiers

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

CMSC 206 Binary Search Trees 1 Binary Search Tree n A Binary Search Tree is a Binary Tree in

Binary Search Trees and Balanced Binary Search Trees using AVL Trees Mark Redekopp David Kempe

Bubbles for Fama PRESENTER Robin Greenwood, Harvard Business School DISCUSSANT Bubbles for Fama

Dimension Reduction with Heavy Tails Gabriel Kuhn Munich University of Technology

Dimensionality reduction Machine Learning Hamid Beigy Sharif University of Technology Fall 1393

[ ] F , , , = is the m -dimisional vector of F F F (unobservable) sources of variation

Discriminative Keyword Spotting Joseph Keshet, The Hebrew University David Grangier, IDIAP

Illinois Early Childhood Innovation Zones: Early Wins & Lessons Learned So Far Part 1 of

PQL: A Purely-Declarative Java Extension for Parallel Programming Christoph Reichenbach 1 , 2 ,

Polynomials and Fast Fourier Transform (FFT) Polynomials n-1 a i x i a polynomial of degree n-1

Evaluating Binary Classifiers TPR FPR Many slides attributable - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Evaluating Binary Classifiers TPR FPR Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten,

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Binary Numbers Binary numbers look like this Binary Numbers or Binary Code Binary numbers or

A Quick Review Decimal to binary Binary to decimal Binary to hexadecimal

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Binary Trees, Heaps Binary Trees, Heaps Binary trees Binary trees A binary tree (

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Binary Numbers 723 Binary Numbers 723 = 7x100 + 2x10 + 3x1 Binary Numbers 723 = 7x100 + 2x10 +

First look at structures CS 6355: Structured Prediction 1 So far Binary classifiers

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

CMSC 206 Binary Search Trees 1 Binary Search Tree n A Binary Search Tree is a Binary Tree in

Binary Search Trees and Balanced Binary Search Trees using AVL Trees Mark Redekopp David Kempe

Bubbles for Fama PRESENTER Robin Greenwood, Harvard Business School DISCUSSANT Bubbles for Fama

Dimension Reduction with Heavy Tails Gabriel Kuhn Munich University of Technology

Dimensionality reduction Machine Learning Hamid Beigy Sharif University of Technology Fall 1393

[ ] F , , , = is the m -dimisional vector of F F F (unobservable) sources of variation

Discriminative Keyword Spotting Joseph Keshet, The Hebrew University David Grangier, IDIAP

Illinois Early Childhood Innovation Zones: Early Wins &amp; Lessons Learned So Far Part 1 of

PQL: A Purely-Declarative Java Extension for Parallel Programming Christoph Reichenbach 1 , 2 ,

Polynomials and Fast Fourier Transform (FFT) Polynomials n-1 a i x i a polynomial of degree n-1

Illinois Early Childhood Innovation Zones: Early Wins & Lessons Learned So Far Part 1 of