Data Mining Classification: Alternative Techniques Imbalanced Class - PDF document

Data Mining Classification: Alternative Techniques Imbalanced Class Problem Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 1 Class Imbalance Problem  Lots of classification problems where the classes are skewed (more records from one class than another) – Credit card fraud – Intrusion detection – Defective products in manufacturing assembly line – COVID-19 test results on a random sample Introduction to Data Mining, 2 nd Edition 10/05/2020 2 2

Challenges  Evaluation measures such as accuracy are not well-suited for imbalanced class  Detecting the rare class is like finding a needle in a haystack Introduction to Data Mining, 2 nd Edition 10/05/2020 3 3 Confusion Matrix  Confusion Matrix: PREDICTED CLASS Class=Yes Class=No Class=Yes a b ACTUAL CLASS Class=No c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative) Introduction to Data Mining, 2 nd Edition 10/05/2020 4 4

Accuracy PREDICTED CLASS Class=Yes Class=No Class=Yes a b ACTUAL (TP) (FN) CLASS Class=No c d (FP) (TN)  Most widely-used metric:   a d TP TN   Accuracy       a b c d TP TN FP FN Introduction to Data Mining, 2 nd Edition 10/05/2020 5 5 Problem with Accuracy  Consider a 2-class problem – Number of Class NO examples = 990 – Number of Class YES examples = 10 Introduction to Data Mining, 2 nd Edition 10/05/2020 6 6

Problem with Accuracy  Consider a 2-class problem – Number of Class NO examples = 990 – Number of Class YES examples = 10 PREDICTED CLASS Class=Yes Class=No Class=Yes 0 10 ACTUAL CLASS Class=No 0 990 Introduction to Data Mining, 2 nd Edition 10/05/2020 7 7 Problem with Accuracy  Consider a 2-class problem – Number of Class NO examples = 990 – Number of Class YES examples = 10  If a model predicts everything to be class NO, accuracy is 990/1000 = 99 % – This is misleading because the model does not detect any class YES example – Detecting the rare class is usually more interesting (e.g., frauds, intrusions, defects, etc) Introduction to Data Mining, 2 nd Edition 10/05/2020 8 8

Which model is better? PREDICTED A Class=Yes Class=No ACTUAL Class=Yes 0 10 Class=No 0 990 B PREDICTED Class=Yes Class=No ACTUAL Class=Yes 10 0 Class=No 90 900 Introduction to Data Mining, 2 nd Edition 10/05/2020 9 9 Which model is better? PREDICTED A Class=Yes Class=No ACTUAL Class=Yes 5 5 Class=No 0 990 B PREDICTED Class=Yes Class=No ACTUAL Class=Yes 10 0 Class=No 90 900 Introduction to Data Mining, 2 nd Edition 10/05/2020 10 10

Alternative Measures PREDICTED CLASS Class=Yes Class=No Class=Yes a b ACTUAL Class=No c d CLASS a  Precision (p)  a c a  Recall (r)  a b 2 rp 2 a   F - measure (F)    r p 2 a b c Introduction to Data Mining, 2 nd Edition 10/05/2020 11 11 Alternative Measures 10   Precision (p) 0 . 5 PREDICTED CLASS  10 10 10 Class=Yes Class=No   Recall (r) 1  10 0 Class=Yes 10 0 2 * 1 * 0 . 5   ACTUAL F - measure (F) 0 . 62  1 0 . 5 CLASS Class=No 10 980 990   Accuracy 0 . 99 1000 Introduction to Data Mining, 2 nd Edition 10/05/2020 12 12

Alternative Measures 10   Precision (p) 0 . 5 PREDICTED CLASS  10 10 10 Class=Yes Class=No   Recall (r) 1  10 0 Class=Yes 10 0 2 * 1 * 0 . 5 ACTUAL   F - measure (F) 0 . 62  1 0 . 5 CLASS Class=No 10 980 990   Accuracy 0 . 99 1000 1 PREDICTED CLASS   Precision (p) 1  1 0 Class=Yes Class=No 1   Recall (r) 0 . 1  1 9 Class=Yes 1 9 ACTUAL 2 * 0 . 1 * 1   F - measure (F) 0 . 18 CLASS Class=No 0 990  1 0 . 1 991   Accuracy 0 . 991 1000 Introduction to Data Mining, 2 nd Edition 10/05/2020 13 13 Alternative Measures PREDICTED CLASS  Precision (p) 0 . 8 Class=Yes Class=No  Recall (r) 0 . 8 Class=Yes 40 10  F - measure (F) 0 . 8 ACTUAL  Accuracy 0 . 8 CLASS Class=No 10 40 Introduction to Data Mining, 2 nd Edition 10/05/2020 14 14

Alternative Measures PREDICTED CLASS  Precision (p) 0 . 8 Class=Yes Class=No A  Recall (r) 0 . 8 Class=Yes 40 10  F - measure (F) 0 . 8 ACTUAL  Accuracy 0 . 8 CLASS Class=No 10 40 PREDICTED CLASS B Class=Yes Class=No  Precision (p) ~ 0 . 04  Recall (r) 0 . 8 Class=Yes 40 10 ACTUAL  F - measure (F) ~ 0 . 08 CLASS Class=No 1000 4000  Accuracy ~ 0 . 8 Introduction to Data Mining, 2 nd Edition 10/05/2020 15 15 Measures of Classification Performance PREDICTED CLASS Yes No ACTUAL Yes TP FN CLASS No FP TN  is the probability that we reject the null hypothesis when it is true. This is a Type I error or a false positive (FP).  is the probability that we accept the null hypothesis when it is false. This is a Type II error or a false negative (FN). Introduction to Data Mining, 2 nd Edition 10/05/2020 16 16

Alternative Measures PREDICTED CLASS Precision � p � � 0.8 TPR � Recall � r � � 0.8 Class=Yes Class=No FPR � 0.2 F � measure � F � � 0.8 Class=Yes 40 10 Accuracy � 0.8 ACTUAL CLASS Class=No 10 40 TPR FPR � 4 PREDICTED CLASS Precision � p � � 0.038 TPR � Recall � r � � 0.8 Class=Yes Class=No FPR � 0.2 F � measure � F � � 0.07 Class=Yes 40 10 ACTUAL Accuracy � 0.8 CLASS Class=No 1000 4000 TPR FPR � 4 Introduction to Data Mining, 2 nd Edition 10/05/2020 17 17 Alternative Measures PREDICTED CLASS  Precision (p) 0 . 5 Class=Yes Class=No   TPR Recall (r) 0 . 2 Class=Yes 10 40 ACTUAL  FPR 0 . 2 Class=No 10 40 CLASS   F measure 0.28 PREDICTED CLASS  Precision (p) 0 . 5 Class=Yes Class=No   TPR Recall (r) 0 . 5 Class=Yes 25 25  FPR 0 . 5 ACTUAL Class=No 25 25 CLASS   F measure 0.5 PREDICTED CLASS  Precision (p) 0 . 5 Class=Yes Class=No   TPR Recall (r) 0 . 8 Class=Yes 40 10  FPR 0 . 8 ACTUAL Class=No 40 10 CLASS   F measure 0.61 Introduction to Data Mining, 2 nd Edition 10/05/2020 18 18

ROC (Receiver Operating Characteristic)  A graphical approach for displaying trade-off between detection rate and false alarm rate  Developed in 1950s for signal detection theory to analyze noisy signals  ROC curve plots TPR against FPR – Performance of a model represented as a point in an ROC curve – Changing the threshold parameter of classifier changes the location of the point Introduction to Data Mining, 2 nd Edition 10/05/2020 19 19 ROC Curve (TPR,FPR):  (0,0): declare everything to be negative class  (1,1): declare everything to be positive class  (1,0): ideal  Diagonal line: – Random guessing – Below diagonal line:  prediction is opposite of the true class Introduction to Data Mining, 2 nd Edition 10/05/2020 20 20

ROC (Receiver Operating Characteristic)  To draw ROC curve, classifier must produce continuous-valued output – Outputs are used to rank test records, from the most likely positive class record to the least likely positive class record  Many classifiers produce only discrete outputs (i.e., predicted class) – How to get continuous-valued outputs?  Decision trees, rule-based classifiers, neural networks, Bayesian classifiers, k-nearest neighbors, SVM Introduction to Data Mining, 2 nd Edition 10/05/2020 21 21 Example: Decision Trees Decision Tree Continuous-valued outputs Introduction to Data Mining, 2 nd Edition 10/05/2020 22 22

ROC Curve Example Introduction to Data Mining, 2 nd Edition 10/05/2020 23 23 ROC Curve Example - 1-dimensional data set containing 2 classes (positive and negative) - Any points located at x > t is classified as positive At threshold t: TPR=0.5, FNR=0.5, FPR=0.12, TNR=0.88 Introduction to Data Mining, 2 nd Edition 10/05/2020 24 24

Using ROC for Model Comparison  No model consistently outperforms the other  M 1 is better for small FPR  M 2 is better for large FPR  Area Under the ROC curve Ideal:   Area = 1 Random guess:   Area = 0.5 Introduction to Data Mining, 2 nd Edition 10/05/2020 25 25 How to Construct an ROC curve • Use a classifier that produces a Instance Score True Class continuous-valued score for 1 0.95 + each instance 2 0.93 + • The more likely it is for the 3 0.87 - instance to be in the + class, the higher the score 4 0.85 - • Sort the instances in decreasing 5 0.85 - order according to the score 6 0.85 + • Apply a threshold at each unique 7 0.76 - value of the score 8 0.53 + • Count the number of TP, FP, 9 0.43 - TN, FN at each threshold 10 0.25 + • TPR = TP/(TP+FN) • FPR = FP/(FP + TN) Introduction to Data Mining, 2 nd Edition 10/05/2020 26 26

Data Mining Classification: Alternative Techniques Imbalanced Class - PDF document

Data Mining Classification: Alternative Techniques Imbalanced Class Problem Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 1 Class Imbalance Problem Lots of classification problems where the classes are

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 4 Rule-Based

Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 4 Instance-Based

Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 1 Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 1 Instructor: Yizhou Sun

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Integrating SMT with Theorem Proving for AMS Verification Yan Peng & Mark Greenstreet

Applications of metric e v al u ation P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G IN

Analytical Modeling of Parallel Systems Ananth Grama, Anshul Gupta, George Karypis, and Vipin

Using a Set Constraint Solver for Program Verifjcation Maximiliano Cristi Universidad Nacional

Independent Component Analysis Aleix M. Martinez aleix@ece.osu.edu Independent Component

Polynomial Space [HMU06,Chp.11b] The classes PS and NPS Relationship to Other Classes

Hardware-Assisted Critical Sections Dr. Liam OConnor University of Edinburgh LFCS (and UNSW)

Applying Predicate Logic to Monitoring Network Traffic Wolfgang Schreiner

Data Mining Classification: Alternative Techniques Imbalanced Class - PDF document

Data Mining Classification: Alternative Techniques Imbalanced Class Problem Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 1 Class Imbalance Problem Lots of classification problems where the classes are

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 4 Rule-Based

Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 4 Instance-Based

Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 1 Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 1 Instructor: Yizhou Sun

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Integrating SMT with Theorem Proving for AMS Verification Yan Peng &amp; Mark Greenstreet

Applications of metric e v al u ation P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G IN

Analytical Modeling of Parallel Systems Ananth Grama, Anshul Gupta, George Karypis, and Vipin

Using a Set Constraint Solver for Program Verifjcation Maximiliano Cristi Universidad Nacional

Independent Component Analysis Aleix M. Martinez aleix@ece.osu.edu Independent Component

Polynomial Space [HMU06,Chp.11b] The classes PS and NPS Relationship to Other Classes

Hardware-Assisted Critical Sections Dr. Liam OConnor University of Edinburgh LFCS (and UNSW)

Applying Predicate Logic to Monitoring Network Traffic Wolfgang Schreiner

Integrating SMT with Theorem Proving for AMS Verification Yan Peng & Mark Greenstreet