imbalanced domain learning
play

Imbalanced Domain Learning Fraud Detection Course - 2019/2020 Nuno - PowerPoint PPT Presentation

Imbalanced Domain Learning Fraud Detection Course - 2019/2020 Nuno Moniz nuno.moniz@fc.up.pt Today 1. Beyond Standard ML 2. Imbalanced Domain Learning 2.1 Problem Formulation 2.2 Evaluation/Learning 3. Strategies for Imbalanced Domain


  1. Imbalanced Domain Learning Fraud Detection Course - 2019/2020 Nuno Moniz nuno.moniz@fc.up.pt

  2. Today 1. Beyond Standard ML 2. Imbalanced Domain Learning 2.1 Problem Formulation · 2.2 Evaluation/Learning · 3. Strategies for Imbalanced Domain Learning 4. Practical Examples Fraud Detection Course 2019/2020 - Nuno Moniz

  3. Beyond Standard Machine Learning

  4. Hey Model 1, all apples are red, yellow or green.

  5. Hey Model 1, what's the colour of this apple?

  6. Famous ML Mistakes #1 Fraud Detection Course 2019/2020 - Nuno Moniz

  7. Famous ML Mistakes #2 Fraud Detection Course 2019/2020 - Nuno Moniz

  8. Machine Learning, Predictive Modelling The goal of predictive modelling is to obtain a good approximation for an unknown function · - Y = f ( x 1 x 2 , , ⋯ ) What you need (the most basic): · A dataset - A target variable - Learning algorithm(s) - Evaluation metric(s) - Fraud Detection Course 2019/2020 - Nuno Moniz

  9. Modelling Wally Your task is to �nd Wally among 100 · people You train two models using the state-of- · the-art in Deep Learning Model 1 obtains 99% Accuracy - Model 2 obtains 1% Accuracy - Confusion Matrix? · Fraud Detection Course 2019/2020 - Nuno Moniz

  10. Predicting Popular News Your task is to anticipate the most popular · news of the day You train two models using the state-of- · the-art Ensemble Learning techniques Model 1 obtains 0.1 NMSE - Model 2 obtains 0.5 NMSE - Normalized Mean Squared Error: · y i ) 2 ˆ ( − y i n ∑ n 1 i =1 y 2 i Fraud Detection Course 2019/2020 - Nuno Moniz

  11. Imbalanced Domain Learning

  12. Imbalanced Domain Learning? It's still predictive modelling, and as such... · "The goal of predictive modelling is to obtain a good approximation for an unknown · function" - Y = f ( x 1 x 2 , , ⋯ ) Standard predictive modelling has some assumptions: · The distribution of the target variable is balanced - Users have uniform preferences: all errors were born equal - Assumptions in Imbalanced Domain Learning : · The distribution of the target variable is imbalanced - Users have non-uniform preferences: some cases are more important - The more important/relevant cases are those which are rare or extreme - Fraud Detection Course 2019/2020 - Nuno Moniz

  13. Imbalanced Domain Learning - Nominal Target A balanced distribution An imbalanced distribution Fraud Detection Course 2019/2020 - Nuno Moniz

  14. Imbalanced Domain Learning - Numerical Target A balanced distribution An imbalanced distribution Fraud Detection Course 2019/2020 - Nuno Moniz

  15. Problems with Imbalanced Domains There are two main problems when learning with imbalanced domains 1. How to learn? 2. How to evaluate? How to learn? How to evaluate? Models are optimized to accurately Most of the most well-known evaluation · · represent the maximum of information metrics are focused on assessing the average behaviour of the models When there's information imbalance, this · means that it will more likely represent the However, there are a lot of cases where the · majority type of information in detriment evaluation objective is to understand if a to the minority (rare/extreme cases) model is capable of predicting a certain class or subset of values, i.e. imbalanced domain learning Fraud Detection Course 2019/2020 - Nuno Moniz

  16. The Problem of Evaluation This problem is di�erent for classi�cation and regression tasks · Imbalanced Learning has been explored in classi�cation for over 20 years, but in regression · problems it is very recent The main idea is this: for evaluating this type of problems: · Remember that not all cases are equal - You're focused on the ability of models in predicting a rare cases - Missing a prediction of rare cases is worst than missing a normal case - Fraud Detection Course 2019/2020 - Nuno Moniz

  17. The Problem of Evaluation Classification Regression Standard Evaluation Standard Evaluation Accuracy MSE · · Error Rate (complement) RMSE · · Non-Standard Evaluation Non-Standard Evaluation F-Score MSE · · ϕ G-Mean RMSE · · ϕ ROC Curves / AUC Utility-Based Metrics ( UBL R Package) · · Fraud Detection Course 2019/2020 - Nuno Moniz

  18. The Problem of Learning Imagine the following scenario: You have a dataset of 10,000 cases of credit transactions classi�ed as Fraud or Normal · This dataset has 9,990 cases classi�ed as Normal , and only 10 cases classi�ed as Fraud · Learning algorithm are not human beings: they're programmed to operate in a pre- · determined way This usually means that the problem they want to solve is: how can we accurately represent · the data? However, learning algorithms make choices - they have assumptions. The most hazardous for imbalanced domain learning are: 1. Assuming that all cases are equal 2. Internal optimization/decisions based on standard metrics Fraud Detection Course 2019/2020 - Nuno Moniz

  19. The Problem of Learning Instead of "It's all about the bass", it's in fact all about the mean/mode. Remember this? Fraud Detection Course 2019/2020 - Nuno Moniz

  20. Until now 1. There's more to machine learning than standard tasks 2. Learning algorithms are biased and, 3. Algorithms are focused on reducing the average error/representing the majority cases 4. Beware of standard evaluation metrics if your task is imbalanced domain learning Fraud Detection Course 2019/2020 - Nuno Moniz

  21. Strategies for Imbalanced Domain Learning

  22. Strategies for Imbalanced Domain Learning Fraud Detection Course 2019/2020 - Nuno Moniz

  23. Data Pre-Processing Goal : change the examples distribution · before applying any learning algorithm Advantages : any standard learning · algorithm can then be used · Disadvantages di�cult to decide the optimal - distribution (a perfect balance does not always provide the optimal results) the strategies applied may severly - increase/decrease the total number of examples Fraud Detection Course 2019/2020 - Nuno Moniz

  24. Special-purpose Learning Methods Goal : change existing algorithms to provide · a better �t to the imbalanced distribution · Advantages very e�ective in the contexts for which - they were design more comprehensible to the user - · Disadvantages di�cult task because it requires a deep - knowledge of both the algorithm and the domain di�culty of using an already adapted - method in a di�erent learning system Fraud Detection Course 2019/2020 - Nuno Moniz

  25. Prediction Post-Processing Goal : change the predictions after applying · any learning algorithm Advantages : any standard learning · algorithm can be used Disadvantages : potential loss of models · interpretability Fraud Detection Course 2019/2020 - Nuno Moniz

  26. Practical Examples

  27. Practical Examples - Data Pre-Processing Data Pre-Processing strategies are also known are Resampling Strategies · These are the most common strategies to tackle imbalanced domain learning tasks · We will look at practical examples for both classi�cation and regression using: · Random Undersampling - Random Oversampling - SMOTE - Fraud Detection Course 2019/2020 - Nuno Moniz

  28. Preliminaries in R Install the package UBL from CRAN · install.packages("UBL") Install UBL from GitHub · library(devtools) # stable release install_github("paobranco/UBL",ref="master") # development release install_github("paobranco/UBL",ref="develop") After installation, the package can be used, as any other R package · library(UBL) Fraud Detection Course 2019/2020 - Nuno Moniz

  29. Data for Pratical Examples - Classification Tasks We will use the well-known dataset iris . The iris �ower data set or Fisher's Iris data set is a · multivariate data set introduced by the British statistician and biologist Ronald Fisher in 1936 The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica · and Iris versicolor). For the purpose of the practical examples, we will consider the class setosa as being the rare · class, and the other classes as being the normal class library(UBL) # generating an artificially imbalanced data set data(iris) data <- iris[, c(1, 2, 5)] data$Species <- factor(ifelse(data$Species == "setosa","rare","common")) ## checking the class distribution of this artificial data set table(data$Species) ## ## common rare ## 100 50 Fraud Detection Course 2019/2020 - Nuno Moniz

  30. Random Undersampling To force the models to focus on the most important and least represented class(es) this · technique randomly removes examples from the most represented and therefore less important class(es) As such, the modi�ed data set obtained is smaller than the original one · The user must always be aware that to obtain a more balanced data set, this strategy may · discard useful data Therefore, this strategy should be applied with caution, specially in smaller datasets · Fraud Detection Course 2019/2020 - Nuno Moniz

Recommend


More recommend