Learning and Imbalanced Data January 28, 2019 David Rimshnick - PowerPoint PPT Presentation

Learning and Imbalanced Data January 28, 2019 David Rimshnick Data Science in the Wild, Spring 2019

What is data imbalance? • Unequal distribution of data towards a certain characteristic • Target variable • Classification: Certain classes have much higher % of samples • E.g. Very rare disease, 99.9% of test results could be negative • Regression: Certain ranges of results much more prevalent • E.g. Almost all outputs are 0 or close to it, very few non-zero • Action variable • One of inputs (e.g. action) has very low variance in sample • Difficult for model to learn impact of changing that variable • Will revisit when we discuss reinforcement learning Data Science in the Wild, Spring 2019 Learning and Imbalanced Data 1/28/2019 2

Why is imbalance bad? • Discussion Data Science in the Wild, Spring 2019 Learning and Imbalanced Data 1/28/2019 3

What is wrong with data imbalance • Rare Disease Example: Classifier can get 99.9% accuracy by just assuming all negative! • This is also why accuracy is not the best metric • Loss function may need to be modified • Need to consider false-negative rate as well as true-positive, etc • confusion matrix , AuROC, etc – to be discussed again in later lectures • Sample may not mimic population • 90% of sample is A , but only 50% of population is • Overfitting - Model may ‘memorize’ defining characteristics of minority class instead of learning underlying pattern Data Science in the Wild, Spring 2019 Learning and Imbalanced Data 1/28/2019 4

How do we deal with data imbalance • Alter the sample • Three primary methods: • Oversampling: For under-represented class or part of distribution, duplicate observations until dataset is balanced • Undersampling: For over-represented class or part of distribution, remove observations until dataset is balanced • Synthetic Data Creation • Alter the cost function Data Science in the Wild, Spring 2019 Learning and Imbalanced Data 1/28/2019 5

Oversampling • “Random Oversampling” • Randomly duplicate records from minority class(es) with replacement until dataset is balanced • Downside: Data Science in the Wild, Spring 2019 Learning and Imbalanced Data 1/28/2019 6

Oversampling • “Random Oversampling” • Randomly duplicate records from minority class(es) with replacement until dataset is balanced • Downside: Overfitting • Model may ‘memorize’ idiosyncratic characteristics of overbalanced records as opposed to learning scalable pattern Data Science in the Wild, Spring 2019 Learning and Imbalanced Data 1/28/2019 7

Undersampling • “Random Undersampling ” • Randomly delete records from majority class(es) until dataset is balanced • Downside: Data Science in the Wild, Spring 2019 Learning and Imbalanced Data 1/28/2019 8

Undersampling • “Random Undersampling ” • Randomly delete records from majority class(es) until dataset is balanced • Downside: Loss of data! Data Science in the Wild, Spring 2019 Learning and Imbalanced Data 1/28/2019 9

‘Informed’ Undersampling • Several methods exist (see paper for reference) • Example: Edited Nearest Neighbor Rule (ENN) • Remove instance of majority class whose prediction made by KNN method is different than the majority class • Intuition: Remove “confusing” examples of majority class, make decision surface more smooth • Algorithm: 1. Obtain the 𝑙 nearest neighbors of 𝑦 𝑗 , 𝑦 𝑗 ∈ 𝑂 2. 𝑦 𝑗 will be removed if the number of neighbors from another class is predominant 3. The process will be repeated for every majority instance of the subset 𝑂 Data Science in the Wild, Spring 2019 Learning and Imbalanced Data 1/28/2019 10

Synthetic Data Creation • Instead of just resampling existing values to oversample, create artificial or synthetic data • One of best known techniques: SMOTE (Synthetic Minority Over- sampling Technique) • Algorithm: • For each 𝑦 𝑗 from a minority set, choose 𝑜 nearest neighbors • Select randomly one instance 𝑙 from nearest neighbors • Create a new instance with features as a convex combination (with some parameter 𝑞 ) of the features of the original instance and the nearest neighbor Chawla, Nitesh V., et al. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research 16 (2002): 321-357. Data Science in the Wild, Spring 2019 Learning and Imbalanced Data 1/28/2019 11

Visualization of SMOTE algorithm Image Source: Beckmann, Marcelo, Nelson FF Ebecken, and Beatriz SL Pires de Lima. "A KNN undersampling approach for data balancing." Journal of Intelligent Learning Systems and Applications 7.04 (2015): 104. Data Science in the Wild, Spring 2019 Learning and Imbalanced Data 1/28/2019 12

Cost function alteration • Idea: Assign greater cost to observations from minority class 1 • E.g.: In the loss function, assign weight 𝑥 𝑗 = 𝑞 𝑗 ∗𝐷 where 𝑞 𝑗 is the sample proportion of class 𝑗 , and 𝐷 is the number of classes • Downside is that you have to edit algorithm, i.e. no longer black-box • More general framework: Assign greater weight to observations that are mishandled by model • What is this technique when done iteratively? Data Science in the Wild, Spring 2019 Learning and Imbalanced Data 1/28/2019 13

Cost function alteration • Idea: Assign greater cost to observations from minority class 1 • E.g.: In the loss function, assign weight 𝑥 𝑗 = 𝑞 𝑗 ∗𝐷 where 𝑞 𝑗 is the sample proportion of class 𝑗 , and 𝐷 is the number of classes • Downside is that you have to edit algorithm, i.e. no longer black-box • More general framework: Assign greater weight to observations that are mishandled by model • What is this technique when done iteratively? Boosting! Data Science in the Wild, Spring 2019 Learning and Imbalanced Data 1/28/2019 14

Attribution • This lecture is partially based on the following paper: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 Data Science in the Wild, Spring 2019 Learning and Imbalanced Data 1/28/2019 15

Learning and Imbalanced Data January 28, 2019 David Rimshnick - PowerPoint PPT Presentation

Learning and Imbalanced Data January 28, 2019 David Rimshnick Data Science in the Wild, Spring 2019 What is data imbalance? Unequal distribution of data towards a certain characteristic Target variable Classification: Certain

The Learning Tree Workshop: The Learning Tree Workshop: Experience-based Learning Series on

What is mobile learning, mobile learning policies and technologies Dr. Mohamed Ally Learning

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

Impasse, Conflict Impasse, Conflict and Learning of CS Notions and Learning of CS Notions David

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Year 7 Learning Evening 2017 W elcome! Year 7 Learning Evening 2017 Year 7 Learning Evening

Learning is a never-ending process Tasks come and go, but learning is forever Learn more e ff

Welcome to Welcome to The Learning Tree Workshop Series on Learning Differences, Learning

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1 MACHINE LEARNING

Virtual Learning St. John the Baptist School Remote Learning vs. Virtual Learning Remote

Objectives Objectives Objectives Objectives Learning Learning Learning Learning

Kolmogorov-Chaitin Complexity of Linear Digital Controllers Implemented using Fixed-point

Computing the Best Rank ( r 1 , r 2 , r 3 ) Approximation of a Tensor Lars Eld en

Estimation of Transformations Shao-Yi Chien Department of Electrical Engineering

Machine learning and black-box expensive optimization S ebastien Verel Laboratoire

T witte r F e e ds Pr ofiling With T F - IDF Juraj Petrik & Daniela Chuda 1 T a sk

Mixed-Signal VLSI Design Course Code: EE719 Department: Electrical Engineering Lecture 38: April

Generic Polyphase Filterbanks with CUDA Jan Krmer DLR German Aerospace Center Communication

From Atoms to Bits Ahmet Onat 2018 onat@sabanciuniv.edu Layout of the Lecture Analog

Learning and Imbalanced Data January 28, 2019 David Rimshnick - PowerPoint PPT Presentation

Learning and Imbalanced Data January 28, 2019 David Rimshnick Data Science in the Wild, Spring 2019 What is data imbalance? Unequal distribution of data towards a certain characteristic Target variable Classification: Certain

The Learning Tree Workshop: The Learning Tree Workshop: Experience-based Learning Series on

What is mobile learning, mobile learning policies and technologies Dr. Mohamed Ally Learning

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

Impasse, Conflict Impasse, Conflict and Learning of CS Notions and Learning of CS Notions David

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Year 7 Learning Evening 2017 W elcome! Year 7 Learning Evening 2017 Year 7 Learning Evening

Learning is a never-ending process Tasks come and go, but learning is forever Learn more e ff

Welcome to Welcome to The Learning Tree Workshop Series on Learning Differences, Learning

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1 MACHINE LEARNING

Virtual Learning St. John the Baptist School Remote Learning vs. Virtual Learning Remote

Objectives Objectives Objectives Objectives Learning Learning Learning Learning

Kolmogorov-Chaitin Complexity of Linear Digital Controllers Implemented using Fixed-point

Computing the Best Rank ( r 1 , r 2 , r 3 ) Approximation of a Tensor Lars Eld en

Estimation of Transformations Shao-Yi Chien Department of Electrical Engineering

Machine learning and black-box expensive optimization S ebastien Verel Laboratoire

T witte r F e e ds Pr ofiling With T F - IDF Juraj Petrik &amp; Daniela Chuda 1 T a sk

Mixed-Signal VLSI Design Course Code: EE719 Department: Electrical Engineering Lecture 38: April

Generic Polyphase Filterbanks with CUDA Jan Krmer DLR German Aerospace Center Communication

From Atoms to Bits Ahmet Onat 2018 onat@sabanciuniv.edu Layout of the Lecture Analog

T witte r F e e ds Pr ofiling With T F - IDF Juraj Petrik & Daniela Chuda 1 T a sk