ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING Caveats and Techniques to Deal with Imbalanced Datasets (Adapted from H. He and E. A. Garcia, “ Learning from Imbalanced Data, ” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-- ‐ 1284, 2009 ) 1 1
ADVANCED MACHINE LEARNING ML is now everywhere ❖ Increase of storage capacity easy to build large datasets e.g. companies can store activities of clients with a variety of attributes ❖ Widely available code for ML techniques Easy to use by laymen ➢ ML used widely in a variety of domains ❖ One hopes that ML will solve problems with which companies struggle (vast amount of data, very noisy, high-dimensional) 2 2
ADVANCED MACHINE LEARNING ML techniques assume balanced datasets ML algorithms assumes that data is balanced ➢ In classification: comparative number of instances of each class What would be the result of running SVM on the imbalanced dataset? Balanced dataset Imbalanced dataset 826 dataset points: 790 positive class, 36 negative class 3 3
ADVANCED MACHINE LEARNING Imbalanced Data: Why and When? 5 5
ADVANCED MACHINE LEARNING Imbalanced Data: Example Finance: Number of clients who closed account / nm clients who did not: 1% Robotics: Number of points where the two arms intersect / nm points with no intersection: 4-5% One has usually much less datapoints from the adverse class. This is unfortunate as we care a lot about avoiding misclassifying elements of this class. 6 6
ADVANCED MACHINE LEARNING Imbalanced Data: Example What would be the effect on SVR? 7 7
ADVANCED MACHINE LEARNING Imbalanced Data: Example Poor interpolation with missing data 8 8
ADVANCED MACHINE LEARNING Types of Imbalance Few data locally Balanced between-class datasets but white dataset not representative with more data is some regions 9 9
ADVANCED MACHINE LEARNING 10 10
ADVANCED MACHINE LEARNING Imbalance and curse of dimensionality 11 11
ADVANCED MACHINE LEARNING Approaches to learning with imbalanced datasets 12 12
ADVANCED MACHINE LEARNING Learning with imbalanced datasets Two main approaches Act on the data Act on the cost function Sampling Methods Cost-Sensitive Methods 13 13
ADVANCED MACHINE LEARNING 14 14
ADVANCED MACHINE LEARNING Sampling Methods Increase Dataset Generate new data points for the smallest class Compensate the lack of data by: Decrease Dataset Remove redundant datapoints from the largest class 15 15
ADVANCED MACHINE LEARNING Undersampling Remove redundant datapoints Looses statistics – good only if enough datapoints on undersampled class and for low dimensional datasets 16 16
ADVANCED MACHINE LEARNING Oversampling Pick neighbour and create new datapoint Risk overfitting, especially if one does this for points that are noise 17 17
ADVANCED MACHINE LEARNING SMOTE: Synthetic minority oversampling technique Generate new samples inbetween existing datapoints based on their local density and their borders with the other class. Can use cleaning techniques (undersampling) to remove redundancy in the end. No Neighbors of the Several Neighbors Surrounded only on one same class noise of the same class side by the other class Surrounded by the safe other class in danger 18 18
ADVANCED MACHINE LEARNING 19 19
ADVANCED MACHINE LEARNING 20 20
ADVANCED MACHINE LEARNING 21 21
ADVANCED MACHINE LEARNING How to treat Imbalanced Datasets with SVM SVM with asymmetric misclassification cost Vary the penalty placed on each datapoint depending on its class M 1 C 2 * minimize w + i i i 2 M i 1 SVM class boundary adjustement Changing b changes the boundary between the two classes M i y sgn k x x , b i i 1 22 22
ADVANCED MACHINE LEARNING How to treat Imbalanced Datasets with SVM 23 23
ADVANCED MACHINE LEARNING Taking into account imbalanced datasets in the assessment of performance Traditional accuracy measure are sensitive to the data distribution The F-measure is better adapted as it evaluated performance on one class 24 24
ADVANCED MACHINE LEARNING TOOLBOX https://github.com/scikit-learn-contrib/imbalanced-learn. 25 25
Recommend
More recommend