advanced machine learning caveats and techniques to deal
play

ADVANCED MACHINE LEARNING Caveats and Techniques to Deal with - PowerPoint PPT Presentation

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING Caveats and Techniques to Deal with Imbalanced Datasets (Adapted from H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9,


  1. ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING Caveats and Techniques to Deal with Imbalanced Datasets (Adapted from H. He and E. A. Garcia, “ Learning from Imbalanced Data, ” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-- ‐ 1284, 2009 ) 1 1

  2. ADVANCED MACHINE LEARNING ML is now everywhere ❖ Increase of storage capacity  easy to build large datasets e.g. companies can store activities of clients with a variety of attributes ❖ Widely available code for ML techniques  Easy to use by laymen ➢ ML used widely in a variety of domains ❖ One hopes that ML will solve problems with which companies struggle (vast amount of data, very noisy, high-dimensional) 2 2

  3. ADVANCED MACHINE LEARNING ML techniques assume balanced datasets ML algorithms assumes that data is balanced ➢ In classification: comparative number of instances of each class What would be the result of running SVM on the imbalanced dataset? Balanced dataset Imbalanced dataset 826 dataset points: 790 positive class, 36 negative class 3 3

  4. ADVANCED MACHINE LEARNING Imbalanced Data: Why and When? 5 5

  5. ADVANCED MACHINE LEARNING Imbalanced Data: Example Finance: Number of clients who closed account / nm clients who did not: 1% Robotics: Number of points where the two arms intersect / nm points with no intersection: 4-5% One has usually much less datapoints from the adverse class. This is unfortunate as we care a lot about avoiding misclassifying elements of this class. 6 6

  6. ADVANCED MACHINE LEARNING Imbalanced Data: Example What would be the effect on SVR? 7 7

  7. ADVANCED MACHINE LEARNING Imbalanced Data: Example Poor interpolation with missing data 8 8

  8. ADVANCED MACHINE LEARNING Types of Imbalance Few data locally Balanced between-class datasets but white dataset not representative with more data is some regions 9 9

  9. ADVANCED MACHINE LEARNING 10 10

  10. ADVANCED MACHINE LEARNING Imbalance and curse of dimensionality 11 11

  11. ADVANCED MACHINE LEARNING Approaches to learning with imbalanced datasets 12 12

  12. ADVANCED MACHINE LEARNING Learning with imbalanced datasets Two main approaches Act on the data Act on the cost function Sampling Methods Cost-Sensitive Methods 13 13

  13. ADVANCED MACHINE LEARNING 14 14

  14. ADVANCED MACHINE LEARNING Sampling Methods Increase Dataset Generate new data points for the smallest class Compensate the lack of data by: Decrease Dataset Remove redundant datapoints from the largest class 15 15

  15. ADVANCED MACHINE LEARNING Undersampling Remove redundant datapoints Looses statistics – good only if enough datapoints on undersampled class and for low dimensional datasets 16 16

  16. ADVANCED MACHINE LEARNING Oversampling Pick neighbour and create new datapoint Risk overfitting, especially if one does this for points that are noise 17 17

  17. ADVANCED MACHINE LEARNING SMOTE: Synthetic minority oversampling technique Generate new samples inbetween existing datapoints based on their local density and their borders with the other class. Can use cleaning techniques (undersampling) to remove redundancy in the end. No Neighbors of the Several Neighbors Surrounded only on one same class  noise of the same class side by the other class Surrounded by the  safe other class  in danger 18 18

  18. ADVANCED MACHINE LEARNING 19 19

  19. ADVANCED MACHINE LEARNING 20 20

  20. ADVANCED MACHINE LEARNING 21 21

  21. ADVANCED MACHINE LEARNING How to treat Imbalanced Datasets with SVM  SVM with asymmetric misclassification cost  Vary the penalty placed on each datapoint depending on its class   M 1 C  2     * minimize w + i i i 2 M  i 1  SVM class boundary adjustement  Changing b changes the boundary between the two classes     M     i   y sgn k x x , b i    i 1 22 22

  22. ADVANCED MACHINE LEARNING How to treat Imbalanced Datasets with SVM 23 23

  23. ADVANCED MACHINE LEARNING Taking into account imbalanced datasets in the assessment of performance Traditional accuracy measure are sensitive to the data distribution The F-measure is better adapted as it evaluated performance on one class 24 24

  24. ADVANCED MACHINE LEARNING TOOLBOX https://github.com/scikit-learn-contrib/imbalanced-learn. 25 25

Recommend


More recommend