classifier inspired scaling for training set selection
play

Classifier Inspired Scaling for Training Set Selection Walter - PowerPoint PPT Presentation

Classifier Inspired Scaling for Training Set Selection Walter Bennette DISTRIBUTION A: Approved for public release: distribution unlimited: 16 May 2016. Case #88ABW-2016- 2511 Outline Instance-based classification Training set


  1. Classifier Inspired Scaling for Training Set Selection Walter Bennette DISTRIBUTION A: Approved for public release: distribution unlimited: 16 May 2016. Case #88ABW-2016- 2511

  2. Outline · Instance-based classification · Training set selection - ENN - DROP3 - CHC · Scaling approaches - Stratified - Classifier inspired · Experimental results 2/46

  3. Instance-based classification

  4. Instance-based classification 4/46

  5. Instance-based classification 5/46

  6. Instance-based classification 6/46

  7. Instance-based classification 7/46

  8. Instance-based classification 8/46

  9. Instance-based classification 9/46

  10. Instance-based classification 10/46

  11. Instance-based classification 11/46

  12. Instance-based classification 12/46

  13. Instance-based classification 13/46

  14. Instance-based classification What are they used for? · Classification of gene expression · Content-based image retrieval · Text categorization · Load forecasting assistant for power company 14/46

  15. Instance-based classification What if there is a large amount of data? 15/46

  16. Instance-based classification What if there is a huge amount of data? 16/46

  17. Instance-based classification What if there is a serious amount of data? 17/46

  18. Training set selection (TSS)

  19. Training set selection (TSS) · Instead of maintaining all of the training data · Keep only certain necessary data points 19/46

  20. Edited Nearest Neighbors (ENN) Formulation: · An instance is removed from the training data if its does not agree with the majority of it nearest neighbors k Effect: · Makes decision boundaries smoother · Doesn't remove much data 20/46

  21. Edited Neares Neighbors (ENN) 21/46

  22. DROP3 Formulation: DROP3 (Training set TR): Selection set S. Let S = TR after applying ENN. For each instance Xi in S: Find the k +1 nearest neighbors of Xi in S. Add Xi to each of its lists of associates. For each instance Xi in S: Let with = # of associates of Xi classified correctly with Xi as a neighbor. Let without = # of associates of Xi classified correctly without Xi. If without ≥ with Remove Xi from S. For each associate a of Xi Remove Xi from a’s list of neighbors. Find a new nearest neighbor for a. Add a to its new list of associates. Endif Return S. 22/46

  23. DROP3 Formulation: · Iterative procedure that compares accuracy of neighbors with and without members Effect: · Removes much more data than ENN · Maintains acceptable accuracy 23/46

  24. DROP3 24/46

  25. Genetic algorithm (CHC) Formulation: · A chromosome is a subset of the training data · A binary gene represents each instance · Fitness = α ∗ Accuracy + (1 − α ) ∗ Reduction Effectiveness: · Removes a large amount of data · Achieves acceptable accuracy 25/46

  26. Genetic algorithm (CHC) 26/46

  27. Scaling

  28. Scaling · As datasets grow, TSS becomes more and more expensive · May be prohibitive · The vast majority of scaling approaches rely on a stratified approach 28/46

  29. No scaling 29/46

  30. Stratified scaling 30/46

  31. Representative Data Detection (ReDD) · Lin et al. 2015 · Used for support vector machines and did not consider data reduction 31/46

  32. Our approach

  33. Classifier inspired approach · Based heavily on ReDD · Used for kNN and monitor data reduction 33/46

  34. The filter The "Balance"" dataset · Determine scale positions - Balanced - Leaning right - Leaning left · Attributes - Left weight - Left distance - Right weight - Right distance 34/46

  35. The filter 35/46

  36. The filter 36/46

  37. The filter 37/46

  38. Experimentation Parameters: · Learn a Random Forest for the filter · Split data into 1/3rd, 2/3rd Design: · Perform for ENN, CHC, and DROP3 with 3-NN · Compare no scaling, stratified, and classifier inspired · Calculate reduction, accuracy, and computation time with 10-fold CV 38/46

  39. Datasets · 10 experimental datasets from KEEL 39/46

  40. Reduction 40/46

  41. Accuracy 41/46

  42. Time 42/46

  43. Results · Maintains accuracy (mostly) · Maintains data reduction · Slower than stratified approach, but may improve for larger datasets 43/46

  44. Future work · Perform for many more datasets · Apply to very large datasets · Investigate if damage can be spotted apriori 44/46

  45. Conclusion Promising candidate for scaling Training Set Selection to large datasets 45/46

  46. Questions Walter Bennette walter.bennette.1@us.af.mil 315-330-4957 46/46

Recommend


More recommend