Classifier Inspired Scaling for Training Set Selection Walter Bennette DISTRIBUTION A: Approved for public release: distribution unlimited: 16 May 2016. Case #88ABW-2016- 2511
Outline · Instance-based classification · Training set selection - ENN - DROP3 - CHC · Scaling approaches - Stratified - Classifier inspired · Experimental results 2/46
Instance-based classification
Instance-based classification 4/46
Instance-based classification 5/46
Instance-based classification 6/46
Instance-based classification 7/46
Instance-based classification 8/46
Instance-based classification 9/46
Instance-based classification 10/46
Instance-based classification 11/46
Instance-based classification 12/46
Instance-based classification 13/46
Instance-based classification What are they used for? · Classification of gene expression · Content-based image retrieval · Text categorization · Load forecasting assistant for power company 14/46
Instance-based classification What if there is a large amount of data? 15/46
Instance-based classification What if there is a huge amount of data? 16/46
Instance-based classification What if there is a serious amount of data? 17/46
Training set selection (TSS)
Training set selection (TSS) · Instead of maintaining all of the training data · Keep only certain necessary data points 19/46
Edited Nearest Neighbors (ENN) Formulation: · An instance is removed from the training data if its does not agree with the majority of it nearest neighbors k Effect: · Makes decision boundaries smoother · Doesn't remove much data 20/46
Edited Neares Neighbors (ENN) 21/46
DROP3 Formulation: DROP3 (Training set TR): Selection set S. Let S = TR after applying ENN. For each instance Xi in S: Find the k +1 nearest neighbors of Xi in S. Add Xi to each of its lists of associates. For each instance Xi in S: Let with = # of associates of Xi classified correctly with Xi as a neighbor. Let without = # of associates of Xi classified correctly without Xi. If without ≥ with Remove Xi from S. For each associate a of Xi Remove Xi from a’s list of neighbors. Find a new nearest neighbor for a. Add a to its new list of associates. Endif Return S. 22/46
DROP3 Formulation: · Iterative procedure that compares accuracy of neighbors with and without members Effect: · Removes much more data than ENN · Maintains acceptable accuracy 23/46
DROP3 24/46
Genetic algorithm (CHC) Formulation: · A chromosome is a subset of the training data · A binary gene represents each instance · Fitness = α ∗ Accuracy + (1 − α ) ∗ Reduction Effectiveness: · Removes a large amount of data · Achieves acceptable accuracy 25/46
Genetic algorithm (CHC) 26/46
Scaling
Scaling · As datasets grow, TSS becomes more and more expensive · May be prohibitive · The vast majority of scaling approaches rely on a stratified approach 28/46
No scaling 29/46
Stratified scaling 30/46
Representative Data Detection (ReDD) · Lin et al. 2015 · Used for support vector machines and did not consider data reduction 31/46
Our approach
Classifier inspired approach · Based heavily on ReDD · Used for kNN and monitor data reduction 33/46
The filter The "Balance"" dataset · Determine scale positions - Balanced - Leaning right - Leaning left · Attributes - Left weight - Left distance - Right weight - Right distance 34/46
The filter 35/46
The filter 36/46
The filter 37/46
Experimentation Parameters: · Learn a Random Forest for the filter · Split data into 1/3rd, 2/3rd Design: · Perform for ENN, CHC, and DROP3 with 3-NN · Compare no scaling, stratified, and classifier inspired · Calculate reduction, accuracy, and computation time with 10-fold CV 38/46
Datasets · 10 experimental datasets from KEEL 39/46
Reduction 40/46
Accuracy 41/46
Time 42/46
Results · Maintains accuracy (mostly) · Maintains data reduction · Slower than stratified approach, but may improve for larger datasets 43/46
Future work · Perform for many more datasets · Apply to very large datasets · Investigate if damage can be spotted apriori 44/46
Conclusion Promising candidate for scaling Training Set Selection to large datasets 45/46
Questions Walter Bennette walter.bennette.1@us.af.mil 315-330-4957 46/46
Recommend
More recommend