on feature selection bias variance and bagging
play

On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich - PowerPoint PPT Presentation

On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich Caruana 2 1 Department of Computer Science Cornell University 2 Microsoft Corporation ECML-PKDD 2009 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 1


  1. On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich Caruana 2 1 Department of Computer Science Cornell University 2 Microsoft Corporation ECML-PKDD 2009 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 1 / 22

  2. Task: Model Presence/Absence of Birds Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 2 / 22

  3. Task: Model Presence/Absence of Birds Tried: SVMs boosted decision trees bagged decision trees neural networks · · · Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 2 / 22

  4. Task: Model Presence/Absence of Birds Tried: SVMs boosted decision trees bagged decision trees neural networks · · · Ultimate goal: understand avian population dynamics Ran feature selection to find smallest feature set with excellent performance. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 2 / 22

  5. Bagging Likes Many Noisy Features (?) European Starling 0.385 bagging 0.38 all features 0.375 0.37 RMS 0.365 0.36 0.355 0.35 0 5 10 15 20 25 30 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 3 / 22

  6. Surprised Reviewers Reviewer A [I] also found that the results reported in Figure 2 [were] strange, where the majority [of] results show that classifiers built from selected features are actually inferior to the ones trained from the whole feature [set]. Reviewer B It is very surprising that the performance of all methods improves (or stays constant) when the number of features is increased. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 4 / 22

  7. Purpose of this Study Does bagging often benefit from many features? If so, why? Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 5 / 22

  8. Outline Story Behind the Paper 1 Background 2 Experiment 1: FS and Bias-Variance 3 Experiment 2: Weak, Noisy Features 4 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 6 / 22

  9. Review of Bagging Bagging: simple ensemble learning algorithm [Bre96]: draw random sample of training data train a model using sample (e.g. decision tree) repeat N times (e.g. 25 times) bagged predictions: average predictions of N models Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 7 / 22

  10. Facts about Bagging Surprisingly competitive performance & rarely overfits [BK99]. Main benefit is reducing variance of constituent models [BK99]. Improves ability to ignore irrelevant features [AP96]. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 8 / 22

  11. Review of Bias-Variance Decomposition Error of learning algorithm on example x comes from 3 sources: noise intrinsic error / uncertainty for x ’s true label bias how close, on average, is algorithm to optimal prediction variance how much does prediction change if change training set Error decomposes as: error ( x ) = noise ( x ) + bias ( x ) + variance ( x ) On real problems, cannot separately measure bias and noise. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 9 / 22

  12. Measuring Bias & Variance (Squared Error) Generate empirical distribution of the algorithm’s predictions [BK99]: Randomly sample 1 2 of the training data. Train model using sample and make predictions y for test data. Repeat R times (e.g. 20 times). Compute average prediction y m for every test example. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 10 / 22

  13. Measuring Bias & Variance (Squared Error) Generate empirical distribution of the algorithm’s predictions [BK99]: Randomly sample 1 2 of the training data. Train model using sample and make predictions y for test data. Repeat R times (e.g. 20 times). Compute average prediction y m for every test example. For each test example x with true label t : bias ( x ) = ( t − y m ) 2 R variance ( x ) = 1 � ( y m − y i ) 2 R i = 1 Average over test cases to get expected bias & variance for algorithm. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 10 / 22

  14. Review of Feature Selection Forward Stepwise Feature Selection Start from empty selected set. Evaluate benefit of selecting each non-selected feature (train model for each choice). Select most beneficial feature. Repeat search until stopping criteria. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 11 / 22

  15. Review of Feature Selection Forward Stepwise Feature Selection Start from empty selected set. Evaluate benefit of selecting each non-selected feature (train model for each choice). Select most beneficial feature. Repeat search until stopping criteria. Correlation-based Feature Filtering Rank features by individual correlation with class label. Choose cutoff point (by statistical test or cross-validation). Keep features above cutoff point. Discard rest. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 11 / 22

  16. Experiment 1: Bias-Variance of Feature Selection Summary: Dataset Sizes 19 datasets 1e+06 100000 order features using feature selection 10000 forward stepwise feature selection or # Features 1000 correlation feature filtering, depending 100 on dataset size 10 estimate bias & variance at multiple 1 feature set sizes 100 1000 10000 100000 # Samples 5-fold cross-validation Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 12 / 22

  17. Case 1: No Improvement from Feature Selection covtype 0.08 variance bias/noise 0.07 0.06 single decision tree MSE 0.05 � � bagged decision tree ✠ � � 0.04 � ✠ � 0.03 0.02 1 2 3 4 5 10 20 30 40 50 54 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 13 / 22

  18. Case 2: FS Improves Non-Bagged Model medis 0.1 variance bias/noise 0.095 ❑ ❆ ❆ 0.09 ❆ 0.085 overfits with too many features MSE 0.08 0.075 0.07 0.065 0.06 1 2 3 4 5 10 20 30 40 50 60 63 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 14 / 22

  19. Take Away Points More features ⇒ lower bias/noise, higher variance. Feature selection does not improve bagged model performance (1 exception). Best subset size corresponds to best bias/variance tradeoff point. Algorithm dependant Relevant features may be discarded if variance increase outweighs extra information Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 15 / 22

  20. Why Does Bagging Benefit from so Many Features? cryst 0.13 variance 0.125 bias/noise 0.12 0.115 0.11 MSE 0.105 0.1 0.095 0.09 0.085 0.08 1 5 10 25 50 100 200 400 800 1,341 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 16 / 22

  21. Why Does Bagging Benefit from so Many Features? cryst 0.13 variance 0.125 bias/noise 0.12 0.115 ✛ 0.11 MSE 0.105 0.1 0.095 0.09 0.085 0.08 1 5 10 25 50 100 200 400 800 1,341 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 16 / 22

  22. Why Does Bagging Benefit from so Many Features? cryst 0.13 variance 0.125 bias/noise 0.12 0.115 ✛ 0.11 MSE 0.105 0.1 0.095 0.09 ✛ 0.085 0.08 1 5 10 25 50 100 200 400 800 1,341 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 16 / 22

  23. Hypothesis Bagging improves base learner’s ability to benefit from weak, noisy features. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 17 / 22

  24. Experiment 2: Noisy Informative Features Summary: generate synthetic data (6 features) duplicate 1/2 of the features 20 times corrupt X % of values in duplicated features train single and bagged trees with corrupted features and 3 non-duplicated features compare to: ideal, unblemished feature set, and no noisy features (3 non-duplicated only) Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 18 / 22

  25. Bagging Extracts More Info from Noisy Features damaged 0.3 variance bias/noise 0.25 0.2 MSE 0.15 0.1 0.05 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

  26. Bagging Extracts More Info from Noisy Features damaged 0.3 variance bias/noise 0.25 0.2 MSE 0.15 0.1 ✲ 6 original features (ideal) 0.05 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

  27. Bagging Extracts More Info from Noisy Features damaged 0.3 variance bias/noise 0.25 3 non-duplicated features (baseline) ❍❍❍❍❍ 0.2 MSE ❍ ❥ 0.15 0.1 ✲ 6 original features (ideal) 0.05 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

Recommend


More recommend