On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich - PowerPoint PPT Presentation

On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich Caruana 2 1 Department of Computer Science Cornell University 2 Microsoft Corporation ECML-PKDD 2009 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 1 / 22

Task: Model Presence/Absence of Birds Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 2 / 22

Task: Model Presence/Absence of Birds Tried: SVMs boosted decision trees bagged decision trees neural networks · · · Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 2 / 22

Task: Model Presence/Absence of Birds Tried: SVMs boosted decision trees bagged decision trees neural networks · · · Ultimate goal: understand avian population dynamics Ran feature selection to find smallest feature set with excellent performance. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 2 / 22

Bagging Likes Many Noisy Features (?) European Starling 0.385 bagging 0.38 all features 0.375 0.37 RMS 0.365 0.36 0.355 0.35 0 5 10 15 20 25 30 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 3 / 22

Surprised Reviewers Reviewer A [I] also found that the results reported in Figure 2 [were] strange, where the majority [of] results show that classifiers built from selected features are actually inferior to the ones trained from the whole feature [set]. Reviewer B It is very surprising that the performance of all methods improves (or stays constant) when the number of features is increased. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 4 / 22

Purpose of this Study Does bagging often benefit from many features? If so, why? Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 5 / 22

Outline Story Behind the Paper 1 Background 2 Experiment 1: FS and Bias-Variance 3 Experiment 2: Weak, Noisy Features 4 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 6 / 22

Review of Bagging Bagging: simple ensemble learning algorithm [Bre96]: draw random sample of training data train a model using sample (e.g. decision tree) repeat N times (e.g. 25 times) bagged predictions: average predictions of N models Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 7 / 22

Facts about Bagging Surprisingly competitive performance & rarely overfits [BK99]. Main benefit is reducing variance of constituent models [BK99]. Improves ability to ignore irrelevant features [AP96]. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 8 / 22

Review of Bias-Variance Decomposition Error of learning algorithm on example x comes from 3 sources: noise intrinsic error / uncertainty for x ’s true label bias how close, on average, is algorithm to optimal prediction variance how much does prediction change if change training set Error decomposes as: error ( x ) = noise ( x ) + bias ( x ) + variance ( x ) On real problems, cannot separately measure bias and noise. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 9 / 22

Measuring Bias & Variance (Squared Error) Generate empirical distribution of the algorithm’s predictions [BK99]: Randomly sample 1 2 of the training data. Train model using sample and make predictions y for test data. Repeat R times (e.g. 20 times). Compute average prediction y m for every test example. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 10 / 22

Measuring Bias & Variance (Squared Error) Generate empirical distribution of the algorithm’s predictions [BK99]: Randomly sample 1 2 of the training data. Train model using sample and make predictions y for test data. Repeat R times (e.g. 20 times). Compute average prediction y m for every test example. For each test example x with true label t : bias ( x ) = ( t − y m ) 2 R variance ( x ) = 1 � ( y m − y i ) 2 R i = 1 Average over test cases to get expected bias & variance for algorithm. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 10 / 22

Review of Feature Selection Forward Stepwise Feature Selection Start from empty selected set. Evaluate benefit of selecting each non-selected feature (train model for each choice). Select most beneficial feature. Repeat search until stopping criteria. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 11 / 22

Review of Feature Selection Forward Stepwise Feature Selection Start from empty selected set. Evaluate benefit of selecting each non-selected feature (train model for each choice). Select most beneficial feature. Repeat search until stopping criteria. Correlation-based Feature Filtering Rank features by individual correlation with class label. Choose cutoff point (by statistical test or cross-validation). Keep features above cutoff point. Discard rest. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 11 / 22

Experiment 1: Bias-Variance of Feature Selection Summary: Dataset Sizes 19 datasets 1e+06 100000 order features using feature selection 10000 forward stepwise feature selection or # Features 1000 correlation feature filtering, depending 100 on dataset size 10 estimate bias & variance at multiple 1 feature set sizes 100 1000 10000 100000 # Samples 5-fold cross-validation Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 12 / 22

Case 1: No Improvement from Feature Selection covtype 0.08 variance bias/noise 0.07 0.06 single decision tree MSE 0.05 � � bagged decision tree ✠ � � 0.04 � ✠ � 0.03 0.02 1 2 3 4 5 10 20 30 40 50 54 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 13 / 22

Case 2: FS Improves Non-Bagged Model medis 0.1 variance bias/noise 0.095 ❑ ❆ ❆ 0.09 ❆ 0.085 overfits with too many features MSE 0.08 0.075 0.07 0.065 0.06 1 2 3 4 5 10 20 30 40 50 60 63 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 14 / 22

Take Away Points More features ⇒ lower bias/noise, higher variance. Feature selection does not improve bagged model performance (1 exception). Best subset size corresponds to best bias/variance tradeoff point. Algorithm dependant Relevant features may be discarded if variance increase outweighs extra information Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 15 / 22

Why Does Bagging Benefit from so Many Features? cryst 0.13 variance 0.125 bias/noise 0.12 0.115 0.11 MSE 0.105 0.1 0.095 0.09 0.085 0.08 1 5 10 25 50 100 200 400 800 1,341 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 16 / 22

Why Does Bagging Benefit from so Many Features? cryst 0.13 variance 0.125 bias/noise 0.12 0.115 ✛ 0.11 MSE 0.105 0.1 0.095 0.09 0.085 0.08 1 5 10 25 50 100 200 400 800 1,341 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 16 / 22

Why Does Bagging Benefit from so Many Features? cryst 0.13 variance 0.125 bias/noise 0.12 0.115 ✛ 0.11 MSE 0.105 0.1 0.095 0.09 ✛ 0.085 0.08 1 5 10 25 50 100 200 400 800 1,341 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 16 / 22

Hypothesis Bagging improves base learner’s ability to benefit from weak, noisy features. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 17 / 22

Experiment 2: Noisy Informative Features Summary: generate synthetic data (6 features) duplicate 1/2 of the features 20 times corrupt X % of values in duplicated features train single and bagged trees with corrupted features and 3 non-duplicated features compare to: ideal, unblemished feature set, and no noisy features (3 non-duplicated only) Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 18 / 22

Bagging Extracts More Info from Noisy Features damaged 0.3 variance bias/noise 0.25 0.2 MSE 0.15 0.1 0.05 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

Bagging Extracts More Info from Noisy Features damaged 0.3 variance bias/noise 0.25 0.2 MSE 0.15 0.1 ✲ 6 original features (ideal) 0.05 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

Bagging Extracts More Info from Noisy Features damaged 0.3 variance bias/noise 0.25 3 non-duplicated features (baseline) ❍❍❍❍❍ 0.2 MSE ❍ ❥ 0.15 0.1 ✲ 6 original features (ideal) 0.05 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich - PowerPoint PPT Presentation

On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich Caruana 2 1 Department of Computer Science Cornell University 2 Microsoft Corporation ECML-PKDD 2009 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 1

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Review Selection bias, overfitting Bias v. variance v. residual Bias-variance tradeoff

Bias, Variance and Error Bias and Variance given algorithm that outputs estimate for , we

Bias- -Variance Theory Variance Theory Bias Decompose Error Rate into components, some

Random Forest Bagging Bagging or bootstrap aggregation a technique for reducing the variance

Bagging and Boosting Amit Srinet Dave Snyder Outline Bagging Definition Variants Examples

BIAS What Is Bias? Bias can be defined as favoring one side, position, or belief being

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Making Generative Classifiers Robust to Selection Bias Andrew Smith Charles Elkan November

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Bias-Variance Tradeoff Machine Learning 1 Bias and variance Every learning algorithm requires

Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap Aggregation Bagging The Main

Lecture 13 Lecture 13 Oct-27-2007 Bagging Bagging Generate T random sample from training

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

BIAS BIAS LIGHT LIGHT & & MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

50 Ways to Tweak Your Paper Some Comments on Paper Writing and Reviewing Johannes Frnkranz TU

Where The Web Is Going @jaredthenerd jaredthenerd.com Where The Web Is Going by Jared Faris is

SAML 2.0: LECP Solution Proposal Work Plan Item W-5a Frederick Hirsch 23 October 2003 Intent:

The next inflection point Adam Bosworth, Chief Architect, SVP BEA-Crossgain Agenda

Rule-Based Classification Johannes Frnkranz Knowledge Engineering Group TU Darmstadt

Fast Training of Support Vector Machines for Survival Analysis Sebastian Plsterl 1 , Nassir

Functional Bid Landscape Forecasting for Display Advertising Yuchen Wang 1 Kan Ren 1 Weinan Zhang

Clustering Rankings in the Fourier Domain Stphan Clmenon and Romaric Gaudel and Jrmie

On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich - PowerPoint PPT Presentation

On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich Caruana 2 1 Department of Computer Science Cornell University 2 Microsoft Corporation ECML-PKDD 2009 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 1

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Review Selection bias, overfitting Bias v. variance v. residual Bias-variance tradeoff

Bias, Variance and Error Bias and Variance given algorithm that outputs estimate for , we

Bias- -Variance Theory Variance Theory Bias Decompose Error Rate into components, some

Random Forest Bagging Bagging or bootstrap aggregation a technique for reducing the variance

Bagging and Boosting Amit Srinet Dave Snyder Outline Bagging Definition Variants Examples

BIAS What Is Bias? Bias can be defined as favoring one side, position, or belief being

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Making Generative Classifiers Robust to Selection Bias Andrew Smith Charles Elkan November

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Bias-Variance Tradeoff Machine Learning 1 Bias and variance Every learning algorithm requires

Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap Aggregation Bagging The Main

Lecture 13 Lecture 13 Oct-27-2007 Bagging Bagging Generate T random sample from training

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

BIAS BIAS LIGHT LIGHT &amp; &amp; MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

50 Ways to Tweak Your Paper Some Comments on Paper Writing and Reviewing Johannes Frnkranz TU

Where The Web Is Going @jaredthenerd jaredthenerd.com Where The Web Is Going by Jared Faris is

SAML 2.0: LECP Solution Proposal Work Plan Item W-5a Frederick Hirsch 23 October 2003 Intent:

The next inflection point Adam Bosworth, Chief Architect, SVP BEA-Crossgain Agenda

Rule-Based Classification Johannes Frnkranz Knowledge Engineering Group TU Darmstadt

Fast Training of Support Vector Machines for Survival Analysis Sebastian Plsterl 1 , Nassir

Functional Bid Landscape Forecasting for Display Advertising Yuchen Wang 1 Kan Ren 1 Weinan Zhang

Clustering Rankings in the Fourier Domain Stphan Clmenon and Romaric Gaudel and Jrmie

BIAS BIAS LIGHT LIGHT & & MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh