Feature and model selection Subhransu Maji CMPSCI 689: Machine - PowerPoint PPT Presentation

Feature and model selection Subhransu Maji CMPSCI 689: Machine Learning 10 February 2015 12 February 2015

Administrivia Homework stuff � ‣ Homework 3 is out ‣ Homework 2 has been graded ‣ Ask your TA any questions related to grading TA office hours (currently Thursday 2:30-3:30) � 1. Wednesday 3:30 - 4:30? Later in the week � ‣ p1: decision trees and perceptrons ‣ due on March 03 Start thinking about projects � ‣ Form teams (2+) ‣ A proposal describing your project will be due mid March (TBD) CMPSCI 689 Subhransu Maji (UMASS) 2 /25

The importance of good features CMPSCI 689 Subhransu Maji (UMASS) 3 /25

The importance of good features Most learning methods are invariant to feature permutation � ‣ E.g., patch vs. pixel representation of images CMPSCI 689 Subhransu Maji (UMASS) 3 /25

The importance of good features Most learning methods are invariant to feature permutation � ‣ E.g., patch vs. pixel representation of images permute pixels bag of pixels can you recognize the digits? CMPSCI 689 Subhransu Maji (UMASS) 3 /25

The importance of good features Most learning methods are invariant to feature permutation � ‣ E.g., patch vs. pixel representation of images permute pixels permute patches bag of pixels bag of patches can you recognize the digits? CMPSCI 689 Subhransu Maji (UMASS) 3 /25

Irrelevant and redundant features Irrelevant features � E [ f ; C ] = E [ f ] ‣ E.g., a binary feature with Redundant features � ‣ For example, pixels next to each other are highly correlated Irrelevant features are not that unusual � ‣ Consider bag-of-words model for text which typically have on the order of 100,000 features, but only a handful of them are useful for spam classification � � � Different learning algorithms are affected differently by irrelevant and redundant features CMPSCI 689 Subhransu Maji (UMASS) 4 /25

Irrelevant and redundant features How do irrelevant features affect decision tree classifiers? Consider adding 1 binary noisy feature for a binary classification task � ‣ For simplicity assume that in our dataset there are N/2 instances label=+1 and N/2 instances with label=-1 ‣ Probability that a noisy feature is perfectly correlated with the labels in the dataset is 2x0.5 ᴺ ‣ Very small if N is large (1e-6 for N=21) ‣ But things are considerably worse where there are many irrelevant features, or if we allow partial correlation For large datasets, the decision tree learner can learn to ignore noisy features that are not correlated with the labels. CMPSCI 689 Subhransu Maji (UMASS) 5 /25

Irrelevant and redundant features How do irrelevant features affect kNN classifiers? kNN classifiers (with Euclidean distance) treat all the features equally � Noisy dimensions can dominate distance computation � Randomly distributed points in high dimensions are all (roughly) equally apart! � � � a i ← N (0 , 1) b i ← N (0 , 1) � √ � E [ || a − b || ] → 2 D � � � � kNN classifiers can be bad with noisy features even for large N CMPSCI 689 Subhransu Maji (UMASS) 6 /25

Irrelevant and redundant features How do irrelevant features affect perceptron classifiers? Perceptrons can learn low weight on irrelevant features � Irrelevant features can affect the convergence rate � ‣ updates are wasted on learning low weights on irrelevant features But like decision trees, if the dataset is large enough, the perceptron will eventually learn to ignore the weights � Effect of noise on classifiers: “3” vs “8” classification using pixel features (28x28 images = 784 features) i = 2 0 , . . . , 2 12 x ← [ x z ] z i = N (0 , 1) , vary the number of noisy dimensions CMPSCI 689 Subhransu Maji (UMASS) 7 /25

Feature selection Selecting a small subset of useful features � Reasons: � ‣ Reduce measurement cost ‣ Reduces data set and resulting model size ‣ Some algorithms scale poorly with increased dimension ‣ Irrelevant features can confuse some algorithms ‣ Redundant features adversely affect generalization for some learning methods ‣ Removal of features can make learning easier and improve generalization (for example by increasing the margin) CMPSCI 689 Subhransu Maji (UMASS) 8 /25

Feature selection methods CMPSCI 689 Subhransu Maji (UMASS) 9 /25

Feature selection methods Methods agnostic to the learning algorithm CMPSCI 689 Subhransu Maji (UMASS) 9 /25

Feature selection methods Methods agnostic to the learning algorithm ‣ Surface heuristics: remove a feature if it rarely changes CMPSCI 689 Subhransu Maji (UMASS) 9 /25

Feature selection methods Methods agnostic to the learning algorithm ‣ Surface heuristics: remove a feature if it rarely changes ‣ Ranking based: rank features according to some criteria CMPSCI 689 Subhransu Maji (UMASS) 9 /25

Feature selection methods Methods agnostic to the learning algorithm ‣ Surface heuristics: remove a feature if it rarely changes ‣ Ranking based: rank features according to some criteria ➡ Correlation: scatter plot CMPSCI 689 Subhransu Maji (UMASS) 9 /25

Feature selection methods Methods agnostic to the learning algorithm ‣ Surface heuristics: remove a feature if it rarely changes ‣ Ranking based: rank features according to some criteria ➡ Correlation: scatter plot ➡ Mutual information: X entropy H ( X ) = − p ( x ) log p ( x ) x decision � trees? CMPSCI 689 Subhransu Maji (UMASS) 9 /25

Feature selection methods Methods agnostic to the learning algorithm ‣ Surface heuristics: remove a feature if it rarely changes ‣ Ranking based: rank features according to some criteria ➡ Correlation: scatter plot ➡ Mutual information: X entropy H ( X ) = − p ( x ) log p ( x ) x decision � trees? ‣ Usually cheap CMPSCI 689 Subhransu Maji (UMASS) 9 /25

Feature selection methods Methods agnostic to the learning algorithm ‣ Surface heuristics: remove a feature if it rarely changes ‣ Ranking based: rank features according to some criteria ➡ Correlation: scatter plot ➡ Mutual information: X entropy H ( X ) = − p ( x ) log p ( x ) x decision � trees? ‣ Usually cheap Wrapper methods ‣ Aware of the learning algorithm (forward and backward selection) ‣ Can be computationally expensive CMPSCI 689 Subhransu Maji (UMASS) 9 /25

Forward and backward selection CMPSCI 689 Subhransu Maji (UMASS) 10 /25

Forward and backward selection Given: a learner L, a dictionary of features D to select from ‣ E.g., L = kNN classifier, D = polynomial functions of features Forward selection � ‣ Start with an empty set of features F = Φ ‣ Repeat till |F| < n ➡ For every f in D • Evaluate the performance of the learner on F ∪ f ➡ Pick the best feature f* ➡ F = F ∪ f*, D = D \ f* CMPSCI 689 Subhransu Maji (UMASS) 10 /25

Forward and backward selection Given: a learner L, a dictionary of features D to select from ‣ E.g., L = kNN classifier, D = polynomial functions of features Forward selection � ‣ Start with an empty set of features F = Φ ‣ Repeat till |F| < n ➡ For every f in D • Evaluate the performance of the learner on F ∪ f ➡ Pick the best feature f* ➡ F = F ∪ f*, D = D \ f* Backward selection is similar � ‣ Initialize F = D, and iteratively remove the feature that is least useful ‣ Much slower than forward selection CMPSCI 689 Subhransu Maji (UMASS) 10 /25

Forward and backward selection Given: a learner L, a dictionary of features D to select from ‣ E.g., L = kNN classifier, D = polynomial functions of features Forward selection � ‣ Start with an empty set of features F = Φ ‣ Repeat till |F| < n ➡ For every f in D • Evaluate the performance of the learner on F ∪ f ➡ Pick the best feature f* ➡ F = F ∪ f*, D = D \ f* Backward selection is similar � ‣ Initialize F = D, and iteratively remove the feature that is least useful ‣ Much slower than forward selection Greedy, but can be near optimal under certain conditions CMPSCI 689 Subhransu Maji (UMASS) 10 /25

Approximate feature selection What if the number of potential features are very large? � ‣ If may be hard to find the optimal feature � � � � [Viola and Jones, IJCV 01] � � � � Approximation by sampling: pick the best among a random subset � If done during decision tree learning, this will give you a random tree � ‣ We will see later (in the lecture on ensemble learning ) that it is good to train many random trees and average them (random forest). CMPSCI 689 Subhransu Maji (UMASS) 11 /25

Feature normalization CMPSCI 689 Subhransu Maji (UMASS) 12 /25

Feature normalization Even if a feature is useful some normalization may be good CMPSCI 689 Subhransu Maji (UMASS) 12 /25

Feature normalization Even if a feature is useful some normalization may be good Per-feature normalization µ d = 1 X ‣ Centering x n,d x n,d ← x n,d − µ d N n s 1 x n,d ← x n,d / σ d ‣ Variance scaling X ( x n,d − µ d ) 2 σ d = N n x n,d ← x n,d /r d ‣ Absolute scaling | x n,d | r d = max n CMPSCI 689 Subhransu Maji (UMASS) 12 /25

Feature and model selection Subhransu Maji CMPSCI 689: Machine - PowerPoint PPT Presentation

Feature and model selection Subhransu Maji CMPSCI 689: Machine Learning 10 February 2015 12 February 2015 Administrivia Homework stuff Homework 3 is out Homework 2 has been graded Ask your TA any questions related to grading TA

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Feature Selection: ROC and Subset Selection Theodoridis 5.5-5.7 Using ROC for Feature Selection

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Mutual Information an Adequate Tool for Feature Selection ? Benot Frnay November 15, 2013

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

10601 Machine Learning Model and feature selection Model selection issues We have seen some

PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2018 Soleymani

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

Objective Bayesian Analysis James O. Berger Duke University and the Statistical and Applied

SOME STATISTICAL TESTS Overview Theory of statistical tests Test for a difference in mean

Disclosure Differential Effect of Plasma Estradiol Levels The authors have no financial

Table of contents 1. Introduction: You are already an experimentalist 2. Conditions 3. Items

Is there a new infinitive in Russian Romani?: a corpus-based study of subject-verb agreement

12-11-06 Phylogenetics 1: An overview Phylogenetics 1: An overview Phylogenetic tree used in The

Testing theories of fairness Intentions matter Armin Falk, Ernst Fehr, Urs Fischbacher

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis Chaoqi Wang , Roger Grosse,

Feature and model selection Subhransu Maji CMPSCI 689: Machine - PowerPoint PPT Presentation

Feature and model selection Subhransu Maji CMPSCI 689: Machine Learning 10 February 2015 12 February 2015 Administrivia Homework stuff Homework 3 is out Homework 2 has been graded Ask your TA any questions related to grading TA

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Feature Selection: ROC and Subset Selection Theodoridis 5.5-5.7 Using ROC for Feature Selection

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Mutual Information an Adequate Tool for Feature Selection ? Benot Frnay November 15, 2013

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

10601 Machine Learning Model and feature selection Model selection issues We have seen some

PCA &amp; ICA CE-717: Machine Learning Sharif University of Technology Spring 2018 Soleymani

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

Objective Bayesian Analysis James O. Berger Duke University and the Statistical and Applied

SOME STATISTICAL TESTS Overview Theory of statistical tests Test for a difference in mean

Disclosure Differential Effect of Plasma Estradiol Levels The authors have no financial

Table of contents 1. Introduction: You are already an experimentalist 2. Conditions 3. Items

Is there a new infinitive in Russian Romani?: a corpus-based study of subject-verb agreement

12-11-06 Phylogenetics 1: An overview Phylogenetics 1: An overview Phylogenetic tree used in The

Testing theories of fairness Intentions matter Armin Falk, Ernst Fehr, Urs Fischbacher

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis Chaoqi Wang , Roger Grosse,

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2018 Soleymani