Data Warehousing and Machine Learning Feature Selection Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 16
Feature Selection When features don’t help Data generated by process described by Bayesian network: Class Class ⊕ ⊖ 0.5 0.5 A 1 Class 0 1 ⊕ 0.4 0.6 A 1 A 3 A 3 ⊖ 0.5 0.5 Class 0 1 ⊕ 0.5 0.5 A 2 ⊖ 0.7 0.3 A 1 0 1 A 2 0 1.0 0.0 1 0.0 1.0 Attribute A 2 is just a duplicate of A 1 . Conditional class probability for example: P ( ⊕ | A 1 = 1 , A 2 = 1 , A 3 = 0 ) = 0 . 461 DWML Spring 2008 2 / 16
Feature Selection The Naive Bayes model learned from data: Class Class ⊕ ⊖ 0.5 0.5 A 1 A 2 A 3 A 1 A 2 A 3 Class 0 1 Class 0 1 Class 0 1 ⊕ 0.4 0.6 ⊕ 0.4 0.6 ⊕ 0.5 0.5 ⊖ 0.5 0.5 ⊖ 0.5 0.5 ⊖ 0.7 0.3 In Naive Bayes model: P ( ⊕ | A 1 = 1 , A 2 = 1 , A 3 = 0 ) = 0 . 507 Intuitively: the NB model double counts the information provided by A 1 , A 2 . DWML Spring 2008 3 / 16
Feature Selection The Naive Bayes model with selected features A 1 and A 3 : Class Class ⊕ ⊖ 0.5 0.5 A 1 A 3 A 1 A 3 Class 0 1 Class 0 1 ⊕ 0.4 0.6 ⊕ 0.5 0.5 ⊖ 0.5 0.5 ⊖ 0.7 0.3 In this Naive Bayes model: P ( ⊕ | A 1 = 1 , A 3 = 0 ) = 0 . 461 (and all other posterior class probabilities are also the same as for the true model). DWML Spring 2008 4 / 16
Feature Selection Decision Tree Decision trees learned from the same data: A 3 A 1 1 0 1 0 A 2 A 3 A 3 A 1 1 0 1 0 1 0 1 0 ⊕ / ⊖ : ⊕ / ⊖ : ⊕ / ⊖ : ⊕ / ⊖ : ⊕ / ⊖ : ⊕ / ⊖ : ⊕ / ⊖ : ⊕ / ⊖ : 0 . 66 / 0 . 57 / 0 . 46 / 0 . 36 / 0 . 66 / 0 . 46 / 0 . 57 / 0 . 36 / 0 . 33 0 . 43 0 . 54 0 . 64 0 . 33 0 . 54 0 . 43 0 . 64 Decision tree does not test two equivalent variables twice on one branch (but might pick one or the other on different branches). DWML Spring 2008 5 / 16
Feature Selection Problems • Correlated features can skew prediction • Irrelevant features (not correlated to class variable) cause unnecessary blowup of model space (search space) • Irrelevant features can drown the information provided by informative features in noise (e.g. distance function dominated by random values of many uninformative features) • Irrelevant features in a model reduce its explanatory value (also when predictive accuracy is not reduced). DWML Spring 2008 6 / 16
Feature Selection Problems • Correlated features can skew prediction • Irrelevant features (not correlated to class variable) cause unnecessary blowup of model space (search space) • Irrelevant features can drown the information provided by informative features in noise (e.g. distance function dominated by random values of many uninformative features) • Irrelevant features in a model reduce its explanatory value (also when predictive accuracy is not reduced). Methods of feature selection • Define relevance of features, and filter out irrelevant features before learning (relevance independent of used model). • Filter features based on model-specific criteria (e.g. eliminate highly correlated features for Naive Bayes). • Wrapper approach : evaluate feature subsets by model performance. DWML Spring 2008 6 / 16
Feature Selection Relevance A possible definition: A feature A i is irrelevant if for all a ∈ states ( A i ) and c ∈ states ( C ) P ( C = c | A i = a ) = P ( C = c ) . DWML Spring 2008 7 / 16
Feature Selection Relevance A possible definition: A feature A i is irrelevant if for all a ∈ states ( A i ) and c ∈ states ( C ) P ( C = c | A i = a ) = P ( C = c ) . Limitations of relevance based filtering: • Even if A i is irrelevant, it may become relevant in the presence of another feature, i.e. for some A j and a ′ ∈ states ( A j ) : P ( C = c | A i = a , A j = a ′ ) � = P ( C = c | A j = a ′ ) . • For Naive Bayes: irrelevant features neither help nor hurt • Irrelevance does not capture redundancy Generally: difficult to say in a data- and method-independent way what features are useful. DWML Spring 2008 7 / 16
Feature Selection The Wrapper Approach [Kohavi,John 97] • Search over possible feature subsets • Candidate feature subsets v are evaluated: • Construct a model using features v using the given learning method (= induction algorithm). • Evaluate performance of the model using cross-validation • Assign score f ( v ) : average predictive accuracy in cross-validation • Best feature subset found is used to learn final model DWML Spring 2008 8 / 16
Feature Selection Feature Selection Search The feature subset lattice for 4 attributes: E.g. 1 , 0 , 1 , 0 represents feature subset { A 1 , A 3 } . Search space too big for exhaustive search! DWML Spring 2008 9 / 16
Feature Selection Greedy Search f(v)= 0.5 DWML Spring 2008 10 / 16
Feature Selection Greedy Search f(v)= 0.5 0.6 0.7 0.6 0.5 DWML Spring 2008 10 / 16
Feature Selection Greedy Search f(v)= 0.5 0.7 0.65 0.63 0.72 DWML Spring 2008 10 / 16
Feature Selection Greedy Search f(v)= 0.5 0.7 0.72 0.69 0.7 DWML Spring 2008 10 / 16
Feature Selection Greedy Search f(v)= 0.5 0.7 0.72 Search terminates when no score improvement obtained by expansion. DWML Spring 2008 10 / 16
Feature Selection Best First Search f(v)= 0.5 open: closed: best: DWML Spring 2008 11 / 16
Feature Selection Best First Search f(v)= 0.5 open: 0.6 0.7 0.6 0.5 closed: best: DWML Spring 2008 11 / 16
Feature Selection Best First Search f(v)= 0.5 open: 0.6 0.7 0.6 0.5 closed: best: 0.65 0.63 0.72 DWML Spring 2008 11 / 16
Feature Selection Best First Search f(v)= 0.5 open: 0.6 0.7 0.6 0.5 closed: best: 0.65 0.63 0.72 0.69 0.7 DWML Spring 2008 11 / 16
Feature Selection Best First Search f(v)= 0.5 open: 0.6 0.7 0.6 0.5 closed: best: 0.65 0.63 0.72 0.69 0.7 0.75 DWML Spring 2008 11 / 16
Feature Selection Best First Search f(v)= 0.5 open: 0.6 0.7 0.6 0.5 closed: best: 0.65 0.63 0.72 0.69 0.7 0.75 DWML Spring 2008 11 / 16
Feature Selection Best First Search f(v)= 0.5 open: 0.6 0.7 0.6 0.5 closed: best: 0.65 0.63 0.72 0.69 0.7 0.75 DWML Spring 2008 11 / 16
Feature Selection Best First Search f(v)= 0.5 open: 0.6 0.7 0.6 0.5 closed: best: 0.65 0.63 0.72 0.82 0.69 0.7 0.75 DWML Spring 2008 11 / 16
Feature Selection Best First Search f(v)= 0.5 open: 0.6 0.7 0.6 0.5 closed: best: 0.65 0.63 0.72 0.82 0.69 0.7 0.75 DWML Spring 2008 11 / 16
Feature Selection Best First Search Seach continues until k consecutive expansions have not generated any score improvement for feature subset. Can also be used as anytime algorithm : search continues indefinitely, current ’best’ is always available. DWML Spring 2008 11 / 16
Feature Selection Experimental Results: Accuracy Results from [Kohavi,John 97]. Table gives accuracy in 10-fold cross-validation. Comparison of algorithm using all features, and algorithm using selected features (-FSS). Here with greedy search. DWML Spring 2008 12 / 16
Feature Selection Experimental Results: Number of Featues Results from [Kohavi,John 97]. Average number of features selected in 10-fold cross-validation. DWML Spring 2008 13 / 16
Feature Selection Experimental Results: Accuracy Results from [Kohavi,John 97]. Table gives accuracy in 10-fold cross-validation. Comparison of feature subset selection using hill climbing and best first search. DWML Spring 2008 14 / 16
Feature Selection Experimental Results: Number of Featues Results from [Kohavi,John 97]. Average number of features selected in 10-fold cross-validation. DWML Spring 2008 15 / 16
Feature Generation Building new features • Discretization of continuous attributes • Value grouping: e.g. reduce date of sale to month of sale • Synthesize new features: e.g. from A 1 , A 2 (continuous) compute A new := A 1 / A 2 DWML Spring 2008 16 / 16
Recommend
More recommend