data warehousing and machine learning
play

Data Warehousing and Machine Learning Feature Selection Thomas D. - PowerPoint PPT Presentation

Data Warehousing and Machine Learning Feature Selection Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 16 Feature Selection When features dont help Data generated by process


  1. Data Warehousing and Machine Learning Feature Selection Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 16

  2. Feature Selection When features don’t help Data generated by process described by Bayesian network: Class Class ⊕ ⊖ 0.5 0.5 A 1 Class 0 1 ⊕ 0.4 0.6 A 1 A 3 A 3 ⊖ 0.5 0.5 Class 0 1 ⊕ 0.5 0.5 A 2 ⊖ 0.7 0.3 A 1 0 1 A 2 0 1.0 0.0 1 0.0 1.0 Attribute A 2 is just a duplicate of A 1 . Conditional class probability for example: P ( ⊕ | A 1 = 1 , A 2 = 1 , A 3 = 0 ) = 0 . 461 DWML Spring 2008 2 / 16

  3. Feature Selection The Naive Bayes model learned from data: Class Class ⊕ ⊖ 0.5 0.5 A 1 A 2 A 3 A 1 A 2 A 3 Class 0 1 Class 0 1 Class 0 1 ⊕ 0.4 0.6 ⊕ 0.4 0.6 ⊕ 0.5 0.5 ⊖ 0.5 0.5 ⊖ 0.5 0.5 ⊖ 0.7 0.3 In Naive Bayes model: P ( ⊕ | A 1 = 1 , A 2 = 1 , A 3 = 0 ) = 0 . 507 Intuitively: the NB model double counts the information provided by A 1 , A 2 . DWML Spring 2008 3 / 16

  4. Feature Selection The Naive Bayes model with selected features A 1 and A 3 : Class Class ⊕ ⊖ 0.5 0.5 A 1 A 3 A 1 A 3 Class 0 1 Class 0 1 ⊕ 0.4 0.6 ⊕ 0.5 0.5 ⊖ 0.5 0.5 ⊖ 0.7 0.3 In this Naive Bayes model: P ( ⊕ | A 1 = 1 , A 3 = 0 ) = 0 . 461 (and all other posterior class probabilities are also the same as for the true model). DWML Spring 2008 4 / 16

  5. Feature Selection Decision Tree Decision trees learned from the same data: A 3 A 1 1 0 1 0 A 2 A 3 A 3 A 1 1 0 1 0 1 0 1 0 ⊕ / ⊖ : ⊕ / ⊖ : ⊕ / ⊖ : ⊕ / ⊖ : ⊕ / ⊖ : ⊕ / ⊖ : ⊕ / ⊖ : ⊕ / ⊖ : 0 . 66 / 0 . 57 / 0 . 46 / 0 . 36 / 0 . 66 / 0 . 46 / 0 . 57 / 0 . 36 / 0 . 33 0 . 43 0 . 54 0 . 64 0 . 33 0 . 54 0 . 43 0 . 64 Decision tree does not test two equivalent variables twice on one branch (but might pick one or the other on different branches). DWML Spring 2008 5 / 16

  6. Feature Selection Problems • Correlated features can skew prediction • Irrelevant features (not correlated to class variable) cause unnecessary blowup of model space (search space) • Irrelevant features can drown the information provided by informative features in noise (e.g. distance function dominated by random values of many uninformative features) • Irrelevant features in a model reduce its explanatory value (also when predictive accuracy is not reduced). DWML Spring 2008 6 / 16

  7. Feature Selection Problems • Correlated features can skew prediction • Irrelevant features (not correlated to class variable) cause unnecessary blowup of model space (search space) • Irrelevant features can drown the information provided by informative features in noise (e.g. distance function dominated by random values of many uninformative features) • Irrelevant features in a model reduce its explanatory value (also when predictive accuracy is not reduced). Methods of feature selection • Define relevance of features, and filter out irrelevant features before learning (relevance independent of used model). • Filter features based on model-specific criteria (e.g. eliminate highly correlated features for Naive Bayes). • Wrapper approach : evaluate feature subsets by model performance. DWML Spring 2008 6 / 16

  8. Feature Selection Relevance A possible definition: A feature A i is irrelevant if for all a ∈ states ( A i ) and c ∈ states ( C ) P ( C = c | A i = a ) = P ( C = c ) . DWML Spring 2008 7 / 16

  9. Feature Selection Relevance A possible definition: A feature A i is irrelevant if for all a ∈ states ( A i ) and c ∈ states ( C ) P ( C = c | A i = a ) = P ( C = c ) . Limitations of relevance based filtering: • Even if A i is irrelevant, it may become relevant in the presence of another feature, i.e. for some A j and a ′ ∈ states ( A j ) : P ( C = c | A i = a , A j = a ′ ) � = P ( C = c | A j = a ′ ) . • For Naive Bayes: irrelevant features neither help nor hurt • Irrelevance does not capture redundancy Generally: difficult to say in a data- and method-independent way what features are useful. DWML Spring 2008 7 / 16

  10. Feature Selection The Wrapper Approach [Kohavi,John 97] • Search over possible feature subsets • Candidate feature subsets v are evaluated: • Construct a model using features v using the given learning method (= induction algorithm). • Evaluate performance of the model using cross-validation • Assign score f ( v ) : average predictive accuracy in cross-validation • Best feature subset found is used to learn final model DWML Spring 2008 8 / 16

  11. Feature Selection Feature Selection Search The feature subset lattice for 4 attributes: E.g. 1 , 0 , 1 , 0 represents feature subset { A 1 , A 3 } . Search space too big for exhaustive search! DWML Spring 2008 9 / 16

  12. Feature Selection Greedy Search f(v)= 0.5 DWML Spring 2008 10 / 16

  13. Feature Selection Greedy Search f(v)= 0.5 0.6 0.7 0.6 0.5 DWML Spring 2008 10 / 16

  14. Feature Selection Greedy Search f(v)= 0.5 0.7 0.65 0.63 0.72 DWML Spring 2008 10 / 16

  15. Feature Selection Greedy Search f(v)= 0.5 0.7 0.72 0.69 0.7 DWML Spring 2008 10 / 16

  16. Feature Selection Greedy Search f(v)= 0.5 0.7 0.72 Search terminates when no score improvement obtained by expansion. DWML Spring 2008 10 / 16

  17. Feature Selection Best First Search f(v)= 0.5 open: closed: best: DWML Spring 2008 11 / 16

  18. Feature Selection Best First Search f(v)= 0.5 open: 0.6 0.7 0.6 0.5 closed: best: DWML Spring 2008 11 / 16

  19. Feature Selection Best First Search f(v)= 0.5 open: 0.6 0.7 0.6 0.5 closed: best: 0.65 0.63 0.72 DWML Spring 2008 11 / 16

  20. Feature Selection Best First Search f(v)= 0.5 open: 0.6 0.7 0.6 0.5 closed: best: 0.65 0.63 0.72 0.69 0.7 DWML Spring 2008 11 / 16

  21. Feature Selection Best First Search f(v)= 0.5 open: 0.6 0.7 0.6 0.5 closed: best: 0.65 0.63 0.72 0.69 0.7 0.75 DWML Spring 2008 11 / 16

  22. Feature Selection Best First Search f(v)= 0.5 open: 0.6 0.7 0.6 0.5 closed: best: 0.65 0.63 0.72 0.69 0.7 0.75 DWML Spring 2008 11 / 16

  23. Feature Selection Best First Search f(v)= 0.5 open: 0.6 0.7 0.6 0.5 closed: best: 0.65 0.63 0.72 0.69 0.7 0.75 DWML Spring 2008 11 / 16

  24. Feature Selection Best First Search f(v)= 0.5 open: 0.6 0.7 0.6 0.5 closed: best: 0.65 0.63 0.72 0.82 0.69 0.7 0.75 DWML Spring 2008 11 / 16

  25. Feature Selection Best First Search f(v)= 0.5 open: 0.6 0.7 0.6 0.5 closed: best: 0.65 0.63 0.72 0.82 0.69 0.7 0.75 DWML Spring 2008 11 / 16

  26. Feature Selection Best First Search Seach continues until k consecutive expansions have not generated any score improvement for feature subset. Can also be used as anytime algorithm : search continues indefinitely, current ’best’ is always available. DWML Spring 2008 11 / 16

  27. Feature Selection Experimental Results: Accuracy Results from [Kohavi,John 97]. Table gives accuracy in 10-fold cross-validation. Comparison of algorithm using all features, and algorithm using selected features (-FSS). Here with greedy search. DWML Spring 2008 12 / 16

  28. Feature Selection Experimental Results: Number of Featues Results from [Kohavi,John 97]. Average number of features selected in 10-fold cross-validation. DWML Spring 2008 13 / 16

  29. Feature Selection Experimental Results: Accuracy Results from [Kohavi,John 97]. Table gives accuracy in 10-fold cross-validation. Comparison of feature subset selection using hill climbing and best first search. DWML Spring 2008 14 / 16

  30. Feature Selection Experimental Results: Number of Featues Results from [Kohavi,John 97]. Average number of features selected in 10-fold cross-validation. DWML Spring 2008 15 / 16

  31. Feature Generation Building new features • Discretization of continuous attributes • Value grouping: e.g. reduce date of sale to month of sale • Synthesize new features: e.g. from A 1 , A 2 (continuous) compute A new := A 1 / A 2 DWML Spring 2008 16 / 16

Recommend


More recommend