reducing dimensionality
play

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 - PowerPoint PPT Presentation

Feature Selection Feature Extraction Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection Feature Extraction Outline Feature Selection 1 Feature Extraction 2 Principal Components Analysis (PCA) Factor


  1. Feature Selection Feature Extraction Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1

  2. Feature Selection Feature Extraction Outline Feature Selection 1 Feature Extraction 2 Principal Components Analysis (PCA) Factor Analysis (FA) Multidimensional Scaling (MDS) Linear Discriminants Analysis (LDA) 2

  3. Feature Selection Feature Extraction Motivation Reduction in complexity of prediction and training Reduction in cost of data extraction Simpler models – reduced variance Easier to visualize & analyze results, identify outliers, etc. 3

  4. Feature Selection Feature Extraction Basic Approaches Given an input population characterized by d attributes: Feature Selection : find k < d dimensions that give the most information. Discard the other d − k . 4

  5. Feature Selection Feature Extraction Basic Approaches Given an input population characterized by d attributes: Feature Selection : find k < d dimensions that give the most information. Discard the other d − k . subset selection 4

  6. Feature Selection Feature Extraction Basic Approaches Given an input population characterized by d attributes: Feature Selection : find k < d dimensions that give the most information. Discard the other d − k . subset selection Feature Extraction : find k ≤ d dimensions that are linear combinations of the original d 4

  7. Feature Selection Feature Extraction Basic Approaches Given an input population characterized by d attributes: Feature Selection : find k < d dimensions that give the most information. Discard the other d − k . subset selection Feature Extraction : find k ≤ d dimensions that are linear combinations of the original d Principal Components Analysis (unsupervised) 4

  8. Feature Selection Feature Extraction Basic Approaches Given an input population characterized by d attributes: Feature Selection : find k < d dimensions that give the most information. Discard the other d − k . subset selection Feature Extraction : find k ≤ d dimensions that are linear combinations of the original d Principal Components Analysis (unsupervised) Related: Factor Analysis and Multidimensional Scaling 4

  9. Feature Selection Feature Extraction Basic Approaches Given an input population characterized by d attributes: Feature Selection : find k < d dimensions that give the most information. Discard the other d − k . subset selection Feature Extraction : find k ≤ d dimensions that are linear combinations of the original d Principal Components Analysis (unsupervised) Related: Factor Analysis and Multidimensional Scaling Linear Discriminants Analysis (supervised) 4

  10. Feature Selection Feature Extraction Basic Approaches Given an input population characterized by d attributes: Feature Selection : find k < d dimensions that give the most information. Discard the other d − k . subset selection Feature Extraction : find k ≤ d dimensions that are linear combinations of the original d Principal Components Analysis (unsupervised) Related: Factor Analysis and Multidimensional Scaling Linear Discriminants Analysis (supervised) Text also mensions Nonlinear methods: Isometric feature mapping and Locally Linear Embedding 4

  11. Feature Selection Feature Extraction Basic Approaches Given an input population characterized by d attributes: Feature Selection : find k < d dimensions that give the most information. Discard the other d − k . subset selection Feature Extraction : find k ≤ d dimensions that are linear combinations of the original d Principal Components Analysis (unsupervised) Related: Factor Analysis and Multidimensional Scaling Linear Discriminants Analysis (supervised) Text also mensions Nonlinear methods: Isometric feature mapping and Locally Linear Embedding Not enough info to really justify 4

  12. Feature Selection Feature Extraction Subset Selection Assume we have a suitable error function and can evaluate it for a variety of models (cross-validation). 5

  13. Feature Selection Feature Extraction Subset Selection Assume we have a suitable error function and can evaluate it for a variety of models (cross-validation). Misclassification error for classification problems 5

  14. Feature Selection Feature Extraction Subset Selection Assume we have a suitable error function and can evaluate it for a variety of models (cross-validation). Misclassification error for classification problems Mean-squared error for regression 5

  15. Feature Selection Feature Extraction Subset Selection Assume we have a suitable error function and can evaluate it for a variety of models (cross-validation). Misclassification error for classification problems Mean-squared error for regression Can’t evaluate all 2 d subsets of d features 5

  16. Feature Selection Feature Extraction Subset Selection Assume we have a suitable error function and can evaluate it for a variety of models (cross-validation). Misclassification error for classification problems Mean-squared error for regression Can’t evaluate all 2 d subsets of d features Forward selection: Start with an empty feature set. Repeatedly add the feature that reduces the error the most. Stop when decrease is insignificant. 5

  17. Feature Selection Feature Extraction Subset Selection Assume we have a suitable error function and can evaluate it for a variety of models (cross-validation). Misclassification error for classification problems Mean-squared error for regression Can’t evaluate all 2 d subsets of d features Forward selection: Start with an empty feature set. Repeatedly add the feature that reduces the error the most. Stop when decrease is insignificant. Backward selection: Start with all features. Remove the feature that decreases the error the most (or increases it the least). Stop when any further removals increase the error significantly. 5

  18. Feature Selection Feature Extraction Subset Selection Assume we have a suitable error function and can evaluate it for a variety of models (cross-validation). Misclassification error for classification problems Mean-squared error for regression Can’t evaluate all 2 d subsets of d features Forward selection: Start with an empty feature set. Repeatedly add the feature that reduces the error the most. Stop when decrease is insignificant. Backward selection: Start with all features. Remove the feature that decreases the error the most (or increases it the least). Stop when any further removals increase the error significantly. Both directions are O ( d 2 ) 5

  19. Feature Selection Feature Extraction Subset Selection Assume we have a suitable error function and can evaluate it for a variety of models (cross-validation). Misclassification error for classification problems Mean-squared error for regression Can’t evaluate all 2 d subsets of d features Forward selection: Start with an empty feature set. Repeatedly add the feature that reduces the error the most. Stop when decrease is insignificant. Backward selection: Start with all features. Remove the feature that decreases the error the most (or increases it the least). Stop when any further removals increase the error significantly. Both directions are O ( d 2 ) Hill-climing: not guaranteed to find global optimum 5

  20. Feature Selection Feature Extraction Notes Variant floating search adds multiple features at once, then backtracks to see what features can be removed Selection is less useful in very high-dimension problems where individual features are of limiteduse, but clusters of features are significant. 6

  21. Feature Selection Feature Extraction Outline Feature Selection 1 Feature Extraction 2 Principal Components Analysis (PCA) Factor Analysis (FA) Multidimensional Scaling (MDS) Linear Discriminants Analysis (LDA) 7

  22. Feature Selection Feature Extraction Principal Components Analysis (PCA) Find a mapping � z = A � x onto a lower-dimension space Unsupervised method: seeks to minimize variance Intuitively: try to spread the points apart as far as possible 8

  23. Feature Selection Feature Extraction 1st Principal Component Assume � x ∼ N ( � µ, Σ). Then w T � w T � w T Σ � � x ∼ N ( � µ, � w ) w T w T Find z 1 = � x , with � w 1 = 1, that maximizes 1 � 1 � w T Var ( z 1 ) = � 1 Σ � w 1 . w T w T w 1 − α ( � w 1 − 1), α ≥ 0 Find max � w 1 � 1 Σ � 1 � Solution: Σ � w 1 = α� w 1 This is an eigenvalue problem on Σ. We want the solution (eigenvector) corresponding to the largest eigenvalue α 9

  24. Feature Selection Feature Extraction 2nd Principal Component w T w T w T Next find z 2 = � x , with � w 2 = 1 and � w 1 = 0, that 2 � 2 � 2 � w T maximizes Var ( z 2 ) = � 2 Σ � w 2 . Solution: Σ � w 2 = α 2 � w 2 Choose the solution (eigenvector) corresponding to the 2nd largest eigenvalue α 2 Because Σ is symmetric, its eigenvectors are mutually orthogonal 10

  25. Feature Selection Feature Extraction Visualizing PCA z = W T ( � � x − � m ) 11

  26. Feature Selection Feature Extraction Is Spreading the Space Enough? Although we can argue that spreading the points leads to a better- conditioned problem: What does this have to do with reducing dimensionality? 12

Recommend


More recommend