Feature Selection Feature Extraction Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1
Feature Selection Feature Extraction Outline Feature Selection 1 Feature Extraction 2 Principal Components Analysis (PCA) Factor Analysis (FA) Multidimensional Scaling (MDS) Linear Discriminants Analysis (LDA) 2
Feature Selection Feature Extraction Motivation Reduction in complexity of prediction and training Reduction in cost of data extraction Simpler models – reduced variance Easier to visualize & analyze results, identify outliers, etc. 3
Feature Selection Feature Extraction Basic Approaches Given an input population characterized by d attributes: Feature Selection : find k < d dimensions that give the most information. Discard the other d − k . 4
Feature Selection Feature Extraction Basic Approaches Given an input population characterized by d attributes: Feature Selection : find k < d dimensions that give the most information. Discard the other d − k . subset selection 4
Feature Selection Feature Extraction Basic Approaches Given an input population characterized by d attributes: Feature Selection : find k < d dimensions that give the most information. Discard the other d − k . subset selection Feature Extraction : find k ≤ d dimensions that are linear combinations of the original d 4
Feature Selection Feature Extraction Basic Approaches Given an input population characterized by d attributes: Feature Selection : find k < d dimensions that give the most information. Discard the other d − k . subset selection Feature Extraction : find k ≤ d dimensions that are linear combinations of the original d Principal Components Analysis (unsupervised) 4
Feature Selection Feature Extraction Basic Approaches Given an input population characterized by d attributes: Feature Selection : find k < d dimensions that give the most information. Discard the other d − k . subset selection Feature Extraction : find k ≤ d dimensions that are linear combinations of the original d Principal Components Analysis (unsupervised) Related: Factor Analysis and Multidimensional Scaling 4
Feature Selection Feature Extraction Basic Approaches Given an input population characterized by d attributes: Feature Selection : find k < d dimensions that give the most information. Discard the other d − k . subset selection Feature Extraction : find k ≤ d dimensions that are linear combinations of the original d Principal Components Analysis (unsupervised) Related: Factor Analysis and Multidimensional Scaling Linear Discriminants Analysis (supervised) 4
Feature Selection Feature Extraction Basic Approaches Given an input population characterized by d attributes: Feature Selection : find k < d dimensions that give the most information. Discard the other d − k . subset selection Feature Extraction : find k ≤ d dimensions that are linear combinations of the original d Principal Components Analysis (unsupervised) Related: Factor Analysis and Multidimensional Scaling Linear Discriminants Analysis (supervised) Text also mensions Nonlinear methods: Isometric feature mapping and Locally Linear Embedding 4
Feature Selection Feature Extraction Basic Approaches Given an input population characterized by d attributes: Feature Selection : find k < d dimensions that give the most information. Discard the other d − k . subset selection Feature Extraction : find k ≤ d dimensions that are linear combinations of the original d Principal Components Analysis (unsupervised) Related: Factor Analysis and Multidimensional Scaling Linear Discriminants Analysis (supervised) Text also mensions Nonlinear methods: Isometric feature mapping and Locally Linear Embedding Not enough info to really justify 4
Feature Selection Feature Extraction Subset Selection Assume we have a suitable error function and can evaluate it for a variety of models (cross-validation). 5
Feature Selection Feature Extraction Subset Selection Assume we have a suitable error function and can evaluate it for a variety of models (cross-validation). Misclassification error for classification problems 5
Feature Selection Feature Extraction Subset Selection Assume we have a suitable error function and can evaluate it for a variety of models (cross-validation). Misclassification error for classification problems Mean-squared error for regression 5
Feature Selection Feature Extraction Subset Selection Assume we have a suitable error function and can evaluate it for a variety of models (cross-validation). Misclassification error for classification problems Mean-squared error for regression Can’t evaluate all 2 d subsets of d features 5
Feature Selection Feature Extraction Subset Selection Assume we have a suitable error function and can evaluate it for a variety of models (cross-validation). Misclassification error for classification problems Mean-squared error for regression Can’t evaluate all 2 d subsets of d features Forward selection: Start with an empty feature set. Repeatedly add the feature that reduces the error the most. Stop when decrease is insignificant. 5
Feature Selection Feature Extraction Subset Selection Assume we have a suitable error function and can evaluate it for a variety of models (cross-validation). Misclassification error for classification problems Mean-squared error for regression Can’t evaluate all 2 d subsets of d features Forward selection: Start with an empty feature set. Repeatedly add the feature that reduces the error the most. Stop when decrease is insignificant. Backward selection: Start with all features. Remove the feature that decreases the error the most (or increases it the least). Stop when any further removals increase the error significantly. 5
Feature Selection Feature Extraction Subset Selection Assume we have a suitable error function and can evaluate it for a variety of models (cross-validation). Misclassification error for classification problems Mean-squared error for regression Can’t evaluate all 2 d subsets of d features Forward selection: Start with an empty feature set. Repeatedly add the feature that reduces the error the most. Stop when decrease is insignificant. Backward selection: Start with all features. Remove the feature that decreases the error the most (or increases it the least). Stop when any further removals increase the error significantly. Both directions are O ( d 2 ) 5
Feature Selection Feature Extraction Subset Selection Assume we have a suitable error function and can evaluate it for a variety of models (cross-validation). Misclassification error for classification problems Mean-squared error for regression Can’t evaluate all 2 d subsets of d features Forward selection: Start with an empty feature set. Repeatedly add the feature that reduces the error the most. Stop when decrease is insignificant. Backward selection: Start with all features. Remove the feature that decreases the error the most (or increases it the least). Stop when any further removals increase the error significantly. Both directions are O ( d 2 ) Hill-climing: not guaranteed to find global optimum 5
Feature Selection Feature Extraction Notes Variant floating search adds multiple features at once, then backtracks to see what features can be removed Selection is less useful in very high-dimension problems where individual features are of limiteduse, but clusters of features are significant. 6
Feature Selection Feature Extraction Outline Feature Selection 1 Feature Extraction 2 Principal Components Analysis (PCA) Factor Analysis (FA) Multidimensional Scaling (MDS) Linear Discriminants Analysis (LDA) 7
Feature Selection Feature Extraction Principal Components Analysis (PCA) Find a mapping � z = A � x onto a lower-dimension space Unsupervised method: seeks to minimize variance Intuitively: try to spread the points apart as far as possible 8
Feature Selection Feature Extraction 1st Principal Component Assume � x ∼ N ( � µ, Σ). Then w T � w T � w T Σ � � x ∼ N ( � µ, � w ) w T w T Find z 1 = � x , with � w 1 = 1, that maximizes 1 � 1 � w T Var ( z 1 ) = � 1 Σ � w 1 . w T w T w 1 − α ( � w 1 − 1), α ≥ 0 Find max � w 1 � 1 Σ � 1 � Solution: Σ � w 1 = α� w 1 This is an eigenvalue problem on Σ. We want the solution (eigenvector) corresponding to the largest eigenvalue α 9
Feature Selection Feature Extraction 2nd Principal Component w T w T w T Next find z 2 = � x , with � w 2 = 1 and � w 1 = 0, that 2 � 2 � 2 � w T maximizes Var ( z 2 ) = � 2 Σ � w 2 . Solution: Σ � w 2 = α 2 � w 2 Choose the solution (eigenvector) corresponding to the 2nd largest eigenvalue α 2 Because Σ is symmetric, its eigenvectors are mutually orthogonal 10
Feature Selection Feature Extraction Visualizing PCA z = W T ( � � x − � m ) 11
Feature Selection Feature Extraction Is Spreading the Space Enough? Although we can argue that spreading the points leads to a better- conditioned problem: What does this have to do with reducing dimensionality? 12
Recommend
More recommend