3 dimensionality reductjon
play

3. Dimensionality Reductjon Chlo-Agathe Azencot Centre for - PowerPoint PPT Presentation

Introductjon to Machine Learning CentraleSuplec Paris Fall 2017 3. Dimensionality Reductjon Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectjves Give


  1. Introductjon to Machine Learning CentraleSupélec Paris — Fall 2017 3. Dimensionality Reductjon Chloé-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr

  2. Learning objectjves ● Give reasons why one would wish to reduce the dimensionality of a data set. ● Explain the difgerence between feature selectjon and feature extractjon. ● Implement some fjlter strategies. ● Implement some wrapper strategies. ● Derive the computatjon of principal components from a “max variance” defjnitjon ● Implement PCA. 2

  3. Curse of dimensionality ● Methods / intuitjons that work in low dimension may not apply to high dimensions. ? ● p=2: Fractjon of the points within a square that fall outside of the circle inscribed in it: 3

  4. Curse of dimensionality ● Methods / intuitjons that work in low dimension may not apply to high dimensions. ● p=2: Fractjon of the points within a square that fall outside of the circle inscribed in it: r 4

  5. Curse of dimensionality ● Methods / intuitjons that work in low dimension may not apply to high dimensions. ● p=3: Fractjon of the points within a cube that fall outside of the sphere inscribed in it: r 5

  6. Curse of dimensionality ● Volume of a p-sphere: The Gamma functjon Γ generalizes the factorial. Γ(n) = (n-1)! ● When p ↗ the proportjon of a hypercube outside of its inscribed hypersphere approaches 1. ● What this means: – hyperspace is very big – all points are far apart ⇒ dimensionality reductjon. 6

  7. More reasons to reduce dimensionality ● Computatjonal complexity (tjme and space) ● Interpretability ● Simpler models are more robust (less variance) ● Data visualizatjon ● Cost of data acquisitjon ● Eliminate non-relevant atuributes that can make it harder for an algorithm to learn. 7

  8. Approaches to dimensionality reductjon ● Feature selectjon Choose m < p features, ignore the remaining (p-m) – Filtering approaches Apply a statjstjcal measure to assign a score to each feature (correlatjon, χ²-test). – Wrapper approaches Search problem: Find the best set of features for a given predictjve model. – Embedded approaches Simultaneously fjt a model and learn which features should be included. All these feature selectjon approaches are supervised. 8

  9. Approaches to dimensionality reductjon ● Feature selectjon Choose m < p features, ignore the remaining (p-m) – Filtering approaches Apply a statjstjcal measure to assign a score to each feature (correlatjon, χ²-test). – Wrapper approaches Search problem: Find the best set of features for a given predictjve model. – Embedded approaches Simultaneously fjt a model and learn which features should be included. All these feature selectjon approaches are supervised. 9

  10. Approaches to dimensionality reductjon ● Feature selectjon Choose m < p features, ignore the remaining (p-m) – Filtering approaches Apply a statjstjcal measure to assign a score to each feature (correlatjon, χ²-test). – Wrapper approaches Search problem: Find the best set of features for a given predictjve model. – Embedded approaches Simultaneously fjt a model and learn which features should be included. All these feature selectjon approaches are supervised. 10

  11. Approaches to dimensionality reductjon ● Feature selectjon Choose m < p features, ignore the remaining (p-m) – Filtering approaches Apply a statjstjcal measure to assign a score to each feature (correlatjon, χ²-test). – Wrapper approaches Search problem: Find the best set of features for a given predictjve model. – Embedded approaches Simultaneously fjt a model and learn which features should be included. Are those approaches supervised or unsupervised? ? All these feature selectjon approaches are supervised. 11

  12. Feature selectjon: Overview All features Features Features Features set M set 1 set 2 Filter Embedded Predictor approaches approaches Wrapper Features Predictor Features approaches Lasso Features Predictor Predictor Elastjc Net 7 . p a h C Subset selectjon: e e S forward selectjon backward selectjon fmoatjng selectjon 12

  13. Feature selectjon: Subset selectjon 13

  14. Approaches to dimensionality reductjon ● Feature selectjon Choose m < p features, ignore the remaining (p-m) – Filtering approaches Apply a statjstjcal measure to assign a score to each feature (correlatjon, χ²-test). – Wrapper approaches Search problem: Find the best set of features for a given predictjve model. – Embedded approaches Simultaneously fjt a model and learn which features should be included. All these feature selectjon approaches are supervised. 14

  15. Subset selectjon ● Goal: Find the subset of features that leads to the best-performing algorithm. ● How many subsets of p features are there? ? 15

  16. Subset selectjon ● Goal: Find the subset of features that leads to the best-performing algorithm. ● Issue: such sets. 16

  17. Subset selectjon ● Goal: Find the subset of features that leads to the best-performing algorithm. ● Issue: such sets. : Error of a ● Greedy approach: forward search predictor trained only using the features in Add the “best” feature at each step – Initjally: – New best feature: – stop if – else: 17

  18. Subset selectjon ● Goal: Find the subset of features that leads to the best-performing algorithm. ● Issue: such sets. : Error of a ● Greedy approach: forward search predictor trained only using the features in Add the “best” feature at each step – Initjally: – New best feature: – stop if – else: What is the complexity of this algorithm? ? 18

  19. Subset selectjon ● Goal: Find the subset of features that leads to the best- performing algorithm. ● Issue: such sets. ● Greedy approach: forward search : Error of a predictor trained only Add the “best” feature at each step using the features in – Initjally: – New best feature: – stop if – else: Complexity: O(p² x C) where C=complexity of training and evaluatjng the model (might depend on p also). Much betuer than O(2 p )! 19

  20. Subset selectjon ● Greedy approach: forward search : Error of a predictor trained only Add the “best” feature at each step using the features in – Initjally: – New best feature: – stop if – else: Complexity: O(p²) ● Alternatjve strategies: – Backward search: start from {1, …, p}, eliminate features. – Floatjng search: alternatjvely add q features and remove r features. 20

  21. Approaches to dimensionality reductjon ● Feature extractjon Project the p features on m < p new dimensions ● Principal Components Analysis (PCA) ● Factor Analysis (FA) Linear ● Non-negatjve Matrix Factorizatjon (NMF) ● Linear Discriminant Analysis (LDA) Supervised ● Multjdimensional scaling (MDS) ● Isometric feature mapping (Isomap) Non linear ● Locally Linear Embedding (LLE) ● Autoencoders Most of these approaches are unsupervised. 21

  22. Feature extractjon: Principal Component Analysis 22

  23. Principal Components Analysis (PCA) ● Goal: Find a low-dimensional space such that informatjon loss is minimized when the data is projected on that space. 23

  24. Principal Components Analysis (PCA) ● Goal: Find a low-dimensional space such that informatjon loss is minimized when the data is projected on that space. ● Unsupervised: We're only looking at the data, not at any labels. 24

  25. Principal Components Analysis (PCA) ● Goal: Find a low-dimensional space such that informatjon loss is minimized when the data is projected on that space. ● Unsupervised: We're only looking at the data, not at any labels. In PCA, we want the variance to be maximized. 25

  26. Principal Components Analysis (PCA) ● Goal: Find a low-dimensional space such that informatjon loss is minimized when the data is projected on that space. ● Unsupervised: We're only looking at the data, not at any labels. In PCA, we want the variance to be maximized. Projectjon on x 2 Projectjon on x 1 26

  27. Principal Components Analysis (PCA) ● Goal: Find a low-dimensional space such that informatjon loss is minimized when the data is projected on that space. ● Unsupervised: We're only looking at the data, not at any labels. In PCA, we want the variance to be maximized. Warning! This requires standardizing the features. Projectjon on x 2 Projectjon on x 1 27

  28. Feature standardizatjon ● Variance of feature j in data set D : ? 28

  29. Feature standardizatjon ● Variance of feature j in data set D : ● Features that take large values will have large variance Compare [10, 20, 30, 40, 50] with [0.1, 0.2, 0.3, 0.4, 0.5]. ● Standardizatjon: – mean centering: give each feature a mean of 0 – variance scaling: give each feature a variance of 1 29

  30. Feature standardizatjon ● Variance of feature j in data set D : ● Features that take large values will have large variance Compare [10, 20, 30, 40, 50] with [0.1, 0.2, 0.3, 0.4, 0.5]. ● Standardizatjon: – mean centering: give each feature a mean of 0 – variance scaling: give each feature a variance of 1 30

  31. Feature standardizatjon ● Variance of feature j in data set D : ● Features that take large values will have large variance Compare [10, 20, 30, 40, 50] with [0.1, 0.2, 0.3, 0.4, 0.5]. ● Standardizatjon: – mean centering: give each feature a mean of 0 – variance scaling: give each feature a variance of 1 31

Recommend


More recommend