information theoretic feature selection for non standard
play

Information theoretic feature selection for non-standard data - PowerPoint PPT Presentation

Information theoretic feature selection for non-standard data Michel Verleysen Machine Learning Group Universit catholique de Louvain Louvain-la-Neuve, Belgium michel.verleysen@uclouvain.be Thanks to PhD and post-doc and other


  1. Information theoretic feature selection for non-standard data Michel Verleysen Machine Learning Group Université catholique de Louvain Louvain-la-Neuve, Belgium michel.verleysen@uclouvain.be

  2. Thanks to • PhD and post-doc and other colleagues (in and out UCL), in particular Damien François Catherine Krier Amaury Lendasse Gauthier Doquire Fabrice Rossi Frederico Coelho StatLearn 2011 Michel Verleysen 2

  3. Outline • Motivation • Feature selection in a nutshell • Relevance criterion • Mutual information • Structured data • Case studies – MI with missing data – MI with mixed data – MI for multi-label data – semi-supervised feature selection StatLearn 2011 Michel Verleysen 3

  4. Motivation HD data are everywhere • Enhanced data acquisition possibilities → many HD data! classification - clustering - regression known information 0.60 0.50 0.40 DIM = 256 Admissible 0.30 alcohol 0.20 level 0.10 0.00 0 50 100 150 200 250 -0.10 -0.20 +/- -0.30 Predicted alcohol Modeling concentration StatLearn 2011 Michel Verleysen 4

  5. Motivation HD data are everywhere • Enhanced data acquisition possibilities → many HD data! classification - clustering - regression feature clustering extraction DIM = 16384 From B. Fertil & http: / / genstyle.imed.jussieu.fr StatLearn 2011 Michel Verleysen 5

  6. Motivation HD data are everywhere • Enhanced data acquisition possibilities → many HD data! classification - clustering - regression Sunspots ( ) =  y f x , x , x − + − t DIM 1 t 1 t y= ?         x , x , x − + − t DIM 1 t 1 t StatLearn 2011 Michel Verleysen 6

  7. Motivation Generic data analysis number of Variables or features When 1 When I find myself Times 1 in times of trouble Trouble 1 Analysis Mother Mary Let 65 comes to me Speaking words of wisdom 1 Models wisdom, let it be. … number of observations StatLearn 2011 Michel Verleysen 7

  8. Motivation The big challenge • What is the problem with many features ? – Computational complexity ? StatLearn 2011 Michel Verleysen 8

  9. Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really StatLearn 2011 Michel Verleysen 9

  10. Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really – Models stuck in local minima? StatLearn 2011 Michel Verleysen 10

  11. Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really – Models stuck in local minima? Not so much StatLearn 2011 Michel Verleysen 11

  12. Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really – Models stuck in local minima? Not so much – Concentration of distances ? StatLearn 2011 Michel Verleysen 12

  13. Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really – Models stuck in local minima? Not so much – Concentration of distances ? Yes StatLearn 2011 Michel Verleysen 13

  14. Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really – Models stuck in local minima? Not so much – Concentration of distances ? Yes – Poor estimations ? StatLearn 2011 Michel Verleysen 14

  15. Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really – Models stuck in local minima? Not so much – Concentration of distances ? Yes – Poor estimations ? Yes StatLearn 2011 Michel Verleysen 15

  16. Motivation Concentration of the Euclidean norm • Distribution of the norm of random vectors – i.i.d. components in [ 0,1] – norms in [ 0 , d ] as d = 2 d = 50 • Norms concentrate around their expectation • They don’t discriminate anymore ! StatLearn 2011 Michel Verleysen 16

  17. Motivation Distances also concentrate Dimension = 2 Dimension = 100 Pairwise distances seem nearly equal for all points Relative contrast vanishes as the dimension increases ( ) − Var X DMAX DMIN 2 ( ) = d d If then → lim 0 0 p → ∞ E X d DMIN 2 d [Beyer] → ∞ when d StatLearn 2011 Michel Verleysen 17

  18. Motivation The estimation problem • An example of linear method: Principal component analysis (PCA) Based on covariance matrix – huge (DIM x DIM) – poorly estimated with low/ finite number of data • Other methods: – Linear discriminant analysis (LDA) – Partial least squares (PLS) – … Similar problems! StatLearn 2011 Michel Verleysen 18

  19. Motivation Nonlinear tools Nonlinear models ( ) = θ  y f x , x , , x , 1 2 d If d ↗ ↗ , size( θ ) ↗ ↗ • θ results from the minimization of a non-convex cost function – local minima – numerical problems (flats, high slopes) – convergence – etc • Ex: Multi-layer perceptrons, Gaussian mixtures (RBF), kernel machines, self-organizing maps, etc. StatLearn 2011 Michel Verleysen 19

  20. Motivation Why reducing the dimensionality ? • Not useful in theory: – More information means easier task – Models can ignore irrelevant features (e.g. set weights to zero) • But... – Lot of inputs means … Lots of parameters & Large input space • Curse of dimensionality and risks of overfitting ! StatLearn 2011 Michel Verleysen 20

  21. Motivation Overfitting Model-dependent • Use regularization From: Duda et al., Pattern Classification, 2 nd ed., Wiley, 2001 StatLearn 2011 Michel Verleysen 21

  22. Motivation Overfitting Model-dependent Data-dependent • D points to fit the simplest (linear) model in a D -dim space (perfect) fitting → approximation: • much more than D points! • What is much less than D points are available? • Use regularization From: Duda et al., Pattern Classification, 2 nd ed., Wiley, 2001 StatLearn 2011 Michel Verleysen 22

  23. Outline • Motivation • Feature selection in a nutshell • Relevance criterion • Mutual information • Structured data • Case studies – MI with missing data – MI with mixed data – MI for multi-label data – semi-supervised feature selection StatLearn 2011 Michel Verleysen 23

  24. Feature selection in a nutshell Feature selection in a nutshell • 1001 ways (and more… ) to perform feature selection • The challenges: – Unsupervised     x x 1 1     x x redundancy     2 2     criterion           x x N M Supervised     x x 1 1     x relevance x     2 2 selection y criterion              x  x N M StatLearn 2011 Michel Verleysen 24

  25. Feature selection in a nutshell Feature selection in a nutshell • 1001 ways (and more… ) to perform feature selection • The challenges: – Unsupervised – supervised – Filter     x x 1 1     x x relevance     2 2 selection y criterion            x   x  N M Wrapper     x x 1 1     x x (non)linear     2 2 selection y ˆ     model        x   x  N M - y StatLearn 2011 Michel Verleysen 25

  26. Feature selection in a nutshell Feature selection in a nutshell • 1001 ways (and more… ) to perform feature selection • The challenges: – Unsupervised – supervised – Filter – wrapper – Selection     x x 1 1     x x relevance     2 2 selection y criterion            x   x  N M Projection     x z 1 1     x relevance z   2   projection 2 y criterion            x    z N M StatLearn 2011 Michel Verleysen 26

  27. Feature selection in a nutshell Feature selection in a nutshell • 1001 ways (and more… ) to perform feature selection • The challenges: – Unsupervised – supervised – Filter – wrapper – Selection – Projection – Linear • Straightworward, easy • No tuning parameter • No estimation problem • But obviously doesn’t capture nonlinear relationships… Nonlinear • Less intuitive (interpretability) • Less straightworward (bounds,… ) • Estimation difficulties StatLearn 2011 Michel Verleysen 27

Recommend


More recommend