Information theoretic feature selection for non-standard data Michel Verleysen Machine Learning Group Université catholique de Louvain Louvain-la-Neuve, Belgium michel.verleysen@uclouvain.be
Thanks to • PhD and post-doc and other colleagues (in and out UCL), in particular Damien François Catherine Krier Amaury Lendasse Gauthier Doquire Fabrice Rossi Frederico Coelho StatLearn 2011 Michel Verleysen 2
Outline • Motivation • Feature selection in a nutshell • Relevance criterion • Mutual information • Structured data • Case studies – MI with missing data – MI with mixed data – MI for multi-label data – semi-supervised feature selection StatLearn 2011 Michel Verleysen 3
Motivation HD data are everywhere • Enhanced data acquisition possibilities → many HD data! classification - clustering - regression known information 0.60 0.50 0.40 DIM = 256 Admissible 0.30 alcohol 0.20 level 0.10 0.00 0 50 100 150 200 250 -0.10 -0.20 +/- -0.30 Predicted alcohol Modeling concentration StatLearn 2011 Michel Verleysen 4
Motivation HD data are everywhere • Enhanced data acquisition possibilities → many HD data! classification - clustering - regression feature clustering extraction DIM = 16384 From B. Fertil & http: / / genstyle.imed.jussieu.fr StatLearn 2011 Michel Verleysen 5
Motivation HD data are everywhere • Enhanced data acquisition possibilities → many HD data! classification - clustering - regression Sunspots ( ) = y f x , x , x − + − t DIM 1 t 1 t y= ? x , x , x − + − t DIM 1 t 1 t StatLearn 2011 Michel Verleysen 6
Motivation Generic data analysis number of Variables or features When 1 When I find myself Times 1 in times of trouble Trouble 1 Analysis Mother Mary Let 65 comes to me Speaking words of wisdom 1 Models wisdom, let it be. … number of observations StatLearn 2011 Michel Verleysen 7
Motivation The big challenge • What is the problem with many features ? – Computational complexity ? StatLearn 2011 Michel Verleysen 8
Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really StatLearn 2011 Michel Verleysen 9
Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really – Models stuck in local minima? StatLearn 2011 Michel Verleysen 10
Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really – Models stuck in local minima? Not so much StatLearn 2011 Michel Verleysen 11
Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really – Models stuck in local minima? Not so much – Concentration of distances ? StatLearn 2011 Michel Verleysen 12
Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really – Models stuck in local minima? Not so much – Concentration of distances ? Yes StatLearn 2011 Michel Verleysen 13
Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really – Models stuck in local minima? Not so much – Concentration of distances ? Yes – Poor estimations ? StatLearn 2011 Michel Verleysen 14
Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really – Models stuck in local minima? Not so much – Concentration of distances ? Yes – Poor estimations ? Yes StatLearn 2011 Michel Verleysen 15
Motivation Concentration of the Euclidean norm • Distribution of the norm of random vectors – i.i.d. components in [ 0,1] – norms in [ 0 , d ] as d = 2 d = 50 • Norms concentrate around their expectation • They don’t discriminate anymore ! StatLearn 2011 Michel Verleysen 16
Motivation Distances also concentrate Dimension = 2 Dimension = 100 Pairwise distances seem nearly equal for all points Relative contrast vanishes as the dimension increases ( ) − Var X DMAX DMIN 2 ( ) = d d If then → lim 0 0 p → ∞ E X d DMIN 2 d [Beyer] → ∞ when d StatLearn 2011 Michel Verleysen 17
Motivation The estimation problem • An example of linear method: Principal component analysis (PCA) Based on covariance matrix – huge (DIM x DIM) – poorly estimated with low/ finite number of data • Other methods: – Linear discriminant analysis (LDA) – Partial least squares (PLS) – … Similar problems! StatLearn 2011 Michel Verleysen 18
Motivation Nonlinear tools Nonlinear models ( ) = θ y f x , x , , x , 1 2 d If d ↗ ↗ , size( θ ) ↗ ↗ • θ results from the minimization of a non-convex cost function – local minima – numerical problems (flats, high slopes) – convergence – etc • Ex: Multi-layer perceptrons, Gaussian mixtures (RBF), kernel machines, self-organizing maps, etc. StatLearn 2011 Michel Verleysen 19
Motivation Why reducing the dimensionality ? • Not useful in theory: – More information means easier task – Models can ignore irrelevant features (e.g. set weights to zero) • But... – Lot of inputs means … Lots of parameters & Large input space • Curse of dimensionality and risks of overfitting ! StatLearn 2011 Michel Verleysen 20
Motivation Overfitting Model-dependent • Use regularization From: Duda et al., Pattern Classification, 2 nd ed., Wiley, 2001 StatLearn 2011 Michel Verleysen 21
Motivation Overfitting Model-dependent Data-dependent • D points to fit the simplest (linear) model in a D -dim space (perfect) fitting → approximation: • much more than D points! • What is much less than D points are available? • Use regularization From: Duda et al., Pattern Classification, 2 nd ed., Wiley, 2001 StatLearn 2011 Michel Verleysen 22
Outline • Motivation • Feature selection in a nutshell • Relevance criterion • Mutual information • Structured data • Case studies – MI with missing data – MI with mixed data – MI for multi-label data – semi-supervised feature selection StatLearn 2011 Michel Verleysen 23
Feature selection in a nutshell Feature selection in a nutshell • 1001 ways (and more… ) to perform feature selection • The challenges: – Unsupervised x x 1 1 x x redundancy 2 2 criterion x x N M Supervised x x 1 1 x relevance x 2 2 selection y criterion x x N M StatLearn 2011 Michel Verleysen 24
Feature selection in a nutshell Feature selection in a nutshell • 1001 ways (and more… ) to perform feature selection • The challenges: – Unsupervised – supervised – Filter x x 1 1 x x relevance 2 2 selection y criterion x x N M Wrapper x x 1 1 x x (non)linear 2 2 selection y ˆ model x x N M - y StatLearn 2011 Michel Verleysen 25
Feature selection in a nutshell Feature selection in a nutshell • 1001 ways (and more… ) to perform feature selection • The challenges: – Unsupervised – supervised – Filter – wrapper – Selection x x 1 1 x x relevance 2 2 selection y criterion x x N M Projection x z 1 1 x relevance z 2 projection 2 y criterion x z N M StatLearn 2011 Michel Verleysen 26
Feature selection in a nutshell Feature selection in a nutshell • 1001 ways (and more… ) to perform feature selection • The challenges: – Unsupervised – supervised – Filter – wrapper – Selection – Projection – Linear • Straightworward, easy • No tuning parameter • No estimation problem • But obviously doesn’t capture nonlinear relationships… Nonlinear • Less intuitive (interpretability) • Less straightworward (bounds,… ) • Estimation difficulties StatLearn 2011 Michel Verleysen 27
Recommend
More recommend