Information theoretic feature selection for non-standard data - PowerPoint PPT Presentation

Information theoretic feature selection for non-standard data Michel Verleysen Machine Learning Group Université catholique de Louvain Louvain-la-Neuve, Belgium michel.verleysen@uclouvain.be

Thanks to • PhD and post-doc and other colleagues (in and out UCL), in particular Damien François Catherine Krier Amaury Lendasse Gauthier Doquire Fabrice Rossi Frederico Coelho StatLearn 2011 Michel Verleysen 2

Outline • Motivation • Feature selection in a nutshell • Relevance criterion • Mutual information • Structured data • Case studies – MI with missing data – MI with mixed data – MI for multi-label data – semi-supervised feature selection StatLearn 2011 Michel Verleysen 3

Motivation HD data are everywhere • Enhanced data acquisition possibilities → many HD data! classification - clustering - regression known information 0.60 0.50 0.40 DIM = 256 Admissible 0.30 alcohol 0.20 level 0.10 0.00 0 50 100 150 200 250 -0.10 -0.20 +/- -0.30 Predicted alcohol Modeling concentration StatLearn 2011 Michel Verleysen 4

Motivation HD data are everywhere • Enhanced data acquisition possibilities → many HD data! classification - clustering - regression feature clustering extraction DIM = 16384 From B. Fertil & http: / / genstyle.imed.jussieu.fr StatLearn 2011 Michel Verleysen 5

Motivation HD data are everywhere • Enhanced data acquisition possibilities → many HD data! classification - clustering - regression Sunspots ( ) =  y f x , x , x − + − t DIM 1 t 1 t y= ?         x , x , x − + − t DIM 1 t 1 t StatLearn 2011 Michel Verleysen 6

Motivation Generic data analysis number of Variables or features When 1 When I find myself Times 1 in times of trouble Trouble 1 Analysis Mother Mary Let 65 comes to me Speaking words of wisdom 1 Models wisdom, let it be. … number of observations StatLearn 2011 Michel Verleysen 7

Motivation The big challenge • What is the problem with many features ? – Computational complexity ? StatLearn 2011 Michel Verleysen 8

Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really StatLearn 2011 Michel Verleysen 9

Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really – Models stuck in local minima? StatLearn 2011 Michel Verleysen 10

Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really – Models stuck in local minima? Not so much StatLearn 2011 Michel Verleysen 11

Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really – Models stuck in local minima? Not so much – Concentration of distances ? StatLearn 2011 Michel Verleysen 12

Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really – Models stuck in local minima? Not so much – Concentration of distances ? Yes StatLearn 2011 Michel Verleysen 13

Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really – Models stuck in local minima? Not so much – Concentration of distances ? Yes – Poor estimations ? StatLearn 2011 Michel Verleysen 14

Motivation The big challenge • What is the problem with many features ? – Computational complexity ? Not really – Models stuck in local minima? Not so much – Concentration of distances ? Yes – Poor estimations ? Yes StatLearn 2011 Michel Verleysen 15

Motivation Concentration of the Euclidean norm • Distribution of the norm of random vectors – i.i.d. components in [ 0,1] – norms in [ 0 , d ] as d = 2 d = 50 • Norms concentrate around their expectation • They don’t discriminate anymore ! StatLearn 2011 Michel Verleysen 16

Motivation Distances also concentrate Dimension = 2 Dimension = 100 Pairwise distances seem nearly equal for all points Relative contrast vanishes as the dimension increases ( ) − Var X DMAX DMIN 2 ( ) = d d If then → lim 0 0 p → ∞ E X d DMIN 2 d [Beyer] → ∞ when d StatLearn 2011 Michel Verleysen 17

Motivation The estimation problem • An example of linear method: Principal component analysis (PCA) Based on covariance matrix – huge (DIM x DIM) – poorly estimated with low/ finite number of data • Other methods: – Linear discriminant analysis (LDA) – Partial least squares (PLS) – … Similar problems! StatLearn 2011 Michel Verleysen 18

Motivation Nonlinear tools Nonlinear models ( ) = θ  y f x , x , , x , 1 2 d If d ↗ ↗ , size( θ ) ↗ ↗ • θ results from the minimization of a non-convex cost function – local minima – numerical problems (flats, high slopes) – convergence – etc • Ex: Multi-layer perceptrons, Gaussian mixtures (RBF), kernel machines, self-organizing maps, etc. StatLearn 2011 Michel Verleysen 19

Motivation Why reducing the dimensionality ? • Not useful in theory: – More information means easier task – Models can ignore irrelevant features (e.g. set weights to zero) • But... – Lot of inputs means … Lots of parameters & Large input space • Curse of dimensionality and risks of overfitting ! StatLearn 2011 Michel Verleysen 20

Motivation Overfitting Model-dependent • Use regularization From: Duda et al., Pattern Classification, 2 nd ed., Wiley, 2001 StatLearn 2011 Michel Verleysen 21

Motivation Overfitting Model-dependent Data-dependent • D points to fit the simplest (linear) model in a D -dim space (perfect) fitting → approximation: • much more than D points! • What is much less than D points are available? • Use regularization From: Duda et al., Pattern Classification, 2 nd ed., Wiley, 2001 StatLearn 2011 Michel Verleysen 22

Outline • Motivation • Feature selection in a nutshell • Relevance criterion • Mutual information • Structured data • Case studies – MI with missing data – MI with mixed data – MI for multi-label data – semi-supervised feature selection StatLearn 2011 Michel Verleysen 23

Feature selection in a nutshell Feature selection in a nutshell • 1001 ways (and more… ) to perform feature selection • The challenges: – Unsupervised     x x 1 1     x x redundancy     2 2     criterion           x x N M Supervised     x x 1 1     x relevance x     2 2 selection y criterion              x  x N M StatLearn 2011 Michel Verleysen 24

Feature selection in a nutshell Feature selection in a nutshell • 1001 ways (and more… ) to perform feature selection • The challenges: – Unsupervised – supervised – Filter     x x 1 1     x x relevance     2 2 selection y criterion            x   x  N M Wrapper     x x 1 1     x x (non)linear     2 2 selection y ˆ     model        x   x  N M - y StatLearn 2011 Michel Verleysen 25

Feature selection in a nutshell Feature selection in a nutshell • 1001 ways (and more… ) to perform feature selection • The challenges: – Unsupervised – supervised – Filter – wrapper – Selection     x x 1 1     x x relevance     2 2 selection y criterion            x   x  N M Projection     x z 1 1     x relevance z   2   projection 2 y criterion            x    z N M StatLearn 2011 Michel Verleysen 26

Feature selection in a nutshell Feature selection in a nutshell • 1001 ways (and more… ) to perform feature selection • The challenges: – Unsupervised – supervised – Filter – wrapper – Selection – Projection – Linear • Straightworward, easy • No tuning parameter • No estimation problem • But obviously doesn’t capture nonlinear relationships… Nonlinear • Less intuitive (interpretability) • Less straightworward (bounds,… ) • Estimation difficulties StatLearn 2011 Michel Verleysen 27

Information theoretic feature selection for non-standard data - PowerPoint PPT Presentation

Information theoretic feature selection for non-standard data Michel Verleysen Machine Learning Group Universit catholique de Louvain Louvain-la-Neuve, Belgium michel.verleysen@uclouvain.be Thanks to PhD and post-doc and other

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Feature Selection: ROC and Subset Selection Theodoridis 5.5-5.7 Using ROC for Feature Selection

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Mutual Information an Adequate Tool for Feature Selection ? Benot Frnay November 15, 2013

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

INFORMATION-THEORETIC SECURITY INFORMATION-THEORETIC SECURITY Lecture 4 - Elements of Information

INFORMATION-THEORETIC SECURITY INFORMATION-THEORETIC SECURITY Lecture 1 - Elements of Information

INFORMATION-THEORETIC SECURITY INFORMATION-THEORETIC SECURITY Lecture 2 - Elements of Information

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2018 Soleymani

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Overview of LENT Theory Low Energy Nuclear Transmutations Yogendra Srivastava Professor of

Logic Programming Unification Temur Kutsia Research Institute for Symbolic Computation Johannes

Logical Foundations of Cyber-Physical Systems Andr Platzer Andr Platzer (CMU) LFCPS/20:

Substitution Theorem Proposition 2.7: Let 1 and 2 be equivalent formulas, and [ 1 ] p

Monthly Proton Flux Veronica Bindi, AMS Collaboration Physics and Astronomy Department University

CARTESIAN Philosophy (I) PCES 5.1 Probably the 2 most important influences on Descartess life

IF YOU ARE JUST JOINING THIS CLASS... PLEASE COME TO THE FRONT OF THE CLASS AND SEE PROF.

Our Place in the Cosmos Our Place in the Cosmos The Ancient Greeks and and By far the most

Information theoretic feature selection for non-standard data - PowerPoint PPT Presentation

Information theoretic feature selection for non-standard data Michel Verleysen Machine Learning Group Universit catholique de Louvain Louvain-la-Neuve, Belgium michel.verleysen@uclouvain.be Thanks to PhD and post-doc and other

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Feature Selection: ROC and Subset Selection Theodoridis 5.5-5.7 Using ROC for Feature Selection

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Mutual Information an Adequate Tool for Feature Selection ? Benot Frnay November 15, 2013

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

INFORMATION-THEORETIC SECURITY INFORMATION-THEORETIC SECURITY Lecture 4 - Elements of Information

INFORMATION-THEORETIC SECURITY INFORMATION-THEORETIC SECURITY Lecture 1 - Elements of Information

INFORMATION-THEORETIC SECURITY INFORMATION-THEORETIC SECURITY Lecture 2 - Elements of Information

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

PCA &amp; ICA CE-717: Machine Learning Sharif University of Technology Spring 2018 Soleymani

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Overview of LENT Theory Low Energy Nuclear Transmutations Yogendra Srivastava Professor of

Logic Programming Unification Temur Kutsia Research Institute for Symbolic Computation Johannes

Logical Foundations of Cyber-Physical Systems Andr Platzer Andr Platzer (CMU) LFCPS/20:

Substitution Theorem Proposition 2.7: Let 1 and 2 be equivalent formulas, and [ 1 ] p

Monthly Proton Flux Veronica Bindi, AMS Collaboration Physics and Astronomy Department University

CARTESIAN Philosophy (I) PCES 5.1 Probably the 2 most important influences on Descartess life

IF YOU ARE JUST JOINING THIS CLASS... PLEASE COME TO THE FRONT OF THE CLASS AND SEE PROF.

Our Place in the Cosmos Our Place in the Cosmos The Ancient Greeks and and By far the most

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2018 Soleymani