1 NETTAB BBCC NOVEMBER 2010 NEAPLES Summary Logic Data Mining - PowerPoint PPT Presentation

NETTAB BBCC NOVEMBER 2010 NEAPLES Mathematical Models for Feature Selection And their Application In Bioinformatics Paola Bertolazzi, Giovanni Felici Istituto di Analisi dei Sistemi ed Informatica IASI-CNR Paola Festa Dipartimento di Matematica e Applicazioni R.M. Caccioppoli , Università degli Studi di Napoli Federico II 1

NETTAB BBCC NOVEMBER 2010 NEAPLES Summary Logic Data Mining System: online dmb.iasi.cnr.it Focus on: Formulation of the Feature Selection Problem GRASP Methods Applications 2

NETTAB BBCC NOVEMBER 2010 NEAPLES The Logic Data Mining flow FEATURE FEATURE E RAW DISCRETIZATION LEARNING SELECTION SELECTION TION DATA Samples Identify significant Select Select few logic lect few logic c Build logic from thresholds for the variables that variables ables that formulas using appear ar to have a more value of rational appear to have a the selected classes, variables; intervals stron strong capability ong capabili ility y variables that are data in generate discrete of tellin of telling one ling one able to classify any variables, then class from the class ss from the correctly the format logic other over the othe her r over the training data: 3 whole ole sample le whole sample IF(X&Y) THEN Z

NETTAB BBCC NOVEMBER 2010 NEAPLES Feature Selection • FS is a projection of a set of multidimensional points from their original space to a space of smaller dimension with little "loss of information" or large "reduction of noise". • Information and noise must defined w.r.t. to the objective of the specific application: clustering, classification, synthesis... • in supervised learning application, we want to preserve or enhance the relative distances between observations belonging to different groups. 4

NETTAB BBCC NOVEMBER 2010 NEAPLES FS as a Combinatorial Problem When the projection of the points is simply a selection of a subset of the available dimensions, the FS problem has a combinatorial nature. Such fact has been pointed out and exploited already in the literature: • Garey M.R. and Johnson D.S. Computer and Intractability: a guide on the theory of NP-completeness. Freeman, San Francisco, 1979. • E. Boros, P.L. Hammer, T. Ibaraki, A. Kogan, E. Mayoraz, I. Muchnik, An implementation of logical analysis of data, IEEE Transactions on Knowledge and Data Engineering, 12 (2) 292-306 (2000). • M. Charikar, V. Guruswami, R. Kumar, S. Rajagopalan and A. Sahai. Combinatorial Feature Selection Problems. In Proceedings of FOCS 2000. 5 • R. Beretta, A. Mendes, P. Moscato, Integer programming models and algorithms for molecular classification of cancer from Microarray Data, Proceedings of the Twenty-eighth Australasian Computer Science Conference, 38, 361 - 370 (2005)

NETTAB BBCC NOVEMBER 2010 NEAPLES Notations and Definitions we assume that n m-dimensional points are the input data for the FS problems. The points are represented in the rational matrix A M (resp. N) is the index set of the columns of A (resp. rows); then      mxn n N , m M , A n m , A R An appropriate measure of the information contained in A is given by:   2      I ( A ) a a  ij jk i j i k I(A)  the average quadratic distance of the points in A, directly related 6 to the varia varianc nce expressed by A, a widely used measure in Statistics and Data Analysis.

NETTAB BBCC NOVEMBER 2010 NEAPLES A Simple Optimization Problem Consider now the projection of A on a subset of its dimensions M', such that |M' | =  < m, and   1 if k M '   x k  otherwise 0        2 I ( A ) a a x and therefore x  ik jk k i j i k represents the portion of information preserved by the projection of the points of A on their M' dimensions.        2 max I ( A ) a a x 7 The simplest optimization problem  x ik jk k i j i k that can be defined would be:    x k k  x { 0 , 1 } k

NETTAB BBCC NOVEMBER 2010 NEAPLES A (proper) extension: minimization of the infimum-norm  An alternative to the average max   approach consists in requiring       2 a a x , i , j , i j a minimum level of distance ik jk k k  between each pair, and   x k requiring a projection that k  maximizes such level: x { 0 , 1 } k Rela lati tion on between ween the two models els Let h = m x n and   R h be the Euclidean subspace where a point  is defined as    follows:             2 ,..., : ( a a ) , l i ( n 1 ) j , i j 1 h l ik jk k With a proper definition of the projection  x we have that the 2 models become: 8  inf  1 max max x x       x x k k k k   m m x { 0 , 1 } x { 0 , 1 }

NETTAB BBCC NOVEMBER 2010 NEAPLES 1) Special Case: Binary Data   k 2 d ( a a ) Let ij ik jk   1 if ( a a )    ik jk k  a { 0 , 1 } d If data in A is binary, then ij ij  0 otherwise  max and the FS problem can be rewritten as:      k d x , i , j , i j k ij k    x k k  xk { 0 , 1 } 2) Special Case: Supervised Learning The row vectors of A are partitioned into two different classes  ~ ~ max   Α A B      k ~ ~ d x , i , j , c ( i ) c ( j ) ~ ~     Α Α a , c ( i ) , a B , c ( i ) B k ij k  i i   9 x ~ ~   Α k n , n B k  Only the distance between points of A B x { 0 , 1 } k different classes is taken into account; but the number of constraints is still very large, as it grows quadratically with n.

NETTAB BBCC NOVEMBER 2010 NEAPLES An example features samples 1 2 3 4 5 6 7 8 9 10 1 A 1 1 1 0 1 0 0 1 0 2 B 1 1 0 0 0 1 0 1 1 3 B 0 1 1 0 1 1 1 1 0 x4 x4 + x5+ X6 +X10 >=1 >=1 constraint(1, 1,2) 2): x1 x1 + x6+ x7 x7 >=1 >=1 constraint(1, 1,3) 3): sol olution on with mi minima mal size (X6 = 1, Xi = 0, i <>6) # con onstraints prop opor ortion onal to o Na*Nb With β  2 the max value 10 of  is still 1. We need β = 3 for a solution with  = 2

NETTAB BBCC NOVEMBER 2010 NEAPLES Variant 1) A Compact Model Assume the case of supervised learning, and consider the subset of constraints related to row i belonging to class A, and add over the elements of class B:      ~       k k d ij x d x , j , c ( j ) B k ij k k k ~  j : c ( j ) B            k k d x d x n ij k ij k B k k ~ ~ ~    j : c ( j ) B j : c ( j ) B j : c ( j ) B    11 k separates perfectly the 2 classes 1 ~ ~  k k k  d d , d i ij i k is useless for separation  0 ~  j : c ( j ) B

NETTAB BBCC NOVEMBER 2010 NEAPLES A Compact Model (2) ~ The value d can be adopted as a direct measure of the importance ik of column k for row i:  ~ max  d ~   ik ~ , i : c ( i ) A     f x , i n   k ik k B ~      f      f f ik x  ik ik ~ k k d ~   ik  , i : c ( i ) B xk { 0 , 1 }   n A And  controls the density of the constraint matrix of the IP problem • if  = 0.5, the coefficients of the constraints have value 1 only when the value of k for element i is different from the mode of the values of k over the element of the other class; 12 • If f is not rounded, then the constraints represent the maximization of the average hamming distance between the k th coordinate of element i and the same coordinate of all the elements belonging to the other class.

NETTAB BBCC NOVEMBER 2010 NEAPLES How to solve those large (and hard) IPs ? • At optimality: whit contained dimensions; else heuristics… RELEVANT ISSUES • The quality of solution depends on the chosen sample as well as on the solution algorithms • There are many equivalent solutions for a given problem • Cross validation approach: integrate the solutions obtained on different subset of the available data (re-sampling) • It is required to solve many instances of the same problems over different input data … Good heuristics seem to be the right approach… Their weakness w.r.t. optimal methods are balanced by data sampling 13 Is it better to have MANY GOOD SOLUTIONS or FEW OPTIMAL ONES ?

1 NETTAB BBCC NOVEMBER 2010 NEAPLES Summary Logic Data Mining - PowerPoint PPT Presentation

NETTAB BBCC NOVEMBER 2010 NEAPLES Mathematical Models for Feature Selection And their Application In Bioinformatics Paola Bertolazzi, Giovanni Felici Istituto di Analisi dei Sistemi ed Informatica IASI-CNR Paola Festa Dipartimento di

Higher order functions York University CSE 3401 Vida Movahedi 1 York University CSE 3401

Februa uary y 20, 2014 4 ACTION ITEMS A. Rescission of Milton Airport as Surplus Property B.

Relations among partitions. II: Adjusting for one partition R. A. Bailey University of St

CCS 2018 Satellite: DOOCN Thessaloniki 27 September 2018 SIMPLICIAL COMPLEXES: EMERGENT

Parabolic Deligne-Lusztig varieties and Brou es conjectures for reductive groups Jean

On Combining State Space Reductions with Global Fairness Assumptions Shaojie Zhang 1 Jun Sun 2 Jun

EU-Project OPEN HOUSE Dr. Natalie Eig (Architect, DGNB Auditor) Fraunhofer Institute for

Diary Data Capture System for Time Use Built Blaise in and Maniplus Fred Wensing

Correla'ons within Non-equilibrium Greens Func'ons method Hossein Mahzoon MSU Pawel

r t

Bioinformatics resources and standards for modeling neuronal signalling Nicolas Le Novre,

Assimilation of satellite altimetry data in hydrological models for improved inland surface water

NHG NHGRI RI Genom enomic c Medi edicine e Activi vities National Human Genome Research

Over er-Archi ching T ng Topi opics: cs: V Var ariant ants Prioritizing

Targeted Proteomics Environment Status of the Skyline open-source software project four years

Statistical Genetics Matthew Stephens Statistics Retreat, October 26th 2012 Matthew Stephens

Scalable Data Science with Hadoop, Spark and R Mario Inchiosa, PhD Principal Software Engineer

Computational Science and Engineering Malik Ghallab April 2013 Centuries of craftsmanship

9/18/2017 UW MEDICINE | UCSF ASIAN HEALTH SYMPOSIUM 2017 UW MEDICINE TITLE OR EVENT

Amendments to DISC2 and CLIN2 Vice President Portfolio Development and Review Concept Proposals

SigPath: Quantitative information management for cell signaling pathways and networks Institute

Reducing Extraneous Processing Modality Principle Jan L. Plass, ECT Coherence Principle

Privacy In The Era Of Personalised Medicine Dr Kelvin Ross, Dr Brent Richards, Dr Zoran

8/5/2019 Q UESTION #2: Y OUR PATIENT IS CAPABLE P RETEST PROBABILITY OF CORONARY HEART DISEASE IN

Sambuz

Useful Links

Newsletter

Mail Us