NETTAB BBCC NOVEMBER 2010 NEAPLES Mathematical Models for Feature Selection And their Application In Bioinformatics Paola Bertolazzi, Giovanni Felici Istituto di Analisi dei Sistemi ed Informatica IASI-CNR Paola Festa Dipartimento di Matematica e Applicazioni R.M. Caccioppoli , Università degli Studi di Napoli Federico II 1
NETTAB BBCC NOVEMBER 2010 NEAPLES Summary Logic Data Mining System: online dmb.iasi.cnr.it Focus on: Formulation of the Feature Selection Problem GRASP Methods Applications 2
NETTAB BBCC NOVEMBER 2010 NEAPLES The Logic Data Mining flow FEATURE FEATURE E RAW DISCRETIZATION LEARNING SELECTION SELECTION TION DATA Samples Identify significant Select Select few logic lect few logic c Build logic from thresholds for the variables that variables ables that formulas using appear ar to have a more value of rational appear to have a the selected classes, variables; intervals stron strong capability ong capabili ility y variables that are data in generate discrete of tellin of telling one ling one able to classify any variables, then class from the class ss from the correctly the format logic other over the othe her r over the training data: 3 whole ole sample le whole sample IF(X&Y) THEN Z
NETTAB BBCC NOVEMBER 2010 NEAPLES Feature Selection • FS is a projection of a set of multidimensional points from their original space to a space of smaller dimension with little "loss of information" or large "reduction of noise". • Information and noise must defined w.r.t. to the objective of the specific application: clustering, classification, synthesis... • in supervised learning application, we want to preserve or enhance the relative distances between observations belonging to different groups. 4
NETTAB BBCC NOVEMBER 2010 NEAPLES FS as a Combinatorial Problem When the projection of the points is simply a selection of a subset of the available dimensions, the FS problem has a combinatorial nature. Such fact has been pointed out and exploited already in the literature: • Garey M.R. and Johnson D.S. Computer and Intractability: a guide on the theory of NP-completeness. Freeman, San Francisco, 1979. • E. Boros, P.L. Hammer, T. Ibaraki, A. Kogan, E. Mayoraz, I. Muchnik, An implementation of logical analysis of data, IEEE Transactions on Knowledge and Data Engineering, 12 (2) 292-306 (2000). • M. Charikar, V. Guruswami, R. Kumar, S. Rajagopalan and A. Sahai. Combinatorial Feature Selection Problems. In Proceedings of FOCS 2000. 5 • R. Beretta, A. Mendes, P. Moscato, Integer programming models and algorithms for molecular classification of cancer from Microarray Data, Proceedings of the Twenty-eighth Australasian Computer Science Conference, 38, 361 - 370 (2005)
NETTAB BBCC NOVEMBER 2010 NEAPLES Notations and Definitions we assume that n m-dimensional points are the input data for the FS problems. The points are represented in the rational matrix A M (resp. N) is the index set of the columns of A (resp. rows); then mxn n N , m M , A n m , A R An appropriate measure of the information contained in A is given by: 2 I ( A ) a a ij jk i j i k I(A) the average quadratic distance of the points in A, directly related 6 to the varia varianc nce expressed by A, a widely used measure in Statistics and Data Analysis.
NETTAB BBCC NOVEMBER 2010 NEAPLES A Simple Optimization Problem Consider now the projection of A on a subset of its dimensions M', such that |M' | = < m, and 1 if k M ' x k otherwise 0 2 I ( A ) a a x and therefore x ik jk k i j i k represents the portion of information preserved by the projection of the points of A on their M' dimensions. 2 max I ( A ) a a x 7 The simplest optimization problem x ik jk k i j i k that can be defined would be: x k k x { 0 , 1 } k
NETTAB BBCC NOVEMBER 2010 NEAPLES A (proper) extension: minimization of the infimum-norm An alternative to the average max approach consists in requiring 2 a a x , i , j , i j a minimum level of distance ik jk k k between each pair, and x k requiring a projection that k maximizes such level: x { 0 , 1 } k Rela lati tion on between ween the two models els Let h = m x n and R h be the Euclidean subspace where a point is defined as follows: 2 ,..., : ( a a ) , l i ( n 1 ) j , i j 1 h l ik jk k With a proper definition of the projection x we have that the 2 models become: 8 inf 1 max max x x x x k k k k m m x { 0 , 1 } x { 0 , 1 }
NETTAB BBCC NOVEMBER 2010 NEAPLES 1) Special Case: Binary Data k 2 d ( a a ) Let ij ik jk 1 if ( a a ) ik jk k a { 0 , 1 } d If data in A is binary, then ij ij 0 otherwise max and the FS problem can be rewritten as: k d x , i , j , i j k ij k x k k xk { 0 , 1 } 2) Special Case: Supervised Learning The row vectors of A are partitioned into two different classes ~ ~ max Α A B k ~ ~ d x , i , j , c ( i ) c ( j ) ~ ~ Α Α a , c ( i ) , a B , c ( i ) B k ij k i i 9 x ~ ~ Α k n , n B k Only the distance between points of A B x { 0 , 1 } k different classes is taken into account; but the number of constraints is still very large, as it grows quadratically with n.
NETTAB BBCC NOVEMBER 2010 NEAPLES An example features samples 1 2 3 4 5 6 7 8 9 10 1 A 1 1 1 0 1 0 0 1 0 2 B 1 1 0 0 0 1 0 1 1 3 B 0 1 1 0 1 1 1 1 0 x4 x4 + x5+ X6 +X10 >=1 >=1 constraint(1, 1,2) 2): x1 x1 + x6+ x7 x7 >=1 >=1 constraint(1, 1,3) 3): sol olution on with mi minima mal size (X6 = 1, Xi = 0, i <>6) # con onstraints prop opor ortion onal to o Na*Nb With β 2 the max value 10 of is still 1. We need β = 3 for a solution with = 2
NETTAB BBCC NOVEMBER 2010 NEAPLES Variant 1) A Compact Model Assume the case of supervised learning, and consider the subset of constraints related to row i belonging to class A, and add over the elements of class B: ~ k k d ij x d x , j , c ( j ) B k ij k k k ~ j : c ( j ) B k k d x d x n ij k ij k B k k ~ ~ ~ j : c ( j ) B j : c ( j ) B j : c ( j ) B 11 k separates perfectly the 2 classes 1 ~ ~ k k k d d , d i ij i k is useless for separation 0 ~ j : c ( j ) B
NETTAB BBCC NOVEMBER 2010 NEAPLES A Compact Model (2) ~ The value d can be adopted as a direct measure of the importance ik of column k for row i: ~ max d ~ ik ~ , i : c ( i ) A f x , i n k ik k B ~ f f f ik x ik ik ~ k k d ~ ik , i : c ( i ) B xk { 0 , 1 } n A And controls the density of the constraint matrix of the IP problem • if = 0.5, the coefficients of the constraints have value 1 only when the value of k for element i is different from the mode of the values of k over the element of the other class; 12 • If f is not rounded, then the constraints represent the maximization of the average hamming distance between the k th coordinate of element i and the same coordinate of all the elements belonging to the other class.
NETTAB BBCC NOVEMBER 2010 NEAPLES How to solve those large (and hard) IPs ? • At optimality: whit contained dimensions; else heuristics… RELEVANT ISSUES • The quality of solution depends on the chosen sample as well as on the solution algorithms • There are many equivalent solutions for a given problem • Cross validation approach: integrate the solutions obtained on different subset of the available data (re-sampling) • It is required to solve many instances of the same problems over different input data … Good heuristics seem to be the right approach… Their weakness w.r.t. optimal methods are balanced by data sampling 13 Is it better to have MANY GOOD SOLUTIONS or FEW OPTIMAL ONES ?
Recommend
More recommend