a constructive approach to incremental learning
play

A constructive approach to incremental learning Mario Rosario - PowerPoint PPT Presentation

High Performance Computing and Networking Institute National Research Council, Italy The Data Reference Model: A constructive approach to incremental learning Mario Rosario Guarracino October 12, 2006 10/13/2006 10:26 PM Acknowledgements


  1. High Performance Computing and Networking Institute National Research Council, Italy The Data Reference Model: A constructive approach to incremental learning Mario Rosario Guarracino October 12, 2006 10/13/2006 10:26 PM

  2. Acknowledgements � prof. Franco Giannessi – U. of Pisa, � prof. Panos Pardalos – CAO UFL, � Onur Seref – CAO UFL, � Claudio Cifarelli – U. of Rome La Sapienza. Workshop on Data Mining and Mathematical Programming October 12, 2006 -- Pg. 2

  3. Agenda � Generalized eigenvalue classification � Purpose of incremental learning � Subset selection algorithm � Initial points selection � Accuracy results � Conclusion and future work Workshop on Data Mining and Mathematical Programming October 12, 2006 -- Pg. 3

  4. Introduction � Supervised learning refers to the capability of a system to learn from examples ( training set ). � The trained system is able to provide an answer ( output ) for each new question ( input ). � S upervised means the desired output for the training set is provided by an external teacher. � Binary classification is among the most successful methods for supervised learning. Workshop on Data Mining and Mathematical Programming October 12, 2006 -- Pg. 4

  5. Applications � Many applications in biology and medicine: � Tissues that are prone to cancer can be detected with high accuracy. � New DNA sequences or proteins can be tracked down to their origins. � Identification of new genes or isoforms of gene expressions in large datasets. � Analysis and reduction of data spatiality and principal characteristics for drug design. Workshop on Data Mining and Mathematical Programming October 12, 2006 -- Pg. 5

  6. Peculiarity of the problem � Data produced in biomedical application will exponentially increase in the next years. � In genomic/proteomic application, data are often updated, which poses problems to the training step. � Publicly available datasets contain gene expression data for tens of thousands characteristics. � Current classification methods can over-fit the problem, providing models that do not generalize well. Workshop on Data Mining and Mathematical Programming October 12, 2006 -- Pg. 6

  7. Linear discriminant planes � Consider a binary classification task with points in two linearly separable sets. – There exists a plane that classifies all points in the two sets B B A A � There are infinitely many planes that correctly classify the training data. Workshop on Data Mining and Mathematical Programming October 12, 2006 -- Pg. 7

  8. Best plane � To construct the plane “furthers” from both classes, we examine the convex hull of each set. � � � � � � � � ��� � � � � � � � � � � � � � � � � � � � � � � � B B c � � A � � � � � � � � ���� A d � � � � � � � � � � � � � The best plane bisects closest points in the convex hulls. Workshop on Data Mining and Mathematical Programming October 12, 2006 -- Pg. 8

  9. SVM classification � A different approach, yielding the same solution, is to maximize the margin between support planes – Support planes leave all points of a class on one side � � � � � � ��� � ���� B B A A �� � � � � �� � � � � � � Support planes are pushed apart until they “bump” into a small set of data points ( support vectors ). Workshop on Data Mining and Mathematical Programming October 12, 2006 -- Pg. 9

  10. SVM classification � Support Vector Machines are the state of the art for the existing classification methods. � Their robustness is due to the strong fundamentals of statistical learning theory. � The training relies on optimization of a quadratic convex cost function, for which many methods are available. – Available software includes SVM-Lite and LIBSVM. � These techniques can be extended to the nonlinear discrimination, embedding the data in a nonlinear space using kernel functions . Workshop on Data Mining and Mathematical Programming October 12, 2006 -- Pg. 10

  11. A different religion � Mangasarian (2004) showed binary classification problem can be formulated as a generalized eigenvalue problem (GEPSVM). � Find x’w 1 = γ 1 the closer to A and the farther from B : � �� � �� � � ��� � �� � �� � � ��� � �� B B A A O. L. Mangasarian and E. W. Wild Multisurface Proximal Support Vector Classification via Generalized Eigenvalues. Data Mining Institute Tech. Rep. 04-03, June 2004. Workshop on Data Mining and Mathematical Programming October 12, 2006 -- Pg. 11

  12. GEP technique � �� � �� � � ��� � �� � �� � � ��� � �� Let: � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � Previous equation becomes: � � �� ��� � � �� � � � � Raleigh quotient of Generalized Eigenvalue Problem Gx= λ Hx . Workshop on Data Mining and Mathematical Programming October 12, 2006 -- Pg. 12

  13. GEP technique Conversely, to find the plane closer to B and further from A we need to solve: � �� � �� � � ��� � �� � �� � � ��� � �� which has the same eigenvectors of the previous problem and reciprocal eigenvalues. We only need to evaluate the eigenvectors related to min and max eigenvalues of Gx= λ Hx . Workshop on Data Mining and Mathematical Programming October 12, 2006 -- Pg. 13

  14. GEP technique Let [ w 1 γ 1 ] and [ w m γ m ] be eigenvectors associated to min and max eigenvalues of Gx= λ Hx : � a � A � closer to x'w 1 - γ 1 = 0 than to x'w m - γ m = 0 , � b � B � closer to x'w m - γ m = 0 than to x'w 1 - γ 1 = 0 . Workshop on Data Mining and Mathematical Programming October 12, 2006 -- Pg. 14

  15. Regularization � A and B can be rank-deficient. � G and H are always rank-deficient, � the product of matrices of dimension ( n + 1 � n ) is of rank at least n � 0/ � eigenvalue. � Do we need to regularize the problem to obtain a well posed problem? Workshop on Data Mining and Mathematical Programming October 12, 2006 -- Pg. 15

  16. An useful theorem Consider GEP Gx= λ Hx and the transformed G 1 x= λ H 1 x defined by: � � � � � � � � � �� � � � � � � � � � �� for each choice of scalars τ 1 , τ 2 , δ 1 and δ 2 , such that the 2 � 2 matrix � � � � � � � � � � � � is nonsingular. Then G*x= λ H*x and Gx= λ Hx have the same eigenvectors. Workshop on Data Mining and Mathematical Programming October 12, 2006 -- Pg. 16

  17. Linear case � In the linear case, the theorem can be applied. For τ 1 = τ 2 =1 and δ 1 = δ 2 = δ , the transformed problem is: � �� � �� � � � � � �� � �� � � ��� � �� � �� � � � � � �� � �� � � � ��� � �� � As long as δ � 1, matrix Ω is non-degenerate. � In practice, in each class of the training set, there has to be a number of linearly independent points equal to the number of features. – prob ( Ker(G) � Ker(H) ≠ 0) = 0 Workshop on Data Mining and Mathematical Programming October 12, 2006 -- Pg. 17

  18. Classification accuracy: linear kernel Dataset train dim ReGEC GEPSVM SVM NDC 300 7 87.60 86.70 89.00 ClevelandHeart 297 13 86.05 81.80 83.60 PimaIndians 768 8 74.91 73.60 75.70 GalaxyBright 2462 14 98.24 98.60 98.30 Accuracy results have been obtained using ten fold cross validation Workshop on Data Mining and Mathematical Programming October 12, 2006 -- Pg. 18

  19. Nonlinear case � A standard technique to obtain greater separability between sets is to embed the points into a nonlinear space, via kernel functions, like the gaussian kernel : � �� � �� � � � � � � � � � � � � � � � Each element of kernel matrix is: � �� � �� � � � � �� � � ��� � � � � � � where � � � � Workshop on Data Mining and Mathematical Programming October 12, 2006 -- Pg. 19

  20. Nonlinear case � Using a gaussian kernel the problem becomes: � � � �� � � � � �� � � ��� � � � �� � � � � �� � � ��� � �� � to produce the proximal surfaces: � � �� � � � � � � � � � � � � �� � � � � � � � � � � The associated GEP involves matrices of the order of the training set and rank at most the number of features. Workshop on Data Mining and Mathematical Programming October 12, 2006 -- Pg. 20

  21. ReGEC � Matrices are deeply rank deficient and the problem is ill posed. � We propose to generate the two proximal surfaces: � � �� � � � � � � � � � � � � �� � � � � � � � � � solving the problem � � � �� � � � � �� � � � � � � � � � � �� � � ��� � � � �� � � � � �� � � � � � � � � � � �� � � ��� � �� ~ ~ where K A and K B are main diagonals of K(A,C) and K(B,C) . Workshop on Data Mining and Mathematical Programming October 12, 2006 -- Pg. 21

Recommend


More recommend