Order parameters and model selection in Machine Learning: model characterization and feature selection Romaric Gaudel Advisor: Mich` ele Sebag; Co-advisor: Antoine Cornu´ ejols PhD, December 14, 2010
Introduction Relational Kernels Feature Selection Conclusion + Supervised Machine Learning Background Unknown distribution I P ( x , y ) on X × Y Objective Find h ∗ minimizing generalization error h ∗ ( x ) > 0 Err ( h ) = I P ( x , y ) [ ℓ ( h ( x ) , y )] E I Where ℓ ( h ( x ) , y ) is the cost of error on example x h ∗ ( x ) = 0 h ∗ ( x ) < 0 Given Training examples L = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } Where ( x i , y i ) ∼ I P ( x , y ) , i ∈ 1 , . . . , n R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 2 / 52
Introduction Relational Kernels Feature Selection Conclusion + Supervised Machine Learning 2 (Vapnik-Chervonenkis; Bottou & Bousquet, 08) Approximation error (a.k.a. bias ) Learned hypothesis belong to H h ∗ h ∗ H = argmin Err ( h ) h ∈H Approximation Estimation error (a.k.a. variance ) h ∗ H H Err estimated by empirical error P ℓ ( h ( x i ) , y i ) Err n ( h ) = 1 n h n = argmin Err n ( h ) h ∈H Optimization error Learned hypothesis returned by an optimization algorithm A ˆ h n = A ( L ) R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 3 / 52
Introduction Relational Kernels Feature Selection Conclusion + Supervised Machine Learning 2 (Vapnik-Chervonenkis; Bottou & Bousquet, 08) Approximation error (a.k.a. bias ) Learned hypothesis belong to H h ∗ h ∗ H = argmin Err ( h ) h ∈H Approximation Estimation error (a.k.a. variance ) h ∗ H H Estimation Err estimated by empirical error P ℓ ( h ( x i ) , y i ) h n Err n ( h ) = 1 n h n = argmin Err n ( h ) h ∈H Optimization error Learned hypothesis returned by an optimization algorithm A ˆ h n = A ( L ) R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 3 / 52
Introduction Relational Kernels Feature Selection Conclusion + Supervised Machine Learning 2 (Vapnik-Chervonenkis; Bottou & Bousquet, 08) Approximation error (a.k.a. bias ) Learned hypothesis belong to H h ∗ h ∗ H = argmin Err ( h ) h ∈H Approximation Estimation error (a.k.a. variance ) h ∗ H H Estimation Err estimated by empirical error P ℓ ( h ( x i ) , y i ) h n Optimization Err n ( h ) = 1 ˆ h n n h n = argmin Err n ( h ) h ∈H Optimization error Learned hypothesis returned by an optimization algorithm A ˆ h n = A ( L ) R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 3 / 52
Introduction Relational Kernels Feature Selection Conclusion + Focus of the thesis Combinatorial optimization problems hidden in Machine Learning + Relational representation = ⇒ Combinatorial optimization problem Example: Mutagenesis database - + Feature Selection = ⇒ Combinatorial optimization problem Example: Microarray data − R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 4 / 52
Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion Outline Relational Kernels 1 Feature Selection 2 R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 5 / 52
Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion Outline Relational Kernels 1 Feature Selection 2 R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 6 / 52
Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion Relational Learning / Inductive Logic Programming Position Relational database X : keys in the database Background knowledge H : set of logical formulas Expressive language Actual covering test: Constraint Satisfaction Problem (CSP) R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 7 / 52
Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion CSP consequences within Inductive Logic Programming Consequences of the Phase Transition Complexity Worst case: NP-hard Average case: “easy” except in Phase Transistion (Cheeseman et al. 91) Phase Transition in Inductive Logic Programming Existence (Giordana & Saitta, 00) Impact: fails to learn in Phase Transition region (Botta et al., 03) R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 8 / 52
Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion Multiple Instance Problems The missing link between Relational and Propositional Learning Multiple Instance Problems (MIP) (Dietterich et al., 89) An example: set of instances An instance: vector of features Target-concept: there exists an instance satisfying a predicate P pos ( x ) ⇐ ⇒ ∃ I ∈ x , P ( I ) Example of MIP Positive key ring A locked door A positive key-ring contains a key which can unlock the door Negative key ring R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 9 / 52
Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion Support Vector Machine A Convex optimization problem ˆ n n 0 < ξ i < 1 h n ( x ) > 0 α i − 1 X X argmin α i α j y i y j � x i , x j � 2 R n α ∈ I i = 1 i = 1 (P n i = 1 α i y i = 0 s.t. 0 � α i � C , i = 1 , . . . , n ˆ ξ i = 0 h n ( x ) < 0 Kernel trick ˆ h n ( x ) = 1 ˆ � x i , x j � � K ( x i , x j ) ξ i > 1 h n ( x ) = 0 ˆ h n ( x ) = − 1 Kernel-based propositionalization (differs from RKHS framework) ( L = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } � Φ : x → ( K ( x 1 , x ) , . . . , K ( x n , x )) K R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 10 / 52
Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion SVM and MIP Averaging-kernel for MIP (G¨ artner et al., 02) Given a kernel k on instances P P x j ∈ x ′ k ( x i , x j ) x i ∈ x K ( x , x ′ ) = norm ( x ) norm ( x ′ ) Question MIP Target-concept: existential properties Averaging-Kernel: average properties Do averaging-kernels sidestep limitations of Relational Learning? R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 11 / 52
Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion Methodology Inspired from Phase Transition studies Usual Phase Transition framework Generate data after control parameters Observe results Draw phase diagram: results w.r.t. order parameters This study Generalized Multiple Instance Problem Experimental results of averaging-kernel-based propositionalization R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 12 / 52
Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion Outline Relational Kernels 1 Theoretical failure region Lower bound on the generalization error Empirical failure region Feature Selection 2 R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 13 / 52
Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion Generalized Multiple Instance Problems Generalized MIP (Weidmann et al., 03) An example: set of instances An instance: vector of features Target-concept: conjunction of predicates P 1 , . . . , P m m ^ pos ( x ) ⇐ ⇒ ∃ I 1 , . . . , I m ∈ x , P i ( I i ) i = 1 O CH3 O CH3 Example of Generalized MIP C N C N CH3 N C O C O A molecule: set of sub-graphs = ⇒ CH3 N C C C C Bioactivity: implies several sub-graphs N N N N CH3 CH CH3 CH R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 14 / 52
Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion Control Parameters Category Param. Definition | Σ | Size of alphabet Σ , a ∈ Σ Instances d number of numerical features, I = ( a , z ) z ∈ [ 0 , 1 ] d + ε M + Number of instances per posi- tive example M − Number of instances per nega- tive example m + Number of instances in a predi- Examples cate, for positive example m − Number of instances in a predi- cate, for negative example P m Number of predicates “missed” - ε by each negative example P Number of predicate Concept ε Radius of each predicate ( ε - ball) R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 15 / 52
Recommend
More recommend