Many Features, Few Samples: Many Features, Few Samples: From cheminformatics cheminformatics to bioinformatics to bioinformatics From Kristin P. Bennett Kristin P. Bennett Department of Mathematical Sciences Department of Mathematical Sciences Rensselaer Polytechnic Institute Rensselaer Polytechnic Institute and RPI DDASSL Project Members: and RPI DDASSL Project Members: C. Breneman Breneman, M. , M. Embrechts Embrechts, J. Bi, M. , J. Bi, M. C. Momma, N. Sukumar Sukumar, M. Song , M. Song Momma, N. Interface 2004 5/04
Cheminformatics Problem Problem Cheminformatics Given for each Molecule i i Given for each Molecule x Descriptor vector � Descriptor vector � i y � Bioresponse Bioresponse � i ≈ ( ) Construct a function f x y i i to predict bioresponse Catch: many descriptors/attributes (600-1000+ ) very few data points (30-200) descriptors very correlated
Electron Density-Derived TAE-Wavelet Descriptors 3 surface 1 ) Surface properties are encoded on 0.002 e/au Surface properties are encoded on 0.002 e/au 3 surface 1 ) Breneman Breneman, C.M. and , C.M. and Rhem Rhem, M. [1997] , M. [1997] J. Comp. Chem. J. Comp. Chem. , Vol. , Vol. 18 18 (2), p. 182 (2), p. 182- -197 197 2 ) Histograms or wavelet encoded of surface properties give 2 ) Histograms or wavelet encoded of surface properties give TAE property descriptors TAE property descriptors Histograms PIP (Local Ionization Potential) Wavelet Coefficients
PEST Hybrid Property/Shape Descriptors • Surface properties and shape information are encoded into alignment-free descriptors PIP vs Segment Length • 9 different surface properties
Many features/Little Data Many features/Little Data Issues Issues Overfitting � Overfitting � Feature selection � Feature selection � Difficult validation � Difficult validation � Model/parameter selection � Model/parameter selection � High model variance � High model variance � Not confident in any one model � Not confident in any one model �
DDASSL Learning Methodology: DDASSL Learning Methodology: One Method with Three Engines One Method with Three Engines � Method Method � � Regularized Kernel Learning Engines Regularized Kernel Learning Engines � � Bagged Feature Selection/Visualization Bagged Feature Selection/Visualization � � Bagged Final Models Bagged Final Models � � Learning Engines (linear and kernel) Learning Engines (linear and kernel) � � Support Vector Machine (SVM) Support Vector Machine (SVM) � � Partial Least Squares (PLS) Partial Least Squares (PLS) � � Boosted Latent Analysis (BLA) Boosted Latent Analysis (BLA) �
Minimize Regularized Loss Minimize Regularized Loss � Minimize the training error and capacity Minimize the training error and capacity � ∑ + min ( ( ), ) ( ) Loss f x y P f f i i i f 1 (x) f 2 (x) � Overfitting is likely with high-capacity functions � Capacity control makes good generalization possible even in very high-dimensional input spaces
Support Vector Regression (SVR) Support Vector Regression (SVR) � Minimize the regularized empirical error: Minimize the regularized empirical error: � � � l 1 ∑ training error + model complexity training error + model complexity ξ + ξ + * 2 min ( ) || || C w i i 2 ξ ξ * w , b , , = i i 1 i ε -insensitive loss function: − = − − ε ( ( )) : m ax(0,| ( ) | ) L y f x y f x ε L ε ξ * y-f(x) - ε ε � Overfitting is avoided by controlling the model complexity: || w || � Add kernels to create nonlinear functions
Feature Selection via Feature Selection via Sparse SVM/LP Sparse SVM/LP ν - Construct linear ν � Construct linear -SVM using 1 SVM using 1- -norm LP: norm LP: � � ( ) C ∑ + + ν ε + ε + * * min ( ) || | | z z C w 1 � i i ε * , , , , w b z z = 1 i ( ) ⋅ + − + ≥ − ε x w b y z i i i . st ( ) ⋅ + − − ≤ ε * * x w b y z i i i � Pick best C for SVM Pick best C for SVM ε ε ≥ = � � * * , , , 0 1,.. , z z i i i � Keep descriptors Keep descriptors � with nonzero coefficients with nonzero coefficients w > | | 0 i
Bagged Variable Selection Bagged Variable Selection DATASET Random Variables Training set Test set Bootstrap sample k Training Validation Predictive Model Sparse Tuning / Nonlinear SVM Linear SVM Prediction … descriptors Prediction Reduced Data
Final Bagged Predictive Model Final Bagged Predictive Model Achieve the better generalization performance � construct a series of non-linear SVM models � use the average of all models as final prediction to reduce variance
CACO- -2 Data 2 Data CACO � Human intestinal cell line Human intestinal cell line � � Predicts drug absorption Predicts drug absorption � � 27 molecules with tested permeability 27 molecules with tested permeability � � 718 descriptors generated 718 descriptors generated � � Electronic TAE Electronic TAE � � Shape/Property (PEST) Shape/Property (PEST) � � Traditional (MOE) Traditional (MOE) �
Molecular Surface Properties Molecular Surface Properties Electronic Properties � Electronic Properties � ρ ( r ' ) dr ' Z α ∑ � Electrostatic Potential Electrostatic Potential ∫ � EP ( r ) = − r − R α r − r ' α K ( r ) = − ( ψ * ∇ 2 ψ + ψ ∇ 2 ψ *) � Electronic Kinetic Energy Density Electronic Kinetic Energy Density � G ( r ) = −∇ ψ * . ∇ ψ ∇ρ ∇ρ • � Electron Density Gradients Electron Density Gradients •N N � 2 ρ ( r ) = K ( r ) − G ( r ) L ( r ) = −∇ � Laplacian Laplacian of the Electron Density of the Electron Density � ρ i ( r ) ε i ∑ ( r ) = PIP � Local Average Ionization Potential Local Average Ionization Potential � ρ ( r ) i � Bare Nuclear Potential (BNP) Bare Nuclear Potential (BNP) � F+(r) = ρ ρ HOMO(r) � Fukui function Fukui function HOMO(r) F+(r) = �
Visualization of feature selection results Visualization of feature selection results To investigate the relative importance of selected descriptors and their consistency
Caco- -2 2 – – 14 Features (SVM) 14 Features (SVM) Caco � Each star represents a descriptor � Each ray is a PEOE.VSA.FNEG separate a.don DRNB10 BNPB31 bootstrap � The area of a star represents the relative importance of that descriptor ABSDRN6 ABSKMIN KB54 FUKB14 � Descriptors shaded cyan have a negative effect � Unshaded ones PEOE.VSA.FPPOS SIKIA SlogP.VSA0 SMR.VSA2 have a positive effect � Hydrophobicity - Hydrophobicity - a.don a.don � Size and Shape - - ABSDRN6, SMR.VSA2, ABSDRN6, SMR.VSA2, ANGLEB45 ANGLEB45 � � Size and Shape DRNB00 ANGLEB45 Large is bad. Flat is bad. Globular is good. Large is bad. Flat is bad. Globular is good. Polarity – – PEOE.VSA...: negative partial charge good. � � Polarity PEOE.VSA...: negative partial charge good.
Bagged SVM (RBF) Caco- -2 2 Bagged SVM (RBF) Caco -3 -4 -5 -6 -7 2 = 0.93 Train R cv Blind Test R 2 = 0.83 -8 -8 -7 -6 -5 -4 -3 Before feature selection R 2 =.66
New Learning Engine: BLA New Learning Engine: BLA Boosted Latent Analysis Boosted Latent Analysis Construct Orthogonal Latent Features and Construct Orthogonal Latent Features and corresponding predictive model for (sub) corresponding predictive model for (sub) differentiable loss functions differentiable loss functions � Orthogonal Boosting of linear functions Orthogonal Boosting of linear functions � � For least squares loss, equivalent to PLS For least squares loss, equivalent to PLS � � Easy to tune Easy to tune � � Easy to implement algorithm: small changes for Easy to implement algorithm: small changes for � different loss functions different loss functions � Feature selection for linear models Feature selection for linear models � � Kernelizable Kernelizable for nonlinear models for nonlinear models �
Review of AnyBoost AnyBoost Review of (Mason et al. 1999) (Mason et al. 1999) = 0 Initial F t Pseudo response: u Weak learning algorithm Negative gradient Find t : t ’ u> 0 Weak hypotheses ( t i ) Loss function T = [ T t i ] Compute stepsize (c) Stepsize c i , or backfitting c Predictive model L ∑ Steepest descent = + 0 i i F t c t i = 1
Orthogonal AnyBoost AnyBoost Orthogonal = 0 Initial F t Pseudo response: u Weak learning algorithm ′ Negative gradient = − j Find t : t ’ u> 0 0 1,..., 1 t t = j i Weak hypotheses ( t i ) Loss function T = [ T t i ] Compute stepsize (c) Stepsize c i , or backfitting c Predictive model L ∑ = + 0 i i F t c t = 1 i Subspace or conjugate gradient algorithm
Boosted Latent Analysis Boosted Latent Analysis (Momma and Bennett 2004) (Momma and Bennett 2004) = 0 Initial F t Weak learner (linear) deflate X: X ← (I - tt’)X Pseudo response: u Pseudo response: u Max( t) t ’ u> 0 ⇒ w = X’u, t = Xw u = - ∇ Loss u = y - Tc Weak hypotheses ( t i ) Loss = || y - Tc || 2 Generic loss function t 0 = e T = [ T t i ] Step size computation Stepsize c i Back-fitting c Predictive model = ∑ � L = + γ i i F t c ' F x w i = 1
Recommend
More recommend