Many Features, Few Samples: Many Features, Few Samples: From - PowerPoint PPT Presentation

Many Features, Few Samples: Many Features, Few Samples: From cheminformatics cheminformatics to bioinformatics to bioinformatics From Kristin P. Bennett Kristin P. Bennett Department of Mathematical Sciences Department of Mathematical Sciences Rensselaer Polytechnic Institute Rensselaer Polytechnic Institute and RPI DDASSL Project Members: and RPI DDASSL Project Members: C. Breneman Breneman, M. , M. Embrechts Embrechts, J. Bi, M. , J. Bi, M. C. Momma, N. Sukumar Sukumar, M. Song , M. Song Momma, N. Interface 2004 5/04

Cheminformatics Problem Problem Cheminformatics Given for each Molecule i i Given for each Molecule x Descriptor vector � Descriptor vector � i y � Bioresponse Bioresponse � i ≈ ( ) Construct a function f x y i i to predict bioresponse Catch: many descriptors/attributes (600-1000+ ) very few data points (30-200) descriptors very correlated

Electron Density-Derived TAE-Wavelet Descriptors 3 surface 1 ) Surface properties are encoded on 0.002 e/au Surface properties are encoded on 0.002 e/au 3 surface 1 ) Breneman Breneman, C.M. and , C.M. and Rhem Rhem, M. [1997] , M. [1997] J. Comp. Chem. J. Comp. Chem. , Vol. , Vol. 18 18 (2), p. 182 (2), p. 182- -197 197 2 ) Histograms or wavelet encoded of surface properties give 2 ) Histograms or wavelet encoded of surface properties give TAE property descriptors TAE property descriptors Histograms PIP (Local Ionization Potential) Wavelet Coefficients

PEST Hybrid Property/Shape Descriptors • Surface properties and shape information are encoded into alignment-free descriptors PIP vs Segment Length • 9 different surface properties

Many features/Little Data Many features/Little Data Issues Issues Overfitting � Overfitting � Feature selection � Feature selection � Difficult validation � Difficult validation � Model/parameter selection � Model/parameter selection � High model variance � High model variance � Not confident in any one model � Not confident in any one model �

DDASSL Learning Methodology: DDASSL Learning Methodology: One Method with Three Engines One Method with Three Engines � Method Method � � Regularized Kernel Learning Engines Regularized Kernel Learning Engines � � Bagged Feature Selection/Visualization Bagged Feature Selection/Visualization � � Bagged Final Models Bagged Final Models � � Learning Engines (linear and kernel) Learning Engines (linear and kernel) � � Support Vector Machine (SVM) Support Vector Machine (SVM) � � Partial Least Squares (PLS) Partial Least Squares (PLS) � � Boosted Latent Analysis (BLA) Boosted Latent Analysis (BLA) �

Minimize Regularized Loss Minimize Regularized Loss � Minimize the training error and capacity Minimize the training error and capacity � ∑ + min ( ( ), ) ( ) Loss f x y P f f i i i f 1 (x) f 2 (x) � Overfitting is likely with high-capacity functions � Capacity control makes good generalization possible even in very high-dimensional input spaces

Support Vector Regression (SVR) Support Vector Regression (SVR) � Minimize the regularized empirical error: Minimize the regularized empirical error: � � � l 1 ∑ training error + model complexity training error + model complexity ξ + ξ + * 2 min ( ) || || C w i i 2 ξ ξ * w , b , , = i i 1 i ε -insensitive loss function: − = − − ε ( ( )) : m ax(0,| ( ) | ) L y f x y f x ε L ε ξ * y-f(x) - ε ε � Overfitting is avoided by controlling the model complexity: || w || � Add kernels to create nonlinear functions

Feature Selection via Feature Selection via Sparse SVM/LP Sparse SVM/LP ν - Construct linear ν � Construct linear -SVM using 1 SVM using 1- -norm LP: norm LP: � � ( ) C ∑ + + ν ε + ε + * * min ( ) || | | z z C w 1 � i i ε * , , , , w b z z = 1 i ( ) ⋅ + − + ≥ − ε x w b y z i i i . st ( ) ⋅ + − − ≤ ε * * x w b y z i i i � Pick best C for SVM Pick best C for SVM ε ε ≥ = � � * * , , , 0 1,.. , z z i i i � Keep descriptors Keep descriptors � with nonzero coefficients with nonzero coefficients w > | | 0 i

Bagged Variable Selection Bagged Variable Selection DATASET Random Variables Training set Test set Bootstrap sample k Training Validation Predictive Model Sparse Tuning / Nonlinear SVM Linear SVM Prediction … descriptors Prediction Reduced Data

Final Bagged Predictive Model Final Bagged Predictive Model Achieve the better generalization performance � construct a series of non-linear SVM models � use the average of all models as final prediction to reduce variance

CACO- -2 Data 2 Data CACO � Human intestinal cell line Human intestinal cell line � � Predicts drug absorption Predicts drug absorption � � 27 molecules with tested permeability 27 molecules with tested permeability � � 718 descriptors generated 718 descriptors generated � � Electronic TAE Electronic TAE � � Shape/Property (PEST) Shape/Property (PEST) � � Traditional (MOE) Traditional (MOE) �

Molecular Surface Properties Molecular Surface Properties Electronic Properties � Electronic Properties � ρ ( r ' ) dr ' Z α ∑ � Electrostatic Potential Electrostatic Potential ∫ � EP ( r ) = − r − R α r − r ' α K ( r ) = − ( ψ * ∇ 2 ψ + ψ ∇ 2 ψ *) � Electronic Kinetic Energy Density Electronic Kinetic Energy Density � G ( r ) = −∇ ψ * . ∇ ψ ∇ρ ∇ρ • � Electron Density Gradients Electron Density Gradients •N N � 2 ρ ( r ) = K ( r ) − G ( r ) L ( r ) = −∇ � Laplacian Laplacian of the Electron Density of the Electron Density � ρ i ( r ) ε i ∑ ( r ) = PIP � Local Average Ionization Potential Local Average Ionization Potential � ρ ( r ) i � Bare Nuclear Potential (BNP) Bare Nuclear Potential (BNP) � F+(r) = ρ ρ HOMO(r) � Fukui function Fukui function HOMO(r) F+(r) = �

Visualization of feature selection results Visualization of feature selection results To investigate the relative importance of selected descriptors and their consistency

Caco- -2 2 – – 14 Features (SVM) 14 Features (SVM) Caco � Each star represents a descriptor � Each ray is a PEOE.VSA.FNEG separate a.don DRNB10 BNPB31 bootstrap � The area of a star represents the relative importance of that descriptor ABSDRN6 ABSKMIN KB54 FUKB14 � Descriptors shaded cyan have a negative effect � Unshaded ones PEOE.VSA.FPPOS SIKIA SlogP.VSA0 SMR.VSA2 have a positive effect � Hydrophobicity - Hydrophobicity - a.don a.don � Size and Shape - - ABSDRN6, SMR.VSA2, ABSDRN6, SMR.VSA2, ANGLEB45 ANGLEB45 � � Size and Shape DRNB00 ANGLEB45 Large is bad. Flat is bad. Globular is good. Large is bad. Flat is bad. Globular is good. Polarity – – PEOE.VSA...: negative partial charge good. � � Polarity PEOE.VSA...: negative partial charge good.

Bagged SVM (RBF) Caco- -2 2 Bagged SVM (RBF) Caco -3 -4 -5 -6 -7 2 = 0.93 Train R cv Blind Test R 2 = 0.83 -8 -8 -7 -6 -5 -4 -3 Before feature selection R 2 =.66

New Learning Engine: BLA New Learning Engine: BLA Boosted Latent Analysis Boosted Latent Analysis Construct Orthogonal Latent Features and Construct Orthogonal Latent Features and corresponding predictive model for (sub) corresponding predictive model for (sub) differentiable loss functions differentiable loss functions � Orthogonal Boosting of linear functions Orthogonal Boosting of linear functions � � For least squares loss, equivalent to PLS For least squares loss, equivalent to PLS � � Easy to tune Easy to tune � � Easy to implement algorithm: small changes for Easy to implement algorithm: small changes for � different loss functions different loss functions � Feature selection for linear models Feature selection for linear models � � Kernelizable Kernelizable for nonlinear models for nonlinear models �

Review of AnyBoost AnyBoost Review of (Mason et al. 1999) (Mason et al. 1999) = 0 Initial F t Pseudo response: u Weak learning algorithm Negative gradient Find t : t ’ u> 0 Weak hypotheses ( t i ) Loss function T = [ T t i ] Compute stepsize (c) Stepsize c i , or backfitting c Predictive model L ∑ Steepest descent = + 0 i i F t c t i = 1

Orthogonal AnyBoost AnyBoost Orthogonal = 0 Initial F t Pseudo response: u Weak learning algorithm ′ Negative gradient = − j Find t : t ’ u> 0 0 1,..., 1 t t = j i Weak hypotheses ( t i ) Loss function T = [ T t i ] Compute stepsize (c) Stepsize c i , or backfitting c Predictive model L ∑ = + 0 i i F t c t = 1 i Subspace or conjugate gradient algorithm

Boosted Latent Analysis Boosted Latent Analysis (Momma and Bennett 2004) (Momma and Bennett 2004) = 0 Initial F t Weak learner (linear) deflate X: X ← (I - tt’)X Pseudo response: u Pseudo response: u Max( t) t ’ u> 0 ⇒ w = X’u, t = Xw u = - ∇ Loss u = y - Tc Weak hypotheses ( t i ) Loss = || y - Tc || 2 Generic loss function t 0 = e T = [ T t i ] Step size computation Stepsize c i Back-fitting c Predictive model = ∑ � L = + γ i i F t c ' F x w i = 1

Many Features, Few Samples: Many Features, Few Samples: From - PowerPoint PPT Presentation

Many Features, Few Samples: Many Features, Few Samples: From cheminformatics cheminformatics to bioinformatics to bioinformatics From Kristin P. Bennett Kristin P. Bennett Department of Mathematical Sciences Department of Mathematical

Samples Advertising of samples and handing out samples Advertising Education and Assurance

-Samples [AB98] Hyp: domain S is a smooth curve or surface. S 1 -Samples [AB98] Hyp:

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

STAT 113 Independent vs. Paired Samples Colin Reimer Dawson Oberlin College November 16, 2017

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 14, 2019 1 / 125

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 13, 2019 1 / 125

Toward Efficient Many-to-Many Broadcast in Dynamic Wireless Networks Fabian Mager , Carsten

Labeling Blood Samples There are documented occurrences and near misses of mislabeling of blood

This graph shows the evidence from the samples giving an indication of the predominance of the

MutaPon Analysis in Frozen and FFPE Tumor Samples Gad Getz, PhD KrisPn Ardlie, PhD Broad

SEM Photographs of Activated ash samples SEM Micrographs (Original ash samples) (a) Sample S1F1

Inferential Problems with Nonprobability Samples Richard Valliant University of Michigan &

Lecture 6: samples and populations Todays lecture Look at fundamental concepts of samples and

Stochastic Simulation Idea: probabilities samples Get probabilities from samples: X count X

Counting Words: Type probabilities Population models Type-rich populations, samples, ZM &

Development of a new heating stage equipped thermal electron filter for scanning electron

Efficient Modeling of Laser-Plasma Accelerators Using the Ponderomotive-Based Code INF&RNO

Blacklight Power and Randell Millss theory of the atom. My goal for creating the website:

Opportunities and challenges for strong interaction physics with the future CERN pre-accelerator

Electronic states of confined 2-electron quantum systems Tokuei Sako 1 Geerd HF Diercksen 2 1

derived from THEMIS observations Jeongwoo Lee 1,2 , Kyungguk Min 1 , Kunihiro Keika 1 , Wen Li 3

Characterisation of the interactions between pollutants and solid matrix in mixed contaminated

Corrosion inhibition of carbon steel pipelines by some novel Schiff base compounds during

Many Features, Few Samples: Many Features, Few Samples: From - PowerPoint PPT Presentation

Many Features, Few Samples: Many Features, Few Samples: From cheminformatics cheminformatics to bioinformatics to bioinformatics From Kristin P. Bennett Kristin P. Bennett Department of Mathematical Sciences Department of Mathematical

Samples Advertising of samples and handing out samples Advertising Education and Assurance

-Samples [AB98] Hyp: domain S is a smooth curve or surface. S 1 -Samples [AB98] Hyp:

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

STAT 113 Independent vs. Paired Samples Colin Reimer Dawson Oberlin College November 16, 2017

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 14, 2019 1 / 125

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 13, 2019 1 / 125

Toward Efficient Many-to-Many Broadcast in Dynamic Wireless Networks Fabian Mager , Carsten

Labeling Blood Samples There are documented occurrences and near misses of mislabeling of blood

This graph shows the evidence from the samples giving an indication of the predominance of the

MutaPon Analysis in Frozen and FFPE Tumor Samples Gad Getz, PhD KrisPn Ardlie, PhD Broad

SEM Photographs of Activated ash samples SEM Micrographs (Original ash samples) (a) Sample S1F1

Inferential Problems with Nonprobability Samples Richard Valliant University of Michigan &amp;

Lecture 6: samples and populations Todays lecture Look at fundamental concepts of samples and

Stochastic Simulation Idea: probabilities samples Get probabilities from samples: X count X

Counting Words: Type probabilities Population models Type-rich populations, samples, ZM &amp;

Development of a new heating stage equipped thermal electron filter for scanning electron

Efficient Modeling of Laser-Plasma Accelerators Using the Ponderomotive-Based Code INF&amp;RNO

Blacklight Power and Randell Millss theory of the atom. My goal for creating the website:

Opportunities and challenges for strong interaction physics with the future CERN pre-accelerator

Electronic states of confined 2-electron quantum systems Tokuei Sako 1 Geerd HF Diercksen 2 1

derived from THEMIS observations Jeongwoo Lee 1,2 , Kyungguk Min 1 , Kunihiro Keika 1 , Wen Li 3

Characterisation of the interactions between pollutants and solid matrix in mixed contaminated

Corrosion inhibition of carbon steel pipelines by some novel Schiff base compounds during

Inferential Problems with Nonprobability Samples Richard Valliant University of Michigan &

Counting Words: Type probabilities Population models Type-rich populations, samples, ZM &

Efficient Modeling of Laser-Plasma Accelerators Using the Ponderomotive-Based Code INF&RNO