Improved Applicability Domain Determination of QSAR Models Using Local Mapping: RDN Natalia Aniceto , Alex Freitas, Andreas Bender, Taravat Ghafourian PhD student ● University of Kent
BACKGROUND “Applicability Domain” Establishing boundaries for prediction reliability is arguably as important as demonstrating good predictive performance. cumulative result of data noise and sparseness AD methods proposed so far typically address the data as a whole, often focusing on a single aspect of data (i.e. similarity to training set, descriptor span, data density, etc). Combining different measures that address data locally has shown improved AD characterization. Sahigara et al. 1 and Sheridan 2 dense STD + Similarity + Predictions In order to successfully characterize a model’s AD: AD = f ( local density & local reliability ) Hypothesis: sparse 1. Sahigara et al. Journal of Cheminformatics 2013, 5:27 2. Sheridan. J. Chem. Inf. Model. 2012, 52, 814−823
METHODS ( loc ( loc ity ) ity ) Ideal scenario AD = f local den ensit ity & loc local rel elia iabilit AD = f local den ensit ity & loc local rel elia iabilit ❶ ❷ Accuracy ❶ Local Density Density Neighbourhood (dk-NN) Rationale: AD output measure Sahigara et al. 1 Each training instance provides Chemical space coverage of chemical space coverage according to densely populated is its local vicinity. A radius of coverage is placed on However high each instance from the average Euclidean Distance (ED) to its local density D i neighbours within the average ED does not imply to the k-th NN. ↑ k : ↑ span of coverage high reliability scan through the chemical space
METHODS ( loc ity ) AD = f local den ensit ity & loc local rel elia iabilit ❶ ❷ D i* ❷ Local Reliability & precision bias ↑ agreement ↑ 1 – STD Agreement Deviation within an with observed ensemble of models response (STD) ↑W (Tetko et al) Di is less penalized D i Higher reliability = larger coverage D i * = D i x W i Tetko et al 2008 J. Chem. Inf. Model. 2008, 48, 1733 – 1746
METHODS Training set Scan through the chemical space External set out in k = 5 k = 7 k = 3 k = 1 By keeping track of new instances entering the AD, and updating in- AD-Accuracy , we build a map of reliability across the model’s space
METHODS Part #1 • Explore the capabilities of the Reliability-Density Neighbourhood (RDN) algorithm • Characterize Mispredictions ( ❶ “working” dataset ) Part #2 Ames mutagenicity test Test the RDN algorithm on ❷ benchmark datasets CYP450 inhibition test (1A2) Class = { + , - } Part #3 Compare RDN vs ❸ other AD methods
METHODS Part #1 ❶ Explore the capabilities of RDN Test Hypothesis P-gp dataset f ( local ity ) AD = f l den ensit ity & local l relia eliabil ilit Class = {S, NS} Validation set N = 194 Training set ? Accuracy N = 659 Test set N = 195 Build a decision tree model AD output measure Chemical space coverage Ensemble model ❷ Characterizing Mispredictions (10-fold bagging) Global Density? • Predictions • STD • Agreement Kernel Density Estimation ( KDE ) Descriptor span? Test RDN Build RDN Decision tree descriptor span
RESULTS Part #1 P-gp ❶ Explore the capabilities of RDN RDN Original dk-NN 0.73 0.73 0.725 0.725 Accuracy in the AD 0.72 Accuracy in the AD 0.72 0.715 0.715 0.71 0.71 0.705 0.705 0.7 0.7 0.695 0.695 0.69 0.69 0.685 0.685 0.68 0.68 60 70 80 90 100 92 94 96 98 100 % data in AD % data in the AD Shrink distances to 1/3 in the beginning ½ D i D i 0.9 IV set Accuracy inside AD 0.85 TE set 0.8 0.75 0.7 0.65 5 25 45 65 85 % data in AD
RESULTS Part #1 P-gp Role of bias-precision correction STD 1 TR set 1.0 0.9 Agreement (to observed) 0.8 TE set 0.9 0.7 0.6 0.8 Accuracy 0.5 IV set 0.4 0.7 0.3 0.2 0.6 0.1 0 0 0.2 0.4 0.6 0.5 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19 0.21 0.23 0.25 0.27 0.29 0.31 0.33 0.35 0.37 0.39 0.41 0.43 0.45 0.47 0.49 0.51 0.53 >0.55 STD STD
RESULTS Part #1 P-gp ❷ Diagnosing mispredictions Descriptor span ? IN OUT N = 133 ( 68% ) N = 62 ( 32% ) Acc = 67.7% Acc = 72.6%
RESULTS Part #1 P-gp ❷ Diagnosing mispredictions Density in feature space ? Kernel Density Estimation (KDE) TE set 0.95 Accuracy inside the AD IV set 0.85 0.75 0.65 0.55 0.45 5 15 25 35 45 55 65 75 85 95 % covered data PC1
METHODS Part #2 Test the RDN algorithm on ❷ benchmark datasets P-gp dataset Class = {S, NS} training Test #1 Test #2 Validation set 4358 N = 1089 + 1090 Ames dataset N = 194 Training set N = 689 CYP450 dataset 3743 N = 1870 + 1870 Test set N = 195 Predictions , STD & Agreement from OChem Test RDN Build RDN Test RDN Build RDN Test Hypothesis f ( local ity ) AD = f l den ensit ity & local l relia eliabil ilit
RESULTS Part #2 Test the RDN algorithm on ❷ benchmark datasets CYP450 model Pgp model Ames model 0.9 1 1 IV set Test #1 Test #1 Test #2 Test #2 0.85 TE set 0.95 0.95 Accuracy in AD Accuracy in AD Accuracy in AD 0.8 0.9 0.9 0.75 0.85 0.85 0.7 0.8 0.65 0.75 0.8 5 25 45 65 85 20 40 60 80 100 20 40 60 80 100 % data in AD % data in AD % data in AD
METHODS Part #3 Compare Reliability-Density Neighbourhood (RDN) vs ❸ other AD methods Benchmark datasets AD techniques RDN Ames dataset vs STD CYP450 dataset KDE dk-NN Test Hypothesis f ( local ity ) AD = f l den ensit ity & local l relia eliabil ilit
RESULTS Part #3 Test #1 Test #2 Accuracy vs. % coverage (implicitly, distance-to-model ) Ames model 1 1 0.85 0.9 RDN STD KDE dk-NN 0.85 0.95 0.95 0.83 0.8 0.75 0.9 0.9 0.81 0.7 0.85 0.85 0.79 0.65 0.6 0.8 0.77 0.8 0.55 0.5 0.75 0.75 0.75 5 15 25 35 45 55 65 75 85 95 0 5 10 15 20 25 20 40 60 80 100 2 10 18 26 34 42 50 58 66 74 82 90 98 # nearest neighbours CYP450 model 0.91 1 1 1 KDE dk-NN STD RDN 0.89 0.95 0.95 0.95 0.87 0.9 0.85 0.85 0.9 0.9 0.83 0.8 0.85 0.85 0.81 0.75 0.7 0.79 0.8 0.8 0 5 10 15 20 25 5 15 25 35 45 55 65 75 85 95 0 20 40 60 80 100 20 40 60 80 100 # nearest neighbours
CONCLUSIONS Local density corrected for local reliability (Precision + Bias) is able to successfuly sort new instances according to their predictive performance through the definition of map that identifies regions according to their probability to contain mispredictions The RDN method performs robustly in new unseen data (two external datasets show similar profiles of accuracy accross chemical space) For the optimal establishment of the AD of a QSAR model: RDN + STD case-by-case selection of the best candidate.
THANK YOU FOR YOUR ATTENTION! Improved Applicability Domain Determination of QSAR Models Using Local Mapping: RDN Natalia Aniceto , Alex Freitas, Andreas Bender, Taravat Ghafourian PhD student ● University of Kent
Recommend
More recommend