(Ramsundar, Kearns, Riley, …, Google, VSP) How well does this work? Massively Multitask Networks for Drug Discovery Bharath Ramsundar *, † , � RBHARATH @ STANFORD . EDU Steven Kearnes *, † KEARNES @ STANFORD . EDU Patrick Riley � PFR @ GOOGLE . COM Dale Webster � DRW @ GOOGLE . COM David Konerding � DEK @ GOOGLE . COM Vijay Pande † PANDE @ STANFORD . EDU ( * Equal contribution, † Stanford University, � Google Inc.) Abstract After a suitable target has been identified, the first step in the drug discovery process is “hit finding.” Given some Massively multitask neural architectures provide druggable target, pharmaceutical companies will screen a learning framework for drug discovery that millions of drug-like compounds in an effort to find a synthesizes information from many distinct bi- few attractive molecules for further optimization. These ological sources. To train these architectures at screens are often automated via robots, but are expensive scale, we gather large amounts of data from pub- to perform. Virtual screening attempts to replace or aug- lic sources to create a dataset of nearly 40 mil- ment the high-throughput screening process by the use of lion measurements across more than 200 bio- computational methods (Shoichet, 2004). Machine learn- logical targets. We investigate several aspects ing methods have frequently been applied to virtual screen- of the multitask framework by performing a se- ing by training supervised classifiers to predict interactions ries of empirical studies and obtain some in- between targets and small molecules. teresting results: (1) massively multitask net- There are a variety of challenges that must be overcome works obtain predictive accuracies significantly to achieve effective virtual screening. Low hit rates in better than single-task methods, (2) the pre- experimental screens (often only 1 – 2 % of screened com- dictive power of multitask networks improves pounds are active against a given target) result in im- as additional tasks and data are added, (3) the balanced datasets that require special handling for effec- total amount of data and the total number of tive learning. For instance, care must be taken to guard tasks both contribute significantly to multitask http://DeepChem.io against unrealistic divisions between active and inactive improvement, and (4) multitask networks afford compounds (“artificial enrichment”) and against informa- limited transferability to tasks not in the training tion leakage due to strong similarity between active com- set. Our results underscore the need for greater pounds (“analog bias”) (Rohrer & Baumann, 2009). Fur- data sharing and further algorithmic innovation thermore, the paucity of experimental data means that over- to accelerate the drug discovery process. fitting is a perennial thorn. arXiv:1502.02072
Deep Learning Approaches to Drug Design
Deep Learning Approaches to Drug Design • Why Deep Learning? • Merck Kaggle contest: multi-task deep neural networks that combine small datasets together, increases the amount of data • Great at extracting features from rich, “natural” data sources (images, video, speech)
Deep Learning Approaches to Drug Design • Why Deep Learning? • Merck Kaggle contest: multi-task deep neural networks that combine small datasets together, increases the amount of data • Great at extracting features from rich, “natural” data sources (images, video, speech) • Outstanding questions • Can we devise rich, natural featurizations of molecules that can be fed to deep networks? • What architectures will provide the best performance?
Multi-task Learning is important } … DBN Input
Multi-task Learning is important … Task 1 Task 2 Task n } shared representation … DBN Input
(Ramsundar, Kearns, Riley, …, Google, VSP) Questions to address
(Ramsundar, Kearns, Riley, …, Google, VSP) Questions to address 1. Do massively multitask networks provide a performance boost over simple machine learning methods?
(Ramsundar, Kearns, Riley, …, Google, VSP) Questions to address 1. Do massively multitask networks provide a performance boost over simple machine learning methods? 2. How does the performance of a multitask network depend on the number of tasks?
(Ramsundar, Kearns, Riley, …, Google, VSP) Questions to address 1. Do massively multitask networks provide a performance boost over simple machine learning methods? 2. How does the performance of a multitask network depend on the number of tasks? 3. Do massively multitask networks extract generalizable information about chemical space?
(Ramsundar, Kearns, Riley, …, Google, VSP) Protocol Softmax nodes, one per dataset Hidden layers 1-4 layers with 50-3000 nodes Fully connected to layer below, rectified linear activation Input Layer 1024 binary nodes
(Ramsundar, Kearns, Riley, …, Google, VSP) Results: AUC for various models PCBA MUV Tox21 Model ( n = 128) ( n = 17) ( n = 12) . 801 . 752 . 738 Logistic Regression (LR) . 800 . 774 . 790 Random Forest (RF) . 795 . 732 . 714 Single-Task Neural Net (STNN) Max { LR, RF, STNN } . 821 . 781 . 790 1 -Hidden (1200) Layer Multitask Neural Net . 852 . 816 . 789 4 -Hidden (1000) Layer Multitask Neural Net . 858 . 836 . 810 Pyramidal (2000 , 100) Multitask Neural Net, .75 Dropout . 837 . 802 . 872 Pyramidal (2000 , 100) Multitask Neural Net . 860 . 862 . 824 Table 2. Median 5 -fold-average AUCs for various models. The last column is the result of a sign test vs. the Pyramidal (2000 , 100) network (last row) on the 5 -fold-average AUCs for all datasets except those in the DUD-E group (we remove DUD-E datasets for reasons discussed in the text). For each model, the sign test estimates the fraction of datasets for which that model is superior to the Pyrami- dal (2000 , 100) network (bottom row). We use the Wilson score interval to derive a 95% confidence interval for this fraction. Non-neural network methods were trained using scikit-learn (Pedregosa et al., 2011) implementations and basic hyperparameter optimization. We also include results for a hypothetical “best” single-task model (Max { LR, RF, STNN } ) to provide a stronger baseline. Details for our cross-validation and training procedures are given in the Appendix.
(Ramsundar, Kearns, Riley, …, Google, VSP) Multitask DNN does better across the board PCBA MUV Tox21 Model ( n = 128) ( n = 17) ( n = 12) . 801 . 752 . 738 Logistic Regression (LR) . 800 . 774 . 790 Random Forest (RF) . 795 . 732 . 714 Single-Task Neural Net (STNN) Max { LR, RF, STNN } . 821 . 781 . 790 1 -Hidden (1200) Layer Multitask Neural Net . 852 . 816 . 789 4 -Hidden (1000) Layer Multitask Neural Net . 858 . 836 . 810 Pyramidal (2000 , 100) Multitask Neural Net, .75 Dropout . 837 . 802 . 872 Pyramidal (2000 , 100) Multitask Neural Net . 860 . 862 . 824 Table 2. Median 5 -fold-average AUCs for various models. The last column is the result of a sign test vs. the Pyramidal (2000 , 100) network (last row) on the 5 -fold-average AUCs for all datasets except those in the DUD-E group (we remove DUD-E datasets for reasons discussed in the text). For each model, the sign test estimates the fraction of datasets for which that model is superior to the Pyrami- dal (2000 , 100) network (bottom row). We use the Wilson score interval to derive a 95% confidence interval for this fraction. Non-neural network methods were trained using scikit-learn (Pedregosa et al., 2011) implementations and basic hyperparameter optimization. We also include results for a hypothetical “best” single-task model (Max { LR, RF, STNN } ) to provide a stronger baseline. Details for our cross-validation and training procedures are given in the Appendix.
(Ramsundar, Kearns, Riley, …, Google, VSP) Room to grow: more data will help
Challenges of interpretation
Challenges of interpretation • Going beyond Deep Neural Net as a black box
Challenges of interpretation • Going beyond Deep Neural Net as a black box • How can we systematically interpret the features learned by the network?
Challenges of interpretation • Going beyond Deep Neural Net as a black box • How can we systematically interpret the features learned by the network? • Can neurons be matched to functional groups like carboxylates or amines?
Challenges of interpretation • Going beyond Deep Neural Net as a black box • How can we systematically interpret the features learned by the network? • Can neurons be matched to functional groups like carboxylates or amines? • If so, can we argue that the network “thinks” like an organic chemist?
(Ramsundar, Kearns, Riley, …, Google, VSP) Questions to address + some answers
(Ramsundar, Kearns, Riley, …, Google, VSP) Questions to address + some answers 1. Do massively multitask networks provide a performance boost over simple machine learning methods? Yes: significantly, with room to grow
(Ramsundar, Kearns, Riley, …, Google, VSP) Questions to address + some answers 1. Do massively multitask networks provide a performance boost over simple machine learning methods? Yes: significantly, with room to grow 2. How does the performance of a multitask network depend on the number of tasks? We haven’t saturated yet and are working to get much more data.
(Ramsundar, Kearns, Riley, …, Google, VSP) Questions to address + some answers 1. Do massively multitask networks provide a performance boost over simple machine learning methods? Yes: significantly, with room to grow 2. How does the performance of a multitask network depend on the number of tasks? We haven’t saturated yet and are working to get much more data. 3. Do massively multitask networks extract generalizable information about chemical space? Appears this may be possible – simpler ML on DNN features works.
But how does this do in the “real world”?
(Kearnes, Goldman, VSP) Steven Kearnes goes to Vertex to test Modeling Industrial ADMET Data with Multitask Networks Steven Kearnes Brian Goldman Vijay Pande Stanford University Vertex Pharmaceuticals Inc. Stanford University kearnes@stanford.edu brian_goldman@vrtx.com pande@stanford.edu Abstract Task-Specific Output (Single-Task or Multitask) Deep learning methods such as multitask ⋯ ⋯ neural networks have recently been applied to ligand-based virtual screening and other drug discovery applications. Using a set of indus- trial ADMET datasets, we compare neural net- Hidden Layer N works to standard baseline models and ana- lyze multitask learning e ff ects with both ran- ⋮ dom cross-validation and a more relevant tem- poral validation scheme. We confirm that Hidden Layer 1 multitask learning can provide modest ben- efits over single-task models and show that smaller datasets tend to benefit more than ⋯ 101000000001000100001000100000010 ⋯ larger datasets from multitask learning. Addi- Input Features tionally, we find that adding massive amounts of side information is not guaranteed to improve Figure 1: Abstract neural network architecture. The performance relative to simpler multitask learn- input vector is a binary molecular fingerprint with 1024 ing. Our results emphasize that multitask ef- bits. All connections between layers are dense , meaning fects are highly dataset-dependent, suggesting that every unit in layer n is connected to every unit in the use of dataset-specific models to maximize layer n + 1 . Each output block is a task-specific two-class overall performance. softmax layer; dashed lines indicate that models can be arXiv:1606.08793 either single-task or multitask. 1 Introduction
(Kearnes, Goldman, VSP) How did we do?
(Kearnes, Goldman, VSP) How did we do? Table 1: Proprietary datasets used for model evaluation. Each data point is associated with an experiment date used for temporal validation. Dataset Actives Inactives Total A 20 247 9652 29 899 B 32 806 23 936 56 742 C 40 136 27 703 67 839 D 24 379 2374 26 753 E 21 722 2746 24 468 F 25 202 2034 27 236 G 2003 3226 5229 H 500 526 1026 I 669 344 1013 J 883 399 1282 K 845 357 1202 L 489 164 653 M 820 357 1177 N 1420 740 2160 O 670 1417 2087 P 3861 4107 7968 Q 1056 2658 3714 R 215 2760 2975 S 987 582 1569 T 1454 5935 7389 U 3998 2790 6788 V 2795 896 3691 187 157 95 703 282 860
(Kearnes, Goldman, VSP) How did we do? Table 1: Proprietary datasets used for model evaluation. c. Each data point is associated with an experiment date at used for temporal validation. s ces Dataset Actives Inactives Total a A 20 247 9652 29 899 s- B 32 806 23 936 56 742 r- C 40 136 27 703 67 839 r- D 24 379 2374 26 753 g- E 21 722 2746 24 468 y F 25 202 2034 27 236 G 2003 3226 5229 d H 500 526 1026 t I 669 344 1013 - J 883 399 1282 an K 845 357 1202 r- L 489 164 653 M 820 357 1177 ] N 1420 740 2160 O 670 1417 2087 P 3861 4107 7968 Q 1056 2658 3714 Figure 2: Box plots showing ∆ AUC values between mul- R 215 2760 2975 titask (MTNN or W-MTNN) and STNN models with the S 987 582 1569 same core architecture. Each box plot summarizes 10 T 1454 5935 7389 ∆ AUC values, one for each combination of model archi- U 3998 2790 6788 tecture (e.g. (2000, 1000)) and task weighting strategy V 2795 896 3691 NN) (MTNN or W-MTNN). 187 157 95 703 282 860 on
(Kearnes, Goldman, VSP) How did we do? Table 2: Median test set AUC values for random forest, logistic regression, single-task neural network (STNN), and multitask neural network (MTNN) models. W-MTNN models are task-weighted models, meaning that the cost for each task is weighted inversely proportional to the amount of training data for that task. We also report median ∆ AUC values and sign test 95% confidence intervals for comparisons between each model and random forest or logistic regression (see Section 2.3). Bold values indicate confidence intervals that do not include 0.5. vs. Random Forest vs. Logistic Regression Median Median Sign Test Median Sign Test Model AUC ∆ AUC 95% CI ∆ AUC 95% CI Random Forest 0.719 − 0.016 (0.20, 0.57) Logistic Regression 0.758 0.016 (0.43, 0.80) (1000) 0.748 0.043 (0.47, 0.84) 0.007 (0.39, 0.77) (4000) 0.761 0.052 (0.52, 0.87) 0.015 (0.52, 0.87) STNN (2000, 100) 0.749 0.039 (0.47, 0.84) 0.007 (0.35, 0.73) (2000, 1000) 0.759 0.038 (0.47, 0.84) 0.008 (0.35, 0.73) (4000, 2000, 1000, 1000) 0.736 0.041 (0.43, 0.80) − 0.011 (0.27, 0.65) (1000) 0.792 0.049 (0.67, 0.95) 0.029 (0.52, 0.87) (4000) 0.768 0.057 (0.61, 0.93) 0.031 (0.57, 0.90) MTNN (2000, 100) 0.797 0.044 (0.61, 0.93) 0.023 (0.43, 0.80) (2000, 1000) 0.800 0.071 (0.67, 0.95) 0.040 (0.52, 0.87) (4000, 2000, 1000, 1000) 0.809 0.059 (0.72, 0.97) 0.024 (0.43, 0.80) (1000) 0.793 0.059 (0.78, 0.99) 0.040 (0.67, 0.95) (4000) 0.773 0.055 (0.72, 0.97) 0.036 (0.67, 0.95) W-MTNN (2000, 100) 0.769 0.050 (0.61, 0.93) 0.022 (0.43, 0.80) (2000, 1000) 0.821 0.077 (0.78, 0.99) 0.041 (0.67, 0.95) (4000, 2000, 1000, 1000) 0.800 0.071 (0.61, 0.93) 0.035 (0.47, 0.84)
(Kearnes, Goldman, VSP) How did we do? Table 3: Comparisons between neural network models. Di ff erences between STNN, MTNN, and W-MTNN models with the same core (hidden layer) architecture are reported as median ∆ AUC values and sign test 95% confidence intervals. Bold values indicate confidence intervals that do not include 0.5. vs. STNN vs. MTNN Median Sign Test Median Sign Test Model ∆ AUC 95% CI ∆ AUC 95% CI (1000) 0.010 (0.43, 0.80) (4000) 0.012 (0.43, 0.80) MTNN (2000, 100) 0.015 (0.39, 0.77) (2000, 1000) 0.026 (0.47, 0.84) (4000, 2000, 1000, 1000) 0.023 (0.43, 0.80) (1000) 0.017 (0.52, 0.87) 0.002 (0.37, 0.76) (4000) 0.007 (0.47, 0.84) 0.002 (0.35, 0.73) W-MTNN (2000, 100) 0.004 (0.39, 0.77) − 0.002 (0.28, 0.68) (2000, 1000) 0.032 (0.57, 0.90) 0.005 (0.43, 0.80) (4000, 2000, 1000, 1000) 0.033 (0.43, 0.80) 0.004 (0.43, 0.80)
(Subramanian, Ramsundar, VSP, Denny) Bharath Ramsundar goes to Pfizer (virtually) Computational Modeling of β ‑ Secretase 1 (BACE-1) Inhibitors Using Ligand Based Approaches Govindan Subramanian, * , § Bharath Ramsundar, ‡ Vijay Pande, ⊥ and Rajiah Aldrin Denny † § VMRD Global Discovery, Zoetis, 333 Portage Street, Kalamazoo, Michigan 49007, United States ‡ Department of Computer Science and ⊥ Department of Chemistry, Stanford University, 318 Campus Drive, Stanford, California 94305, United States † Worldwide Medicinal Chemistry, P fi zer Inc., 610 Main Street, Cambridge, Massachusetts 02139, United States * S Supporting Information ABSTRACT: The binding a ffi nities (IC 50 ) reported for diverse structural and chemical classes of human β -secretase 1 (BACE- 1) inhibitors in literature were modeled using multiple in silico ligand based modeling approaches and statistical techniques. The descriptor space encompasses simple binary molecular fi ngerprint, one- and two-dimensional constitutional, physicochemical, and topological descriptors, and sophisticated three-dimensional molecular fi elds that require appropriate structural alignments of varied chemical sca ff olds in one universal chemical space. The a ffi nities were modeled using qualitative classi fi cation or quantitative regression schemes involving linear, nonlinear, and deep neural network (DNN) machine-learning methods used in the scienti fi c literature for quantitative − structure activity relationships (QSAR). In a departure from tradition, ∼ 20% of the chemically diverse data set (205 compounds) was used to train the model with the remaining ∼ 80% of the structural and chemical analogs used as part of an external validation (1273 compounds) and prospective test (69 compounds) sets respectively to ascertain the model performance. The machine-learning methods investigated herein performed well in both the qualitative classi fi cation ( ∼ 70% accuracy) and quantitative IC 50 predictions (RMSE ∼ 1 log). The success of the 2D descriptor based machine learning approach when compared against the 3D fi eld based technique pursued for h BACE-1 inhibitors provides a strong impetus for systematically applying such methods during the lead identi fi cation and optimization e ff orts for other protein families as well. ■
(Subramanian, Ramsundar, VSP, Denny) The target: BACE-1 Scheme 1. Depiction of BACE-1 Binding Site (left) Using the Ligand from PDB Code 3UQP along with the Protein − Ligand Interaction (right)
(Subramanian, Ramsundar, VSP, Denny) Workflow Scheme 2. Work fl ow for the Training, Test, and Validation Set Compound Alignment Used for 3D-Field Based Approaches
(Subramanian, Ramsundar, VSP, Denny) Results Table 1. Statistical Measures for the Various Classi fi cation Models Developed in This Work a Training set (205): experimentally active (102) with IC 50 ≤ 100 nM; experimentally inactive (103). b Validation set (1273): experimentally active (551) with IC 50 ≤ 100 nM; experimentally inactive (722). c Fingerprint and descriptors as implemented within Canvas modeling suite from Schro ̈ dinger. d (TP + TN)/total no. molecules where TP and TN correspond to true positives and true negatives. e TP/(TP + FN) where FN correspond to false negatives. f TN/(TN + FP) where FP correspond to false positives. g Matthews correlation coe ffi cient, MCC = [(TP * TN − FP * FN)/ √ ((TP + FP)(TP + FN)(TN + FP)(TN + FN))]. h Model developed using Bayesian approach as implemented within Canvas modeling suite from Schro ̈ dinger. i Constitutional, physicochemical, and topological descriptors as implemented within Canvas modeling suite from Schro ̈ dinger. j Model developed using recursive partitioning (RP) using the Canvas modeling suite from Schro ̈ dinger. k Random forest (RF) model developed using DEEPCHEM package. l Deep neural net (DNN) model developed using DEEPCHEM package. m Reverse split (yellow highlight). Training set (1180): experimentally active (521) with IC 50 ≤ 100 nM; experimentally inactive (659). Validation set (295): experimentally active (130) with IC 50 ≤ 100 nM; experimentally inactive (165).
(Subramanian, Ramsundar, VSP, Denny) Results, part 2 Table 2. Statistical Parameters for the Various Quantitative Models Developed in This Work a Training set (205): experimentally active (102) with IC 50 ≤ 100 nM; experimentally inactive (103). b Validation set (1273): experimentally active (551) with IC 50 ≤ 100 nM; experimentally inactive (722). c Statistical technique employed. See the Abbreviations section for de fi nitions. d Coe ffi cient of the fi t of a linear regression. e Root-mean-square error. f Mean absolute error. g Standard error. h 1D and 2D constitutional, physiochemical, and topological descriptors as implemented within Canvas modeling suite from Schro ̈ dinger. i 3D-grid based fi eld descriptors utilizing hydrophobic, H- bond donor, and acceptor probes as implemented by the individual approaches implemented within Schro ̈ dinger and Sybyl modeling packages. j Reverse split (yellow highlighting). Training set (1180): experimentally active (521) with IC 50 ≤ 100 nM; experimentally inactive (659). Validation set (295): experimentally active (130) with IC 50 ≤ 100 nM; experimentally inactive (165).
“One shot” to get it right
Siamese neural network for one-shot learning Siamese Neural Networks for One-shot Image Recognition Gregory Koch GKOCH @ CS . TORONTO . EDU Richard Zemel ZEMEL @ CS . TORONTO . EDU Ruslan Salakhutdinov RSALAKHU @ CS . TORONTO . EDU Department of Computer Science, University of Toronto. Toronto, Ontario, Canada. Abstract The process of learning good features for ma- chine learning applications can be very compu- tationally expensive and may prove difficult in cases where little data is available. A prototyp- ical example of this is the one-shot learning set- ting, in which we must correctly make predic- tions given only a single example of each new class. In this paper, we explore a method for learning siamese neural networks which employ a unique structure to naturally rank similarity be- tween inputs. Once a network has been tuned, we can then capitalize on powerful discrimina- tive features to generalize the predictive power of the network not just to new data, but to entirely Figure 1. Example of a 20-way one-shot classification task using new classes from unknown distributions. Using a the Omniglot dataset. The lone test image is shown above the grid convolutional architecture, we are able to achieve of 20 images representing the possible unseen classes that we can strong results which exceed those of other deep choose for the test image. These 20 images are our only known learning models with near state-of-the-art perfor- examples of each of those classes. mance on one-shot classification tasks.
Siamese neural network for one-shot learning Figure 2. Our general strategy. 1) Train a model to discriminate between a collection of same/different pairs. 2) Generalize to evaluate new categories based on learned feature mappings for verification. Proceedings of the 32 nd International Conference on Machine Learning , Lille, France, 2015. JMLR: W&CP volume 37. Copy- right 2015 by the author(s).
Siamese neural network for one-shot learning Figure 2. Our general strategy. 1) Train a model to discriminate between a collection of same/different pairs. 2) Generalize to evaluate new categories based on learned feature mappings for verification. Proceedings of the 32 nd International Conference on Machine Learning , Lille, France, 2015. JMLR: W&CP volume 37. Copy- right 2015 by the author(s).
(Kearnes, McCloskey, Berndl, VSP, Riley) Beyond molecular fingerprints: conv nets on graphs Molecular Graph Convolutions: Moving Beyond Fingerprints Convolutional Networks on Graphs Steven Kearnes Kevin McCloskey Marc Berndl for Learning Molecular Fingerprints Stanford University Google Inc. Google Inc. kearnes@stanford.edu mccloskey@google.com marcberndl@google.com Vijay Pande Patrick Riley Stanford University Google Inc. pande@stanford.edu pfr@google.com David Duvenaud † , Dougal Maclaurin † , Jorge Aguilera-Iparraguirre Abstract Rafael G´ omez-Bombarelli, Timothy Hirzel, Al´ an Aspuru-Guzik, Ryan P. Adams Molecular “fingerprints” encoding structural information are the workhorse of cheminfor- Harvard University matics and machine learning in drug discovery applications. However, fingerprint representa- tions necessarily emphasize particular aspects of the molecular structure while ignoring others, rather than allowing the model to make data- driven decisions. We describe molecular graph Abstract convolutions , a machine learning architecture for learning from undirected graphs, specifically small molecules. Graph convolutions use a sim- Figure 1: Molecular graph for ibuprofen. Unmarked ver- ple encoding of the molecular graph—atoms, We introduce a convolutional neural network that operates directly on graphs. tices represent carbon atoms, and bond order is indicated bonds, distances, etc.—which allows the model by the number of lines used for each edge. These networks allow end-to-end learning of prediction pipelines whose inputs to take greater advantage of information in the graph structure. Although graph convolutions are graphs of arbitrary size and shape. The architecture we present generalizes do not outperform all fingerprint-based meth- design should benefit from molecular representations standard molecular feature extraction methods based on circular fingerprints. We ods, they (along with other graph-based meth- that are as complete and general as possible rather ods) represent a new paradigm in ligand-based show that these data-driven features are more interpretable, and have better pre- virtual screening with exciting opportunities for than relying on application-specific features or encod- future improvement. ings. dictive performance on a variety of tasks. First-year chemistry students quickly become fa- miliar with a common representation for small 1 Introduction molecules: the molecular graph. Figure 1 gives These neural graph fingerprints offer several advantages over fixed fingerprints: an example of the molecular graph for ibuprofen, Computer-aided drug design requires representations an over-the-counter non-steroidal anti-inflammatory of molecules that can be related to biological activ- • Predictive performance. By using data adapting to the task at hand, machine-optimized drug. The atoms and bonds between atoms form the ity or other experimental endpoints. These repre- fingerprints can provide substantially better predictive performance than fixed fingerprints. nodes and edges, respectively, of the graph. Both sentations encode structural features, physical prop- atoms and bonds have associated properties, such as We show that neural graph fingerprints match or beat the predictive performance of stan- erties, or activity in other assays [Todeschini and atom type and bond order. Although the basic molec- Consonni, 2009; Petrone et al., 2012]. The recent dard fingerprints on solubility, drug efficacy, and organic photovoltaic efficiency datasets. ular graph representation does not capture the quan- advent of “deep learning” has enabled the use of • Parsimony. Fixed fingerprints must be extremely large to encode all possible substructures tum mechanical structure of molecules or necessarily very raw representations that are less application- express all of the information that it might suggest to without overlap. For example, [28] used a fingerprint vector of size 43,000, after having specific when building machine learning models [Le- an expert medicinal chemist, its ubiquity in academia Cun et al., 2015]. For instance, image recognition removed rarely-occurring features. Differentiable fingerprints can be optimized to encode and industry makes it a desirable starting point for models that were once based on complex features ex- only relevant features, reducing downstream computation and regularization requirements. machine learning on chemical information. tracted from images are now trained exclusively on the pixels themselves—deep architectures can “learn” Here we describe molecular graph convolutions , a • Interpretability. Standard fingerprints encode each possible fragment completely dis- appropriate representations for input data. Conse- deep learning system using a representation of small tinctly, with no notion of similarity between fragments. In contrast, each feature of a neural quently, deep learning systems for drug screening or molecules as undirected graphs of atoms. Graph con- graph fingerprint can be activated by similar but distinct molecular fragments, making the feature representation more meaningful. 1
Siamese neural network for one-shot learning Low Data Drug Discovery with One-shot Learning Han Altae-Tran, † , § Bharath Ramsundar, ‡ , § Aneesh S. Pappu, ‡ and Vijay Pande ∗ , ¶ Department of Biological Engineering, Massachusetts Institute of Technology, Department of Computer Science, Stanford University, and Department of Chemistry, Stanford University E-mail: pande@stanford.edu Abstract Recent advances in machine learning have made significant contributions to drug discovery. Deep neural networks in particular have been demonstrated to provide sig- nificant boosts in predictive power when inferring the properties and activities of small- molecule compounds. 1 However, the applicability of these techniques has been limited by the requirement for large amounts of training data. In this work, we demonstrate how one-shot learning can be used to significantly lower the amounts of data required to make meaningful predictions in drug discovery applications. We introduce a new architecture, the residual LSTM embedding, that, when combined with graph convolu- tional neural networks, significantly improves the ability to learn meaningful distance metrics over small-molecules. We open source all models introduced in this work as part of DeepChem, an open-source framework for deep-learning in drug discovery. 2 ∗ To whom correspondence should be addressed † Department of Biological Engineering, Massachusetts Institute of Technology ‡ Department of Computer Science, Stanford University arXiv:1611.03199 ¶ Department of Chemistry, Stanford University § equal contribution
(Altae-Tran, Ramsundar, Pappu, VSP) A new architecture: Residual LSTM Compound Label Structure Lithium Ion similarity @′ @ C Ethanol Dataset ⋅ × Caffeine Tosylate ∑ Dopamine B′ B New Compound prediction ? Styrene Oxide
(Altae-Tran, Ramsundar, Pappu, VSP) A new architecture: Residual LSTM conv net → graph conv net Compound Label Structure Lithium Ion similarity @′ @ C Ethanol Dataset ⋅ × Caffeine Tosylate ∑ Dopamine B′ B New Compound prediction ? Styrene Oxide
(Altae-Tran, Ramsundar, Pappu, VSP) One-step iterative refinement of embeddings
(Altae-Tran, Ramsundar, Pappu, VSP) One-step iterative refinement of embeddings Initialize r = g 0 ( S ) δ z = 0 δ z = 0 Repeat e = k ( f 0 ( x ) + δ z, r ) e = k ( r + δ z , g 0 ( S )) (similarity measures) a j = e j / P m A ij = e ij / P m j =1 e ij j =1 e ij (attention mechanism) r = a > r r = A g 0 ( S ) (expected feature map) δ z = LSTM ([ δ z, r ]) δ z = LSTM ([ δ z , r ]) (generate updates) Return f ( x ) = f 0 ( x ) + δ z g ( S ) = g 0 ( S ) + δ z (evolve embeddings)
(Altae-Tran, Ramsundar, Pappu, VSP) The major graph operations Graph Convolution pick & with (&,,) ∈ / repeat for sum over & and apply new features For each of the 5 = deg (,) set 2 = dist(,,&) remaining & nonlinearity for , operations, the nodes + F(⋅) local being operated on are topology shown in blue, with features unchanged nodes shown in light blue. For graph convolution and transform = >,? & + A >,? graph pool, the Graph Gather Graph Pool operation is shown for new features sum all nodes molecular max over for , featurization neighbors and self a single node, v, + however, these local global topology topology operations are performed on all nodes features features v in the graph simultaneously.
(Altae-Tran, Ramsundar, Pappu, VSP) Tests of this method The goal of the Tox21 SIDER contains challenge is to information on marketed "crowdsource" data medicines and their analysis by independent recorded adverse drug researchers. reactions.
(Altae-Tran, Ramsundar, Pappu, VSP) Significant improvement in AUC over RF Table 1: Accuracies of models on held-out tasks for Tox21. Numbers reported are median on test-tasks. Numbers for each task are averaged for 20 random choices of support sets. Tox21 RF (50 trees) RF (100 trees) Siamese AttnLSTM ResLSTM 10 pos, 10 neg 0.537 0.563 0.831 0.834 0.840 5 pos, 10 neg 0.537 0.579 0.790 0.820 0.837 1 pos, 10 neg 0.537 0.584 0.710 0.687 0.757 1 pos, 5 neg 0.571 0.572 0.689 0.595 0.815 1 pos, 1 neg 0.536 0.542 0.668 0.652 0.784 Table 2: Accuracies of models on held-out tasks for SIDER. Numbers reported are median on test-tasks. Numbers for each task are averaged for 20 random choices of support sets SIDER RF (50 trees) RF (100 trees) Siamese AttnLSTM ResLSTM 10 pos, 10 neg 0.551 0.546 0.660 0.671 0.752 5 pos, 10 neg 0.534 0.541 0.674 0.671 0.750 1 pos, 10 neg 0.537 0.533 0.542 0.543 0.602 1 pos, 5 neg 0.536 0.535 0.544 0.539 0.639 1 pos, 1 neg 0.504 0.501 0.506 0.505 0.623
(Altae-Tran, Ramsundar, Pappu, VSP) Significant improvement in AUC over RF Table 1: Accuracies of models on held-out tasks for Tox21. Numbers reported are median on test-tasks. Numbers for each task are averaged for 20 random choices of support sets. Tox21 RF (50 trees) RF (100 trees) Siamese AttnLSTM ResLSTM 10 pos, 10 neg 0.537 0.563 0.831 0.834 0.840 5 pos, 10 neg 0.537 0.579 0.790 0.820 0.837 1 pos, 10 neg 0.537 0.584 0.710 0.687 0.757 1 pos, 5 neg 0.571 0.572 0.689 0.595 0.815 1 pos, 1 neg 0.536 0.542 0.668 0.652 0.784 Table 2: Accuracies of models on held-out tasks for SIDER. Numbers reported are median on test-tasks. Numbers for each task are averaged for 20 random choices of support sets SIDER RF (50 trees) RF (100 trees) Siamese AttnLSTM ResLSTM 10 pos, 10 neg 0.551 0.546 0.660 0.671 0.752 5 pos, 10 neg 0.534 0.541 0.674 0.671 0.750 1 pos, 10 neg 0.537 0.533 0.542 0.543 0.602 1 pos, 5 neg 0.536 0.535 0.544 0.539 0.639 1 pos, 1 neg 0.504 0.501 0.506 0.505 0.623
Recommend
More recommend