Open Source Drug Discovery (OSDD) Connecting Minds & Machines A CSIR led team India consortium with global partnership for affordable healthcare for all Anshu Bhardwaj Scientist & Community Builder OSDD, CSIR India National Knowledge Network “First Annual Workshop” “The e -Infrastructure of India ” 31 st Oct – 1 st Nov 2012
OSDD Focus : Tropical Neglected Diseases First Disease Target : Tuberculosis; Now extended to Malaria Tuberculosis (TB) is one of leading causes of fatality, ranking second only to HIV as the killer infectious disease of adults worldwide. At least one person in New TB cases 2010 the world is newly infected with TB bacilli every second Over 1000 deaths a day or 3 deaths every 2 mins No New TB Drugs past 50 years Source : http://www.globalhealthfacts.org/data/topic/map.aspx?ind=12
Research Spending Per New Drug Company Number of drugs R&D Spending Per Total R&D Spending approved Drug ($Mil) 1997-2011 ($Mil) AstraZeneca 5 11,790.93 58,955 GlaxoSmithKline 10 8,170.81 81,708 Sanofi 8 7,909.26 63,274 Roche Holding AG 11 7,803.77 85,841 Pfizer Inc. 14 7,727.03 108,178 Johnson & Johnson 15 5,885.65 88,285 Eli Lilly & Co. 11 4,577.04 50,347 Abbott Laboratories 8 4,496.21 35,970 Merck & Co Inc 16 4,209.99 67,360 Bristol-Myers Squibb Co. 11 4,152.26 45,675 Novartis AG 21 3,983.13 83,646 Amgen Inc. 9 3,692.14 33,229 Slate’s Bad Math : $55 million on each new drugs Source: http://www.forbes.com/sites/matthewherper/2012/02/10/the-truly-staggering-cost-of-inventing-new-drugs/
Drug Discovery is a Long Risky process with Low Probability of Success http://www.bayerpharma.com/en/research-and-development/processes/index.php
Prediction of non-toxic targets & inhibitors x Efficacy Inhibitor should target the right protein in the pathogen ( Mycobacterium tuberculosis ) Toxicity Inhibitor should not target any crucial protein in host (Human)
Biology is complex !! From a mathematical point of view, to create an accurate model of a single mammalian cell may require generating and then solving somewhere between 100,000 to one million equations The human brain can only process Need automation & new seven pieces of data at a time!!! technology to address the complexity http://news.vanderbilt.edu/2011/10/robot-biologist/
Predictive Science in the Drug Discovery (DD) Process Systems Level Models for DD Virtual Screening - Target Identification for selected - Pharmacomodeling targets& Models for predicting antiTB - Off-target binding predictions and mutagenic properties Predicting toxicity and Systems metabolism Biology for of drugs predicting - Drug-targets MOA HPC for OSDD Community by Garuda/ CMMACS Prediction tools and models to prioritize candidates molecules
Why Open Source Drug discovery ? Many eye balls make the bug shallow! Lack of market incentive for TB Successful Open Source Models Human Genome Sequencing Initiative Open Source Software Initiative (eg: Linux OS) Android The WWW
Real Innovation lies in “Innovating how we innovate”… “We cannot solve our problems with the same thinking we used when we created them .” Albert Einstein
Open TB Drug Discovery Platform Informatics to Experimental Validation to Clinical Trials Target Validation Systems Chem- of insiilico Biology informatics targets OSDD Assay Mtb Strain Chem and Screening Developm- and Clone Directed Facility ent Repository Synthesis Target Lead Lead Identificati Identificati- Optimizati- on for on on Leads Safety In vivo DMPK Pharmacol- efficacy ogy Pre- Pharmco- Clinical Phase I-III genomics Candidate
Unconventional Collaborative Network Pharmacogenomics Data upload expert Virtual Screening Disease experts OSDD portal Gene/Protein Virtual Lab Expression Analysis Mathematical modeling Administrator Manages server Computer Scientists
Shaping Science 2.0 OSDD Semantic Web Architecture
OSDD Platform Released : April 2010 System Architecture Collaborative tools to accelerate neglected diseases research” in the book “Collaborative Computational Technologies for Biomedical Research”. Wiley and Sons. May 2011
Scientific Workflow Management Systems Experimental data from biology and chemistry needs to be managed and analyzed systematically Large datasets and compute intensive analyses needs compute infrastructure http://galaxyproject.org/ http://www.taverna.org.uk/ http://www.tavaxy.org/ https://kepler-project.org/
Weka Workflow a. Convert CSV to test and train files b. Convert both CSVs to arff files: output_file1 is always train file and output_file2 is test file. c. Select two input files for Classifier. Change the parameters in right side panel for each tool d. Evaluate model file: Classifier will be Misc -> SerializedClassifier
Customized workflow with grid infrastructure & applications APIs to submit workflow method to lab note book http://sysborg2.osdd.net Electronic lab note books Jobs are invoked from APIs to extract files from Customized Galaxy and lab note books APIs to submit results submitted to Gridway to lab note book Input file + parameters Gridway Gridway runner meta Job template PBS scheduler Customized Job Status may be LRM Torque checked using DRMAA API Clusters Programs More than 250 applications integrated
Custom APIs for importing input files from OSDD’s open lab note book into Galaxy Get data customized for extracting files from open lab note book
Custom APIs for exporting results to OSDD’s Open lab note book Workflows and the result of the workflows are stored as separate lab note books Lab note book has details of the experiments performed Results of one experiment may be invoked for analysis in another experiment All versions of the workflow and the results are stored Flexibility to execute nested workflows
List of >250 modules integrated as web services by OSDD Community S. No Resources Clients 1 KEGG: Kyoto Encyclopedia of Genes and Genomes 60 2 GetEntry: DDBJ sequence search by accessionID 43 3 GPSR : tools 33 4 PDB : Protein Data Bank 30 5 BioModel:mathematical models of biological DB 25 6 Gtps : Gene Trek in Prokaryote Space 8 WSDbfetch: retrieve entries from biological dbs using 7 7 entry identifiers or accession no. 8 Gibv: Genome Information Broker for Viruses 7 9 DDBJ :DNA Data bank of Japan 7 10 Mafft: a multiple sequence alignment program 4 11 Fasta:- DDBJ database 4 12 Ensembl : maintains automatic annotation 4 13 VecScreen vector contamination 4 14 OMIM:Online Mendelian Inheritance in man 4 15 Gtop: Gene-product Informatics 3 16 GO: Gene Ontology 3 17 SPS : Splicing Profile based Score 2 18 GIBIS: Genome Information Broker for Insertion Sequence 1 19 RefSeq: database of sequence 1 20 GIB: Genome Information Broker 1 21 GIBEnv- DDBJ database 1 22 TxSearch: Database indexing & searching 1
Ongoing: Cheminformatics Community of About 400 PubChem ChEMBL DrugBank HT Virtual screening Cheminformatics Experimental Curated molecule Data Mining Models Assays datasets and Analysis Other Active Communities: • OSDD Women Scientists Forum • OSDD Junior Scientists Forum
Background and Premise
Why are we doing this?
Crowd-Sourcing Large-Scale Data-Driven Cheminformatics Analysis Bioassay Datasets Standard Machine Learning re-ususable based People models/ Computational Publications Models Computational Tools and Resources
Data amplification in Cheminformatics Pubchem Bioassay data (approx. 1 lakh molecules/ dataset Potential Screen Successful PubChem Hits Models (30 million) 6000 descriptors /molecule o Down sizing and random validation require multiple calculation for validation of results o Cross validation up to 50+ time for each experiment
The Problem
C- DAC’s Garuda Grid – Indian Grid Computing Initiative • C-DAC is R&D organization under Ministry of Communication & Information Technology, India • C- DAC’s Garuda Grid is targeted at providing a facility for the scientific community, which would enable them to seamlessly access the distributed resources • Compute Power of GARUDA: ~ 70TFs (6000 CPUs) • Currently there are 55 Garuda Partners • Has NKN (National Knowledge Network) connectivity at 10Gbps
OSDD-Garuda Interface Internet/NKN Results NKN
Weka in Galaxy
OSDD – Garuda Activities • Created OSDD Virtual organization and 70 users registered under this VO. • Garuda Portal customized to support OSDD requirements • Galaxy – a biology workbench has been customized as per OSDD requirements • JNU Head node was set up for hosting Galaxy • Common data has been uploaded to Data Location for accessibility through Galaxy and Portal by all OSDD users • Three cluster resources have been provided for OSDD activities – Hyderabad Cluster with 320 CPUs – Chennai Cluster with 304 CPUs – Param Yuva at Pune with 4368 CPUs • Hand-holding users from the community & resolving their queries
Recommend
More recommend