HPC Asia 2004 BioGrid workshop Development of a Database System for Drug Discovery by Employing Grid Technology July 21,2004 Masato Kitajima1,2 Yukako Tohsato 1, Takahiro Kosaka 1, Kazuto Yamazaki 3,Reiji Teramoto 3, Susumu Date 1, Shinji Shimojo 4, Hideo Matsuda 1 1 Graduate School of Information Science and Technology, Osaka University. 2 Fujitsu Kyushu System Engineering Limited. 3 Research Division, Sumitomo Pharmaceuticals Co., Ltd. 4 Cybermedia Center, Osaka University. 1
Databases in the Life Sciences The amount of data and the number of databases in life science have dramatically increased in just a few years No. of DB 6 0 0 5 0 0 4 0 0 3 0 0 2 0 0 1 0 0 0 1 9 9 61 9 9 71 9 9 81 9 9 92 0 0 02 0 0 12 0 0 22 0 0 32 0 0 4 Year Nucleic Acids Research DB Issue 2
Amount of updates in two months of a DNA database 140,000,000 120,000,000 100,000,000 80,000,000 bases 60,000,000 40,000,000 20,000,000 0 2004/4/21 2004/4/28 2004/5/5 2004/5/12 2004/5/19 2004/5/26 2004/6/2 2004/6/9 2004/6/16 date 3
Common Database Problems in the Life Sciences ・ Increase in the amount of data puts a great load to the administrator who updates the database ・ A slight change in the schema of one of the databases requires a complete rebuild of the whole system ・ A considerable amount of time and resources wasted in just updating the database 4
Different Ways Of Integrating Distributed Databases • Hyperlinked Database – Most commonly used for linking databases – Hyperlinks cannot carry special meanings Integrated Database ( ex. NCBI’s Entrez ) • – User only needs to access a single database – Changes in the schema of one database will prompt the rebuilding of the whole database system Heterogeneous Database ( • ex. Stanford Univ.’s TSIMMIS) – Builds a “wrapper” on each of the databases to be accessed by a mediator (Changes in the schema of one database, only requires a change in the wrapper for that database) – Databases that use authentications and functionalities specific to life sciences(like homology searching and similarity searching) pose a problem in integration 5
Common Problems in Linking the Databases - Unorganized structure of information - Data in unformatted text - Inconsistent use of terms on different databases - Building of relationships between the databases could only be done manually 6
Proposal of a New Database System Use of grid technology and Introduction of the concept of metadata Greatly helped in building mutual data relationships between databases in a distributed system 7
Overview of OGSA-DAI OGSA ( Open Grid Service Architecture Data Access and Integration ) Registry SOAP/HTTP service creation GDSR API interactions Factory GDSF Analysis Grid Data DBMS (RDB, XML DB ) Service GDS 8
Genome-based Drug Discovery Process Application to the drug discovery process Compound (Drug) � Compounds (drugs) are activated by binding to proteins in a cell. � Drug Discovery Process is to find Cell chemical compounds that have good Protein effects on their target proteins. � The process is time-consuming and 5~10 million $ (10~15 years) expensive. Num. of Compounds 10,000 1 200 Target Target Lead Lead Pre Clinical Clinical Drug Identification Validation Identification Optimization 9
Databases Needed in Genome-based Drug Discovery Basic Gene Gene Lead Target Lead Genomic Function Optimization Pre-clinical Clinical Market Finding Identification Validation Research Analysis Genome Known Proteins Proteins Compound Disease Modeling Mapping/ Search Compound Search ( Sequence ・ ( Gene Structure Interaction Structure Similarity Similarity Search ) Search ) Finding Search Genome DB Gene/Protein Interaction Compound Disease ( Gene location, Database DB DB DB SNP) 10
Semantic Gap Exists Between Databases and Their Corresponding Disciplines Basic Gene Gene Lead Target Lead Genomic Function Optimization Pre-clinical Clinical Market Finding Identification Validation Research Analysis Proteins ・ Genome Known Proteins Compound Disease Modeling Mapping/ Search Compound Search ( Sequence ・ ( Structure Structure Similarity Gene Interaction Similarity Search ) Search ) Finding Search Semantic Gap Database relationship Genome DB Gene/Protein Interaction Compound Disease ( Gene location, Database DB DB DB SNP) 11
Linking Databases in Different Disciplines Disease Unification of Compound DB Different Disciplines DB Through Metadata Medicine Chemistry → Supports the Drug Discovery Process Metadata Metadata Lead Identification Gene-Disease Mapping Genome Protein DB DB Life Science 12
Linking Databases in Different Disciplines Disease Compound DB Linking Eleven Databases DB involved in Medicine Chemistry Genomic Drug Discovery MDL NLM •MDL Drug •Medical Data Report Encyclopedia Metadata Metadata Lead Identification Gene-Disease Mapping •ENZYME •GPCR-DB Protein Research •NucleaRDB Foundation •LGIC-DB •LITDB •MDL Drug Data Report Genome Protein DB DB DNA Databank •SwissProt of Japan •PIR •DDBJ •PDB Life Science 13
Two-Level Implementation of the Metadata Protein-Compound Interaction Metadata Protein Compound Metadata Metadata Compound MDDR PDB PIR DB The relationship between groups in each category level of Protein Metadata and Compound Metadata 14
Metadata as Implemented on the Drug Discovery Workflow Basic Gene Gene Lead Target Lead Pre- Clinical Market Genomic Function Finding Validation Identification Optimization Clinical Research Analysis Work Flow Disease Drug Metabolism Metadata Metadata Protein/Compound Interaction Metadata Disease Relation Target Drug Relation Enzyme DiseaseA Active ReceptorA Enzyme Ⅰ DrugA Substrate Protein Relation Ligand Compound 1 ReceptorA Agonist DB Server DB Server DB Server DB Server Gene-Protein Compound Drug Disease DB DB MetabolismDB DB 15
Database System for Protein-Compound Interaction Search USER Web Browser HTTPS Search Portal ( Tomcat ) Database Search Service (Servlet) search process SOAP Factory Factory Factory Factory Factory Protein-Compound Protein Sequence Compound Structure Compound Metadata Protein Metadata Interaction Metadata Homology Search Similarity Search Service Service ( Tanimoto index ) Service (BLAST) Grid Data Structure Keys BLAST GDSF GDSF GDSF GDSF GDSF ( Search D B Compound Service substructures ) (OGSA-DAI) GDS GDS GDS GDS GDS Grid Service (Globus Toolkit 3) Protein DB Interaction DB Protein DB Compound DB Protein DB PDB SwissProt PIR MDDR (Enzyme, GPCR-DB, 16 NucleaRDB, LGIC-DB)
Strategy Used in Protein-Compound Interaction Search Ligand Ontology* Compound Protein or Protein Family Interaction •ENZYME Extracted data from MDDR(MDL ) Protein Name, •GPCR-DB •NucleaRDB Protein Family •LGIC-DB Class * Schuffenhauer A, Zimmermann J, Stoop R, van der Vyver JJ, Lecchini S, Jacoby E. “An ontology for pharmaceutical ligands and its application for in silico screening and library design,” J Chem Inf Comput Sci. 2002 Jul-Aug;42(4):947-55. 17
Process Flow in Protein-Compound Interaction Search New Target Protein Candidate Ligands of New Target Protein Structure Similarity Search Homology Search (ISIS SS, etc ) (BLAST,etc ) Compound ProteinDB Compound Library Descriptors Homologous Target Protein Large Reference Set of Known Ligands of Homologous Target Protein with known ligands Interactions Search Schuffenhauer A, Floersheim P, Acklin P, Jacoby E., “Similarity metrics for ligands reflecting the similarity of the target proteins”, J Chem Inf Comput Sci. 2003 Mar-Apr;43(2):391-405. 18
Example of Protein-Compound Interaction Search Protein (ex.) PPARgamma agonist Binding Domain Compound (ex.) rosiglitazone Zf-C4 Hormone_rec similarity Activity Homology 137-211 318-501 similarity Compound dual agonist Protein (ex.) ragaglitazar similarity Activity Homology (ex.) PPARalpha Binding Domain Compound agonist 19 Zf-C4 Hormone_rec 100-174 281-464 (ex.) fenofibrate
Protein-Compound Interaction Search System Website 20
Applications Available to the User • Protein Sequence Search : Retrieve the target protein’s sequence by specifying its Protein ID. • Homology Search : Search for proteins homologous to the target in the Protein DB. • Protein-Compound Interaction Search : Extract ligands that bind to the homologous proteins. • Compound Search : Search for new compounds that may possibly interact with the target protein, by structural similarity to the extracted ligands. 21
Flow of User Access and Grid Service Execution User Access (Web Browser) Protein-Compound Compound Protein Homology Search Interaction Search Structure Search Sequence Information Information Information Information Search Portal (Servlet) Sequence Search Homology Search Interaction Search Structure Similarity Search Compound Structure Protein Sequence Protein-Compound Compound Protein Protein Similarity Search Homology Search Interaction Metadata Service Metadata Service Metadata Service (Tanimoto Index) (BLAST) Metadata Service Grid Service Grid Data Service GDS GDS GDS GDS (Globus Toolkit 3) (OGSA-DAI) Protein Interaction Compound DB DB DB 22
Recommend
More recommend