HPC Asia 2004 BioGrid workshop Development of a Database System for - PowerPoint PPT Presentation

HPC Asia 2004 BioGrid workshop Development of a Database System for Drug Discovery by Employing Grid Technology July 21,2004 Masato Kitajima1,2 Yukako Tohsato 1, Takahiro Kosaka 1, Kazuto Yamazaki 3,Reiji Teramoto 3, Susumu Date 1, Shinji Shimojo 4, Hideo Matsuda 1 1 Graduate School of Information Science and Technology, Osaka University. 2 Fujitsu Kyushu System Engineering Limited. 3 Research Division, Sumitomo Pharmaceuticals Co., Ltd. 4 Cybermedia Center, Osaka University. 1

Databases in the Life Sciences The amount of data and the number of databases in life science have dramatically increased in just a few years No. of DB 6 0 0 5 0 0 4 0 0 3 0 0 2 0 0 1 0 0 0 1 9 9 61 9 9 71 9 9 81 9 9 92 0 0 02 0 0 12 0 0 22 0 0 32 0 0 4 Year Nucleic Acids Research DB Issue 2

Amount of updates in two months of a DNA database 140,000,000 120,000,000 100,000,000 80,000,000 bases 60,000,000 40,000,000 20,000,000 0 2004/4/21 2004/4/28 2004/5/5 2004/5/12 2004/5/19 2004/5/26 2004/6/2 2004/6/9 2004/6/16 date 3

Common Database Problems in the Life Sciences ・ Increase in the amount of data puts a great load to the administrator who updates the database ・ A slight change in the schema of one of the databases requires a complete rebuild of the whole system ・ A considerable amount of time and resources wasted in just updating the database 4

Different Ways Of Integrating Distributed Databases • Hyperlinked Database – Most commonly used for linking databases – Hyperlinks cannot carry special meanings Integrated Database （ ex. NCBI’s Entrez ） • – User only needs to access a single database – Changes in the schema of one database will prompt the rebuilding of the whole database system Heterogeneous Database （ • ex. Stanford Univ.’s TSIMMIS) – Builds a “wrapper” on each of the databases to be accessed by a mediator (Changes in the schema of one database, only requires a change in the wrapper for that database) – Databases that use authentications and functionalities specific to life sciences(like homology searching and similarity searching) pose a problem in integration 5

Common Problems in Linking the Databases - Unorganized structure of information - Data in unformatted text - Inconsistent use of terms on different databases - Building of relationships between the databases could only be done manually 6

Proposal of a New Database System Use of grid technology and Introduction of the concept of metadata Greatly helped in building mutual data relationships between databases in a distributed system 7

Overview of OGSA-DAI OGSA （ Open Grid Service Architecture Data Access and Integration ） Registry SOAP/HTTP service creation GDSR API interactions Factory GDSF Analysis Grid Data DBMS (RDB, XML DB ） Service GDS 8

Genome-based Drug Discovery Process Application to the drug discovery process Compound (Drug) � Compounds (drugs) are activated by binding to proteins in a cell. � Drug Discovery Process is to find Cell chemical compounds that have good Protein effects on their target proteins. � The process is time-consuming and 5~10 million $ (10~15 years) expensive. Num. of Compounds 10,000 1 200 Target Target Lead Lead Pre Clinical Clinical Drug Identification Validation Identification Optimization 9

Databases Needed in Genome-based Drug Discovery Basic Gene Gene Lead Target Lead Genomic Function Optimization Pre-clinical Clinical Market Finding Identification Validation Research Analysis Genome Known Proteins Proteins Compound Disease Modeling Mapping/ Search Compound Search （ Sequence ・（ Gene Structure Interaction Structure Similarity Similarity Search ） Search ） Finding Search Genome DB Gene/Protein Interaction Compound Disease （ Gene location, Database DB DB DB SNP) 10

Semantic Gap Exists Between Databases and Their Corresponding Disciplines Basic Gene Gene Lead Target Lead Genomic Function Optimization Pre-clinical Clinical Market Finding Identification Validation Research Analysis Proteins ・ Genome Known Proteins Compound Disease Modeling Mapping/ Search Compound Search （ Sequence ・（ Structure Structure Similarity Gene Interaction Similarity Search ） Search ） Finding Search Semantic Gap Database relationship Genome DB Gene/Protein Interaction Compound Disease （ Gene location, Database DB DB DB SNP) 11

Linking Databases in Different Disciplines Disease Unification of Compound DB Different Disciplines DB Through Metadata Medicine Chemistry → Supports the Drug Discovery Process Metadata Metadata Lead Identification Gene-Disease Mapping Genome Protein DB DB Life Science 12

Linking Databases in Different Disciplines Disease Compound DB Linking Eleven Databases DB involved in Medicine Chemistry Genomic Drug Discovery MDL NLM •MDL Drug •Medical Data Report Encyclopedia Metadata Metadata Lead Identification Gene-Disease Mapping •ENZYME •GPCR-DB Protein Research •NucleaRDB Foundation •LGIC-DB •LITDB •MDL Drug Data Report Genome Protein DB DB DNA Databank •SwissProt of Japan •PIR •DDBJ •PDB Life Science 13

Two-Level Implementation of the Metadata Protein-Compound Interaction Metadata Protein Compound Metadata Metadata Compound MDDR PDB PIR DB The relationship between groups in each category level of Protein Metadata and Compound Metadata 14

Metadata as Implemented on the Drug Discovery Workflow Basic Gene Gene Lead Target Lead Pre- Clinical Market Genomic Function Finding Validation Identification Optimization Clinical Research Analysis Work Flow Disease Drug Metabolism Metadata Metadata Protein/Compound Interaction Metadata Disease Relation Target Drug Relation Enzyme DiseaseA Active ReceptorA Enzyme Ⅰ DrugA Substrate Protein Relation Ligand Compound １ ReceptorA Agonist DB Server DB Server DB Server DB Server Gene-Protein Compound Drug Disease DB DB MetabolismDB DB 15

Database System for Protein-Compound Interaction Search USER Web Browser HTTPS Search Portal （ Tomcat ） Database Search Service (Servlet) search process SOAP Factory Factory Factory Factory Factory Protein-Compound Protein Sequence Compound Structure Compound Metadata Protein Metadata Interaction Metadata Homology Search Similarity Search Service Service （ Tanimoto index ） Service (BLAST) Grid Data Structure Keys BLAST GDSF GDSF GDSF GDSF GDSF （ Search ＤＢ Compound Service substructures ） (OGSA-DAI) GDS GDS GDS GDS GDS Grid Service (Globus Toolkit 3) Protein DB Interaction DB Protein DB Compound DB Protein DB PDB SwissProt PIR MDDR (Enzyme, GPCR-DB, 16 NucleaRDB, LGIC-DB)

Strategy Used in Protein-Compound Interaction Search Ligand Ontology* Compound Protein or Protein Family Interaction •ENZYME Extracted data from MDDR(MDL ） Protein Name, •GPCR-DB •NucleaRDB Protein Family •LGIC-DB Class * Schuffenhauer A, Zimmermann J, Stoop R, van der Vyver JJ, Lecchini S, Jacoby E. “An ontology for pharmaceutical ligands and its application for in silico screening and library design,” J Chem Inf Comput Sci. 2002 Jul-Aug;42(4):947-55. 17

Process Flow in Protein-Compound Interaction Search New Target Protein Candidate Ligands of New Target Protein Structure Similarity Search Homology Search (ISIS SS, etc ） (BLAST,etc ） Compound ProteinDB Compound Library Descriptors Homologous Target Protein Large Reference Set of Known Ligands of Homologous Target Protein with known ligands Interactions Search Schuffenhauer A, Floersheim P, Acklin P, Jacoby E., “Similarity metrics for ligands reflecting the similarity of the target proteins”, J Chem Inf Comput Sci. 2003 Mar-Apr;43(2):391-405. 18

Example of Protein-Compound Interaction Search Protein (ex.) PPARgamma agonist Binding Domain Compound (ex.) rosiglitazone Zf-C4 Hormone_rec similarity Activity Homology 137-211 318-501 similarity Compound dual agonist Protein (ex.) ragaglitazar similarity Activity Homology (ex.) PPARalpha Binding Domain Compound agonist 19 Zf-C4 Hormone_rec 100-174 281-464 (ex.) fenofibrate

Protein-Compound Interaction Search System Website 20

Applications Available to the User • Protein Sequence Search : Retrieve the target protein’s sequence by specifying its Protein ID. • Homology Search : Search for proteins homologous to the target in the Protein DB. • Protein-Compound Interaction Search : Extract ligands that bind to the homologous proteins. • Compound Search : Search for new compounds that may possibly interact with the target protein, by structural similarity to the extracted ligands. 21

Flow of User Access and Grid Service Execution User Access (Web Browser) Protein-Compound Compound Protein Homology Search Interaction Search Structure Search Sequence Information Information Information Information Search Portal (Servlet) Sequence Search Homology Search Interaction Search Structure Similarity Search Compound Structure Protein Sequence Protein-Compound Compound Protein Protein Similarity Search Homology Search Interaction Metadata Service Metadata Service Metadata Service (Tanimoto Index) (BLAST) Metadata Service Grid Service Grid Data Service GDS GDS GDS GDS (Globus Toolkit 3) (OGSA-DAI) Protein Interaction Compound DB DB DB 22

HPC Asia 2004 BioGrid workshop Development of a Database System for - PowerPoint PPT Presentation

HPC Asia 2004 BioGrid workshop Development of a Database System for Drug Discovery by Employing Grid Technology July 21,2004 Masato Kitajima1,2 Yukako Tohsato 1, Takahiro Kosaka 1, Kazuto Yamazaki 3,Reiji Teramoto 3, Susumu Date 1, Shinji

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

2004: Poisson Matting 2004: Flash/No-Flash 2004: Flash/No-Flash 2004: Flash/No-Flash 2004: The

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

HPC IN EUROPE Organisation of public HPC resources Context Focus on publicly-funded HPC

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, PhD.

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, H. Cartiaux

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team

AMRtime Precise identification of antimicrobial resistance determinants from metagenomic data

HOMOLOGY IN ELECTROMAGNETIC MODELING Saku Suuriniemi Tampere University of Technology,

Nov Novena ena for for t the he Pr Present esentation tion of the of the Bl Bless essed

O N G O I N G T E A M F O R M AT I O N M A I N TA I N I N G T H E I N T E G R I T Y O F T H

Conservation Biology MODULE 25: CONSERVATION BIOLOGY UNIT 4: TOPICS AND APPLICATIONS Objectives

words in the English language. What is incorrect here? Why? The sign should read Doctors

S.P.A.G. LOLZ! KS2 2: KNOCK - KNOCK! TODAYS CHALLENGE: Learn how to EXPAND nouns into

The Phonics Challenge danger Dec ecod oding ng Qu Quiz church 1 minute rain Split the

Sambuz

Useful Links

Newsletter

Mail Us

HPC Asia 2004 BioGrid workshop Development of a Database System for - PowerPoint PPT Presentation

HPC Asia 2004 BioGrid workshop Development of a Database System for Drug Discovery by Employing Grid Technology July 21,2004 Masato Kitajima1,2 Yukako Tohsato 1, Takahiro Kosaka 1, Kazuto Yamazaki 3,Reiji Teramoto 3, Susumu Date 1, Shinji

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

2004: Poisson Matting 2004: Flash/No-Flash 2004: Flash/No-Flash 2004: Flash/No-Flash 2004: The

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

HPC IN EUROPE Organisation of public HPC resources Context Focus on publicly-funded HPC

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, PhD.

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, H. Cartiaux

MATLAB on UL HPC Checkpointing &amp; parallel execution UL High Performance Computing (HPC) Team

AMRtime Precise identification of antimicrobial resistance determinants from metagenomic data

HOMOLOGY IN ELECTROMAGNETIC MODELING Saku Suuriniemi Tampere University of Technology,

Nov Novena ena for for t the he Pr Present esentation tion of the of the Bl Bless essed

O N G O I N G T E A M F O R M AT I O N M A I N TA I N I N G T H E I N T E G R I T Y O F T H

Conservation Biology MODULE 25: CONSERVATION BIOLOGY UNIT 4: TOPICS AND APPLICATIONS Objectives

words in the English language. What is incorrect here? Why? The sign should read Doctors

S.P.A.G. LOLZ! KS2 2: KNOCK - KNOCK! TODAYS CHALLENGE: Learn how to EXPAND nouns into

The Phonics Challenge danger Dec ecod oding ng Qu Quiz church 1 minute rain Split the

Sambuz

Useful Links

Newsletter

Mail Us

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team