leveraging open chemogenomics data and tools with knime
play

Leveraging Open Chemogenomics Data and Tools with KNIME George - PowerPoint PPT Presentation

Leveraging Open Chemogenomics Data and Tools with KNIME George Papadatos ChEMBL Group georgep@ebi.ac.uk What is EMBL-EBI? Europes home for biological data data, services services, research research and training training A


  1. Leveraging Open Chemogenomics Data and Tools with KNIME George Papadatos ChEMBL Group georgep@ebi.ac.uk

  2. What is EMBL-EBI? • Europe’s home for biological data data, services services, research research and training training • A trusted data provider for the life sciences • Part of the European Molecular Biology Laboratory, an intergovernmental research organisation • International: 570 members of staff from 57 nations

  3. Data resources at EMBL-EBI Genes, genomes & variation d European Nucleotide Archive Ensembl GWAS Catalog European Variation Archive Ensembl Genomes Metagenomics portal g C r o s s d o m a P i n Gene, protein & metabolite expression r e s o u r c RNA Central ArrayExpress Metabolights e s s . Expression Atlas PRIDE C Literature & r o s s ontologies d Protein sequences, families & motifs o m a i n b InterPro Pfam UniProt r Europe PubMed Central e s o BioStudies u r Molecular structures c Gene Ontology e s Experimental Factor Protein Data Bank in Europe Ontology y Electron Microscopy Data Bank Chemical biology Chemical biology ChEMBL ChEMBL SureChEMBL SureChEMBL ChEBI ChEBI Reactions, interactions & Systems pathways BioModels Enzyme Portal BioSamples IntAct Reactome MetaboLights

  4. ChEMBL: Data for drug discovery 1. Scientific facts 3. Insight, tools and resources for translational drug discovery Compound >Thrombin MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLE RECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGT NYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYT TDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVT THGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGY CDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLF EKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDR WVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWR Assay/Target K i = 4.5nM ENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTA NVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGG PFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE Bioactivity data APTT = 11 min. 2. Organization, integration, curation and standardization of pharmacology data

  5. KNIME at the EBI • KNIME nodes to access ChEBI and ChEMBL databases • Trusted community nodes • Workflows on Examples server • Method development and use cases • Provide KNIME training to scientists and researchers • Wellcome Trust drug discovery courses, EMBL courses • CDK community nodes support h"ps://tech.knime.org/book/embl-ebi-nodes-for-knime-trusted-extension

  6. KNIME and ChEMBL ChEMBL Web Virtual Services Machine 14M bioactivities Local access to ChEMBL 1.5M structures data and services UniChem Web Services Access ~110M structures from 27 sources knime://EXAMPLES/099_Community/08_ChEMBL_WebServices

  7. KNIME and ChEMBL ChEMBL Web Virtual Services Machine 14M bioactivities Local access to ChEMBL 1.5M structures data and services Patent Annotations UniChem Web Services 4M patent documents Access ~110M structures 14M structures from 27 sources 260M annotations

  8. Why looking at patent documents? • Patent filing and searching • Legal, financial and commercial incentives & interests • Prior art, novelty, freedom to operate searches • Competitive intelligence • Unprecedented wealth of knowledge • Most of the knowledge will never be disclosed anywhere else • Compounds, scaffolds, reactions • Biological targets, diseases, indications • Average lag of 2-4 years between patent document and journal publication disclosure for chemistry, 4-5 for biological targets

  9. SureChEMBL data processing Patent SureChEMBL System Offices 1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3- Chemistry propyl-1H-pyrazolo[4,3-d]pyrimidin-5- Database yl)phenylsulfonyl]-4-methylpiperazine WO SureChem IP Name to OCR En@ty Structure EP Recogni@on Applica@ons (five methods) & Granted Processed patents Database (service) US Image to Applica@ons & granted Structure (one method) Patent A"achments JP PDFs API Applica@on Abstracts (service) Users www.surechembl.org

  10. SureChEMBL data processing v2 Patent Offices SureChEMBL System Chemistry WO Database SureChem IP OCR Name to En@ty Structure Recogni@on EP ( five methods ) Processed Applica@ons & Granted patents (service) Database Image to US Structure Applica@ons & granted ( one method ) Patent Applica@on JP PDFs Server Abstracts Bio-En@ty (service) Recogni@on Users www.surechembl.org

  11. SureChEMBL bioannotation • SciBite’s Termite text-mining engine run on 4M life-science patents from SureChEMBL corpus • Genes (identified by HGNC symbols) and diseases (identified by MeSH IDs) annotated • Section/frequency information annotated (e.g., in title, abstract, claims, total frequency) • Relevance score (0-3) to flag important chemical and biological entities and remove noise

  12. Relevance scoring – genes/diseases • Various features used: • Term frequency • Position (title, abstract, figure, caption, table) • Frequency distribution • Scores range from 0 – 3 • 3 – most important entities in the patent 3 – most important entities in the patent • 2 – important entities in the patent 2 – important entities in the patent • 1 – mentioned entities in the patent • 0 – ambiguous entity/likely annotation error

  13. Relevance scoring - compounds • Main assumptions for relevance: 1. Very frequent compounds are irrelevant (but if drug-like then that’s OK) 2. Compounds with busy chemical space around them are interesting • Use distribution of close analogues (NNs) among compounds found in the same same patent family patent family • Scores range from 0 – 3 • 3 – highest number of NNs: most important entities in the patent 3 – highest number of NNs: most important entities in the patent • 2 – important entities in the patent 2 – important entities in the patent • 1 – few NNs: mentioned entities in the patent • 0 – singletons or trivial entities, most likely errors or reagents, solvents, substituents Hatori, K., Wakabayashi, H., & Tamaki, K. (2008). JCIM, 48(1), 135–142. doi:10.1021/ci7002686 Tyrchan, C., Boström, J., Giordaneto, F., Winter, J., & Muresan, S. (2012). JCIM, 52(6), 1480–1489. doi:10.1021/ci3001293

  14. Gotchas & out of scope • No Markush extraction • No natural language processing (e.g., ‘compound x is an inhibitor inhibitor of target y’) • No extraction of bioactivities • No chemistry search (yet) • Patent coverage stops in April 2015 • Incremental updates TBD • Patent calls still in dev • Old scripts / workflows may break

  15. Open PHACTS Architecture Drug Discovery Today 2012, 17:21 (doi:10.1016/j.drudis.2012.05.016)

  16. The Open PHACTS node executable API call to KREST nodes

  17. Open PHACTS Patent API https://dev.openphacts.org/docs/develop

  18. Open PHACTS Patent API Compound extracted links Patent Disease Target

  19. Open PHACTS Patent API Compound extracted links Patent Disease Target inferred links

  20. Use case #1: Patent to Entities 1. From a patent get compounds, genes and diseases 2. Filter to remove noise • Frequency and relevance score 3. Process and visualise Compound ? ? Patent Disease Target ?

  21. US-7718693-B2

  22. Use case #1: Patent to Entities • Patent URI: • http://rdf.ebi.ac.uk/resource/surechembl/patent/US-7718693-B2 • API call: • 586 entities back

  23. Use case #1: Patent to Entities • Patent URI: • http://rdf.ebi.ac.uk/resource/surechembl/patent/US-7718693-B2 • API call: • 586 entities back

  24. 1) Look at target and disease entities • 178 target and disease entities • Filter: Relevance score >= 2 à 23 remain • Visualise in tag cloud by frequency

  25. 1) Look at target and disease entities • 178 target and disease entities • Filter: Relevance score >= 2 à 23 remain • Visualise in tag cloud by frequency

  26. Does it make sense?

  27. Does it make sense? US-7718693-B2

  28. 2) Look at compound entities • 408 compound entities • Filter: Relevance score >= 1 à 201 remain • Calculate properties

  29. 2) Look at compound entities • 408 compound entities • Filter: Relevance score >= 1 à 201 remain • Calculate properties

  30. Does it make sense? • Calculate MCS

  31. Does it make sense? • Calculate MCS US-7718693-B2

  32. Use case #2: Drug targets & indications for compound 1. Search patents for a compound (approved drug) 2. Filter to remove noise • Frequency, relevance score and classification code 3. For remaining patents, get disease and target entities 4. Filter to remove noise Compound • Frequency and relevance score 5. Visualise results Patent Disease Target

  33. Eluxadoline (JNJ-27018966, VIBERZI) CHEMBL2159122 FDA Approval: 2015

  34. 1) Get patents for Eluxadoline • UniChem call à SCHEMBL12971682 • Compound URI: • http://rdf.ebi.ac.uk/resource/surechembl/molecule/SCHEMBL12971682 • API call: • Relevance score >=1 à 17 patents (patentome patentome):

  35. 1) Get patents for Eluxadoline • UniChem call à SCHEMBL12971682 • Compound URI: • http://rdf.ebi.ac.uk/resource/surechembl/molecule/SCHEMBL12971682 • API call: • Relevance score >=1 à 17 patents (patentome patentome):

Recommend


More recommend