Chemical Names: Terminological Resources and Corpora Annotation Corinna Kol´ aˇ rik, Roman Klinger, C. M. Friedrich, M. Hofmann-Apitius, J. Fluck Workshop BERBTM ’08 at LREC ’08 Marrakech, Morocco 26 May 2007
Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary Outline Introduction 1 Terminological Resources 2 Test Corpus 3 Machine Learning based Recognition 4 Conclusion & Summary 5 Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 2/25
Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary Introduction Most efforts in Named Entity Recognition were spent on Genes and Proteins (well established methods available, BioCreative I & II) Corpora, Comparable Systems, Standard Dictionary Sources Chemical Named Entities important for: Medical Applications, Drug Development, Pharmaceutical Research, Analysis of bio-chemical pathways,. . . ⇒ Need for terminological resources and annotated corpora Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 3/25
Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary Introduction – Examples Novel nonnarcotic analgesics with an improved therapeutic ratio. Structure-activity relationships of 8-(methylthio)- and 8-(acylthio)-1,2,3,4,5,6-hexahydro-2,6-methano-3- benzazocines. Conversion of the 8-phenolic 1,2,3,4,5,6-hexahydro-2,6-methano-3-benzazocines to the corresponding 8-thiophenolic analogues was achieved by three different routes. Diazo- tization of 8-amino-2,6-methano-3-benzazocine (2) followed by the reaction with CH3SNa afforded 8-(methylthio)-1,2,3,4,5,6-hexahydro-2,6-methano-3-benzazocine (3). Another route using Grewe cyclization was also examined for the synthesis of 3. As the most ef- fective route, Newman-Kwart rearrangement of benzazocines was selected and closely investigated. 8-(N,N-Dimethylthiocarbamoyl)oxy derivatives (6a-e) rearranged to 8-(N,N- dimethylcarbamoyl)thio derivatives (7a-e) in good yields. Reductive cleavage of 7a-e and subsequent methylation or acylations gave the title compounds (3, 8-24). Although anal- gesic activities of sulfur-containing benzazocines decreased compared to the correspond- ing hydroxy compounds , the N-methyl derivative (S-metazocine, 8) showed potent anal- gesic activity. PMID 2999399: Hori M, et.al J Med Chem. . 1985 Nov; 28(11):1656-61. Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 4/25
Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary Introduction – Examples Different nomenclatures: Aspirin: Normalization Trade name: Aspirin Mapping to unique Formula: C 9 H 8 O 4 structure or identifier IUPAC: 2-acetyloxybenzoic acid Smiles: CC(=O)OC1=CC=CC=C1C(=O)O Other synonyms: Acetylsalicylate, Enterosarein, Acenterine, Acylpyrin, Acetosal, Colfarit, Acetylsalicylic Acid, InChI=1/C9H8O4/c1-6(10)13-8-5-3-2-4- Acetosalic acid, Enterosarine 7(8)9(11)12/h2-5H,1H3,(H,11,12)/f/h11H Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 5/25
Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary Introduction – Examples Normalization Additional available information Mapping to unique g Molecular Weight: 180.15742 mol structure or identifier Heavy Atom Count: 13 Structure Search Classes: Benzoic acid family, Cyclooxygenase inhibitors Therapeutic Indications: Fever, Inflammation, Pain . . . Where to find such information with structure and synonyms? Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 6/25
Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary Terminological Resources – Commercial Databases CrossFire Beilstein Database 10 million organic compounds Information of bio-activity and physical properties Literature References CAS Registry SM 35 million organic and inorganic substances Unique IDs (CAS Registry Numbers) assigned ⇒ established IDs The World Drug Index 80,000 marketed and development drugs Drug names, synonyms, trade names, trivial names Drug activity, treatment, manufacturer, medical information Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 7/25
Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary Terminological Resources – Freely Available Res. PubChem ( http://pubchem.ncbi.nlm.nih.gov/ ) Compound: 18.4 million compounds, Structure Information, Smiles, InChI, IUPAC, No Synonyms Substance: 36.8 million entries: substances and proteins, mixtures, extracts, complexes, No Smiles, few InChI Chemical Entities of Biological Interest (ChEBI) ( http://www.ebi.ac.uk/chebi/ ) 15.562 entries Ontological classification Synonyms and Structure information Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 8/25
Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary Terminological Resources – Freely Available Res. MeSH Medical Subject Headings (referred to as MeSH T) Thesaurus used for indexing MEDLINE, hierarchical organized Chemical category contains 8,612 entries No Structure information Supplementary Concept Records (Formally Supplementary Chemical Records) provided by National Library of Medicine (referred to as MeSH C) 175,136 entries Synonyms, no structure but CAS identifier Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 9/25
Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary Terminological Resources – Freely Available Res. Kyoto Encyclopedia of Genes and Genomes (KEGG) ( http://www.genome.jp/kegg/ ) KEGG Compound (15,033 entries) KEGG Drug (6,834 entries) Synonyms and Structures DrugBank ( http://www.drugbank.ca/ ) 4,764 entries Synonyms and structure Human Metabolome Database (HMDB) ( http://www.hmdb.ca/ ) 3000 entries many synonyms and structural information Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 10/25
Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary Terminological Resources – Analysis Number of entries in extracted dictionaries 1e+07 1e+06 Number 100000 10000 1000 Pubchem MeSH_T ChEBI DrugBank HMDB MeSH_C KEGG Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 11/25
Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary Terminological Resources – Analysis Are all synonyms included in PubChem? Percentage of overlap with PubChem 100 80 60 40 20 0 Pubchem MeSH_T ChEBI DrugBank HMDB MeSH_C KEGG Combining all analyzed dictionaries, 69 % of the synonyms are not from PubChem but from the other resources. Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 12/25
Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary Annotation of a Test Corpus How to analyze the usability of the resources for text mining? ⇒ Test Corpus ⇒ Assumption: Some classes are more easy findable in text then others. ⇒ Different classes, easily annotatable Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 13/25
Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary Annotation of a Test Corpus – Classes TRIVIAL Single word terms (also if they were in fact IUPAC) aspirin, estragon, testosterone, Acetylsalicylate IUPAC(-like) Multi-word systematic names N-substituted-pyridino[2,3-f]indole-4,9-dione, 1-hexoxy-4-methyl-hexane, elaidic acid, 1,4-dihydronaphthoquinones PART Partial chemical names (e. g. in enumerations) 8-(methylthio)- and. . . , 17beta- Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 14/25
Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary Annotation of a Test Corpus – Classes ABBREVIATION Abbreviations of names, as part of IUPAC names not tagged separatly TPA, AMPA SUM Sum formulas CH 3 SNa, KOH FAMILY Chemical Families, not pharmacological/functional families (as anti-inflammatory drug, chelator) disaccaride, pyrimidine, hydrazides Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 15/25
Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary Annotation of a Test Corpus Problems occured during annotation Labels (H3-[chemical]) Mixtures of abbreviations in long names (2R,10S)-N(1)-cyclopropylmethyl-2,10-dihydroxy-N(11)- ethylnorspermine abbreviated as (2R,10S)-(HO)(2)CPMENSPM Differentation of family and trivial-names Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 16/25
Recommend
More recommend