Large Scale Knowledge Representation of Large Scale Knowledge Representation of Distributed Biomedical Information Distributed Biomedical Information Volker Stü ümpflen mpflen Volker St Thorsten Barnickel Thorsten Barnickel Karamfilka Nenova Nenova Karamfilka MIPS / Institute for for Bioinformatics Bioinformatics MIPS / Institute GSF – – National Research Center National Research Center for for Environment Environment and and Health Health GSF TMRA 07 TMRA 07
Understanding Understanding Complex Biological Biological Systems Systems Complex Data e g d e l w o n K TMRA 07 TMRA 07
Systems Biology Biology TMRA 07 TMRA 07 Systems
Questions Questions � Different Different knowledge knowledge domains domains ? ? � � Ontologies Ontologies for for semantic semantic structuring structuring ? ? � � Semantic Semantic structures structures from from free free text ? text ? � � Knowledge Knowledge representation representation from from distributed distributed � resources ? ? resources TMRA 07 TMRA 07
Merging Knowledge Knowledge Merging from Different Domains Different Domains from TMRA 07 TMRA 07
Semantic Structuring Structuring Semantic Demands for for Ontologies Ontologies Demands � Life Life sciences sciences have have a a long long tradition tradition in in classification classification … … � � … … various various ontologies ontologies are are available available and in and in use use � � Ontologies (in Ontologies (in the the broadest broadest sense sense): ): � � Controlled Controlled vocabularies vocabularies � � Taxonomies Taxonomies � � Frames Frames � � … … � � Examples Examples for for Ontologies: Ontologies: � � MeSH MeSH terms terms, Gene , Gene Ontology Ontology (GO), (GO), FunCat FunCat, , … … � � Many Many more more from from e.g e.g. Open . Open Biomedical Biomedical Ontologies Ontologies � (http://obofoundry.org/ http://obofoundry.org/) ) ( TMRA 07 TMRA 07
Example: : Extending Extending the the Functional Functional Example Context of Proteins of Proteins Context TMRA 07 TMRA 07
Semantic Structuring Structuring and and Semantic Knowledge Representation Representation Knowledge Knowledge Portal Topic Map Topic Map Generation Generation Textmining Distributed access system • several hundreds of biomedical resources Web Service Web Service • distributed • > 1-2 PetaByte TMRA 07 TMRA 07
Knowledge in Free Text in Free Text Knowledge … of pathogen response genes that prevent disease progression. The expression of ERF1 can be activated rapidly by ethylene Free text or jasmonate and can be activated synergistically by both hormones. In addition, both signalling … Topic Map TMRA 07 TMRA 07
REBIMET REBIMET � Relation Relation Extraction Extraction from from Biomedical Biomedical Texts Texts � TMRA 07 TMRA 07
Entity Recognition Recognition Entity � Identification Identification of relevant of relevant biological biological entities entities: : � � Based Based on synonym on synonym lists lists created created from from terms terms in in � taxonomies, , gene gene names names, , … …. . taxonomies � Realized Realized with with Apaches Apaches Lucene Lucene � TMRA 07 TMRA 07
Information Extraction Extraction with with Semantic Semantic Role Role Information Labeling and and Cooccurrence Cooccurrence Labeling ASSERT tool 1. Semantic Role Labeling: (Pradhan S. et al., 2005) 1.1 SPA structure for verb a) 1.2 SPA structure for verb b) 2. Information Extraction: TMRA 07 TMRA 07
Simplified TM TM Representation Representation Simplified � Generation of Topic Generation of Topic Map Map fragments fragments � � Connection Connection to to evidence evidence in text in text by by reification reification � TMRA 07 TMRA 07
Screenshot Portal Portal Screenshot � PSI PSI based based merging merging � of textmining model model of textmining with genome genome model model with TMRA 07 TMRA 07
Large Scale Scale Integration and Integration and Large Knowledge Representation Representation Knowledge Topic Map Topic Map Generation Generation Textmining Distributed access system Web Service Web Service TMRA 07 TMRA 07
GeKnow: Integration of : Integration of GeKnow PEDANT, SIMAP, NCBI data, NCBI PubMed PubMed PEDANT, SIMAP, NCBI data, NCBI � PEDANT 3 ~ 600 GB PEDANT 3 ~ 600 GB � contains 450 genomes each stored in a single MySQL MySQL database database contains 450 genomes each stored in a single � � no possibilities for simultaneous cross genome comparison no possibilities for simultaneous cross genome comparison � � � SIMAP ~ SIMAP ~ 540 GB 540 GB compressed compressed � contains over 7 Mio. unique protein sequences contains over 7 Mio. unique protein sequences � � � NCBI NCBI � Taxonomy information (some thousands) Taxonomy information (some thousands) � � � Textmining from Textmining from PubMed PubMed � 16 Mio. abstracts, 65 Mio Hits, 15 Mio. Sentences, 13 Mio. SPA 16 Mio. abstracts, 65 Mio Hits, 15 Mio. Sentences, 13 Mio. SPA � � structures structures � Integration of these data on the fly Integration of these data on the fly � � Semantic linking of PEDANT databases with SIMAP and NCBI Semantic linking of PEDANT databases with SIMAP and NCBI � Taxonomy Taxonomy � No redundant data No redundant data � TMRA 07 TMRA 07
How To To Generate Generate the the Topic Topic Maps Maps ? ? How Generation of TM fragments � Problems Problems with with generation generation of of one one large TM large TM � � Very Very large large data data collections collections ( (storage storage problems problems) ) � � Distributed Distributed � � Update Update problems problems � TMRA 07 TMRA 07
System Architecture (GeKnow GeKnow) ) System Architecture ( Extension of our our n n- -Tier Tier � � Extension of J2EE based J2EE based component component and service service oriented oriented and architecture architecture (EJBs ( EJBs and Web Services) and Web Services) � Simply by Simply by adding adding some some � semantic components components .. .. semantic .. and one one semantic semantic Tier Tier � � .. and TMRA 07 TMRA 07
Concept: : Concept Independent semantic layer on top of arbitrary data sources � Independent semantic layer on top of arbitrary data sources � Semantic level Semantic manager (merging, fragments) TM Resource manager Configuration Integration level Web Service Web Service TMRA 07 TMRA 07
Integration Tier Integration Tier � Resource Resource: : � Aware of of mapping mapping Aware � � between topic between topic / / association association types and and methods methods types from data data source source from � Handler Handler: : � Proxy Proxy � � Manages connections Manages connections � � Execute query query methods methods Execute � � TMRA 07 TMRA 07
Syntax Tier – – Topic Topic Types Types Syntax Tier � Converts Converts resource resource � specific format format specific into TM TM fragments fragments into � May May access access � multiple resources resources multiple (handled handled by by ( Resource Manager) Manager) Resource TMRA 07 TMRA 07
Syntax Tier – – Association Types Association Types Syntax Tier � Converts Converts resource resource � specific format format specific into TM TM fragments fragments into � May May access access � multiple resources resources multiple (handled handled by by ( Resource Manager) Manager) Resource TMRA 07 TMRA 07
Semantic Tier Semantic Tier � Responsible Responsible for for � fragment generation generation fragment � � Merging Merging � � � No No programming programming required required ( (only only configuration configuration) ) � Configuration TMRA 07 TMRA 07
Portal / Portlets Portlets (JSR (JSR- -168) 168) Portal / TMRA 07 TMRA 07
Portal Portal � Currently Currently JSF JSF based based � � Caused Caused several several problems problems � � Migration to Migration to more more generic generic portlets portlets � (XSLT based based) ) (XSLT TMRA 07 TMRA 07
Recommend
More recommend