linked life data
play

Linked Life Data Vassil Momtchev 19/04/2011 Outline Semantic Data - PowerPoint PPT Presentation

Linked Life Data Vassil Momtchev 19/04/2011 Outline Semantic Data Integration Linked Life Data concept Integrated datasets Behind the scene #2 Interlinking Text and Data #3 Semantic Technologies vs. AI If It Works, It's Not


  1. Linked Life Data Vassil Momtchev 19/04/2011

  2. Outline • Semantic Data Integration • Linked Life Data concept • Integrated datasets • Behind the scene #2

  3. Interlinking Text and Data #3

  4. Semantic Technologies vs. AI If It Works, It's Not AI: A Commercial Look at Artificial Intelligence Startups Eve M. Phillips, M.Sc. Thesis, 1999 MIT One can think of “Semantic Technologies” like as AI, made less abstract and more robust, predictable and manageable #4

  5. Semantic Technologies • “Semantic technologies” (ST) is a general term for any software that involves some kind and level of understanding the meaning of the information it deals with • Examples: – A search engine that can match a query for “bird” with a document mentioning “eagle” – A database that will return Ivan as a result of a query for “?x relativeOf Maria” , when the fact asserted was “Maria motherOf Ivan” – A navigation system that is more intelligent than what we are already used to #5

  6. Ontotext Positioning • Leading semantic technology provider – Top-5 core semantic technology developer – Supplying engines and components to vendors and solution developers • Unique technology portfolio: – Semantic Databases : high-performance RDF DBMS, scalable reasoning – Semantic Search : text-mining (IE), Information Retrieval (IR) – Web Mining : focused crawling, screen scraping, data fusion • Good recognition in the SemTech community – Ontotext pages are ranked #1 for “semantic annotation” and “semantic repository” at GYM #6

  7. Time to Guess It? #7

  8. Massive Data Integration Problem • Extreme amount of data with inconsistent syntax, structure and semantics • Data is supported by different organizations • Information is highly distributed and redundant • Knowledge is locked in vast data silos • Isolated communities which could not reach cross-domain understanding Increase the abstraction level of the data! #8

  9. Data representation: RDBMS vs. RDF Statement Person Subject Predicate Object ID Name Gender myo:Person rdf:type rdfs:Class 1 Maria P. F myo:gender rdfs:type rdfs:Property 2 Ivan Jr. M myo:parent rdfs:range myo:Person 3 … myo:spouse rdfs:range myo:Person myd:Maria rdf:type myo:Person myd:Maria rdf:label “Maria P.” myd:Maria myo:gender “F” Parent Spouse myd:Maria rdf:label “Ivan Jr.” ParID ChiID S1ID S2ID From To myd:Ivan myo:gender “M” 1 2 1 3 myd:Maria myo:parent Myd:Ivan … … myd:Maria myo:spouse myd:John … Relational Tables RDF Representation #9

  10. Data representation: XML vs. RDF <document> ptop:Person <person> <name>Maria</name> <gender>F</gender> <relList> <rel type=“child”>Ivan</rel> ptop:Male <relLiist> </person> ptop:Woman rdf:type • No agreement over the structure ptop:childOf and the vocabulary myData:Ivan • Could not be semantically myData: Maria compared by machine XML Documents RDF Representation #10

  11. RDF Graph owl:SymmetricProperty owl:inverseOf inferred rdf:type rdf:type owl:inverseOf ptop:parentOf owl:relativeOf rdfs:subPropertyOf ptop:Agent owl:inverseOf owl:inverseOf ptop:Person rdf:type rdfs:range ptop:childOf myData:Ivan ptop:Woman myData: Maria #11

  12. Linked Data Design Principles • Unambiguous identifiers for objects (resources) – Use URIs as names for things • Use the structure of the web – Use HTTP URIs so that people can look up the names • Make is easy to discover information about an object (resource) – When someone lookups a URI, provide useful information • Link the object (resource) to related objects – Include links to other URIs

  13. PWC on Semantic Technologies Spring of the data Web Technology forecast, A quarterly journal, Spring 2009, http://www.pwc.com/techforecast/ #13

  14. There is Nothing You Can Do … There is nothing you can do with ontologies that cannot be done without them The same holds for language technology: given unlimited resources, all methods will deliver comparable results for any text analysis task (Y. Willks) BTW, there is also nothing you can on Java than cannot be done on Assembler #14

  15. Conceptual idea LINKED LIFE DATA #15

  16. Semantic Data Integration Current Desired • A lot of biomedical data • Single integration model based available on the web and on linked data technology and internally open standards • Very hard to locate the • Computerized support to information and put it into interpret the information context • Assists scientists to combine • Scientists unable to utilize internal data from existing information well experiments with external • Difficult to automatically knowledge combine public domain knowledge with private company expertise 16

  17. The Original Idea Molecular Disease Interaction Gene Target Patient Protein Drugs #17

  18. Data Integration Levels • Generalization/specialization Semantics (Nexium vs. Esomeprazole) • Homonyms, synonyms • Different metric units • Aggregation (full name with Structure initials vs full name) • Schema mismatch and internal path discrepancy • File format (CSV , XML, flat Syntax file) • Character encoding (ASCII, UTF-8, UTF-16) #18

  19. System Levels in the Knowledge Driven Process • Advanced visualization and statistical analyzes Scientific Intelligence • Information extraction Linked Life Data • Schema alignment Knowledge • Shared identifiers Operational • Data silos applications • Databases Transactional • File system 19

  20. Syntax and Structure Ambiguity • RDF data model resolves all syntax level ambiguities • It helps you express all data in a common data model ID GRAA_HUMAN STANDARD; PRT; 262 AA. AC P12544; DT 01-OCT-1989 (Rel. 12, Created) DT 01-OCT-1989 (Rel. 12, Last sequence update) DT 15-JUN-2002 (Rel. 41, Last annotation update) < PubmedArticle> < MedlineCitation Owner= "NLM" DE Granzyme A precursor (EC 3.4.21.78) (Cytotoxic T- Status= "In-Process"> < PMID lymphocyte proteinase Version= "1"> 21500419< /PMID> < DateCreated> DE 1) (Hanukkah factor) (H factor) (HF) (Granzyme 1) < Year> 2011< /Year> < Month> 04< /Month> (CTL tryptase) < Day> 15< /Day> < /DateCreated> < Article DE (Fragmentin 1). GN PubModel= "Print"> < Journal> < ISSN GZMA OR CTLA3 OR HFSP. OS Homo sapiens IssnType= "Electronic"> 1520-6882< /ISSN> (Human). < JournalIssue CitedMedium= "Internet"> < Volume> 82< /Volume> < Issue> 20< /Issue> < PubDate> < Year> 2010< /Year> < Month> Oct< /Month> < Day> 15< /Day> < /PubDate> < /JournalIssue> #20

  21. Linked Data Mapping • How well interlinked is the linked data cloud? – Many interesting queries are difficult to be expressed in SPARQL – String functions could not be index – Often there are misplaced identifiers CD40L_HUMAN cpath:CPATH-94138 CD40L_HUMAN UNIPROT TNF5_HUMAN P29965 uniprot:P29965 CD4L_HUMAN cpath:CPATH-LOCAL-8467065 cpath:CPATH-LOCAL-8749236 #21

  22. Linked Data Mappings • Identified 6 linked data Namespace mapping Reference node db integration patterns X Y Y X ns-x: id ns-y: id db: id id • Define meta-rules to Mismatched identifiers Value dereference X connect resources with X Y accession term db: id db: accession Y various predicates Transitive link Semantic Annotations • Manually controlled text with X Y X name Y name process The blue lines and the blue text of the captions (used either as part of the URI or literals) designate the criteria for linking the information #22

  23. Instance Level Identify Alignment Relationship Semantics Example Exact match Transitive equivalence Equivalent only for Close match search purposes Generalization of a Broader match concept Narrower Specialization of a Inverse of broader match match concept Unspecified relation Related (no real semantics) #23

  24. Quick Facts! • Public and free RDF warehouse service • Integrates more than 25 popular data sources • Apply text mining technology to link the text with entities • Computer friendly API to access the information #24

  25. Type of possible questions, analysis and interpretation INTEGRATED DATASET #25

  26. Linked Life Data Datasets #26

  27. SPARQL Co- Relation Rest API endpoint Occurrence Finder #27

Recommend


More recommend