automatically annotating text with linked open data
play

Automatically Annotating Text with Linked Open Data Delia Rusu , Bla - PowerPoint PPT Presentation

Automatically Annotating Text with Linked Open Data Delia Rusu , Bla Fortuna, Dunja Mladeni Joef Stefan Institute ailab.ijs.si Motivation: Annotating Text with LOD DBpedia Open Cyc WordNet ailab.ijs.si Overview Related work


  1. Automatically Annotating Text with Linked Open Data Delia Rusu , Blaž Fortuna, Dunja Mladeni ć Jožef Stefan Institute ailab.ijs.si

  2. Motivation: Annotating Text with LOD DBpedia Open Cyc WordNet ailab.ijs.si

  3. Overview Related work Algorithms for annotating with LOD PageRank Context Similarity Evaluation Datasets WordNet OpenCyc DBpedia Conclusions and Future Work ailab.ijs.si

  4. Related Work Supervised approaches: Parallel corpora: Chan et al. – SemEval 2007 Knowledge-based: WordNet::Similarity package – Pedersen et al. 2004 Usage of context free grammars to validate semantic interconnections – Navigli and Velardi, 2005 Formal document structure description, hypothesis building, trying to reason using Cyc – Curtis et al. 2006 Disambiguate Wikipedia articles into Cyc concepts - Medelyan and Legg, 2008 Adapted versions of PageRank – Mihalcea et al. 2004, Agirre and Soroa, 2009 Simple knowledge-based approaches compete with state-of-the-art supervised approaches using a high- quality knowledge base - Ponzetto and Navigli, 2010. ailab.ijs.si

  5. LOD Dataset Representation WordNet (VUA) ailab.ijs.si

  6. LOD Dataset Representation rdf:resource rdf:type rdfs:subClassOf … Example: rdf:resource="http://purl.org/vocabularies/princeton/wn30/ wordsense- values -noun-1" rdf:type rdf:resource="http://purl.org/vocabularies/princeton/wn30/ synset- belief -noun-1" ailab.ijs.si

  7. Algorithms: PageRank candidate resource for a word belonging to the text fragment Example for the word values (human readable description of the resource) • beliefs of a person or social group in which they have an emotional investment (either for or against something); "he has very conservatives values“ • (an ideal accepted by some individual or group) "he has old-fashioned values“ • ((music) the relative duration of a musical note) ailab.ijs.si

  8. Algorithms: PageRank Algorithm steps: set the graph vertices to either of the values 0, if the vertex does not represent a candidate resource, or 1/R , with R being the total number of candidate resources the PageRank value for each vertex i ( PR [ Vi ]) is: ailab.ijs.si

  9. Algorithms: ContextSimilarity In a country as diverse and complex as India, it is not surprising to find that people here reflect the rich glories of the past, the culture, traditions and values relative to geographic locations and the numerous distinctive manners, habits and food that will always remain truly Indian. (an ideal accepted by beliefs of a person or ((music) the relative some individual or social group in which duration of a musical group) "he has old- they have an note) candidate emotional investment fashioned values“ resource (either for or against description something); "he has very conservatives values“ ideal: (the idea of candidate belief: (any cognitive duration, something that is content held as true) continuance: (the neighborhood perfect; something that period of time during resource one hopes to attain) ailab.ijs.si which something description continues)

  10. Algorithms: ContextSimilarity values beliefs of a person or social ContextSimilarity ( resource, w a ) group in which they have an returns Similarity emotional investment (either Similarity = 0 for or against something); "he NR = has very conservatives GetNeighborhoodResources(resour values“ ce) CW = GetContext(w a ) for i = 1 to Size(NR) do (an ideal accepted ((music) the CS = sim cos (NR[i], CW) by some individual relative Similarity = Similarity + CS or group) "he has duration of a end for old-fashioned musical note) return Similarity values“ ailab.ijs.si

  11. Evaluation Datasets Expert annotators WordNet : SemEval 2007 word sense disambiguation Task 7: Course Grained English All Words - 2269 annotated words, 1591 polysemous (WordNet) OpenCyc : a subset comprised of 50 words from the SemEval 2007 Task 7 corpus, with more than one candidate resource, which were manually annotated by 2 annotators DBpedia : a subset of 56 words from the SemEval 2007 Task 7 corpus, with more than one candidate resource, which were manually annotated by 1 annotator Crowdsourcing annotators WordNet and OpenCyc : A subset of 325 words for WordNet, 177 for OpenCyc, from the SemEval 2007 Task 7 corpus ailab.ijs.si

  12. Evaluation Results Expert Annotators Algorithm WordNet OpenCyc DBpedia 75.24 28.00 17.86 CS 73.44 40.00 21.43 PR 52.43 24.00 14.28 Random Crowdsourcing Annotators Algorithm WordNet OpenCyc 63.24 37.55 CS ailab.ijs.si

  13. Conclusions We investigated the applicability of two common approaches, taken from the word sense disambiguation community, for annotating text with LOD datasets: relying on the dataset relationship structure (PageRank) taking advantage of the human-readable description of a resource as well as neighborhood relationships defined for that resource (ContextSimilarity) Three datasets: WordNet, DBpedia and OpenCyc. Experiments revealed the shortcomings of the current state-of-the-art word sense disambiguation methods when applied to different LOD datasets ailab.ijs.si

  14. Conclusions WordNet OpenCyc DBpedia dictionary-based common-sense an effort to extract taxonomy knowledge base structured information from primarily developed Wikipedia for modeling and reasoning highest ratio of distinctions between rich set of instances Purpose covered words resources depend on (named entities: places, for which the reasoning task people, and organizations) the dataset candidate was resources contains concepts few common words developed correspond which are created to covered directly with the support specific tasks possible word (reasoning, named entities which have meanings paraphrasing, etc.) common words (e.g. "Talk" is a song by the British alternative rock band Coldplay) ailab.ijs.si

  15. Conclusions WordNet OpenCyc DBpedia written similar to documentation to the written like dictionary entries; ontology engineer using encyclopedia entries Human- also contain it to model some world readable examples phenomena descriptions easier to hard to understand by very short for some understand by the the general public resources general public most relation contain infrastructure relationships (e.g. types are defined wikiPageUsesTemplate is the most common between the relation in DBpedia infobox triplets) same parts of Relations speech between resources useful relations should be disregarded as they introduce noise between different parts of speech are missing ailab.ijs.si

  16. Future Work Further develop text annotation methods which can offer better performance on datasets, such as OpenCyc and DBpedia, and can be transferred to other datasets Investigate the potential for combining resources from different datasets in the same task Include elements of active learning having users in the loop provide a few annotations, to enhance the discovery of hard to disambiguate text fragments acquire labeled data for performing algorithm optimization ailab.ijs.si

  17. Thank You for Your Attention! ailab.ijs.si

Recommend


More recommend