linked open treebanks
play

Linked Open Treebanks Latin treebanks in the LiLa Knowledge Base - PowerPoint PPT Presentation

Linked Open Treebanks Latin treebanks in the LiLa Knowledge Base Francesco Mambrini and Marco Passarotti {francesco.mambrini}{marco.passarotti}@unicatt.it SyntaxFest TLT 2019 | Paris | August 29, 2019 This project has received funding from


  1. Linked Open Treebanks Latin treebanks in the LiLa Knowledge Base Francesco Mambrini and Marco Passarotti {francesco.mambrini}{marco.passarotti}@unicatt.it SyntaxFest – TLT 2019 | Paris | August 29, 2019 This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme - Grant Agreement No. 769994.

  2. Table of Contents 1 Introduction Latin treebanks The LiLa Knowledge Base Populating LiLa Lemmas Treebanks Potential use cases Conclusions F. Mambrini & M. Passarotti | LiLa – Linking Latin

  3. 4 treebanks of Latin 2 ◮ Latin Dependency Treebank (2006-): Classical Lat., prose and poetry, about 50k tokens; ◮ Index Thomisticus Treebank (2006-): Medieval Lat., only 1 author (Thomas Aquinas), about 400k tokens; ◮ PROIEL (2008): Late and Classical prose, transl. of NT (Jerome’s Vulgate , 4th CE), plus other prose, about 250k; ◮ Late Latin Charter Treebank (2011-): 8th-9th century notary documents (charters) from Central Italy, about 250k. F. Mambrini & M. Passarotti | LiLa – Linking Latin

  4. 4 treebanks of Latin 2 ◮ Latin Dependency Treebank (2006-): Classical Lat., prose and poetry, about 50k tokens; ◮ Index Thomisticus Treebank (2006-): Medieval Lat., only 1 author (Thomas Aquinas), about 400k tokens; ◮ PROIEL (2008): Late and Classical prose, transl. of NT (Jerome’s Vulgate , 4th CE), plus other prose, about 250k; ◮ Late Latin Charter Treebank (2011-): 8th-9th century notary documents (charters) from Central Italy, about 250k. F. Mambrini & M. Passarotti | LiLa – Linking Latin

  5. 4 treebanks of Latin 2 ◮ Latin Dependency Treebank (2006-): Classical Lat., prose and poetry, about 50k tokens; ◮ Index Thomisticus Treebank (2006-): Medieval Lat., only 1 author (Thomas Aquinas), about 400k tokens; ◮ PROIEL (2008): Late and Classical prose, transl. of NT (Jerome’s Vulgate , 4th CE), plus other prose, about 250k; ◮ Late Latin Charter Treebank (2011-): 8th-9th century notary documents (charters) from Central Italy, about 250k. F. Mambrini & M. Passarotti | LiLa – Linking Latin

  6. Aims 3 ◮ Create a Knowledge Base of linguistic resources for Latin ◮ corpora ◮ lexicons ◮ NLP tools ◮ Create common vocabularies to describe them ◮ Use the LOD paradigm F. Mambrini & M. Passarotti | LiLa – Linking Latin

  7. The lemma a gateway to Latin linguistic resources 4 Lemmas Lexical Entries Tokens NLP Output Lexical Resources T extual Resources NLP T ools - Latin Wordnet - Digital libraries - T okenizers - Valency Lexicon - Treebanks - T aggers/parsers - Dictionaries... - T extual corpora... - Lemmatizers... F. Mambrini & M. Passarotti | LiLa – Linking Latin

  8. LEMLAT: the foundation stone http://www.lemlat3.eu/ 5 ◮ 43,432 lemmas from Georges, 1913-1918; OLD and Gradenwitz, 1904; ◮ 82,556 lemmas from Du Cange, 1883-1887; ◮ 26,250 lemmas from Forcellini, 1940. F. Mambrini & M. Passarotti | LiLa – Linking Latin

  9. Towards an ontology of Latin lemmas 6 ontolex:Form rdfs:subClassOf Lemma olia:Verb amo VERB a a ontolex:writtenRep vocab:lemmario_upostag lemma:2012 F. Mambrini & M. Passarotti | LiLa – Linking Latin

  10. Workflow 7 ◮ start from a shallow conversion from TB format to RDF triples ◮ compare the string of the lemmatized token with the written representation(s) of a LEMLAT lemma ◮ link the token to the lemma via the hasLemma property F. Mambrini & M. Passarotti | LiLa – Linking Latin

  11. Linking corpora and lemmas 8 isPartOfSent a nif:Sentence proiel:s17835_0 nif:nextSentence a proiel:s17836_0 nif:Word olia:CommonNoun a a infernus ontolex:writtenRep hasSuffix lemma:20369 a Suffix a suffix:7 rdfs:label hasSuffix hasLemma -n lemma:arcanus hasBase a lemma:infernalis Lemma a a lemma:inferiae hasBase proiel:17835_4 base:639 conll:HEAD proiel:s17835_6 conll:UPOS hasBase conll:WORD NOUN conll:MISC inferni conll:LEMMA ref=REV_1.18 conll:EDGE infernus conj F. Mambrini & M. Passarotti | LiLa – Linking Latin

  12. Linking corpora and lemmas 8 isPartOfSent a nif:Sentence proiel:s17835_0 nif:nextSentence a proiel:s17836_0 nif:Word olia:CommonNoun a a infernus ontolex:writtenRep hasSuffix lemma:20369 a Suffix a suffix:7 rdfs:label hasSuffix hasLemma -n lemma:arcanus hasBase a lemma:infernalis Lemma a a lemma:inferiae hasBase proiel:17835_4 base:639 conll:HEAD proiel:s17835_6 conll:UPOS hasBase conll:WORD NOUN conll:MISC inferni conll:LEMMA ref=REV_1.18 conll:EDGE infernus conj F. Mambrini & M. Passarotti | LiLa – Linking Latin

  13. Linking corpora and lemmas 8 isPartOfSent a nif:Sentence proiel:s17835_0 nif:nextSentence a proiel:s17836_0 nif:Word olia:CommonNoun a a infernus ontolex:writtenRep hasSuffix lemma:20369 a Suffix a suffix:7 rdfs:label hasSuffix hasLemma -n lemma:arcanus hasBase a lemma:infernalis Lemma a a lemma:inferiae hasBase proiel:17835_4 base:639 conll:HEAD proiel:s17835_6 conll:UPOS hasBase conll:WORD NOUN conll:MISC inferni conll:LEMMA ref=REV_1.18 conll:EDGE infernus conj F. Mambrini & M. Passarotti | LiLa – Linking Latin

  14. A wealth of interlinked information that can be queried! 9 Token 1 Token 2 F. Mambrini & M. Passarotti | LiLa – Linking Latin

  15. A wealth of interlinked information that can be queried! 9 Token 1 Token 2 F. Mambrini & M. Passarotti | LiLa – Linking Latin

  16. A wealth of interlinked information that can be queried! 9 Lemma 1 Token 1 Token 2 F. Mambrini & M. Passarotti | LiLa – Linking Latin

  17. A wealth of interlinked information that can be queried! 9 Synset Lemma 1 Morph Token 1 Token 2 F. Mambrini & M. Passarotti | LiLa – Linking Latin

  18. A wealth of interlinked information that can be queried! 9 Synset Lemma 2 Token 3 Lemma 1 Morph Lemma 3 Token 4 Token 1 Token 2 F. Mambrini & M. Passarotti | LiLa – Linking Latin

  19. Querying with SPARQL All verbs that govern subjects formed with affix “-(t)or” 10 F. Mambrini & M. Passarotti | LiLa – Linking Latin

  20. Sample of results from PROIEL from Cicero’s Letters to Atticus 11 gladiatores audio pugnare mirifice (1) gladiators .ACC.PL hear .1SG fight .INF superbly I hear that your gladiators fight superbly. (Cic. Att. . 4.4a.2) F. Mambrini & M. Passarotti | LiLa – Linking Latin

  21. Wordcloud of results from the Index Thomisticus “the Interpreter (of Aristotle, i.e. Averroes) says...” 12 F. Mambrini & M. Passarotti | LiLa – Linking Latin

  22. Conclusions and future works 13 ◮ Language is complex! Morpho-syntactic description is not enough to capture all complexities ◮ LOD provide a way to link treebank annotation and information on other levels (semantics, derivational morphology...) ◮ a lexically based approach (using lemmas as hub node) is one way to do it! F. Mambrini & M. Passarotti | LiLa – Linking Latin

  23. Conclusions and future works 13 ◮ Language is complex! Morpho-syntactic description is not enough to capture all complexities ◮ LOD provide a way to link treebank annotation and information on other levels (semantics, derivational morphology...) ◮ a lexically based approach (using lemmas as hub node) is one way to do it! ◮ but (future works)... F. Mambrini & M. Passarotti | LiLa – Linking Latin

  24. Conclusions and future works 13 ◮ Language is complex! Morpho-syntactic description is not enough to capture all complexities ◮ LOD provide a way to link treebank annotation and information on other levels (semantics, derivational morphology...) ◮ a lexically based approach (using lemmas as hub node) is one way to do it! ◮ but (future works)... ◮ we need to harmonize the tagsets (ontologies) F. Mambrini & M. Passarotti | LiLa – Linking Latin

  25. Conclusions and future works 13 ◮ Language is complex! Morpho-syntactic description is not enough to capture all complexities ◮ LOD provide a way to link treebank annotation and information on other levels (semantics, derivational morphology...) ◮ a lexically based approach (using lemmas as hub node) is one way to do it! ◮ but (future works)... ◮ we need to harmonize the tagsets (ontologies) ◮ we need to find sustainable, scalable solutions together with the projects that own and maintain the resources F. Mambrini & M. Passarotti | LiLa – Linking Latin

  26. Thanks! Get in touch 14 The LiLa Team Università Cattolica del Sacro Cuore CIRCSE Research Centre info@lila-erc.eu https://github.com/CIRCSE https://lila-erc.eu @ERC_LiLa Largo Gemelli 1, 20123 Milan, Italy This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme - Grant Agreement No. 769994. F. Mambrini & M. Passarotti | LiLa – Linking Latin

Recommend


More recommend