the proiel corpora
play

The PROIEL corpora Dag Trygve Truslew Haug Milan, 4 June 2019 Dag - PowerPoint PPT Presentation

The PROIEL corpora Dag Trygve Truslew Haug Milan, 4 June 2019 Dag Haug PROIEL Milan, 4 June 2019 1 / 23 The background A corpus for linguists: focus on making the most of a limited data set for linguistic research Dag Haug PROIEL Milan,


  1. The PROIEL corpora Dag Trygve Truslew Haug Milan, 4 June 2019 Dag Haug PROIEL Milan, 4 June 2019 1 / 23

  2. The background A corpus for linguists: focus on making the most of a limited data set for linguistic research Dag Haug PROIEL Milan, 4 June 2019 2 / 23

  3. The background A corpus for linguists: focus on making the most of a limited data set for linguistic research Pragmatic Resources in Old Indo-European Languages (PROIEL, 2008-2012) word order anaphoric expressions definiteness participles (background events) discourse particles Dag Haug PROIEL Milan, 4 June 2019 2 / 23

  4. The background A corpus for linguists: focus on making the most of a limited data set for linguistic research Pragmatic Resources in Old Indo-European Languages (PROIEL, 2008-2012) word order anaphoric expressions definiteness participles (background events) discourse particles The corpus should help this research, but also be useful for others Annotation continues (with less resources) Dag Haug PROIEL Milan, 4 June 2019 2 / 23

  5. Texts NT and translations (Greek, Latin, Gothic, Armenian, OCS) Dag Haug PROIEL Milan, 4 June 2019 3 / 23

  6. Texts NT and translations (Greek, Latin, Gothic, Armenian, OCS) Latin additional texts: extracts from the Gallic War, Letters to Atticus, De officiis Peregrinatio Aetheriae, Palladius’ Opus Agriculturae 225.064 tokens Petronius’ Satyricon (32020 tokens) annotated but not reviewed Dag Haug PROIEL Milan, 4 June 2019 3 / 23

  7. Cooperation with other projects Many projects have used our platform for annotation of other texts Poetic Edda (Greinir skáldskapar) Old Norwegian (Mediaeval Nordic Text Archive) Old Swedish (Språkbanken in Gothenburg) Mediaeval English and Romance (ISWOC) Old Slavic texts (TOROT) Rigveda just starting (Erica Biagetti) Adds up to a sizeable knowledge base on ancient and medieval Indo-European languages! Dag Haug PROIEL Milan, 4 June 2019 4 / 23

  8. The PROIEL annotation Many-layered annotation: Morphological annotation Syntactic annotation (dependency/LFG-based) Semantic and other customised annotation (e.g. animacy) Annotation of information structure and anaphoric links Token alignments Dag Haug PROIEL Milan, 4 June 2019 5 / 23

  9. Workflow for annotation International team of student annotators Manual disambiguation of morphology and lemmatization Syntactic annotation Review by project members Advanced annotation by project members Dag Haug PROIEL Milan, 4 June 2019 6 / 23

  10. Morphology All standard categories of Latin annotated Fully lemmatized No known compatibility issues with other tagsets Preprocessed with statistical taggers and checked manually Interesting non-standard forms in the postclassical texts Dag Haug PROIEL Milan, 4 June 2019 7 / 23

  11. Dependency Grammar 1 Dag Haug PROIEL Milan, 4 June 2019 8 / 23

  12. Dependency Grammar 1 Dependencies are asymmetric relations between words Dag Haug PROIEL Milan, 4 June 2019 8 / 23

  13. Dependency Grammar 1 Dependencies are asymmetric relations between words We label these dependencies with the function of the dependent Dag Haug PROIEL Milan, 4 June 2019 8 / 23

  14. Dependency Grammar 1 Dependencies are asymmetric relations between words We label these dependencies with the function of the dependent The dependencies form a tree under an abstract root Dag Haug PROIEL Milan, 4 June 2019 8 / 23

  15. Dependency Grammar 1 Dependencies are asymmetric relations between words We label these dependencies with the function of the dependent The dependencies form a tree under an abstract root No explicit constituency Dag Haug PROIEL Milan, 4 June 2019 8 / 23

  16. Dependency grammar 2 Inheritent limitations: unique head and overt tokens Dag Haug PROIEL Milan, 4 June 2019 9 / 23

  17. Dependency grammar 2 Inheritent limitations: unique head and overt tokens Other Latin treebanks live with this, but for our purposes we could not ⇒ introduction of structure sharing (from LFG) ⇒ explicit empty nodes Dag Haug PROIEL Milan, 4 June 2019 9 / 23

  18. Dependency grammar 2 Inheritent limitations: unique head and overt tokens Other Latin treebanks live with this, but for our purposes we could not ⇒ introduction of structure sharing (from LFG) ⇒ explicit empty nodes In itself monotonic increase of information compared to LDT/ITTB but led to some different annotation decisions which make conversion non-trivial (but possible!) Also, more fine-grained syntactic relations (easily removeable in one direction) Dag Haug PROIEL Milan, 4 June 2019 9 / 23

  19. Empty nodes Empty nodes appear in the analysis of some very frequent phenomena in Latin Null conjunctions for asyndetic parataxis Null verbs for null copulas and elided verbs Dag Haug PROIEL Milan, 4 June 2019 10 / 23

  20. Empty nodes Dag Haug PROIEL Milan, 4 June 2019 11 / 23

  21. Empty nodes Dag Haug PROIEL Milan, 4 June 2019 11 / 23

  22. Human processing Dag Haug PROIEL Milan, 4 June 2019 12 / 23

  23. Human processing Dag Haug PROIEL Milan, 4 June 2019 12 / 23

  24. Secondary dependencies 1: Control Dag Haug PROIEL Milan, 4 June 2019 13 / 23

  25. Secondary dependencies 2: Shared arguments Dag Haug PROIEL Milan, 4 June 2019 14 / 23

  26. Secondary dependencies 3: Ellipsis Dag Haug PROIEL Milan, 4 June 2019 15 / 23

  27. Too few empty nodes Dag Haug PROIEL Milan, 4 June 2019 16 / 23

  28. Too few empty nodes Syntactic structure cannot actually be reduced to relations between words In retrospect we erred towards conservativity in not using more empty nodes Dag Haug PROIEL Milan, 4 June 2019 16 / 23

  29. Too few empty nodes Syntactic structure cannot actually be reduced to relations between words In retrospect we erred towards conservativity in not using more empty nodes Makes LiLa’s life easier! Dag Haug PROIEL Milan, 4 June 2019 16 / 23

  30. Semantic annotation – animacy HUMAN ORG ANIMAL VEH CONC PLACE NONCONC TIME 1745 Latin nominal lemmata tagged (out of 6612 total) Mainly the Biblical language Dag Haug PROIEL Milan, 4 June 2019 17 / 23

  31. Givenness Givenness tags based on which context the hearer uses to establish reference Discourse (anaphora) → OLD Situation (deixis) → ACC-sit Scenarios (inferences) → ACC-inf Encyclopedic knowledge → ACC-gen No context (no extra-NP information) → NEW Dag Haug PROIEL Milan, 4 June 2019 18 / 23

  32. Givenness Givenness tags based on which context the hearer uses to establish reference Discourse (anaphora) → OLD Situation (deixis) → ACC-sit Scenarios (inferences) → ACC-inf Encyclopedic knowledge → ACC-gen No context (no extra-NP information) → NEW In Latin, this is available for Peregrinatio, and parts of the letters to Atticus and the Gallic War Dag Haug PROIEL Milan, 4 June 2019 18 / 23

  33. Dag Haug PROIEL Milan, 4 June 2019 19 / 23

  34. Watch the null node! Dag Haug PROIEL Milan, 4 June 2019 19 / 23

  35. Alignment: Translating participles in the Vulgate Our NT translations are aligned with the original Greek Automatic alignment of high quality, and hand-corrected for many languages (not Latin!) Dag Haug PROIEL Milan, 4 June 2019 20 / 23

  36. Alignment: Translating participles in the Vulgate Our NT translations are aligned with the original Greek Automatic alignment of high quality, and hand-corrected for many languages (not Latin!) A useful tool to explore syntax and translation strategies Translation w/ participle Translation w/ imperative Dag Haug PROIEL Milan, 4 June 2019 20 / 23

  37. Accessibility Data: XML exports containing morphology, syntax, information status and alignment are available at https://proiel.github.io/ Semantic annotation so far only on request Dag Haug PROIEL Milan, 4 June 2019 21 / 23

  38. Accessibility Data: XML exports containing morphology, syntax, information status and alignment are available at https://proiel.github.io/ Semantic annotation so far only on request Tools: Simple query interface at http://foni.uio.no:3000 Syntactic query interface at INESS, Bergen Command line interface for working with the files Little for computational illiterates, but http://syntacticus.org is a start Dag Haug PROIEL Milan, 4 June 2019 21 / 23

  39. Interoperability? All the Latin treebanks are converted to UD Dag Haug PROIEL Milan, 4 June 2019 22 / 23

  40. Interoperability? All the Latin treebanks are converted to UD Useful for comparing annotations, but a lot of harmonization needed And the conversion is very lossy (especially for PROIEL) Dag Haug PROIEL Milan, 4 June 2019 22 / 23

  41. Future plans Pending funding. . . add texts (finish Satyricon, Plautus) add languages: ideally get good coverage of the various IE branches develop Syntacticus as a reading portal Dag Haug PROIEL Milan, 4 June 2019 23 / 23

Recommend


More recommend