towards digital coptic
play

Towards Digital Coptic Searching and Visualizing Coptic Manuscript - PowerPoint PPT Presentation

Towards Digital Coptic Searching and Visualizing Coptic Manuscript Data Caroline T. Schroeder, University of the Pacific cschroeder@pacific.edu Amir Zeldes, Humboldt-Universitt zu Berlin amir.zeldes@rz.hu-berlin.de Berlin Digital Classicist


  1. Towards Digital Coptic Searching and Visualizing Coptic Manuscript Data Caroline T. Schroeder, University of the Pacific cschroeder@pacific.edu Amir Zeldes, Humboldt-Universität zu Berlin amir.zeldes@rz.hu-berlin.de Berlin Digital Classicist Seminar , 14.1.2014

  2. Plan  Introduction  Coptic data  Annotations so far: normalizing, tokenizing and tagging  Search architecture  Searching through multiple segmentations: ANNIS  Dealing with corpus formats: TEI, SaltNPepper  Visualization  Dedicated visualizations  A reusable generic approach  Conclusion and outlook Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 1/37

  3. Who are these people?  Prof. Caroline T. Schroeder – Religious and Classical Studies / Humanities Center Director University of the Pacific  Dr. Amir Zeldes – Korpuslinguistik / SFB 632 Information Structure (from March: eHumanities group KOMeT ) Humboldt-Universität zu Berlin  Cooperation Coptic SCRIPTORIUM established at 2012 NEH summer institute on "Text in a Digital Age" (Tufts): http://coptic.pacific.edu/ Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 2/37

  4. Why Coptic?  Last stage of Ancient Egyptian Language (starting 2nd Century)  Mediterranean in 1 st millenium  Hellenistic period  Unique language  Longest continuous documentation  Contact language (with Greek)  Religious significance  Early Christianity  Rise of monasticism  Gnosticism  ... Coptische Dialects Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 3/37 BMBF eHumanties - KOMeT / Zeldes

  5. The data  Lots of material (thanks to the Egyptian desert  )  Relatively little online, nothing like Greek and Latin (Perseus)  Lots of things you may want are not available:  New Testament (online, not normalized/lemmatized/annotated)  Old Testament  The Rule of St. Pachomius  Works of Shenoute of Atripe  Apophthegmata patrum  ...  But some have been digitized at some point! Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 4/37

  6. A word about the texts in this talk  So far we've concentrated on Shenoute's sermon Abraham our Father  "As for us, brethren, let us live by the truth so that we are upstanding in all our works, and so that the prophets, apostles and all the saints might dwell among us, ..."  Apophthegmata Patrum (sayings of the desert fathers)  "They said about the blessed Sarah the virgin that she spent sixty years living at the top of the river and she never set foot outside to see the river."  New Testament, esp. Gospel of Mark see http://coptic.pacific.edu/ for corpora and tools Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 5/37

  7. Getting from raw text to annotated corpora  Making the data searchable starts with:  Encoding manuscripts (Epidoc TEI)  Segmentation of "word forms"  Normalization  Segmentation of morphemes  Part-of-speech tagging  More annotations...  Brief recap: Detailed talk in Leipzig last month (slides on my page) Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 6/37

  8. Normalization  Automatic normalization, manual correction  handling of known diacritics, abbreviations  closed, growing list of known variants Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 7/37

  9. Tokenization  Identifying morphemes non-trivial (agglutinative language, different conventions; we follow Layton 2004)  ϫⲓⲛⲧⲁⲓⲣ̅ⲙⲟⲛⲁⲭⲟⲥ 'Since I became a monk' since-that-PAST-1sg-do-monk  ⲉⲛⲧⲁϥⲧⲣⲉⲛⲣⲡϣⲁ 'he who made us keep the ceremony' REL-PAST-3sgM-CAUS-1pl-do-the-observance  Word level segmentation: manual (no scriptio continua )  Morph segmentation: automatic (accuracy: 84% - 94%) ⲛ̄ⲟⲩϣⲏⲣⲉ ` ⲛ̄ⲁⲃⲣⲁϩⲁⲙ `  ⲛ ⲟⲩ ϣⲏⲣⲉ ⲛ ⲁⲃⲣⲁϩⲁⲙ of-a-son of-Abraham of a son of Abraham Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 8/37

  10. Part-of-speech tagging  POS tagging using TreeTagger (Schmid 1994) and a lexicon from the CMCL project (courtesy of Prof. Tito Orlandi)  Two tag sets:  fine grained (45 tags) and coarse (22 tags) (see http://coptic.pacific.edu/ for documentation)  Interannotator agreement: 94.19% agreement, kappa = 93.67 (considers chance agreement, cf. Artstein & Poesio 2008)  Accuracy:  In domain, 10-fold cross-validation: 94.04% (fine)  Out of domain (test with papyri.info): 79.6% (fine) / 87.7% (coarse)  Main difficulties: open classes (N/V), disambiguating homonyms ( ⲉ can have 6 different tags!) Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 9/37

  11. Further annotations  Many other layers are done manually:  Translation  Language of origin  Coreference  Entity tagging (people, places...)  Parallel alignment (with Greek)  Syntax trees (very preliminary tests) Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 10/37

  12. Representing data – how to look at all this stuff?  We now have a lot of data to represent:  Diplomatic transcriptions (including character rendering!)  Normalization  Segmentation into words, morphemes, sometimes letters  Annotations  How do we encode this data for search and visualization? Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 11/37

  13. The first challenge: minimal units  Minimal units, or tokens , are critical for searching:  Find all words preceding the word "God"  Give me any mentions of Saint Paphnutius, ±10 words  Search for the glosses father and son within 20 words  Two problems:  The concept of words is complex in Coptic ⲡⲉϪⲁϥ ϫⲉ ⲉⲓ̇ⲥ ϣ  Annotations overlap parts of words: ⲙⲟⲩⲛ ⲛ̇ⲣⲟⲙⲡⲉ ⲻ Ⲡⲉϫⲉ ⲡ̇ϩⲗ̇ⲗⲟ ⲛⲁϥ individual letters, line breaks...  tokens are smaller than words! he sAid "it's been e ight years" – The old man told him Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 12/37

  14. Solution: segmentation layers in ANNIS  We use the open source ANNIS platform as a search interface (Zeldes et al. 2009)  Any annotation layer can be defined as a segmentation defining alternative views on:  Adjacency (in words, morphemes, etc.)  Proximity (in words, morphemes, etc.)  Context size (in words, morphemes, etc.)  But which segmentation layer do you want to see?  Remember, diplomatic and normalized layers don't match  Any segmentation layer is usable as " base text " Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 13/37

  15. Switching segmentations in ANNIS Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 14/37

  16. Different contexts  Example search: entity="person" Ⲁ ⲩϭⲱⲗⲡ̇ 5 ⲉ̇ⲃⲟⲗ ⲛⲁⲡⲁ ⲁ̇ⲛ  Hit: Abba Antonius ⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ ⲡ̇ϫⲁⲓ̇ⲉ̇ · ϫⲉ ⲟⲩⲛ ⲟⲩⲁ̇ ⲉ̇ϥⲉⲓⲛⲉ̇  Some options:  ±5 words, diplomatic: (less than -5 found, since start of text) Ⲁⲩϭⲱⲗⲡ̇ ⲉ̇ⲃⲟⲗ ⲛⲁⲡⲁ ⲁ̇ⲛⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ⲡ̇ϫⲁⲓ̇ⲉ̇ · ϫⲉⲟⲩⲛⲟⲩⲁ̇ ⲉ̇ϥⲉⲓⲛⲉ̇ ⲙ̇ⲙⲟⲕ  ±10 morphs, normalized: ⲁ ⲩ ϭⲱⲗⲡ ⲉⲃⲟⲗ ⲛ ⲁⲡⲁ ⲁⲛⲧⲱⲛⲓⲟⲥ ϩⲓ ⲡ ϫⲁⲓⲉ · ϫⲉ ⲟⲩⲛ ⲟⲩⲁ ⲉ ϥ ⲉⲓⲛⲉ ⲙⲙⲟ ⲕ  ±5 tokens: Ⲁ ⲩ ϭⲱⲗⲡ̇ ⲉ̇ⲃⲟⲗ ⲛ ⲁⲡⲁ ⲁ̇ⲛ ⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ ⲡ̇ ϫⲁⲓ̇ⲉ̇ · ϫⲉ Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 15/37

  17. Searching with AQL (see http://www.sfb632.uni-potsdam.de/annis/ )  Basic principle of ANNIS Query Language (AQL):  search for some annotations (#1, #2, #3...)  stipulate relationships between them (operators)  Example: verbs of Greek origin pos="V" & source_lang="Greek" & #1 _=_ #2 The head bandit repented identical coverage operator I have faith in God Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 16/37

  18. Referencing segmentations  There are many operators  . (adjacent), _i_ (inclusion), _o_ (overlap), _l_ (left aligned)...  > (dominance), -> (pointing relation), >@l (left child)...  ...  Possible to use segmentations in queries:  #1 . #2 - one followed by two  #1 .word #2 - two is the next word after one  #1 .norm,1,10 #2 - within 1 to 10 norm units  ... Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 17/37

  19. Adding metadata  Metadata is like any other constraint, with meta:: prefix  Can use regular expressions and negation pos!="V" & source_lang="Greek" & #1 _=_ #2 & meta::msName=/ MONB.*/  For metadata names and values we use TEI/EpiDoc as a guideline  More information on AQL: http://www.sfb632.uni-potsdam.de/annis/ Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 18/37

  20. Architecture and formats  Different formats are suitable for different parts of the data  TEI ideal for manuscript structure, metadata  Linguistic formats for computational corpus linguistics: tagging, parsing, coreference  Convert and merge data using SaltNPepper (Zipser & Romary 2010) Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 19/37

Recommend


More recommend