Towards Digital Coptic Searching and Visualizing Coptic Manuscript Data Caroline T. Schroeder, University of the Pacific cschroeder@pacific.edu Amir Zeldes, Humboldt-Universität zu Berlin amir.zeldes@rz.hu-berlin.de Berlin Digital Classicist Seminar , 14.1.2014
Plan Introduction Coptic data Annotations so far: normalizing, tokenizing and tagging Search architecture Searching through multiple segmentations: ANNIS Dealing with corpus formats: TEI, SaltNPepper Visualization Dedicated visualizations A reusable generic approach Conclusion and outlook Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 1/37
Who are these people? Prof. Caroline T. Schroeder – Religious and Classical Studies / Humanities Center Director University of the Pacific Dr. Amir Zeldes – Korpuslinguistik / SFB 632 Information Structure (from March: eHumanities group KOMeT ) Humboldt-Universität zu Berlin Cooperation Coptic SCRIPTORIUM established at 2012 NEH summer institute on "Text in a Digital Age" (Tufts): http://coptic.pacific.edu/ Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 2/37
Why Coptic? Last stage of Ancient Egyptian Language (starting 2nd Century) Mediterranean in 1 st millenium Hellenistic period Unique language Longest continuous documentation Contact language (with Greek) Religious significance Early Christianity Rise of monasticism Gnosticism ... Coptische Dialects Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 3/37 BMBF eHumanties - KOMeT / Zeldes
The data Lots of material (thanks to the Egyptian desert ) Relatively little online, nothing like Greek and Latin (Perseus) Lots of things you may want are not available: New Testament (online, not normalized/lemmatized/annotated) Old Testament The Rule of St. Pachomius Works of Shenoute of Atripe Apophthegmata patrum ... But some have been digitized at some point! Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 4/37
A word about the texts in this talk So far we've concentrated on Shenoute's sermon Abraham our Father "As for us, brethren, let us live by the truth so that we are upstanding in all our works, and so that the prophets, apostles and all the saints might dwell among us, ..." Apophthegmata Patrum (sayings of the desert fathers) "They said about the blessed Sarah the virgin that she spent sixty years living at the top of the river and she never set foot outside to see the river." New Testament, esp. Gospel of Mark see http://coptic.pacific.edu/ for corpora and tools Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 5/37
Getting from raw text to annotated corpora Making the data searchable starts with: Encoding manuscripts (Epidoc TEI) Segmentation of "word forms" Normalization Segmentation of morphemes Part-of-speech tagging More annotations... Brief recap: Detailed talk in Leipzig last month (slides on my page) Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 6/37
Normalization Automatic normalization, manual correction handling of known diacritics, abbreviations closed, growing list of known variants Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 7/37
Tokenization Identifying morphemes non-trivial (agglutinative language, different conventions; we follow Layton 2004) ϫⲓⲛⲧⲁⲓⲣ̅ⲙⲟⲛⲁⲭⲟⲥ 'Since I became a monk' since-that-PAST-1sg-do-monk ⲉⲛⲧⲁϥⲧⲣⲉⲛⲣⲡϣⲁ 'he who made us keep the ceremony' REL-PAST-3sgM-CAUS-1pl-do-the-observance Word level segmentation: manual (no scriptio continua ) Morph segmentation: automatic (accuracy: 84% - 94%) ⲛ̄ⲟⲩϣⲏⲣⲉ ` ⲛ̄ⲁⲃⲣⲁϩⲁⲙ ` ⲛ ⲟⲩ ϣⲏⲣⲉ ⲛ ⲁⲃⲣⲁϩⲁⲙ of-a-son of-Abraham of a son of Abraham Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 8/37
Part-of-speech tagging POS tagging using TreeTagger (Schmid 1994) and a lexicon from the CMCL project (courtesy of Prof. Tito Orlandi) Two tag sets: fine grained (45 tags) and coarse (22 tags) (see http://coptic.pacific.edu/ for documentation) Interannotator agreement: 94.19% agreement, kappa = 93.67 (considers chance agreement, cf. Artstein & Poesio 2008) Accuracy: In domain, 10-fold cross-validation: 94.04% (fine) Out of domain (test with papyri.info): 79.6% (fine) / 87.7% (coarse) Main difficulties: open classes (N/V), disambiguating homonyms ( ⲉ can have 6 different tags!) Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 9/37
Further annotations Many other layers are done manually: Translation Language of origin Coreference Entity tagging (people, places...) Parallel alignment (with Greek) Syntax trees (very preliminary tests) Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 10/37
Representing data – how to look at all this stuff? We now have a lot of data to represent: Diplomatic transcriptions (including character rendering!) Normalization Segmentation into words, morphemes, sometimes letters Annotations How do we encode this data for search and visualization? Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 11/37
The first challenge: minimal units Minimal units, or tokens , are critical for searching: Find all words preceding the word "God" Give me any mentions of Saint Paphnutius, ±10 words Search for the glosses father and son within 20 words Two problems: The concept of words is complex in Coptic ⲡⲉϪⲁϥ ϫⲉ ⲉⲓ̇ⲥ ϣ Annotations overlap parts of words: ⲙⲟⲩⲛ ⲛ̇ⲣⲟⲙⲡⲉ ⲻ Ⲡⲉϫⲉ ⲡ̇ϩⲗ̇ⲗⲟ ⲛⲁϥ individual letters, line breaks... tokens are smaller than words! he sAid "it's been e ight years" – The old man told him Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 12/37
Solution: segmentation layers in ANNIS We use the open source ANNIS platform as a search interface (Zeldes et al. 2009) Any annotation layer can be defined as a segmentation defining alternative views on: Adjacency (in words, morphemes, etc.) Proximity (in words, morphemes, etc.) Context size (in words, morphemes, etc.) But which segmentation layer do you want to see? Remember, diplomatic and normalized layers don't match Any segmentation layer is usable as " base text " Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 13/37
Switching segmentations in ANNIS Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 14/37
Different contexts Example search: entity="person" Ⲁ ⲩϭⲱⲗⲡ̇ 5 ⲉ̇ⲃⲟⲗ ⲛⲁⲡⲁ ⲁ̇ⲛ Hit: Abba Antonius ⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ ⲡ̇ϫⲁⲓ̇ⲉ̇ · ϫⲉ ⲟⲩⲛ ⲟⲩⲁ̇ ⲉ̇ϥⲉⲓⲛⲉ̇ Some options: ±5 words, diplomatic: (less than -5 found, since start of text) Ⲁⲩϭⲱⲗⲡ̇ ⲉ̇ⲃⲟⲗ ⲛⲁⲡⲁ ⲁ̇ⲛⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ⲡ̇ϫⲁⲓ̇ⲉ̇ · ϫⲉⲟⲩⲛⲟⲩⲁ̇ ⲉ̇ϥⲉⲓⲛⲉ̇ ⲙ̇ⲙⲟⲕ ±10 morphs, normalized: ⲁ ⲩ ϭⲱⲗⲡ ⲉⲃⲟⲗ ⲛ ⲁⲡⲁ ⲁⲛⲧⲱⲛⲓⲟⲥ ϩⲓ ⲡ ϫⲁⲓⲉ · ϫⲉ ⲟⲩⲛ ⲟⲩⲁ ⲉ ϥ ⲉⲓⲛⲉ ⲙⲙⲟ ⲕ ±5 tokens: Ⲁ ⲩ ϭⲱⲗⲡ̇ ⲉ̇ⲃⲟⲗ ⲛ ⲁⲡⲁ ⲁ̇ⲛ ⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ ⲡ̇ ϫⲁⲓ̇ⲉ̇ · ϫⲉ Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 15/37
Searching with AQL (see http://www.sfb632.uni-potsdam.de/annis/ ) Basic principle of ANNIS Query Language (AQL): search for some annotations (#1, #2, #3...) stipulate relationships between them (operators) Example: verbs of Greek origin pos="V" & source_lang="Greek" & #1 _=_ #2 The head bandit repented identical coverage operator I have faith in God Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 16/37
Referencing segmentations There are many operators . (adjacent), _i_ (inclusion), _o_ (overlap), _l_ (left aligned)... > (dominance), -> (pointing relation), >@l (left child)... ... Possible to use segmentations in queries: #1 . #2 - one followed by two #1 .word #2 - two is the next word after one #1 .norm,1,10 #2 - within 1 to 10 norm units ... Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 17/37
Adding metadata Metadata is like any other constraint, with meta:: prefix Can use regular expressions and negation pos!="V" & source_lang="Greek" & #1 _=_ #2 & meta::msName=/ MONB.*/ For metadata names and values we use TEI/EpiDoc as a guideline More information on AQL: http://www.sfb632.uni-potsdam.de/annis/ Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 18/37
Architecture and formats Different formats are suitable for different parts of the data TEI ideal for manuscript structure, metadata Linguistic formats for computational corpus linguistics: tagging, parsing, coreference Convert and merge data using SaltNPepper (Zipser & Romary 2010) Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 19/37
Recommend
More recommend