Towards Digital Coptic Searching and Visualizing Coptic Manuscript - PowerPoint PPT Presentation

Towards Digital Coptic Searching and Visualizing Coptic Manuscript Data Caroline T. Schroeder, University of the Pacific cschroeder@pacific.edu Amir Zeldes, Humboldt-Universität zu Berlin amir.zeldes@rz.hu-berlin.de Berlin Digital Classicist Seminar , 14.1.2014

Plan  Introduction  Coptic data  Annotations so far: normalizing, tokenizing and tagging  Search architecture  Searching through multiple segmentations: ANNIS  Dealing with corpus formats: TEI, SaltNPepper  Visualization  Dedicated visualizations  A reusable generic approach  Conclusion and outlook Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 1/37

Who are these people?  Prof. Caroline T. Schroeder – Religious and Classical Studies / Humanities Center Director University of the Pacific  Dr. Amir Zeldes – Korpuslinguistik / SFB 632 Information Structure (from March: eHumanities group KOMeT ) Humboldt-Universität zu Berlin  Cooperation Coptic SCRIPTORIUM established at 2012 NEH summer institute on "Text in a Digital Age" (Tufts): http://coptic.pacific.edu/ Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 2/37

Why Coptic?  Last stage of Ancient Egyptian Language (starting 2nd Century)  Mediterranean in 1 st millenium  Hellenistic period  Unique language  Longest continuous documentation  Contact language (with Greek)  Religious significance  Early Christianity  Rise of monasticism  Gnosticism  ... Coptische Dialects Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 3/37 BMBF eHumanties - KOMeT / Zeldes

The data  Lots of material (thanks to the Egyptian desert  )  Relatively little online, nothing like Greek and Latin (Perseus)  Lots of things you may want are not available:  New Testament (online, not normalized/lemmatized/annotated)  Old Testament  The Rule of St. Pachomius  Works of Shenoute of Atripe  Apophthegmata patrum  ...  But some have been digitized at some point! Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 4/37

A word about the texts in this talk  So far we've concentrated on Shenoute's sermon Abraham our Father  "As for us, brethren, let us live by the truth so that we are upstanding in all our works, and so that the prophets, apostles and all the saints might dwell among us, ..."  Apophthegmata Patrum (sayings of the desert fathers)  "They said about the blessed Sarah the virgin that she spent sixty years living at the top of the river and she never set foot outside to see the river."  New Testament, esp. Gospel of Mark see http://coptic.pacific.edu/ for corpora and tools Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 5/37

Getting from raw text to annotated corpora  Making the data searchable starts with:  Encoding manuscripts (Epidoc TEI)  Segmentation of "word forms"  Normalization  Segmentation of morphemes  Part-of-speech tagging  More annotations...  Brief recap: Detailed talk in Leipzig last month (slides on my page) Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 6/37

Normalization  Automatic normalization, manual correction  handling of known diacritics, abbreviations  closed, growing list of known variants Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 7/37

Tokenization  Identifying morphemes non-trivial (agglutinative language, different conventions; we follow Layton 2004)  ϫⲓⲛⲧⲁⲓⲣ̅ⲙⲟⲛⲁⲭⲟⲥ 'Since I became a monk' since-that-PAST-1sg-do-monk  ⲉⲛⲧⲁϥⲧⲣⲉⲛⲣⲡϣⲁ 'he who made us keep the ceremony' REL-PAST-3sgM-CAUS-1pl-do-the-observance  Word level segmentation: manual (no scriptio continua )  Morph segmentation: automatic (accuracy: 84% - 94%) ⲛ̄ⲟⲩϣⲏⲣⲉ ` ⲛ̄ⲁⲃⲣⲁϩⲁⲙ `  ⲛ ⲟⲩ ϣⲏⲣⲉ ⲛ ⲁⲃⲣⲁϩⲁⲙ of-a-son of-Abraham of a son of Abraham Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 8/37

Part-of-speech tagging  POS tagging using TreeTagger (Schmid 1994) and a lexicon from the CMCL project (courtesy of Prof. Tito Orlandi)  Two tag sets:  fine grained (45 tags) and coarse (22 tags) (see http://coptic.pacific.edu/ for documentation)  Interannotator agreement: 94.19% agreement, kappa = 93.67 (considers chance agreement, cf. Artstein & Poesio 2008)  Accuracy:  In domain, 10-fold cross-validation: 94.04% (fine)  Out of domain (test with papyri.info): 79.6% (fine) / 87.7% (coarse)  Main difficulties: open classes (N/V), disambiguating homonyms ( ⲉ can have 6 different tags!) Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 9/37

Further annotations  Many other layers are done manually:  Translation  Language of origin  Coreference  Entity tagging (people, places...)  Parallel alignment (with Greek)  Syntax trees (very preliminary tests) Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 10/37

Representing data – how to look at all this stuff?  We now have a lot of data to represent:  Diplomatic transcriptions (including character rendering!)  Normalization  Segmentation into words, morphemes, sometimes letters  Annotations  How do we encode this data for search and visualization? Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 11/37

The first challenge: minimal units  Minimal units, or tokens , are critical for searching:  Find all words preceding the word "God"  Give me any mentions of Saint Paphnutius, ±10 words  Search for the glosses father and son within 20 words  Two problems:  The concept of words is complex in Coptic ⲡⲉϪⲁϥ ϫⲉ ⲉⲓ̇ⲥ ϣ  Annotations overlap parts of words: ⲙⲟⲩⲛ ⲛ̇ⲣⲟⲙⲡⲉ ⲻ Ⲡⲉϫⲉ ⲡ̇ϩⲗ̇ⲗⲟ ⲛⲁϥ individual letters, line breaks...  tokens are smaller than words! he sAid "it's been e ight years" – The old man told him Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 12/37

Solution: segmentation layers in ANNIS  We use the open source ANNIS platform as a search interface (Zeldes et al. 2009)  Any annotation layer can be defined as a segmentation defining alternative views on:  Adjacency (in words, morphemes, etc.)  Proximity (in words, morphemes, etc.)  Context size (in words, morphemes, etc.)  But which segmentation layer do you want to see?  Remember, diplomatic and normalized layers don't match  Any segmentation layer is usable as " base text " Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 13/37

Switching segmentations in ANNIS Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 14/37

Different contexts  Example search: entity="person" Ⲁ ⲩϭⲱⲗⲡ̇ 5 ⲉ̇ⲃⲟⲗ ⲛⲁⲡⲁ ⲁ̇ⲛ  Hit: Abba Antonius ⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ ⲡ̇ϫⲁⲓ̇ⲉ̇ · ϫⲉ ⲟⲩⲛ ⲟⲩⲁ̇ ⲉ̇ϥⲉⲓⲛⲉ̇  Some options:  ±5 words, diplomatic: (less than -5 found, since start of text) Ⲁⲩϭⲱⲗⲡ̇ ⲉ̇ⲃⲟⲗ ⲛⲁⲡⲁ ⲁ̇ⲛⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ⲡ̇ϫⲁⲓ̇ⲉ̇ · ϫⲉⲟⲩⲛⲟⲩⲁ̇ ⲉ̇ϥⲉⲓⲛⲉ̇ ⲙ̇ⲙⲟⲕ  ±10 morphs, normalized: ⲁ ⲩ ϭⲱⲗⲡ ⲉⲃⲟⲗ ⲛ ⲁⲡⲁ ⲁⲛⲧⲱⲛⲓⲟⲥ ϩⲓ ⲡ ϫⲁⲓⲉ · ϫⲉ ⲟⲩⲛ ⲟⲩⲁ ⲉ ϥ ⲉⲓⲛⲉ ⲙⲙⲟ ⲕ  ±5 tokens: Ⲁ ⲩ ϭⲱⲗⲡ̇ ⲉ̇ⲃⲟⲗ ⲛ ⲁⲡⲁ ⲁ̇ⲛ ⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ ⲡ̇ ϫⲁⲓ̇ⲉ̇ · ϫⲉ Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 15/37

Searching with AQL (see http://www.sfb632.uni-potsdam.de/annis/ )  Basic principle of ANNIS Query Language (AQL):  search for some annotations (#1, #2, #3...)  stipulate relationships between them (operators)  Example: verbs of Greek origin pos="V" & source_lang="Greek" & #1 _=_ #2 The head bandit repented identical coverage operator I have faith in God Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 16/37

Referencing segmentations  There are many operators  . (adjacent), _i_ (inclusion), _o_ (overlap), _l_ (left aligned)...  > (dominance), -> (pointing relation), >@l (left child)...  ...  Possible to use segmentations in queries:  #1 . #2 - one followed by two  #1 .word #2 - two is the next word after one  #1 .norm,1,10 #2 - within 1 to 10 norm units  ... Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 17/37

Adding metadata  Metadata is like any other constraint, with meta:: prefix  Can use regular expressions and negation pos!="V" & source_lang="Greek" & #1 _=_ #2 & meta::msName=/ MONB.*/  For metadata names and values we use TEI/EpiDoc as a guideline  More information on AQL: http://www.sfb632.uni-potsdam.de/annis/ Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 18/37

Architecture and formats  Different formats are suitable for different parts of the data  TEI ideal for manuscript structure, metadata  Linguistic formats for computational corpus linguistics: tagging, parsing, coreference  Convert and merge data using SaltNPepper (Zipser & Romary 2010) Schroeder & Zeldes / Towards Digital Coptic Berlin, 14.1.2014 19/37

Towards Digital Coptic Searching and Visualizing Coptic Manuscript - PowerPoint PPT Presentation

Towards Digital Coptic Searching and Visualizing Coptic Manuscript Data Caroline T. Schroeder, University of the Pacific cschroeder@pacific.edu Amir Zeldes, Humboldt-Universitt zu Berlin amir.zeldes@rz.hu-berlin.de Berlin Digital Classicist

The Mission of the Coptic Orthodox Church The Coptic Mission Began in 1976 Lead by H.G

& St. Mina Coptic Orthodox Church Pre-Service Training Seminars Lecture 2: THE SACRAMENT OF

& St. Mina Coptic Orthodox Church Pre-Service Training Seminars Lecture 1: THE SPIRITUALITY

SCHOOL: CALIFORNIA SCHOOL OF PROFESSIONAL PSYCHOLOGY Abstract There is a lack of literature

Corpus Linguistics Tools for Sahidic Coptic Amir Zeldes, Humboldt-Universitt zu Berlin

The Digital Revolution 1 Digital Revolution Nadias Theme 2 Digital Revolution Digital

www.centre-for-digital-business.com t hemes Did we not see the digital era coming? o Digital

A Framework for Analysis 1. Digital Endowments 2. Digital Intensities 3. Digital Restrictions

2. Digital Data CHAPTER HIGHLIGHTS Elements of digital media. Digital codes. Di it l d

Shift to Digital Strategy Shift to Digital Strategy Contents Introduction The Digital

DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and

Digital Signatures Digital Signatures And Putting It All Together Digital Signatures And

Digital Advertising (PPC/SEM) Course Digital Advertising (PPC/SEM) Equinet 1 Academy Digital

1 Arbitrated Digital Signatures Digital Signature Standard (DSS) US Govt approved signature

inequalities Bob Gann Digital Inclusion Lead NHS Digital Our digital life in 2025 Experts

About this presentation : Learning : What is Digital Forensics ? Political : Digital

MANTEL DIGITAL DataPrint Presentation About Mantel Digital Mantel Digital designs, develops,

Chapter 8 Digital Media Computer Concepts 2013 8 Section A: Digital Sound Digital Audio

DIGITAL COMMUNICATIONS SUMMARY 1. MEDIA LANDSCAPE TRANSFORMATION 2. KERING DIGITAL STRATEGY

Web Based Interfaces for Digital 2- Way Radio NW Digital Radio Corporation Our Market

Digital Signage & Digital TV Out Of Home (DOOH) 1.Digital Signage : Market and use case

! Institut for Informatics Digital transformation is ... ... the change of the analog to digital,

Science and Digital Tools Digital Science There are increasing opportunities for science to be

Our Digital Strategy DRAFT 1 2018-2021 1.0 - Foreword 2.0 - Our Digital Vision 3.0 -

Towards Digital Coptic Searching and Visualizing Coptic Manuscript - PowerPoint PPT Presentation

Towards Digital Coptic Searching and Visualizing Coptic Manuscript Data Caroline T. Schroeder, University of the Pacific cschroeder@pacific.edu Amir Zeldes, Humboldt-Universitt zu Berlin amir.zeldes@rz.hu-berlin.de Berlin Digital Classicist

The Mission of the Coptic Orthodox Church The Coptic Mission Began in 1976 Lead by H.G

&amp; St. Mina Coptic Orthodox Church Pre-Service Training Seminars Lecture 2: THE SACRAMENT OF

&amp; St. Mina Coptic Orthodox Church Pre-Service Training Seminars Lecture 1: THE SPIRITUALITY

SCHOOL: CALIFORNIA SCHOOL OF PROFESSIONAL PSYCHOLOGY Abstract There is a lack of literature

Corpus Linguistics Tools for Sahidic Coptic Amir Zeldes, Humboldt-Universitt zu Berlin

The Digital Revolution 1 Digital Revolution Nadias Theme 2 Digital Revolution Digital

www.centre-for-digital-business.com t hemes Did we not see the digital era coming? o Digital

A Framework for Analysis 1. Digital Endowments 2. Digital Intensities 3. Digital Restrictions

2. Digital Data CHAPTER HIGHLIGHTS Elements of digital media. Digital codes. Di it l d

Shift to Digital Strategy Shift to Digital Strategy Contents Introduction The Digital

DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and

Digital Signatures Digital Signatures And Putting It All Together Digital Signatures And

Digital Advertising (PPC/SEM) Course Digital Advertising (PPC/SEM) Equinet 1 Academy Digital

1 Arbitrated Digital Signatures Digital Signature Standard (DSS) US Govt approved signature

inequalities Bob Gann Digital Inclusion Lead NHS Digital Our digital life in 2025 Experts

About this presentation : Learning : What is Digital Forensics ? Political : Digital

MANTEL DIGITAL DataPrint Presentation About Mantel Digital Mantel Digital designs, develops,

Chapter 8 Digital Media Computer Concepts 2013 8 Section A: Digital Sound Digital Audio

DIGITAL COMMUNICATIONS SUMMARY 1. MEDIA LANDSCAPE TRANSFORMATION 2. KERING DIGITAL STRATEGY

Web Based Interfaces for Digital 2- Way Radio NW Digital Radio Corporation Our Market

Digital Signage &amp; Digital TV Out Of Home (DOOH) 1.Digital Signage : Market and use case

! Institut for Informatics Digital transformation is ... ... the change of the analog to digital,

Science and Digital Tools Digital Science There are increasing opportunities for science to be

Our Digital Strategy DRAFT 1 2018-2021 1.0 - Foreword 2.0 - Our Digital Vision 3.0 -

& St. Mina Coptic Orthodox Church Pre-Service Training Seminars Lecture 2: THE SACRAMENT OF

& St. Mina Coptic Orthodox Church Pre-Service Training Seminars Lecture 1: THE SPIRITUALITY

Digital Signage & Digital TV Out Of Home (DOOH) 1.Digital Signage : Market and use case