The Marburg Agreement Project Corpus, annotation and preliminary results Magnus Breder Birkenes, Stephanie Leser-Cronau (University of Marburg) “Parallel text analysis in diachronic research”, Marburg, 02/22/2018
Background • “Diachrony of Agreement Systems: Breton – Welsh – German (and other Germanic languages)” (current research project at the University of Marburg) • funding: Deutsche Forschungsgemeinschaft (DFG) • principal investigators: Jürg Fleischer, Paul Widmer, Erich Poppe • research assistants: Magnus Breder Birkenes (programming, North Germanic), Stephanie Leser-Cronau (German / West Germanic), Kerstin Plein (Welsh), Ricarda Scherschel (Breton) • student assistants: Katja Daube, Canan Elif Sertkaya (German), Lara Geinitz, Julia Vogelsang (Brittonic) • method: annotation and comparison of parallel texts to keep text type and content constant, ensuring maximum comparability
Terminology (adapted from: Corbett 2006, p. 5)
Two examples (1) The committee has/have agreed (cf. Corbett 2006, p. 2) (2) Das Mädchen legt seinen n / ihren f Mantel ab. Sie f / es n trägt ein rotes Kleid ’The girl takes her coat off. She is wearing a red dress.’ (Köpcke and Zubin 2009, p. 142) • prone to linguistic leveling?
Motivation • agreement sensitive to genre and style, as shown by Corbett (2006, 271–273) for Russian and Levin (2001) for English • Birkenes and Sommer (2015): collective nouns: plural agreement in the finite verb common in religious (translational) texts • Fleischer (2012) and Leser-Cronau (2017): semantic agreement in oral language, syntactic agreement becoming the norm in written German • language change or genre effects?
Project languages • Germanic languages: entire family (with a special focus on German) • East Germanic: Gothic • West Germanic: High and Low German, Dutch, Afrikaans, West / East / North Frisian, English • North Germanic: Icelandic, Faroese, Norwegian, Swedish, Danish • Brittonic languages: Breton and Welsh
Research questions 1. Pervasiveness of agreement: How does agreement develop diachronically in the Germanic and Brittonic languages? 2. The role of mismatches: How common are agreement mismatches?
Structure of the talk 1. Introduction: Corpus and technical infrastructure (Magnus Breder Birkenes) 2. Preliminary results from the pervasiveness study (Magnus Breder Birkenes) 3. Exploring the results (Paul Widmer) 4. Case study I: Mismatches in the history of German (Stephanie Leser-Cronau) 5. Case study II: Agreement-relevant Initial Consonant Mutations (ICMs) in Welsh (Kerstin Plein) 6. Conclusions 7. Discussion
Corpus and annotation
Corpus and annotation Bible corpus • the Bible as a “massively parallel” text • parallel texts widely used in translational studies, computational linguistics and in typological research (e.g. Cysouw and Wälchli 2007) • less used in diachronic studies. Prominent exceptions: “Pragmatic Resources in Old Indo-European Languages” (PROIEL, see Haug and Jøhndal 2008) and Biblia Medieval ( http://www.bibliamedieval.es/ ) • biblical and religious texts well-attested in the transmission of the project languages (in some better than in others) • pros: widely available, allows for (exact) comparison, prose text • cons: translation syntax, archaic structures, different translation methods, style
Corpus and annotation Bible corpus (Germanic/Brittonic): 34 texts Latin Vulgata Gothic Wulfila High German Tatian Beheim .. Luther 1545 Luther 2017 Low German Bugenhagen Jessen West Frisian Wumkes East Frisian Saterlandic North Frisian Sylt Dutch Statenvertaling 1637 Statenvertaling 1977 Afrikaans Bible English Wycliffe King James Icelandic Gammelnorsk homiliebok Guðbrandsbiblia Biblian Faroese Biblian Norwegian Gammelnorsk homiliebok Reformationsbibelen Bibelen (bokmål/nynorsk) Danish Reformationsbibelen Autoriseret Salesbury BMW BMW Welsh BMW Llafar Le Gonidec L .. Breton Oliéro 0400 0500 0600 0700 0800 0900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
Corpus and annotation Annotation 1. pervasiveness study: defined text portion: “The Birth of Jesus” • larger portion (German, Welsh, Breton): Luke 1;5–2;35 (111 verses) • totally: 16 texts à ca. 2000 tokens (reference: Luther 2017) • smaller portion (all other Germanic languages): Luke 2;1–2;20 (20 verses) • totally: 18 texts à ca. 300 tokens (reference: Luther 2017) 2. mismatch study: all relevant controllers
Corpus and annotation Data (as of February 2018) • German: 13.132 relations • Welsh: 5.048 relations • Breton: 4.007 relations • Other West Germanic languages: 1.658 relations • North Germanic languages: 1.724 relations = 25.569 agreement relations
Corpus and annotation Annotation • tagging of all potential agreement forms, including those not showing any agreement (anymore): • verbs (finite verbs, participles), adjectives, determiners and pronouns • Standard Average European bias: e.g. no object agreement not relevant for Germanic • annotation: • controller, target: gender, number, person • domain: attributive, predicative, relative, anaphoric
Corpus and annotation Example: Luke 2:8 ENG: And there were. pl 2 in the 1 same 1 country 1 shepherds 2 abiding 2 the 3 field 3 , keeping 2 their. 3pl 2,4 flock 4 in watch over by night. (green = controller, blue = target, red = potential target, indices = controller id)
Database and web application
Database and web application Solution • web application for annotating parallel texts • the parallel texts aligned on verse level using an alignment id (cf. the system of Mayer and Cysouw 2014, p. 3161): 1 40001001 TAB The book of the generations of Jesus Chris ... LF 2 40001002 TAB The son of Abraham was Isaac ; and the so ... LF 3 40001003 TAB And the sons of Judah were Perez and Zerah ... LF 4 ... • texts and annotations stored in a SQL database • queries via the webapp
Database and web application Web application • built in Python/Flask, using PostgreSQL as database, programmed by me • hosted on a virtual server at the university computer center in Marburg, maintained by me • allows for stand-off annotation of parallelized texts using HTML and JavaScript • user-friendly: point-and-click annotation • repetitive tasks can be automated
Preliminary results: Development and pervasiveness of agreement
Preliminary results: Development and pervasiveness of agreement Global pervasiveness • the overall frequency of agreement forms (absolute frequency) • the proportion of agreement forms with overt morphology of all potential agreement forms with covert and overt morphology • agreement defined as covariation between a controller and a target in terms of: • at least one feature (rough) • one, two and three features (fine-grained)
Preliminary results: Development and pervasiveness of agreement Number of potential agreement forms lat−vulgate−400 131 lat got−wulfila−500 137 got deu−tatian−830 161 deu−beheim−1343 203 deu−mentel−1466 217 deu deu−luther−1545 197 191 deu−luther−1912 deu−luther−2017 195 nds−bugenhagen−1533 197 nds nds−jessen−1937 210 201 nld−staten−1637 nld nld−staten−1977 199 189 fry−wumkes−1943 fry frs−sater−2000 192 frs frr−sylt−2004 191 frr eng−wycliffe−1388 195 textname eng eng−kingjames−1611 189 afr−bible−1953 213 afr non−homiliebok−1200 124 non isl−guðbrand−1584 174 isl isl−biblian−2007 160 fao−biblian−1949 177 fao 190 dan−reformation−1550 dan dan−autoriseret−1992 188 192 nob−bibelen−2011 nob 201 nno−bibelen−2011 nno cym−ytn−1567 185 181 cym−bwm−1588 cym−bwm1620−1955 185 cym cym−bwm1620−2004 173 cym−llafar−2013 188 bre−legonidec−1827 175 bre−lecoat−1893 188 bre 187 bre−oliero−1913 bre−koad21−2010 183 0 50 100 150 200 frequency total
Preliminary results: Development and pervasiveness of agreement Global pervasiveness: absolute and relative lat−vulgate−400 100 % lat got−wulfila−500 99 % 1 % got deu−tatian−830 91 % 9 % deu−beheim−1343 83 % 17 % deu−mentel−1466 84 % 16 % deu deu−luther−1545 92 % 8 % 92 % 8 % deu−luther−1912 deu−luther−2017 93 % 7 % nds−bugenhagen−1533 88 % 12 % nds nds−jessen−1937 82 % 18 % 80 % 20 % nld−staten−1637 nld nld−staten−1977 79 % 21 % 81 % 19 % fry−wumkes−1943 fry frs−sater−2000 88 % 12 % frs frr−sylt−2004 46 % 54 % frr eng−wycliffe−1388 54 % 46 % textname eng eng−kingjames−1611 42 % 58 % afr−bible−1953 22 % 78 % afr non−homiliebok−1200 91 % 9 % non isl−guðbrand−1584 87 % 13 % isl isl−biblian−2007 85 % 15 % fao−biblian−1949 84 % 16 % fao 61 % 39 % dan−reformation−1550 dan dan−autoriseret−1992 48 % 52 % 48 % 52 % nob−bibelen−2011 nob 54 % 46 % nno−bibelen−2011 nno cym−ytn−1567 71 % 29 % 70 % 30 % cym−bwm−1588 cym−bwm1620−1955 72 % 28 % cym cym−bwm1620−2004 65 % 35 % cym−llafar−2013 70 % 30 % bre−legonidec−1827 63 % 37 % bre−lecoat−1893 66 % 34 % bre 62 % 38 % bre−oliero−1913 bre−koad21−2010 62 % 38 % 0 50 100 150 200 frequency agreement no agreement
Recommend
More recommend