Rusyn as a language between state borders – a statistical approach to variation (for small sample sizes) Albert-Ludwig University of Freiburg, Germany Department of Slavonic Studies Prof. Dr. Achim Rabus & M. Zaidan Lahjouji Project: Russinisch als eine Staatsgrenzen überschreitende Minderheitensprache: Quantitative Perspektiven ( Rusyn as a state border transgressing minority language: quantitive perspectives)
Topics: • The Rusyn project • Interests and Aims • Border Effects • The Corpus of Spoken Rusyn • Quantitative approaches to spoken data • Pitfalls, Possible Solutions and Limitations • Example Dataset & Analysis
The Rusyn Project • Interests and aims: • Status / Condition of the Carpatho Rusyn language • Documentation of Spoken Carpatho-Rusyn (Corpus) • Language Contact with Several „Roofing Languages“ (Slavic and Non-Slavic) • Contact induced changes? • Language Perception • Border Effects (Woolhiser 2005) • Quantitative / Statistical Approaches to Spoken Language Data
The Rusyn Project • Interests and aims: • Status / Condition of the Carpatho Rusyn language • Documentation of the Rusyn Varieties (Corpus) • Language Contact with Several „Roofing Languages“ (Slavic and Non-Slavic) • Contact induced changes? • Language Perception • Border Effects (Woolhiser 2005) • Quantitative / Statistical Approaches to Spoken Language Data (R-Studio)
Magocsi, P. R.: Národ znikadiaľ : ilustrovaná história karpatských Rusínov. Prešov : Rusín a Ľudové noviny, 2007, p. 34.
Rusyns as National Minority
Sociolinguistic Factors • Status? Age? Sex? Education? Mobility? Religion? Which factors determine how people speak?
Border Effects as Hypothesis Border effects (Woolhiser 2005) are detectable within Rusyn vernacular Poland Ukraine Slovakia Hungary Romania
Example: A(j) Conjugation Pugh, S.M. (2009). The Rusyn language: A grammar of the literary standard of Slovakia with reference to Lemko and Subcarpathian Rusyn . München. P. 117.
Corpus of Spoken Rusyn Corpus of Spoken Rusyn CQP – query search: [word=‚ма|має|мат*|зна|знає|знат*|позна|познає|познат*'%cd]
Example Variation within conjugation types AJ and A(j) (Pugh 2009: 116-20) • Our dataset contains: • Threefold variation: мати 3 𝑄𝑡 . 𝑇 . 𝑄𝑠𝑓𝑡 . ( ма , має , мат(ь) ) and ( (по-)зна , (по-)знає, (по-)знат(ь)). ( по− ) знати 3 𝑄𝑡 . 𝑇 . 𝑄𝑠𝑓𝑡 . • Several utterances by the same speakers. Biased + violation of assumptions! Bad! • Context • Metadata of speakers
Coefficients in Multinomial Logistic Regression Model ln( 𝑄 ( 𝑤𝑓𝑠𝑐𝐺𝑝𝑠𝑛 = 𝑛𝑏 ) 𝑄 ( 𝑤𝑓𝑠𝑐𝐺𝑝𝑠𝑛 = 𝑛𝑏𝑓 ) = 𝑐 10 + 𝑐 11 ( 𝑤𝑏𝑠𝑗𝑓𝑢𝑧 = 𝑇𝑚𝑝 ) + 𝑐 12 ( 𝑤𝑏𝑠𝑗𝑓𝑢𝑧 = 𝑀𝑓𝑛 ) + 𝑐 13 𝐵𝑓 + 𝑐 14 ( 𝑡𝑓𝑦 = 𝑛 ) ln( 𝑄 ( 𝑤𝑓𝑠𝑐𝐺𝑝𝑠𝑛 = 𝑛𝑏𝑢 ) 𝑄 ( 𝑤𝑓𝑠𝑐𝐺𝑝𝑠𝑛 = 𝑛𝑏𝑓 ) = 𝑐 20 + 𝑐 21 ( 𝑤𝑏𝑠𝑗𝑓𝑢𝑧 = 𝑇𝑚𝑝 ) + 𝑐 22 ( 𝑤𝑏𝑠𝑗𝑓𝑢𝑧 = 𝑀𝑓𝑛 ) + 𝑐 23 𝐵𝑓 + 𝑐 24 ( 𝑡𝑓𝑦 = 𝑛 )
Problems X Data set is rather small X Biased data set X Dependent variable(verb_form) is categorical X Threefold variation X Independent variables are predominantly categorical X Violation of assumptions (Independence) X We have collected precious data, so we don’t want to give up
Bootstrapping Regression Regression Sample5 Sample1 Robust … Sample2 Sample500 estimation Sample0 Sample3 Sample6
Conclusion • Bootstrapping provides us with a robust estimation of the values of interest, even when assumptions aren’t met or the data set was small and or biased. • Even after Bootstrapping, we can still see clear tendencies: settlement area (Variety) seem to be the most significant factor.
Conclusion • Statistical methods are useful for several aspects of our research. • Our possibilities are rather limited. • Assumptions are often violated when applying state of the art methods. • Nevertheless, robust methods help us to get more unbiased estimations. • Robust estimations should always be reported.
Файно Вам дякуєме за Вашу увагу! Thank you very much for your attention! Contact: zaidan.lahjouji@slavistik.uni-freiburg.de achim.rabus@slavistik.uni-freiburg.de www.russinisch.de
Literature • Christ, Oliver (1994). A modular and flexible architecture for an integrated corpus query system. In: Proceedings of COMPLEX’94: 3rd Conference on Computational Lexicography and Text Research, 23–32. • Evert, S. and Hardie, A. (2011). Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium. In: Proceedings of the Corpus Linguistics 2011 Conference, Birmingham, UK. University of Birmingham. • Hinneburg, Alexander, Heikki Mannila, Samuli Kaislaniemi, TerŠu Nevalainen & Helena Raumolin-Brunberg (2007). “How to handle small samples: Bootstrap and Bayesian methods in the analysis of linguis‹c change”. Literary and Linguis‹c Compu‹ng 22(2): 137–150. • Mueller,T; Schmid,H & Schütze, H. (2013). Efficient higher-order CRFs for morphological tagging. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 322–332, Seattle, Washington, USA, October. Association for Computational Linguistics. • Rabus, A. & A. Šymon (2015): Na novŷch putjach isslidovanja rusyns’kŷch dialektu. Korpus rozhovornoho rusyns’koho jazŷka. In: Koporová, Kvetoslava (Hrsg.): Rusyn’skŷj literaturnŷj jazŷk na Slovakiji. 20 rokiv kodifikaciji. Prešov, 40-54. • Rabus, Achim (2015): Current Developments in Carpatho-Rusyn Speech - Preliminary Observations. In: Krafcik P. & V. Padjak (eds.): Juvilejnyj zbirnyk na čest' profesora Pavla-Roberta Magočija. Užhorod, 489-496. • Rabus, A. & Scherrer, Y. (2017): Lexicon Induction for Spoken Rusyn - Challenges and Results. In: Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, 27-32.
Literature ● Rabus, A., Savić, S., Waldenfels, R. v. (2012). Towards an electronic corpus of the Velikie Minei Čet'i. In: Rediscovery: Bulgarian Codex Suprasliensis of the 10 th century . Sofia: Iztok Zapad. ● Scherrer, Y & Rabus, A (2017): Multi-source morphosyntactic tagging for spoken Rusyn. In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 84 – 92. ● Schimon, A. & A. Rabus (2016): Wahrnehmungsdialektologische Untersuchungen zum Russinischen in Zakarpattja am Beispiel der Region Chust. In: Zeitschrift für Slawistik 61(3), 401-432. ● Šymon, A. & A. Rabus (2016): Ysslidovanja rusyns'koho jazŷka yz pohljada vospryymatel'noji dialektologiji. In: Dynamické procesy v súčasnej slavistike, S. 71-88. (Nachdruck in Rusyn 5/2016 und 6/2016) ● v. Waldenfels, R.; Woźniak, M. (2017). SpoCo – a simple and adaptable web interface for dialect corpora. In: Journal for Language Technology and Computational Linguistics, 31(1), 145 – 160. ● v. Waldenfels,R.; Daniel, M., Dobrushina, N. (2014): Why Standard Orthography? Building the Ustya River Basin Corpus, an online corpus of a Russian dialect. Komp'juternaja lingvistika i intellektual'nye technologii: Po materialam ežegodnoj Meždunarodnoj konferencii «Dialog» (Bekasovo, 4 — 8 ijunja 2014 g.). Vyp. 13 (20). — M.: Izd-vo RGGU, 2014. ● Woolhiser, C. (2005). Political borders and dialect divergence/convergence in Europe. In Peter Auer, Frans Hinskens, and Paul Kerswill, editors, Dialect change, 236–262. Cambridge Univ. Press, Cambridge
Recommend
More recommend