Overabundance as hybrid infmection Quantitative evidence from Czech Matías Guzmán Naranjo and Olivier Bonami 09-11.11.2016, Mannheim MGN, OB Overabundance 2016 1 / 38
1 Overabundance 2 The Czech system 3 Materials 4 Methodology 5 Results Singular locative Overabundance as hybrid infmection Instrumental plural as sociolinguistic variation MGN, OB Overabundance 2016 2 / 38
Overabundance ‘sea’ 2016 Overabundance MGN, OB moř-e kuřat-a měst-a nom.pl moř-e kuř-e měst-o nom.sg ‘chicken’ Defjning oveabundance ‘town’ Example: Czech neuter nouns infmection classes 2 Heteroclisis: one lexeme uses a paradigm that is a mix of two Example: French fut.3pl chant-er-ont same features within the same word. 1 Extended (multiple) exponence: two separate exponents realizing the Not to be confused with: Example: Spanish sbjv.imp.3sg canta-ra vs. canta-se an infmectional paradigm. Overabundance: two difgerent words in free variation fjll the same cell in 3 / 38
Overabundance 2 How are competing infmection strategies distributed? 2016 Overabundance MGN, OB meaning? If so, are a lexeme’s preferences predictable from its form and/or realization? Do overabundant lexemes difger in their preference for one or the other factors governing the distribution of its alternate forms? Given that a lexeme is overabundant, are there linguistic/extralinguistic inherently variable (Aronofg and Lindsay, 2016)? Overabundance and morphological theory nonoverabundant infmection classes ? Or is morphological realization 1 Do overabundant lexemes belong to discrete classes, contrasting with particular: The conceptual characterization of overabundance is still unclear. In proposal) morphological theory (see Bonami and Stump, in press for a sketchy Few efgorts to date to accomodate overabundance within pioneering work of Thornton (2011, 2012). The phenomenon was mostly ignored by morphologists until the 4 / 38
Overabundance 1 2016 Overabundance MGN, OB Czech National Corpus Availability of large corpora with high quality annotation through the 3 Bermel, Knittl, and Russell, 2015; Cvrček et al., 2010) Good documentation of the phenomenon (Bermel and Knittl, 2012a,b; 2 High prevalence of overabundance We focus on Czech declension for opportunistic reasons: Our project strategies in a large corpus. We use statistical modeling to explore the distribution of infmection The method: infmection strategies from both. between two other infmection classes in that it simultaneously allows infmection classes: a group of lexemes forms a class that is a hybrid 2 Show that, is some cases, overabundance amounts to hybridization of difgerent kinds of overabundance, calling for difgerent kinds of analyses. 1 Show that the answers to these questions are not uniform: there are Our goals: 5 / 38
Overabundance zebu 2016 Overabundance MGN, OB knize kniha ‘book’ sestře sestra ‘sister’ zimě zima ‘cold’ loc.sg nom.sg zebu The data set non-overabundant due to data sparsity We examine all nouns from the SYN2015 corpus (Křen et al., 2015), a 120M token balanced corpus of written, edited Czech documenting usage between 2010 and 2014. We estimate whether a lexeme is overabundant over the larger (2200M token) SYN v3 collection of corpora (Hnátková et al., 2014) This diminishes the proportion of incorrect classifjcation as Lemmatization and tagging provided with the corpus. ‘zebu’ Semi-automatic identifjcation of case-number exponents nom.sg loc.sg ‘oak tree’ dub dubu 6 / 38
Overabundance 0.0127 2016 Overabundance MGN, OB 0.0206 0.0088 0 0.0104 0.0046 0.0129 0.0313 pl 0.0097 0.0111 0.0045 0.0219 Overall distribution of overabundance 0.0135 0.0179 sg ins loc voc acc dat gen nom Semi-undeclinables, e.g. whisky ins.sg : whisky vs. whiskou Spelling variation, e.g. analýza ins.sg : analýzou vs. analyzou the corpus. Some nonsystematic instances involve Almost all paradigm cells give rise to some amount of overabundance in 7 / 38
Overabundance Los Lexeme Prop. -a Paris 0.25 Keith 0.38 0.76 Johannes Jacques 0.31 This we call erratic overabundance MGN, OB Overabundance 2016 0.58 0.98 Example 1: the gen.sg of masculine animate nouns However, our corpus shows a handful of overabundant nouns (8 out Masculine animate nouns ending with a consonant-fjnal nom.sg have two possibilities in the gen.sg : 1 ‘hard nouns’: -a , cf. pán ‘sir’: pána 2 ‘soft nouns’: -e , cf. muž ‘man’: muže ‘Hard’ or ‘soft’ status is predictable from the phonological and morphological makeup of the stem. of 1400), all proper names ending in /s/ . Julius Lexeme Prop. -a Columbus 0.25 Smith 0.21 8 / 38
Overabundance Overabundant nouns tend to 2016 Overabundance MGN, OB Example 2: locative singular of hard inanimate nouns overabundant nouns form a hybridization : This is a good candidate for balanced distribution. some nouns exhibit a have strong preferences, but class of their own. 363 dům ‘house’, gen.sg domě Masculine inanimate nouns ending in a so-called hard consonant may use two difgerent endings in the loc.sg : -u or -ě . dub ‘oak tree’, gen.sg dubu -ě only 9 / 38 Many of these are overabundant. In our corpus: 1820 both 7146 -u only 60 50 Number of lexemes 40 30 20 10 0 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of -u in overabundant lexemes
Overabundance stylistic makeup. 2016 Overabundance MGN, OB Example 3: the instrumental plural Only ma 551 Both 439 Only non- ma 0 corpus is as expected, given its Sociolinguistic conditioning: the All Czech nouns may occur in two forms in the instrumental plural, one of which involves the overabundant forms in our sequence -ma . 10 / 38 -ma form is informal. in writing. The distribution of In particular, it is unexpected muž ‘man’: muži ∼ mužema žena ‘woman’: ženami ∼ ženama město ‘town’: městy ∼ městama 80 70 Number of lexemes 60 50 40 30 20 10 0 0.00 0.02 0.04 0.06 0.08 0.10 Proportion of -ma in overabundant lexemes
Materials final_segment + penultimate_segment + 2016 Overabundance MGN, OB interactions, or hidden nodes. We did not fjnd any improvements from adding additional factors, + frequency antepenultimate_segment + length_in_letters + number_vowels The set of predictors that best fjtted the data was: Our goal is twofold: We performed ten-fold cross-validation on all of our models. in R, with a softmax link function, and 10 hidden nodes. Our model was fjtted using the nnet (Venables and Ripley, 2002) package Grammatical vs. sociolinguistic conditioning vs. other forms in the ins.pl to confjrm how they contrast. 2 modelling the last two particular cases ( -u vs. ě in the loc.sg , -ma and 1 modelling the general Czech infmectional system as a proof of concept, 11 / 38
Methodology Confusion matrices and accuracy measures We make use of two basic tools for evaluating the analogical systems: Confusion matrices and accuracy measures. Suppose we have two groups A, and B. and the following words: A: lama, lara, lado, laso, lerr, liz B: pama, ra, dal, kar, olor, gin, grip, wek. We can postulate two models: Model 1: all words starting with an ‘l’ belong to group A, all others to group B Model 2: all words with an ‘a’ as fjrst vowel belong to group A, all others to group B MGN, OB Overabundance 2016 12 / 38
Methodology B 2016 Overabundance MGN, OB No Information Rate : 0.5714 95% CI : (0.7684, 1) Accuracy : 1 8 0 0 Model 1, a perfectly predictive model, produces the following results: 6 A B A Prediction Reference B: pama, ra, dal, kar, olor, gin, grip, wek. A: lama, lara, lado, laso, lerr, liz 13 / 38
Methodology B 2016 Overabundance MGN, OB No Information Rate : 0.5714 95% CI : (0.2886, 0.8234) Accuracy : 0.5714 4 2 4 Model 2, a completely unpredictive model, produces the following results: 4 A B A Prediction Reference B: lerr, liz, olor, gin, grip, wek. A: lama, lara, lado, laso, pama, ra, dal, kar 14 / 38
Results Singular locative Results 1 We fjrst present the results of our model in the complete system for each individual cell of the paradigm. 2 The point of this initial step is to provide some evidence that infmectional class in Czech nouns is strongly correlated with the phonological shape of nouns. 3 This is not just a property of overabundant classes. MGN, OB Overabundance 2016 15 / 38
Results 1 221 4 0 0 5 0 0 50 0 340 i-ovi 0 0 0 0 4 2 1 0 6 3 0 0 0 3 2 0 1 0 1 ém 0 0 0 2 30 282 0 ovi 25 0 19 0 2 23 0 0 0 0 1 0 14 6 602 2276 0 14 0 0 0 0 3 0 25 0 10 830 m 0 0 0 0 0 7 17 0 180 0 0 0 0 0 é-ý 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 2016 Overabundance MGN, OB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0 0 0 0 0 2 0 0 7 0 0 4 ti 0 0 tu 0 0 ý 0 0 29 0 0 0 0 0 0 0 0 5 1 4 0 1 1 886 Singular locative é 0 0 0 6 4 471 0 0 0 0 0 0 1 0 0 0 0 1 0 1 9353 3 0 0 0 31 14 6 0 71 0 0 0 26 0 0 0 0 23 0 u ě-u i-u ovi-u 27 6692 i ti tu ý é-ý m i-ovi ém ovi 0 0-u 92 é ě i Reference Prediction Singular locative 0 1 23 0 0 0 20 2 41 6834 ě 0 5 13 2 19 4 20 1 0 2 9 0 14 0 0 0 0 25 i-u 0 2 0 0 1 0 0 0 0 5 7 1438 421 2 44 0 0 0 9 0 0 14 ovi-u 0 1 1 0 0 0 0 1 5 30 768 0 0 287 0 0 1 1 0 12 29 0 335 0 8 3 0-u 0 0 0 0 6 0 ě-u 0 1 0 0 0 1 10 0 4 11 7707 170 12 79 0 59 23 u 16 / 38
Recommend
More recommend