30 years of corpus based language variation studies
play

30+ years of corpus-based language variation studies. Experiences, - PowerPoint PPT Presentation

30+ years of corpus-based language variation studies. Experiences, challenges and inspirations Vclav Cvrek Slovko 2019 Bratislava, October 24 staven (building N G D A V L sg N G A V pl ) Variation in language Absence of 1:1


  1. 30+ years of corpus-based language variation studies. Experiences, challenges and inspirations Václav Cvrček Slovko 2019 Bratislava, October 24

  2. stavení (building N G D A V L sg N G A V pl ) Variation in language Absence of 1:1 correspondence between form–function homonymy/polysemy (more functions of one form) left (leave, not right) ▶ synonymy (more forms for one function) ▶ splendid – smashing , strong – powerful ▶ robiť – drieť (make, labour) ▶ lidma – lidmi (people Inst . pl . )

  3. Variation in language Absence of 1:1 correspondence between form–function ▶ synonymy (more forms for one function) ▶ splendid – smashing , strong – powerful ▶ robiť – drieť (make, labour) ▶ lidma – lidmi (people Inst . pl . ) ▶ homonymy/polysemy (more functions of one form) ▶ stavení (building { N , G , D , A , V , L } sg ., { N , G , A , V } pl . ) ▶ left (leave, not right)

  4. Variants of variation Language levels Perspectives ▶ phonology, morphematics – phonemes, morphemes ▶ morphology, derivation – indicators of variety ▶ lexicon, syntax – meaning/function ▶ text – register/style, sociolect ▶ synchronic (sociolinguistic, register) ▶ diachronic (dialectal)

  5. Variation and linguisticsIj

  6. Variation and linguistics

  7. Variation and linguistics … isn’t linguistics all about variability? How do we cope with variation… …by searching for “invariant” (and ignoring v.) – langue parole, corpus annotation (?) …by denying/fjghting it – prescriptive tendencies but N.B.: variation is natural & all-pervasive in human language (Ferguson 1983: 154, cit. Biber-Conrad 2009: 23) …by studying it – variability on lower levels is used on higher ones (emphasises hierarchical nature of language) ▶ …by describing it – range & principles of variation (H. Kučera)

  8. Variation and linguistics … isn’t linguistics all about variability? How do we cope with variation… parole, corpus annotation (?) …by denying/fjghting it – prescriptive tendencies but N.B.: variation is natural & all-pervasive in human language (Ferguson 1983: 154, cit. Biber-Conrad 2009: 23) …by studying it – variability on lower levels is used on higher ones (emphasises hierarchical nature of language) ▶ …by describing it – range & principles of variation (H. Kučera) ▶ …by searching for “invariant” (and ignoring v.) – langue ×

  9. Variation and linguistics … isn’t linguistics all about variability? How do we cope with variation… parole, corpus annotation (?) language (Ferguson 1983: 154, cit. Biber-Conrad 2009: 23) …by studying it – variability on lower levels is used on higher ones (emphasises hierarchical nature of language) ▶ …by describing it – range & principles of variation (H. Kučera) ▶ …by searching for “invariant” (and ignoring v.) – langue × ▶ …by denying/fjghting it – prescriptive tendencies ▶ but N.B.: variation is natural & all-pervasive in human

  10. Variation and linguistics … isn’t linguistics all about variability? How do we cope with variation… parole, corpus annotation (?) language (Ferguson 1983: 154, cit. Biber-Conrad 2009: 23) ones (emphasises hierarchical nature of language) ▶ …by describing it – range & principles of variation (H. Kučera) ▶ …by searching for “invariant” (and ignoring v.) – langue × ▶ …by denying/fjghting it – prescriptive tendencies ▶ but N.B.: variation is natural & all-pervasive in human ▶ …by studying it – variability on lower levels is used on higher

  11. higher level) Variation as a pointer ▶ “free variation” does not exist (in the long run) ▶ alternative forms → functional (or semantic) difgerentiation ▶ alternative meanings → formal (or contextual) difgerentiation ▶ if there is a variability ⇒ language will employ it ▶ variation is a pointer to a (hidden) function (usually on a

  12. Variation and corpora Corpus-based approaches to variation with variability) the variation cannot be captured by intuition CL concept) Glance ) well as for describing their principles , range and inventory ▶ (annotation – lemmatization, tagging – as a way of coping ▶ variation is an empirical phenomenon par excellence – most of ▶ fjnding invariant is parallel with searching for pattern ( ← very ▶ ⇒ frequency is crucial in describing variation ( SyD , Word at a ▶ corpora are necessary for identifjcation areas of variation as

  13. 30+ years of corpus-based… Douglas Biber (1988): Variation across speech and writing . Cambridge: Cambridge University Press.

  14. Variation in textsIj

  15. Variability of texts Invariant: information/message Traditionally described by stylistics qualitative (what is general and what is specifjc?) absence of scaling (what is dominant and what is marginal?)

  16. Variability of texts Invariant: information/message Traditionally described by stylistics ▶ qualitative (what is general and what is specifjc?) ▶ absence of scaling (what is dominant and what is marginal?)

  17. Two perspectives Emphasised in CL approaches to text variation ▶ intratextual – dough – register (linguistic properties) ▶ extratextual – cake – genre (conventional categorization)

  18. Multi-dimensional analysis (MDA)Ij

  19. Principles of MDA Multi-dimensional analysis (Biber 1988; Biber & Conrad 2009) empirical approach) ▶ systemic & functional variability ▶ motivated by context & situation ▶ registers ( ∼ intratextual) perspective ▶ assumption: text production involves interrelated choices → groups of features → dimensions of variation ▶ what is used, how often and together with what (bottom-up

  20. Methodology of MDA 2. list of features 3. operationalization 4. statistical evaluation (factor analysis) 5. interpretation dimensions of variation, registers… 1. corpus compilation

  21. Methodology of MDA 2. list of features 3. operationalization 4. statistical evaluation (factor analysis) 5. interpretation dimensions of variation, registers… 1. corpus compilation

  22. Methodology of MDA 2. list of features 3. operationalization 4. statistical evaluation (factor analysis) 5. interpretation dimensions of variation, registers… 1. corpus compilation

  23. Methodology of MDA 2. list of features 3. operationalization 4. statistical evaluation (factor analysis) 5. interpretation dimensions of variation, registers… 1. corpus compilation

  24. Methodology of MDA 2. list of features 3. operationalization 4. statistical evaluation (factor analysis) 1. corpus compilation 5. interpretation → dimensions of variation, registers…

  25. Methodology of MDA 2. list of features 3. operationalization 4. statistical evaluation (factor analysis) 1. corpus compilation 5. interpretation → dimensions of variation, registers…

  26. MDA of CzechIj

  27. MDA of Czech

  28. MDA of Czech Expected challenges / highlights of MDA… Literary × Common Czech word order Egbert 2016; Sharofg 2018) Results published in: Charting the Space of Variation in Czech through MDA. Corpus Linguistics and Linguistic Theory . Slovo a slovesnost 79, 293–321. ▶ …in Czech – situation bordering on diglossia (Bermel 2014): ▶ …in Slavic languages – specifjc morphology, infmection , free ▶ …in 21st century – how to include the web data (Biber & ▶ Cvrček, V. et al. (2018a): From Extra- to Intratextual Characteristics: ▶ Cvrček, V. et al. (2018b): Variabilita češtiny: multidimenzionální analýza.

  29. Data: Corpus Koditex # Text chunks 204 K Lemmata (types) 9 M Words (excl. punct.) 10,8 M Tokens Category recognition 3 334 ▶ guiding principles: diverse , contemporary, text length control ▶ “diversifjed” stratifjed sampling ▶ after 1990, majority from 2007–2014 ▶ text excerpts = chunks (not whole texts) ▶ annotation: lemmas, tags, multi-word unit & named-entity ▶ tools: KonText, MorphoDiTa, NameTag ▶ 3 modes – wri , spo , web ▶ 8 divisions, 45 classes, ≈ 200,000 words per class

  30. Features and their operationalization Originally 140+ features, fjnal list 122 , e.g.: length… semantically bleached nouns… Type-based features – inventories of pronouns, prepositions, conjunctions (relativized using zTTR , Cvrček & Chlumská 2015) Lexical richness – Yule’s K, thematic concentration (Popescu et al. 2007), unigrams & bigrams (zTTR) ▶ phonetics – narrowing é > í , diphthongization ý > ej , average word ▶ morphology – freq. of cases, numbers, moods, tenses… ▶ derivation – adjectives denoting similarity, verbal nouns, diminutives… ▶ lexicon – indefjnite pronouns, reporting verbs, verbs of thinking, ▶ pragmatics – contact expressions, fjllers, intensifjers, downtoners… ▶ syntax – types of attributes, clusters of POS, types of dependent clauses… ▶ text/discourse – questions, phraseology, word repetition…

Recommend


More recommend