sfb 1102 information density and linguistic encoding the
play

SFB 1102: Information Density and Linguistic Encoding The Empirical - PowerPoint PPT Presentation

SFB 1102: Information Density and Linguistic Encoding The Empirical Basis of Slavic The Empirical Basis of Slavic Intercomprehension Intercomprehension Tania Avgustinova, Andrea Fischer, Klara Jagrova, Dietrich Klakow, Roland Marti, Irina Stenger


  1. SFB 1102: Information Density and Linguistic Encoding The Empirical Basis of Slavic The Empirical Basis of Slavic Intercomprehension Intercomprehension Tania Avgustinova, Andrea Fischer, Klara Jagrova, Dietrich Klakow, Roland Marti, Irina Stenger REMU International Conference 28–29 May 2015, Joensuu, Finland INCOMSLAV Avgustinova / Fischer / Jagrova / Klakow / Marti / Stenger 1

  2. SFB 1102: Information Density and Linguistic Encoding Background (e.g. Czech and Polish) Background (e.g. Czech and Polish) “The basic mission/task Základním posláním Podstawowym zadaniem of the Czech-Polish Forum Č esko-polského fóra Forum Polsko-Czeskiego is to support je podpora rozvoje jest wspieranie dzia ł alno ś ci both current and stávajících a vzniku istniej ą cych oraz powstania, new common initiatives nových spole č ných iniciativ nowych, wspólnych inicjatyw within the civil societies nevládních subjekt ů w ś ród spo ł ecze ń stw obywatelskich of both countries.” obou zemí. obydwu pa ń stw. fully understandable still intelligible unintelligible Well ‐ known factors determining similarity of written texts in closely related languages: Orthographic distance (orthographic correspondences in cognate sets) Morphological distance (similarity of forms; correspondences in grammar) Lexical distance (cognates: positive, partial, negative; similarity of closed word classes ) Syntactic distance (aggregate linguistic measure: linear order, complexity of constructions) INCOMSLAV Avgustinova / Fischer / Jagrova / Klakow / Marti / Stenger 2

  3. SFB 1102: Information Density and Linguistic Encoding Approaching intercomprehension Approaching intercomprehension (  an information ‐ theoretic view) … as processing “noisy code” Consider a blended text sample constructed by using information chunks in Czech and Polish interchangeably: Základním posláním Forum Polsko-Czeskiego je podpora rozvoje istniej ą cych oraz powstania nových spole č ných iniciativ w ś ród spo ł ecze ń stw obywatelskich obou zemí. “The basic mission/task of the Czech ‐ Polish Forum is to support both current and new common initiatives within the civil societies of both countries.” It is expected to be intelligible to speakers of these languages, without conforming to the respective encoding systems. INCOMSLAV Avgustinova / Fischer / Jagrova / Klakow / Marti / Stenger 3

  4. SFB 1102: Information Density and Linguistic Encoding A newly established interdisciplinary Collaborative Research Centre tre A newly established interdisciplinary Collaborative Research Cen Language Use Languages offers a wide range of options of how to encode a message. Linguistic Variation Variation is an inherent property of the linguistic system. Central hypothesis Language processing relies on predictability in context (in a broader sense) Contextually determined predictability is appropriately indexed by Shanon’s notion of information Information Density (Surprisal)   1      Surprisal unit P unit | Context log log   2 2 P unit | Context Long ‐ term research programme: information theory for linguistic inquiry Project: Mutual intelligibility and surprisal in Slavic intercomprehension (INCOMSLAV) INCOMSLAV Avgustinova / Fischer / Jagrova / Klakow / Marti / Stenger 4

  5. SFB 1102: Information Density and Linguistic Encoding Research rationale Research rationale The reading intercomprehension scenario reveals inter ‐ lingual tolerance to unfamiliar linguistic encoding asymmetries with regard to intelligibility (depending on the language pair) Goal : identify mechanisms by which languages encode and decode information (the degree of) similarity between Slavic languages provides the basis for (varying) expectations about the linguistic encoding find statistical evidence of mutual intelligibility With meaningful units of language we expect diminished intelligibility through missing units confusion through misrecognition of units General idea : surprisal of language models correlates with intelligibility adapt N ‐ gram LMs for cross ‐ language use via latent space and similarity analyse information ‐ theoretical results with linguistic knowledge INCOMSLAV Avgustinova / Fischer / Jagrova / Klakow / Marti / Stenger 5

  6. SFB 1102: Information Density and Linguistic Encoding Encoding ; linguistic phenomena; meaningful units of language; intelligible information chunks (cognates, paraphrases, fragments); shared grammar text selection Slavic Inter ‐ linguistic quantitative Slavic Inter ‐ observation & annotation, comprehension determinants models of comprehension of intercom ‐ linguistic Matrix of intelligibility surprisal Matrix prehension hypotheses surprisal measure Experiments : variably close Modelling : linguistic and language pairs; synchronic and statistical models of surprisal; diachronic perspective large ‐ scale corpus studies intelligibility validation INCOMSLAV Avgustinova / Fischer / Jagrova / Klakow / Marti / Stenger 6

  7. SFB 1102: Information Density and Linguistic Encoding Encoding ; linguistic phenomena; meaningful units of language; intelligible information chunks (cognates, paraphrases, fragments); shared grammar text selection Slavic Inter ‐ linguistic quantitative Slavic Inter ‐ observation & annotation, comprehension determinants models of comprehension of intercom ‐ linguistic Matrix of intelligibility surprisal Matrix prehension hypotheses surprisal measure Experiments : variably close Modelling : linguistic and language pairs; synchronic and statistical models of surprisal; diachronic perspective large ‐ scale corpus studies intelligibility validation INCOMSLAV Avgustinova / Fischer / Jagrova / Klakow / Marti / Stenger 7

  8. SFB 1102: Information Density and Linguistic Encoding Slavic intercomprehension matrix Slavic intercomprehension matrix East Slavic West Slavic West South Slavic East South SUB ‐ GROUPS Slavic Russ Ruth Sorb Lech Cz ‐ Slk SCB Slv ISO ‐ code 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 1. Russian rus 1(2) 1(3) 2. Ukrainian 2(1) ukr 2(3) 3. Belorusian 3(1) 3(2) bel 4. Upper Sorbian hsb 4(5) 4(6) 4(7) 4(8) 5. Lower Sorbian 5(4) dsb 5(6) 5(7) 5(8) 6. Polish 6(4) 6(5) pol 6(7) 6(8) 7. Czech 7(4) 7(5) 7(6) ces 7(8) 8. Slovak 8(4) 8(5) 8(6) 8(7) slk 9. Bosnian bos 9(10) 9(11) 9(12) 1o. Croatian 10(9) hrv 10(11) 10(12) 11. Serbian 11(9) 11(10) srp 11(12) 12. Slovene 12(9) 12(10) 12(11) slv 13. Macedonian mkd 13(14) 14. Bulgarian 14(13) bul INCOMSLAV Avgustinova / Fischer / Jagrova / Klakow / Marti / Stenger 8

  9. SFB 1102: Information Density and Linguistic Encoding Slavic intercomprehension matrix Slavic intercomprehension matrix East Slavic West Slavic West South Slavic East South SUB ‐ GROUPS Slavic Russ Ruth Sorb Lech Cz ‐ Slk SCB Slv ISO ‐ code 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 1. Russian rus 1(2) 1(3) 1(14) Czech 2. Ukrainian 2(1) ukr 2(3) through How can a 3. Belorusian 3(1) 3(2) bel Polish Russian 4. Upper Sorbian hsb 4(5) 4(6) 4(7) 4(8) understand 5. Lower Sorbian 5(4) dsb 5(6) 5(7) 5(8) Polish Bulgarian? through 6. Polish 6(4) 6(5) pol 6(7) 6(8) Czech Serbian 7. Czech 7(4) 7(5) 7(6) ces 7(8) Croatian 8. Slovak 8(4) 8(5) 8(6) 8(7) slk 9. Bosnian bos 9(10) 9(11) 9(12) 1o. Croatian 10(9) hrv 10(11) 10(12) How can a 11. Serbian 11(9) 11(10) srp 11(12) Bulgarian 12. Slovene 12(9) 12(10) 12(11) slv understand Russian? Croatian 13. Macedonian mkd 13(14) Serbian 14. Bulgarian 14(1) 14(13) bul INCOMSLAV Avgustinova / Fischer / Jagrova / Klakow / Marti / Stenger 9

  10. SFB 1102: Information Density and Linguistic Encoding Slavic intercomprehension matrix Slavic intercomprehension matrix 1+6+14 (7) 1+6+14 (7) East Slavic West Slavic West South Slavic East South SUB ‐ GROUPS Slavic Russ Ruth Sorb Lech Cz ‐ Slk SCB Slv Processing Czech, ISO ‐ code 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. based on knowledge 1. Russian rus 1(2) 1(3) 1(7) of Russian, Polish 2. Ukrainian 2(1) ukr 2(3) and Bulgarian 3. Belorusian 3(1) 3(2) bel 4. Upper Sorbian hsb 4(5) 4(6) 4(7) 4(8) 5. Lower Sorbian 5(4) dsb 5(6) 5(7) 5(8) 6. Polish 6(4) 6(5) pol 6(7) 6(8) 7. Czech 7(4) 7(5) 7(6) ces 7(8) 8. Slovak 8(4) 8(5) 8(6) 8(7) slk 9. Bosnian bos 9(10) 9(11) 9(12) 1o. Croatian 10(9) hrv 10(11) 10(12) 11. Serbian 11(9) 11(10) srp 11(12) 12. Slovene 12(9) 12(10) 12(11) slv 13. Macedonian mkd 13(14) 14. Bulgarian 14(7) 14(13) bul INCOMSLAV Avgustinova / Fischer / Jagrova / Klakow / Marti / Stenger 10

  11. SFB 1102: Information Density and Linguistic Encoding Encoding ; linguistic phenomena; meaningful units of language; intelligible information chunks (cognates, paraphrases, fragments); shared grammar text selection Slavic Inter ‐ linguistic quantitative Slavic Inter ‐ observation & annotation, comprehension determinants models of comprehension of intercom ‐ linguistic Matrix of intelligibility surprisal Matrix prehension hypotheses surprisal measure Experiments : variably close Modelling : linguistic and language pairs; synchronic and statistical models of surprisal; diachronic perspective large ‐ scale corpus studies intelligibility validation INCOMSLAV Avgustinova / Fischer / Jagrova / Klakow / Marti / Stenger 11

Recommend


More recommend