computational dialectology with machine translation
play

Computational dialectology with machine translation techniques Yves - PowerPoint PPT Presentation

Computational dialectology with machine translation techniques Yves Scherrer Department of Digital Humanities, University of Helsinki Linguistics Research Seminar, University of Gothenburg, 12 November 2019 1 Illustration:


  1. Computational dialectology with machine translation techniques Yves Scherrer Department of Digital Humanities, University of Helsinki Linguistics Research Seminar, University of Gothenburg, 12 November 2019 1

  2. Illustration: http://vas3k.com/blog/machine_translation/ 2007 2012–2013 2017–2018 RBMT SMT NMT A brief history of my career as a machine translation researcher interested in dialectology (or the other way round…) 2

  3. Illustration: http://vas3k.com/blog/machine_translation/ 2007 2012–2013 2017–2018 RBMT SMT NMT A brief history of my career as a machine translation researcher interested in dialectology (or the other way round…) 2

  4. Object of study: Swiss German dialects 3

  5. Rule-based machine translation: Standard German → Swiss German

  6. Language variation in rule-based machine translation • B : Modern High German (“Standard German”) StdG entities, but as probability maps Generative dialectology (Veith 1970, 1982) • Most practical, but not historically correct • Dialects are not represented as discrete numbered • D : Swiss German dialects My proposal: • Transformation rules derive a multitude of dialect 4 systems D i from a single reference system B : • # Töpfer # B → # Häfner # D 33333 − 46999 � � � � → geng • immer

  7. Language variation in rule-based machine translation • B : Modern High German (“Standard German”) StdG entities, but as probability maps Generative dialectology (Veith 1970, 1982) • Most practical, but not historically correct • Dialects are not represented as discrete numbered • D : Swiss German dialects My proposal: • Transformation rules derive a multitude of dialect 4 systems D i from a single reference system B : • # Töpfer # B → # Häfner # D 33333 − 46999 � � � � → geng • immer

  8. Example rule: Lemma change {geng} • Rules implemented with XFST fjnite-state toolkit ( Sprachatlas der deutschen Schweiz ) maps • Probability maps extracted from digitized SDS … {all} {immer} 5 {immer} {gäng} � � → � � | � � | � � | |

  9. Example rule: Lemma change {geng} • Rules implemented with XFST fjnite-state toolkit ( Sprachatlas der deutschen Schweiz ) maps • Probability maps extracted from digitized SDS … {all} {immer} 5 {immer} {gäng} � � → � � | � � | � � | |

  10. Example: morphological infmection ADJA [Nom | Acc] Sg Gender Degree Weak 0 i schwarzi 6 � � → � � | schwarz ADJA Nom Sg Fem Pos Weak → schwarz

  11. Example: phonological adaptation Vowel (n d) Vowel gschta n e gschta nn e gschta ng e n n n n g n d 7 � � → � � | � � | � � | gesta nd en → gschta nd e

  12. Implementation Finite-state toolkits do not provide functionality for direct integration of probability maps. We simulate this ability with fmag diacritics . ADJA [Nom | Acc] Sg Gender Degree Weak 0 i defjne adj-2-fm [ ADJA [Nom | Acc] Sg Gender Degree Weak -> [ 0 ”@U.3-254.null@” | i ”@U.3-254.i@” ]]; 8 � � → � � |

  13. Conclusions • Diffjcult to achieve good coverage • Dialectologically interesting features vs. relevant features for practical usage • Diffjcult to evaluate on “real” data due to lack of unifjed writing conventions • Veith’s claim that the ordering of rules mirrors their order of historical appearance could not be verifjed in practice • The digitized maps turned out to be more useful than the rule set • Dialectometrical analyses • Online map viewer 9

  14. Conclusions • Diffjcult to achieve good coverage • Dialectologically interesting features vs. relevant features for practical usage • Diffjcult to evaluate on “real” data due to lack of unifjed writing conventions • Veith’s claim that the ordering of rules mirrors their order of historical appearance could not be verifjed in practice • The digitized maps turned out to be more useful than the rule set • Dialectometrical analyses • Online map viewer 9

  15. Conclusions • Diffjcult to achieve good coverage • Dialectologically interesting features vs. relevant features for practical usage • Diffjcult to evaluate on “real” data due to lack of unifjed writing conventions • Veith’s claim that the ordering of rules mirrors their order of historical appearance could not be verifjed in practice • The digitized maps turned out to be more useful than the rule set • Dialectometrical analyses • Online map viewer 9

  16. Conclusions • Diffjcult to achieve good coverage • Dialectologically interesting features vs. relevant features for practical usage • Diffjcult to evaluate on “real” data due to lack of unifjed writing conventions • Veith’s claim that the ordering of rules mirrors their order of historical appearance could not be verifjed in practice • The digitized maps turned out to be more useful than the rule set • Dialectometrical analyses • Online map viewer 9

  17. Rule-based machine translation: Maschinelle Übersetzung und Dialektometrie. In: D. Huck (ed.): Alemannische Dialektologie: Dialekte im Digitized SDS maps: http://www.dialektkarten.ch York: De Gruyter, 277–295. E. Wiegand (eds.): Dialektologie – Ein Handbuch zur deutschen und allgemeinen Dialektforschung. Berlin, New W. H. Veith (1982): Theorieansätze einer generativen Dialektologie. In: W. Besch / U. Knoop / W. Putschke / H. Hildesheim: Olms. W. H. Veith (1970): -Explikative +Applikative +Komputative Dialektkartographie. (Germanistische Linguistik 4) . Kontakt. (ZDL Beihefte 155). Stuttgart: Steiner, 261–278. Y. Scherrer (2014): Computerlinguistische Experimente für die schweizerdeutsche Dialektlandschaft – (SFCM 2011) . Berlin: Springer, 130–140. Systems and Frameworks for Computational Morphology - Proceedings of the Second International Workshop Y. Scherrer (2011): Morphology generation for Swiss German dialects. In: C. Mahlow / M. Piotrowski (eds.): 8 vols. Bern: Francke. R. Hotzenköcherle / R. Schläpfer / R. Trüb / P. Zinsli (eds.) (1962-1997): Sprachatlas der deutschen Schweiz. K. R. Beesley / L. Karttunen (2003): Finite State Morphology. CSLI Publications. References 10 Standard German → Swiss German

  18. Character-level statistical machine translation: Normalization

  19. The data: The ArchiMob corpus 11

  20. The data: The ArchiMob corpus 11 ArchiMob was an oral history project collecting testimonials of the Second World War period in Switzerland. 555 informants from all linguistic regions, both genders, different backgrounds were interviewed (1999–2001).

  21. The data: The ArchiMob corpus 11 ArchiMob was an oral history project collecting testimonials of the Second World War period in Switzerland. 555 informants from all linguistic regions, both genders, different backgrounds were interviewed (1999–2001). 43 Swiss German interviews were transcribed at the University of Zurich (2006–2018) for dialectological research.

  22. The task: Normalization There is a lot of variation in the transcriptions: • Transcription inconsistencies: different transcribers, transcription tools and changing guidelines • Dialectal variation: different origins of informants • Intra-speaker variation Goals: • Create an additional annotation layer to establish identities between forms that are felt like “the same word”. • Enable dialect-independent corpus search • Facilitate further annotation (e.g. part-of-speech tagging) 12

  23. The task: Normalization There is a lot of variation in the transcriptions: • Transcription inconsistencies: different transcribers, transcription tools and changing guidelines • Dialectal variation: different origins of informants • Intra-speaker variation Goals: • Create an additional annotation layer to establish identities between forms that are felt like “the same word”. • Enable dialect-independent corpus search • Facilitate further annotation (e.g. part-of-speech tagging) 12

  24. The task: Normalization Normalization of dialectal texts: Standard German • Our normalization language is similar but not identical to dann hat man noch gelugt gedacht das ist jetzt der general ja genneraal de ez dasch gluegt tänkt no het me jaa de [German] je hebt hem niet nodig wie jou laat gaan is gewoon dom :p I love you Normalization of historical texts (modernization): schat, DOM :p Iloveyouuuu nodig wie jou laat gaan is gwn nii em schaaaat, je et [Dutch] Normalization of user-generated content: que de ma facilité. mérite plutôt Ce serait une marque de la force de votre Ce ſeroit une marque de la force de voſtre merite pluſtoſt que de ma facilité. [French] 13

  25. The task: Normalization Normalization of dialectal texts: Standard German • Our normalization language is similar but not identical to dann hat man noch gelugt gedacht das ist jetzt der general ja genneraal de ez dasch gluegt tänkt no het me jaa de [German] je hebt hem niet nodig wie jou laat gaan is gewoon dom :p I love you Normalization of historical texts (modernization): schat, DOM :p Iloveyouuuu nodig wie jou laat gaan is gwn nii em schaaaat, je et [Dutch] Normalization of user-generated content: que de ma facilité. mérite plutôt Ce serait une marque de la force de votre Ce ſeroit une marque de la force de voſtre merite pluſtoſt que de ma facilité. [French] 13

Recommend


More recommend