parallel corpora
play

Parallel corpora in translation and contrastive studies Lucie - PowerPoint PPT Presentation

Parallel corpora in translation and contrastive studies Lucie Chlumsk Faculty of Arts, Charles University in Prague Parallel corpora in translation and contrastive studies Lucie Chlumsk Faculty of Arts, Charles University in Prague


  1. Parallel corpora in translation and contrastive studies Lucie Chlumská Faculty of Arts, Charles University in Prague Parallel corpora in translation and contrastive studies Lucie Chlumská Faculty of Arts, Charles University in Prague Parallel corpora in translation and contrastive studies Lucie Chlumská Faculty of Arts, Charles University in Prague

  2. 1. corpus classification and terminology in TS/CS 2. parallel corpora: objectives and issues 3. InterCorp 9: corpus design 4. languages in contrast based on the parallel corpus OUTLINE 1. corpus classification and terminology in TS/CS 2. parallel corpora: objectives and issues 3. InterCorp 9: corpus design 4. languages in contrast based on the parallel corpus 1. corpus classification and terminology in TS/CS 2. parallel corpora: objectives and issues 3. InterCorp 9: corpus design 4. languages in contrast based on the parallel corpus

  3. Corpora in TS/CS: terminology See Granger S., Lerot J. & Petch-Tyson S. (2003) Corpus-based Approaches to Contrastive Linguistics and Translation Studies. Amsterdam: Rodopi.

  4. PARALLEL CORPORA PARALLEL CORPORA

  5. Objectives and issues representativness – genres/text types matter to provide a basis for research in TS/CS segment/sentence alignment, word-to-word allignment? to include originals and their translations highbrow literature and classics vs. virtually anything is available different amount of texts translated and available obvious issue in CL: what texts to include? > what translations to include...? include...? to provide a basis for research in TS/CS segment/sentence alignment, word-to-word allignment? to include originals and their translations highbrow literature and classics vs. virtually anything is available different amount of texts translated and available directionality – small languages vs. big languages include...? obvious issue in CL: what texts to include? > what translations to representativness – genres/text types matter main resource of data for machine translation main resource of data for machine translation • to include originals and their translations • segment/sentence alignment, word-to-word allignment? • to provide a basis for research in TS/CS • main resource of data for machine translation • representativness – genres/text types matter • obvious issue in CL: what texts to include? > what translations to • directionality – small languages vs. big languages • directionality – small languages vs. big languages • different amount of texts translated and available • highbrow literature and classics vs. virtually anything is available

  6. PCA: fiction vs. non-fiction

  7. Bidirectional parallel corpus the analysis of translation universals (s-universals, t-universals) same size in both directions > „reciprocal“ (Zanettin 2011) both a parallel and comparable corpus (e.g. ENPC) > perfect for the analysis of translation universals (s-universals, t-universals) • same size in both directions > „reciprocal“ (Zanettin 2011) • both a parallel and comparable corpus (e.g. ENPC) > perfect for source language target language originals translations source language target language translations originals

  8. Directionality matters TARGET WORD A example: EN shout > CS křičet > EN scream, shout, yell (EN scream > CS křičet, řvát, ječet) EN come > CS jít > EN go, come CS hned > DE gleich > CS stejný, hned, stejně usually, there is no symmetry in translation equivalence ALWAYS DEPENDS ON THE CONTEXT SOURCE WORD B TARGET WORD C TARGET WORD B SOURCE WORD A TARGET WORD C SOURCE WORD C example: EN shout > CS křičet > EN scream, shout, yell (EN scream > CS křičet, řvát, ječet) EN come > CS jít > EN go, come SOURCE WORD C SOURCE WORD A TARGET WORD B example: TARGET WORD A SOURCE WORD B TARGET WORD B SOURCE WORD A TARGET WORD C SOURCE WORD C CS hned > DE gleich > CS stejný, hned, stejně EN shout > CS křičet > EN scream, shout, yell (EN scream > CS křičet, řvát, ječet) EN come > CS jít > EN go, come CS hned > DE gleich > CS stejný, hned, stejně usually, there is no symmetry in translation equivalence ALWAYS DEPENDS ON THE CONTEXT TARGET WORD A SOURCE WORD B • usually, there is no symmetry in translation equivalence • ALWAYS DEPENDS ON THE CONTEXT SOURCE WORD A SOURCE WORD A SOURCE WORD A

  9. INTERCORP v.9 INTERCORP v.9

  10. Basic information multilingual parallel corpus focused on Czech (pivot) • multilingual parallel corpus focused on Czech (pivot) • Czech as pivot, sentence/segment alignment Czech as pivot, sentence/segment alignment • word-to-word alignment > used in Treq (treq.korpus.cz) • word-to-word alignment > used in Treq (treq.korpus.cz)

  11. InterCorp 9: design currently 39 languages fiction, manual alignment in different proportions, not all are lemmatized and/or tagged currently 39 languages www.opensubtitles.org Open Subtitles Europarl: http://www.statmt.org/europarl/ PressEurop: http://www.presseurop.eu legal texts in the EU languages: Acquis Communautaire: http://langtech.jrc.ec.europa.eu/JRC-Acquis.html Project Syndicate: http://www.project-syndicate.org/ in different proportions, not all are lemmatized and/or tagged journalism: www.opensubtitles.org Acquis Communautaire: http://langtech.jrc.ec.europa.eu/JRC-Acquis.html fiction, manual alignment journalism: Project Syndicate: http://www.project-syndicate.org/ PressEurop: http://www.presseurop.eu Open Subtitles legal texts in the EU languages: EP (verbatim 2007-2011): Europarl: http://www.statmt.org/europarl/ • currently 39 languages • in different proportions, not all are lemmatized and/or tagged • design: core and collections (incl. subtitles) • design: core and collections (incl. subtitles) design: core and collections (incl. subtitles) • fiction, manual alignment • journalism: • Project Syndicate: http://www.project-syndicate.org/ • PressEurop: http://www.presseurop.eu • legal texts in the EU languages: • Acquis Communautaire: http://langtech.jrc.ec.europa.eu/JRC-Acquis.html • EP (verbatim 2007-2011): • EP (verbatim 2007-2011): • Europarl: http://www.statmt.org/europarl/ • Open Subtitles • www.opensubtitles.org

  12. Core

  13. Collections

  14. Tags in different languages

  15. Where to find the tagset description? in the Wiki: http://bit.ly/1bv3ll4 in the KonText interface:

  16. LANGUAGES IN CONTRAST LANGUAGES IN CONTRAST

  17. Examples of use word-formation > length? translations? expression stared up at it with a the-bigger-they-are-the-harder-they-fall > meaning? combinations? text types? translations? 1. EN: - ridden , - laden word-formation > translations? possible equivalents in analytical languages? > length? translations? expression stared up at it with a the-bigger-they-are-the-harder-they-fall > meaning? combinations? text types? translations? 1. EN: - ridden , - laden word-formation > translations? possible equivalents in analytical languages? > length? translations? expression stared up at it with a the-bigger-they-are-the-harder-they-fall > meaning? combinations? text types? translations? 1. EN: - ridden , - laden > translations? possible equivalents in analytical languages? 2. EN: Hey , ai n't you that demon-fighting-son-of-a-bitch ? 2. EN: Hey , ai n't you that demon-fighting-son-of-a-bitch ? 2. EN: Hey , ai n't you that demon-fighting-son-of-a-bitch ? 3. CS: deminutives ending in – eček , - ička 3. CS: deminutives ending in – eček , - ička 3. CS: deminutives ending in – eček , - ička

  18. Examples of use have/has/’s/’ve + any word (been) + past participle (been, got(ta)) > transgressives? finite clauses? Having published a draft of this Regulation, ... > tense? > aspect? > markers? have/has/’s/’ve + any word (been) + past participle (been, got(ta)) divorced... he has never given me a present before vs. he’s got(ta), I’ve been 4. EN: present perfect and its counterparts in other languages grammar Sadly , he came late. Honestly, I didn’t do it. > transgressives? finite clauses? Having published a draft of this Regulation, ... grammar > tense? > aspect? > markers? divorced... he has never given me a present before vs. he’s got(ta), I’ve been 4. EN: present perfect and its counterparts in other languages he has never given me a present before vs. he’s got(ta), I’ve been divorced... have/has/’s/’ve + any word (been) + past participle (been, got(ta)) > tense? > aspect? > markers? Having published a draft of this Regulation, ... > transgressives? finite clauses? Sadly , he came late. Honestly, I didn’t do it. grammar 4. EN: present perfect and its counterparts in other languages Sadly , he came late. Honestly, I didn’t do it. 5. EN: -ing clauses – clauses with participle constructions 5. EN: -ing clauses – clauses with participle constructions 5. EN: -ing clauses – clauses with participle constructions 6. EN: syntactical feature – disjunct 6. EN: syntactical feature – disjunct 6. EN: syntactical feature – disjunct

Recommend


More recommend