Lecture 22: Statistical Machine Translation Julia Hockenmaier - PowerPoint PPT Presentation

CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 22: Statistical Machine Translation Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center

Projects and Literature Reviews First report due Nov 26 (PDF written in LaTeX; no length restrictions;   submission through Compass) Purpose of this first report: Check-in to make sure that you’re on track   (or, if not, that we can spot problems) Rubrics for the final reports (due on Reading Day): https://courses.engr.illinois.edu/CS447/LiteratureReviewRubric.pdf https://courses.engr.illinois.edu/CS447/FinalProjectRubric.pdf � 2 CS447 Natural Language Processing

Projects and Literature Reviews Guidelines for first Project Report: What is your project about? What are the relevant papers you are building on? What data are you using? What evaluation metric will you be using? What models will you implement/evaluate? What is your to-do list? Guidelines for first Literature Review Report: What is your literature review about?   (What task or what kind of models?   Do you have any specific questions or focus?)   What are the papers you will review? (If you already have it, give a brief summary of each of them) What’s your to-do list? � 3 CS447 Natural Language Processing

Statistical Machine Translation CS447 Natural Language Processing � 4

Statistical Machine Translation We want the best (most likely) [English] translation for the [Chinese] input: argmax English P ( English | Chinese ) We can either model this probability directly,   or we can apply Bayes Rule. Using Bayes Rule leads to the “noisy channel” model. As with sequence labeling, Bayes Rule simplifies the modeling task, so this was the first approach for statistical MT. � 5 CS447 Natural Language Processing

The noisy channel model Translating from Chinese to English: argmax Eng P ( Eng | Chin ) = argmax Eng P ( Chin | Eng ) P ( Eng ) × ⇤ ⇥� ⌅ ⇤ ⇥� ⌅ Translation Model LanguageModel Noisy   English   Foreign Channel Input I Output O P(O | I) Decoder (Translating to English) Î = argmax I P(O|I)P(I) Guess of   English Input Î � 6 CS447 Natural Language Processing

            The noisy channel model This is really just an application of Bayes’ rule :   ˆ = arg max P ( E | F ) E E P ( F | E ) × P ( E ) = arg max P ( F ) E = arg max P ( F | E ) P ( E ) × E | {z } | {z } Translation Model Language Model The translation model P ( F | E ) is intended to capture   the faithfulness of the translation .   It needs to be trained on a parallel corpus   The language model P ( E ) is intended to capture   the fluency of the translation .   It can be trained on a (very large) monolingual corpus � 7 CS447 Natural Language Processing

Statistical MT with the noisy channel model Monolingual corpora Parallel corpora Good morning, Honourable Members. We will now start the Good morning, Honourable Members. We will now start the Good morning, Honourable Members. We will now start the meeting. First of all, the motion on the "Appointment of the MOTION: PRESIDENT (in Cantonese): Good meeting. First of all, the motion on the "Appointment of the meeting. First of all, the motion on the "Appointment of the Chief Justice of the Court of Final Appeal of the Hong Kong morning, Honourable Members. We will now start Chief Justice of the Court of Final Appeal of the Hong Kong Chief Justice of the Court of Final Appeal of the Hong Kong Special Administrative Region". Secretary for Justice. the meeting. First of all, the motion on the Special Administrative Region". Secretary for Justice. Special Administrative Region". Secretary for Justice. Translation Model Language Model P tr ( 早晨 | morning ) P lm ( honorable | good morning ) Input Translation Decoding algorithm 主席：各位議 President: Good morning, Honourable 員，早晨。 Members. � 8 CS447 Natural Language Processing

n -gram language models for MT With training on data from the web and clever parallel processing (MapReduce/Bloom filters), n can be quite large - Google (2007) uses 5-grams to 7-grams, - This results in huge models, but the effect on translation quality levels off quickly: Size of models Effect on translation quality � 9 CS447 Natural Language Processing

                Translation probability P ( fp i | ep i ) Phrase translation probabilities can be obtained   from a phrase table:   EP FP count green witch grüne Hexe … at home zuhause 10534 at home daheim 9890 is ist 598012 this week diese Woche …. This requires phrase alignment on a parallel corpus . � 10 CS447 Natural Language Processing

        Getting translation probabilities A parallel corpus consists of the same text   in two (or more) languages. Examples: Parliamentary debates: Canadian Hansards; Hong Kong Hansards, Europarl; Movie subtitles (OpenSubtitles) In order to train translation models, we need to   align the sentences (Church & Gale ’93)   We can learn word and phrase alignments from these aligned sentences � 11 CS447 Natural Language Processing

IBM models First statistical MT models, based on noisy channel: Translate from source f to target e   via a translation model P( f | e ) and a language model P( e ) The translation model goes from target e to source f   via word alignments a : P( f | e ) = ∑ a P( f , a | e )   Original purpose: Word-based translation models Today: Can be used to obtain word alignments,   which are then used to obtain phrase alignments   for phrase-based translation models   Sequence of 5 translation models Model 1 is too simple to be used by itself,   but can be trained very easily on parallel data. � 12 CS447 Natural Language Processing

IBM translation models: assumptions The model “generates” the ‘foreign’ source sentence f conditioned on the ‘English’ target sentence e   by the following stochastic process: 1. Generate the length of the source f   with probability p = ... 2. Generate the alignment of the source f   to the target e with probability p = ... 3. Generate the words of the source f   with probability p = ... � 13 CS447 Natural Language Processing

Word alignments in the IBM models CS447 Natural Language Processing � 14

Word alignment John loves Mary. … that John loves Mary.   Jean aime Marie. … dass John Maria liebt. Jean aime Marie dass John Maria liebt John that loves John Mary loves Mary � 15 CS447 Natural Language Processing

Word alignment Maria no dió una bofetada a la bruja verde Mary did not slap the green witch � 16 CS447 Natural Language Processing

Word alignment Marie a traversé le lac à la nage Mary swam across the lake � 17 CS447 Natural Language Processing

Word alignment Source Marie a traversé le lac à la nage Mary swam Target across the lake One target word can be aligned to many source words . � 18 CS447 Natural Language Processing

Word alignment Source Marie a traversé le lac à nage la Mary swam Target across the lake One target word can be aligned to many source words . But each source word can only be aligned to one target word. This allows us to model P (source | target) � 19 CS447 Natural Language Processing

Word alignment Source Marie a traversé le lac à la nage Mary swam Target across the lake Some source words may not align to any target words . � 20 CS447 Natural Language Processing

Word alignment Source Marie a traversé le lac à la nage NULL Mary Target swam across the lake Some source words may not align to any target words . To handle this we assume a NULL word in the target sentence. � 21 CS447 Natural Language Processing

Representing word alignments 1 2 3 4 5 6 7 8 Marie a traversé le lac à la nage 0 NULL 1 Mary 2 swam 3 across 4 the 5 lake Position 1 2 3 4 5 6 7 8 Foreign Marie a traversé le lac à la nage Alignment 1 3 3 4 5 0 0 2 Every source word f[i] is aligned to one target word e[j] (incl. NULL).   We represent alignments as a vector a (of the same length as the source) with a[i] = j � 22 CS447 Natural Language Processing

The IBM alignment models CS447 Natural Language Processing � 23

        The IBM models Use the noisy channel (Bayes rule) to get the best (most likely) target translation e for source sentence f :   noisy channel arg max P ( e | f ) = arg max P ( f | e ) P ( e ) e e The translation model P ( f | e ) requires alignments a   marginalize (=sum)   � P ( f | e ) = P ( f , a | e ) over all alignments a a ∈ A ( e , f ) � Generate f and the alignment a with P ( f , a | e ) :   ∈ A m ⇥ P ( f , a | e ) = P ( m | e ) P ( a j | a 1 ..j − 1 , f 1 ..j − 1 , m, e ) P ( f j | a 1 ..j f 1 ..j − 1 , e , m ) ⇧ ⌅⇤ ⌃ ⇧ ⌅⇤ ⌃ ⇧ ⌅⇤ ⌃ j =1 Length: | f | =m Word alignment a j Translation f j probability   probability of   m = #words   alignment a j of word f j in f j � 24 CS447 Natural Language Processing

Lecture 22: Statistical Machine Translation Julia Hockenmaier - PowerPoint PPT Presentation

CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 22: Statistical Machine Translation Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Projects and Literature Reviews First report due Nov 26 (PDF

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Statistical Machine Translation Graham Neubig Nara Institute of Science and Technology (NAIST)

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Domain Adaptation in Statistical Machine Translation Logic, Language and Computation Bart

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Statistical Machine Translation Overview p EM algorithm Lecture 3 Improved word alignment

Statistical Machine Translation What works and what does not Andreas Maletti Universitt

Welcome! Download the workbook Welcome! Download the workbook ! ! t t

Broad PCORI Funding Announcement Letter of Intent Applicant Town Hall Cycle 1 2017 January 31,

Emergent Lorentz Invariance from Strong

How Secure are our Computer Systems Courses? Majed Almansoori, Jessica Lam, Elias Fang,

28.05.04 09:50 Memory Management The computer memory is a limited resource so the Memory

Functorial Construction E := a ring (in fact field) of p -adic power series in X 1

Particle swarm algorithms for multi-local optimization A. Ismael F . Vaz Edite M.G.P .

Introduction of CEPC-SppC Yifang Wang Institute of High Energy Physics, Beijing Feb. 13, 2014