Lecture 14: Statistical Machine Translation Julia Hockenmaier - PowerPoint PPT Presentation

CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 14: Statistical Machine Translation Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center

Lecture 14:   Machine Translation II d r o W e h : t 1 n t i r a t P n e s l m e d n g o i m l A M B I CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 2

  Statistical Machine Translation Given a Chinese input sentence ( source )…   主席：各位議員，早晨。 …find the best English translation ( target )   President: Good morning, Honourable Members .   We can formalize this as T* = argmax T P ( T | S ) Using Bayes Rule simplifies the modeling task,   so this was the first approach for statistical MT   (the so-called “ noisy-channel model ”):   T* = argmax T P ( T | S ) = argmax T P ( S | T ) P (T) where P ( S | T ) : translation model P (T) : language model 3 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

    The noisy channel model This is really just an application of Bayes’ rule :   T * = argmax T P ( T ∣ S ) P ( T ) = argmax T P ( S ∣ T ) ⏟ Translation Model Language Model The translation model P ( S | T ) is intended to capture   the faithfulness of the translation. [this is the noisy channel] Since we only need P ( S | T ) to score S , and don’t need it to generate a grammatical S, it can be a relatively simple model. P ( S | T ) needs to be trained on a parallel corpus The language model P ( T ) is intended to capture   the fluency of the translation. P ( T ) can be trained on a (very large) monolingual corpus 4 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

IBM models First statistical MT models, based on noisy channel: Translate from (French/foreign) source f to (English) target e via a translation model P ( f | e ) and a language model P ( e ) The translation model goes from target e to source f   via word alignments a : P ( f | e ) = ∑ a P ( f , a | e ) Original purpose: Word-based translation models Later: Were used to obtain word alignments,   which are then used to obtain phrase alignments   for phrase-based translation models   Sequence of 5 translation models Model 1 is too simple to be used by itself,   but can be trained very easily on parallel data. 5 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

IBM translation models: assumptions The model “generates” the ‘foreign’ source sentence f conditioned on the ‘English’ target sentence e   by the following stochastic process: 1. Generate the length of the source f   with probability p = ... 2. Generate the alignment of the source f   to the target e with probability p = ... 3. Generate the words of the source f   with probability p = ... 6 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Word alignment John loves Mary. … that John loves Mary.   Jean aime Marie. … dass John Maria liebt. Jean aime Marie dass John Maria liebt John that loves John Mary loves Mary 7 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Word alignment Maria no dió una bofetada a la bruja verde Mary did not slap the green witch 8 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Word alignment Marie a traversé le lac à la nage Mary swam across the lake 9 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Word alignment Source Marie a traversé le lac à la nage Mary swam Target across the lake One target word can be aligned to many source words . 10 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Word alignment Source Marie a traversé le lac à nage la Mary swam Target across the lake One target word can be aligned to many source words . But each source word can only be aligned to one target word. This allows us to model P (source | target) 11 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Word alignment Source Marie a traversé le lac à la nage Mary swam Target across the lake Some source words may not align to any target words . 12 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Word alignment Source Marie a traversé le lac à la nage NULL Mary Target swam across the lake Some source words may not align to any target words . To handle this we assume a NULL word in the target sentence. 13 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Representing word alignments 1 2 3 4 5 6 7 8 Marie a traversé le lac à la nage 0 NULL 1 Mary 2 swam 3 across 4 the 5 lake Position 1 2 3 4 5 6 7 8 Foreign Marie a traversé le lac à la nage Alignment 1 3 3 4 5 0 0 2 Every source word f[i] is aligned to one target word e[j] (incl. NULL).   We represent alignments as a vector a (of the same length as the source) with a[i] = j 14 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Lecture 14:   M Machine Translation II B I e h T t : 2 n e t m r a n P g i s l l a e d o m CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 15

        The IBM models Use the noisy channel (Bayes rule) to get the best (most likely) target translation e for source sentence f :   noisy channel arg max P ( e | f ) = arg max P ( f | e ) P ( e ) e e The translation model P ( f | e ) requires alignments a   � marginalize (=sum) P ( f | e ) = P ( f , a | e ) over all alignments a a ∈ A ( e , f ) � ∈ A Generate f and the alignment a with P ( f , a | e ) :   m ⇥ P ( f , a | e ) = P ( m | e ) P ( a j | a 1 ..j − 1 , f 1 ..j − 1 , m, e ) P ( f j | a 1 ..j f 1 ..j − 1 , e , m ) ⇧ ⌅⇤ ⌃ ⇧ ⌅⇤ ⌃ ⇧ ⌅⇤ ⌃ j =1 Length: | f | =m Word alignment a j Translation f j m = #words probability of probability in f j alignment a j of word f j 16 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

          IBM model 1: Generative process For each target sentence e = e 1 ..e n of length n :   0 0 1 1 2 2 3 3 4 4 5 5 NULL NULL Mary Mary swam swam across across the the lake lake 1. Choose a length m for the source sentence (e.g m = 8 ) Position 1 2 3 4 5 6 7 8 2. Choose an alignment a = a 1 ... a m for the source sentence Alignment 1 3 3 4 5 0 0 2 Each a j corresponds to a word e i in e : 0 ≤ a j ≤ n Position 1 2 3 4 5 6 7 8 Alignment 1 3 3 4 5 0 0 2 3. Translate each target word e a j into the source language Position 1 2 3 4 5 6 7 8 Alignment 1 3 3 4 5 0 0 2 Translation Marie a traversé le lac à la nage 17 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Model parameters Length probability P ( m | n ) : What’s the probability of generating a source sentence of length m given a target sentence of length n ?   Count in training data, or use a constant Alignment probability: P ( a | m , n ) : Model 1 assumes all alignments have the same probability: For each position a 1 ...a m , pick one of the n +1 target positions uniformly at random Translation probability: P (f j = lac | a j = i, e i = lake ) : In Model 1, these are the only parameters we have to learn. 18 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

IBM model 1: details The length probability is constant: P ( m | e) = ε The alignment probability is uniform   ( n = length of target string): P ( a i | e) = 1/( n+ 1) The translation probability depends only on e ai   (the corresponding target word): P ( f i | e ai ) m � P ( f , a | e ) = P ( m | e ) P ( a j | a 1 ..j − 1 , f 1 ..j − 1 , m, e ) P ( f j | a 1 ..j f 1 ..j − 1 , e , m ) ⌅ ⇤⇥ ⇧ ⌅ ⇤⇥ ⇧ ⌅ ⇤⇥ ⇧ j =1 Length: | f | =m Word alignment a j Translation f j m 1 � = n + 1 P ( f j | e a j ) Translation depends � only on the aligned j =1 English word All alignments have m � the same probability � = P ( f j | e a j ) ( n + 1) m j =1 19 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Finding the best alignment How do we find the best alignment between e and f ? ˆ = arg max P ( f , a | e ) a a m � � = arg max P ( f j | e a j ) ( n + 1) m a j =1 m � = arg max P ( f j | e a j ) a j =1 ˆ = arg max a j P ( f j | e a j ) a j 20 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Lecture 14: Statistical Machine Translation Julia Hockenmaier - PowerPoint PPT Presentation

CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 14: Statistical Machine Translation Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Lecture 14: Machine Translation II d r o W e h : t

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Statistical Machine Translation Graham Neubig Nara Institute of Science and Technology (NAIST)

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Domain Adaptation in Statistical Machine Translation Logic, Language and Computation Bart

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Statistical Machine Translation Overview p EM algorithm Lecture 3 Improved word alignment

Statistical Machine Translation What works and what does not Andreas Maletti Universitt

When uniform weak convergence fails: Empirical processes for dependence functions and residuals

Bibliography (www.mcgill.ca/cancerepi/courses/cancerbio/): Franco EL. Epidemiology in the study

Network Modeling of Infectious June 23, 2015 Disease and Social Diffusion Samuel M. Jenness, PhD

Consistent Approximations in Optimization Johannes O. Royset Professor of Operations Research

Integration of Prevention and Surveillance through Data to Care Kimberly Truss, MPH Assistant

Complete Monotonicity Conjecture of Heat Equation MDDS, SJTU, 2019 Fan Cheng Shanghai Jiao Tong

WIC UPDATE WEBINAR August 13, 2020 Todays Agenda Welcome Jean OLeary Opening

Biovigilance Component Hemovigilance Module Data Sharing in NHSN Creating and Maintaining a

Sambuz

Useful Links

Newsletter

Mail Us