machine translation
play

Machine Translation Spring 2020 Adapted from slides from Chris - PowerPoint PPT Presentation

SFU NatLangLab CMPT 825: Natural Language Processing Machine Translation Spring 2020 Adapted from slides from Chris Manning, Abigail See, Matthew Lamm, Danqi Chen and Karthik Narasimhan Translation One of the holy grail problems in


  1. SFU NatLangLab CMPT 825: Natural Language Processing Machine Translation Spring 2020 Adapted from slides from Chris Manning, Abigail See, Matthew Lamm, Danqi Chen and Karthik Narasimhan

  2. Translation • One of the “holy grail” problems in artificial intelligence • Practical use case: Facilitate communication between people in the world • Extremely challenging (especially for low-resource languages)

  3. Easy and not so easy translations • Easy: • I like apples ich mag Äpfel (German) ↔ • Not so easy: • I like apples J'aime les pommes (French) ↔ • I like red apples J'aime les pommes rouges (French) ↔ • les the but les pommes apples ↔ ↔

  4. ̂ ̂ MT basics • Goal: Translate a sentence w ( s ) in a source language (input) to a sentence in the target language (output) • Can be formulated as an optimization problem: w ( t ) = arg max w ( t ) ψ ( w ( s ) , w ( t ) ) • • where is a scoring function over source and target sentences ψ • Requires two components: • Learning algorithm to compute parameters of ψ • Decoding algorithm for computing the best translation w ( t )

  5. Why is MT challenging? • Single words may be replaced with multi-word phrases • I like apples J'aime les pommes ↔ • Reordering of phrases • I like red apples J'aime les pommes rouges ↔ • Contextual dependence • les the but les pommes apples ↔ ↔ Extremely large output space Decoding is NP-hard ⟹

  6. Vauquois Pyramid • Hierarchy of concepts and distances between them in di ff erent languages • Lowest level: individual words/characters • Higher levels: syntax, semantics • Interlingua: Generic language-agnostic representation of meaning

  7. Evaluating translation quality • Two main criteria: • Adequacy: Translation w ( t ) should adequately reflect the linguistic w ( s ) content of • Fluency: Translation w ( t ) should be fluent text in the target language Di ff erent translations of A Vinay le gusta Python

  8. Evaluation metrics • Manual evaluation is most accurate, but expensive • Automated evaluation metrics: • Compare system hypothesis with reference translations • BiLingual Evaluation Understudy (BLEU) (Papineni et al., 2002): • Modified n-gram precision

  9. BLEU N BLEU = exp 1 ∑ log p n N n =1 Two modifications: • To avoid , all precisions are smoothed log 0 • Each n-gram in reference can be used at most once • Ex. Hypothesis : to to to to to vs Reference : to be or not to be should not get a unigram precision of 1 Precision-based metrics favor short translations • Solution: Multiply score with a brevity penalty for translations e 1 − r / h shorter than reference,

  10. BLEU • Correlates somewhat well with human judgements (G. Doddington, NIST)

  11. BLEU scores Sample BLEU scores for various system outputs Issues? • Alternatives have been proposed: • METEOR: weighted F-measure • Translation Error Rate (TER): Edit distance between hypothesis and reference

  12. Machine Translation (MT) task of translating a sentence from one language (the source language) to a sentence in another language (the target language)

  13. History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s: Statistical MT • Russian → English (motivated by the Cold War!) • Systems were mostly rule-based, using a bilingual • 2000s-2015: Statistical Phrase-Based MT dictionary to map Russian words to their English counterparts • 2015-Present: Neural Machine Translation 1 minute video showing 1954 MT: https://youtu.be/K-HfpsHPmvw 13

  14. History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • (late) 1980s to 2000s: Statistical MT • 2000s-2014: Statistical Phrase-Based MT • 2014-Present: Neural Machine Translation 14

  15. History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s: Statistical MT • 2000s-2015: Statistical Phrase-Based MT • 2015-Present: Neural Machine Translation 15

  16. ̂ ̂ Statistical MT • Key Idea: Learn probabilistic model from data • To find the best English sentence y, given French sentence x: P ( y | x ) y = arg max y • Decompose using Bayes Rule: y = arg max P ( x | y ) P ( y ) y Translation/Alignment Model Language Model Models how words and phrases Models how to write good should be translated English ( fluency ) ( adequacy/fidelity ). Learn from monolingual data Learn from parallel data

  17. ̂ Noisy channel model p S | T p T w ( t ) = arg max w ( t ) ψ ( w ( s ) , w ( t ) ) w ( t ) w ( s ) Input Noisy Decoder Model Channel ψ ( w ( s ) , w ( t ) ) = ψ A ( w ( s ) , w ( t ) ) + ψ F ( w ( t ) ) log P S , T ( w ( s ) , w ( t ) ) = log P S | T ( w ( s ) | w ( t ) ) + log P T ( w ( t ) ) • Generative process for source sentence • Use Bayes rule to recover w ( t ) that is maximally likely under the conditional distribution (which is what we want) p T | S

  18. Data • Statistical MT relies requires lots of parallel corpora 
 (Europarl, Koehn, 2005) • Not available for many low-resource languages in the world

  19. How to define the translation model? Introduce latent variable modeling the alignment (word-level correspondence) between the source sentence x and the target sentence y P ( x | y ) = P ( x , A | y )

  20. What is alignment?

  21. Alignment is complex Alignment can be many-to-one

  22. Alignment is complex Alignment can be one-to-many

  23. Alignment is complex Some words are very fertile!

  24. Alignment is complex Alignment can be many-to-many (phrase-level)

  25. How to define the translation model? Given the alignment, how do we incorporate in our model? P ( x | y ) = P ( x , A | y )

  26. Incorporating alignments • Joint probability of alignment and translation can be defined as: • M ( s ) , M ( t ) are the number of words in source and target sentences • m th is the alignment of the word in the source sentence, i.e. it a m m th specifies that the word is aligned to the word in target a mth Is this su ffi cient?

  27. Incorporating alignments (target) (source) a 1 = 2, a 2 = 3, a 3 = 4,... Multiple source words may align to the same target word!

  28. Reordering and word insertion Assume extra NULL token (Slide credit: Brendan O’Connor)

  29. Independence assumptions • Two independence assumptions: • Alignment probability factors across tokens: • Translation probability factors across tokens:

  30. How do we translate? p ( w ( s ) , w ( t ) ) w ( t ) p ( w ( t ) | w ( s ) ) = arg max • We want: arg max p ( w ( s ) ) w ( t ) • Sum over all possible alignments: • Alternatively, take the max over alignments

  31. Alignments • Key question: How should we align words in source to words in target? good bad

  32. 
 IBM Model 1 1 p ( a m | m , M ( s ) , M ( t ) ) = • Assume M ( t ) • Is this a good assumption? 
 Every alignment is equally likely!

  33. IBM Model 1 • Each source word is aligned to at most one target word 1 p ( a m | m , M ( s ) , M ( t ) ) = • Further, assume M ( t ) • We then have: 
 ( 1 M ( t ) ) M ( s ) p ( w ( s ) | w ( t ) ) p ( w ( s ) , w ( t ) ) = p ( w ( t ) ) ∑ A p ( w ( s ) = v | w ( t ) = u ) • How do we estimate ?

  34. IBM Model 1 • If we had word-to-word alignments, we could compute the probabilities using the MLE: p ( v | u ) = count ( u , v ) • count ( u ) • where = #instances where word was aligned count ( u , v ) u to word in the training set v • However, word-to-word alignments are often hard to come by What can we do?

  35. EM for Model 1 • (E-Step) If we had an accurate translation model, we can estimate likelihood of each alignment as: • (M Step) Use expected count to re-estimate translation parameters: 
 E q [ count ( u , v )] p ( v | u ) = count ( u )

  36. Independence assumptions allow for Viterbi decoding In general, use greedy or beam decoding

  37. 
 
 Model 1: Decoding • Pick target sentence length M ( t ) w ( t ) p ( w ( t ) | w ( s ) ) = arg max w ( t ) p ( w ( s ) , w ( t ) ) • Decode: arg max 1 p ( w ( s ) , w ( t ) ) = p ( w ( t ) ) ∑ M ( t ) p ( w ( s ) | w ( t ) ) • A

  38. Model 1: Decoding (target) (source) At every step , pick target word to maximize product of: m p LM ( w ( t ) m | w ( t ) 1. Language model: < m ) p ( w ( s ) b m | w ( t ) 2. Translation model: m ) where is the inverse alignment from target to source b m

  39. IBM Model 2 • Slightly relaxed assumption: • p ( a m | m , M ( s ) , M ( t ) ) is also estimated, not set to constant • Original independence assumptions still required: • Alignment probability factors across tokens: • Translation probability factors across tokens:

  40. Other IBM models • Models 3 - 6 make successively weaker assumptions • But get progressively harder to optimize • Simpler models are often used to ‘initialize’ complex ones • e.g train Model 1 and use it to initialize Model 2 parameters

  41. Phrase-based MT • Word-by-word translation is not su ffi cient in many cases (literal) (actual) • Solution: build alignments and translation tables between multiword spans or “phrases”

  42. Phrase-based MT • Solution: build alignments and translation tables between multiword spans or “phrases” • Translations condition on multi-word units and assign probabilities to multi-word units • Alignments map from spans to spans

  43. Syntactic MT (Slide credit: Greg Durrett)

Recommend


More recommend