Machine Translation Luke Zettlemoyer (Slides adapted from Karthik - PowerPoint PPT Presentation

CSEP 517 Natural Language Processing Machine Translation Luke Zettlemoyer (Slides adapted from Karthik Narasimhan, Chris Manning, Dan Jurafsky)

Translation • One of the “holy grail” problems in artificial intelligence • Practical use case: Facilitate communication between people in the world • Extremely challenging (especially for low-resource languages)

Easy and not so easy translations • Easy: • I like apples ich mag Äpfel (German) ↔ • Not so easy: • I like apples J'aime les pommes (French) ↔ • I like red apples J'aime les pommes rouges (French) ↔ • les the but les pommes apples ↔ ↔

̂ ̂ MT basics • Goal: Translate a sentence w ( s ) in a source language (input) to a sentence in the target language (output) • Can be formulated as an optimization problem: w ( t ) = arg max w ( t ) ψ ( w ( s ) , w ( t ) ) • • where is a scoring function over source and target sentences ψ • Requires two components: • Learning algorithm to compute parameters of ψ • Decoding algorithm for computing the best translation w ( t )

Why is MT challenging? • Single words may be replaced with multi-word phrases • I like apples J'aime les pommes ↔ • Reordering of phrases • I like red apples J'aime les pommes rouges ↔ • Contextual dependence • les the but les pommes apples ↔ ↔ Extremely large output space Decoding is NP-hard ⟹

Vauquois Pyramid • Hierarchy of concepts and distances between them in di ff erent languages • Lowest level: individual words/characters • Higher levels: syntax, semantics • Interlingua: Generic language-agnostic representation of meaning

Evaluating translation quality • Two main criteria: • Adequacy: Translation w ( t ) should adequately reflect the linguistic w ( s ) content of • Fluency: Translation w ( t ) should be fluent text in the target language Di ff erent translations of A Vinay le gusta Python

Evaluation metrics • Manual evaluation is most accurate, but expensive • Automated evaluation metrics: • Compare system hypothesis with reference translations • BiLingual Evaluation Understudy (BLEU) (Papineni et al., 2002): • Modified n-gram precision

BLEU N BLEU = exp 1 ∑ log p n N n =1 Two modifications: • To avoid , all p i are smoothed log 0 • Each n-gram in reference can be used at most once • Ex. Hypothesis : to to to to to vs Reference : to be or not to be should not get a unigram precision of 1 Precision-based metrics favor short translations • Solution: Multiply score with a brevity penalty for translations e 1 − r / h shorter than reference,

BLEU • Correlates somewhat well with human judgements (G. Doddington, NIST)

BLEU scores Sample BLEU scores for various system outputs Issues? • Alternatives have been proposed: • METEOR: weighted F-measure • Translation Error Rate (TER): Edit distance between hypothesis and reference

Data • Statistical MT relies requires parallel corpora   (Europarl, Koehn, 2005) • And lots of it! • Not available for many low-resource languages in the world

̂ Statistical MT w ( t ) = arg max w ( t ) ψ ( w ( s ) , w ( t ) ) • Scoring function can be broken down as follows:   ψ ψ ( w ( s ) , w ( t ) ) = ψ A ( w ( s ) , w ( t ) ) + ψ F ( w ( t ) ) (adequacy) (fluency) • Allows us to estimate parameters of on separate data ψ • from aligned corpora ψ A • from monolingual corpora ψ F

Noisy channel model p S | T Source Target p T sentence sentence • Generative process for source sentence • Use Bayes rule to recover w ( t ) that is maximally likely under the conditional distribution (which is what we want) p T | S

Noisy channel model p S | T Source Target p T sentence sentence Allows us to use a language model to improve fluency p T • Generative process for source sentence • Use Bayes rule to recover w ( t ) that is maximally likely under the conditional distribution (which is what we want) p T | S

IBM Models • Early approaches to statistical MT • How can we define the translation model ? p S | T • How can we estimate the parameters of the translation model from parallel training examples? • Make use of the idea of alignments

Alignments • Key question: How should we align words in source to words in target? good bad

Incorporating alignments • Joint probability of alignment and translation can be defined as: • M ( s ) , M ( t ) are the number of words in source and target sentences • m th is the alignment of the word in the source sentence, i.e. it a m m th specifies that the word is aligned to the word in target a mth Is this su ffi cient?

Incorporating alignments (target) (source) a 1 = 2, a 2 = 3, a 3 = 4,... Multiple source words may align to the same target word!

Reordering and word insertion Assume extra NULL token (Slide credit: Brendan O’Connor)

Independence assumptions • Two independence assumptions: • Alignment probability factors across tokens: • Translation probability factors across tokens:

How do we translate? p ( w ( s ) , w ( t ) ) w ( t ) p ( w ( t ) | w ( s ) ) = arg max • We want: arg max p ( w ( s ) ) w ( t ) • Sum over all possible alignments: • Alternatively, take the max over alignments • Decoding: Greedy/beam search

  IBM Model 1 1 p ( a m | m , M ( s ) , M ( t ) ) = • Assume M ( t ) • Is this a good assumption?   Every alignment is equally likely!

IBM Model 1 • Each source word is aligned to at most one target word 1 p ( a m | m , M ( s ) , M ( t ) ) = • Further, assume M ( t ) • We then have:   ( 1 M ( t ) ) M ( s ) p ( w ( s ) | w ( t ) ) p ( w ( s ) , w ( t ) ) = p ( w ( t ) ) ∑ A p ( w ( s ) = v | w ( t ) = u ) • How do we estimate ?

IBM Model 1 • If we had word-to-word alignments, we could compute the probabilities using the MLE: p ( v | u ) = count ( u , v ) • count ( u ) • where = #instances where word was aligned count ( u , v ) u to word in the training set v • However, word-to-word alignments are often hard to come by What can we do?

EM for Model 1* (advanced topic) • (E-Step) If we had an accurate translation model, we can estimate likelihood of each alignment as: • (M Step) Use expected count to re-estimate translation parameters:   E q [ count ( u , v )] p ( v | u ) = count ( u )

IBM Model 1 - EM intuition Step 1 Step 2 Step 3 … Step N Example from Philipp Koehn

IBM Model 2 • Slightly relaxed assumption: • p ( a m | m , M ( s ) , M ( t ) ) is also estimated, not set to constant • Original independence assumptions still required: • Alignment probability factors across tokens: • Translation probability factors across tokens:

Other IBM models • Models 3 - 6 make successively weaker assumptions • But get progressively harder to optimize • Simpler models are often used to ‘initialize’ complex ones • e.g train Model 1 and use it to initialize Model 2 parameters

Phrase-based MT • Word-by-word translation is not su ffi cient in many cases (literal) (actual) • Solution: build alignments and translation tables between multiword spans or “phrases”

Phrase-based MT • Solution: build alignments and translation tables between multiword spans or “phrases” • Translations condition on multi-word units and assign probabilities to multi-word units • Alignments map from spans to spans

Phrase la)ces are big! � 7 � �� . Slide credit: Dan Klein

Vauquois Pyramid • Hierarchy of concepts and distances between them in di ff erent languages • Lowest level: individual words/characters • Higher levels: syntax, semantics • Interlingua: Generic language-agnostic representation of meaning

Syntactic MT (Slide credit: Greg Durrett)

Syntactic MT Next time: Neural machine translation

Machine Translation Luke Zettlemoyer (Slides adapted from Karthik - PowerPoint PPT Presentation

CSEP 517 Natural Language Processing Machine Translation Luke Zettlemoyer (Slides adapted from Karthik Narasimhan, Chris Manning, Dan Jurafsky) Translation One of the holy grail problems in artificial intelligence Practical use

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Machine Translation: Going Deep Philipp Koehn 4 June 2015 Philipp Koehn Machine Translation:

Machine Translation Philipp Koehn 1 December 2015 Philipp Koehn Artificial Intelligence:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Global Translation Services Website translation using post-edited machine translation and

Community Translation By Willem Stoeller Examples Community Translation Virtual Teams Powering

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Tree-based and Forest-Based Translation Liang Huang Joint work with Kevin Knight (ISI), Aravind

Cross-ISA Machine Instrumentation Cross-ISA Machine Instrumentation using Fast and Scalable

The Efficacy of Human Post-Editing for Language Translation Spence Green Jeffrey Heer

Compilers & Translator Writing Systems Prof. R. Eigenmann ECE573, Fall 2005

Translation from SQL into the relational algebra Consider the following relational schema:

2D Geometric Transformations Question : How do we represent a geometric object in the plane?

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond rej Bojar

Empirical Methods in Natural Language Processing Lecture 14 Machine translation (I): Introduction

Machine Translation Luke Zettlemoyer (Slides adapted from Karthik - PowerPoint PPT Presentation

CSEP 517 Natural Language Processing Machine Translation Luke Zettlemoyer (Slides adapted from Karthik Narasimhan, Chris Manning, Dan Jurafsky) Translation One of the holy grail problems in artificial intelligence Practical use

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Machine Translation: Going Deep Philipp Koehn 4 June 2015 Philipp Koehn Machine Translation:

Machine Translation Philipp Koehn 1 December 2015 Philipp Koehn Artificial Intelligence:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Global Translation Services Website translation using post-edited machine translation and

Community Translation By Willem Stoeller Examples Community Translation Virtual Teams Powering

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Tree-based and Forest-Based Translation Liang Huang Joint work with Kevin Knight (ISI), Aravind

Cross-ISA Machine Instrumentation Cross-ISA Machine Instrumentation using Fast and Scalable

The Efficacy of Human Post-Editing for Language Translation Spence Green Jeffrey Heer

Compilers &amp; Translator Writing Systems Prof. R. Eigenmann ECE573, Fall 2005

Translation from SQL into the relational algebra Consider the following relational schema:

2D Geometric Transformations Question : How do we represent a geometric object in the plane?

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond rej Bojar

Empirical Methods in Natural Language Processing Lecture 14 Machine translation (I): Introduction

Compilers & Translator Writing Systems Prof. R. Eigenmann ECE573, Fall 2005