Neural Machine Translation: directions for improvement CMSC 470 - PowerPoint PPT Presentation

Neural Machine Translation: directions for improvement CMSC 470 Marine Carpuat

How can we improve on state-of-the-art machine translation approaches? • Model • Training • Data • Objective • Algorithm

Addressing domain mismatch Slides adapted from Kevin Duh [Domain Adaptation in Machine Translation, MTMA 2019]

Supervised training data is not always in the domain we want to translate!

Domain adaptation is an important practical problem in machine translation • It may be expensive to obtain training sets that are both large and relevant to test domain • So we often have to work with whatever we can!

Possible strategies: “ Continued Training ” or “ fine-tuning ” • Requires small in-domain parallel data [Luong and Manning 2016]

Possible strategies: back-translation

Possible strategies: data selection • Train a language model on data representative of test domain • N-gram count based model [Moore & Lewis 2010] • Neural model [Duh et al. 2013] • Neural MT model [Junczys-Dowmunt 2018] • Use perplexity of LM on new data to measure distance from test domain

Possible strategies: different weights for different training samples Corpus level weight Instance level weight Based on classifier that measures similarity of samples with in domain data

How can we improve on state-of-the-art machine translation approaches? • Model • Training • Data • Objective • Algorithm

Beyond Maximum Likelihood Training

How can we improve NMT training? • Assumption: References can substitute for predicted translations during training • Our hypothesis: Modeling divergences between references and predictions improves NMT Based on paper by Weijia Xu [NAACL 2019]

Exposure Bias: Gap Between Training and Inference ℎ 1 ℎ 2 𝑄 𝑧 𝑦 = Inference 我们做了晚餐 𝑈 ? ෑ 𝑞 𝑧 𝑢 𝑧 <𝑢 , 𝑦 Model Translation <s> We will 𝑢=1 ℎ 2 ℎ 1 Maximum Loss = 我们做了晚餐 Likelihood 𝑈 dinner Training ෍ log 𝑞 𝑧 𝑢 𝑧 <𝑢 , 𝑦 Reference 𝑢=1 <s> We made

How to Address Exposure Bias? • Because of exposure bias • Models don ’ t learn to recover from their errors • Cascading errors at test time • Solution: • Expose models to their own predictions during training • But how to compute the loss when the partial translation diverges from the reference?

Existing Method: Scheduled Sampling Reference: <s> We made dinner </s> = choose randomly P 我们做了晚餐 predict <s> We P We We [Bengio et al., NeurIPS 2015]

Existing Method: Scheduled Sampling Reference: <s> We made dinner </s> = choose randomly P ℎ 1 我们做了晚餐 predict <s> We will P will made [Bengio et al., NeurIPS 2015]

Existing Method: Scheduled Sampling Reference: <s> We made dinner </s> … ℎ 2 ℎ 3 ℎ 1 我们做了晚餐 <s> We will [Bengio et al., NeurIPS 2015]

Existing Method: Scheduled Sampling Reference: <s> We made dinner </s> … ℎ 2 ℎ 3 ℎ 1 我们做了晚餐 We <s> will J = log p( “ We ” | “ <s> ” , source) [Bengio et al., NeurIPS 2015]

Existing Method: Scheduled Sampling Reference: <s> We made dinner </s> ℎ 3 … ℎ 2 ℎ 1 我们做了晚餐 We <s> will J = log p( “ made ” | “ <s> We ” , source) [Bengio et al., NeurIPS 2015]

Existing Method: Scheduled Sampling Reference: <s> We made dinner </s> … Incorrect synthetic ℎ 2 ℎ 3 ℎ 1 我们做了晚餐 reference: “ We will dinner ” We <s> will J = log p( “ dinner ” | “ <s> We will ” , source) [Bengio et al., NeurIPS 2015]

Our Solution: Align Reference with Partial Translations Reference: <s> We made dinner </s> Soft Alignment 𝒃 𝟐 ℎ 4 ℎ 2 ℎ 3 ℎ 1 我们做了晚餐 <s> We will make dinner 𝒃 𝟐 logp (“dinner” | “<s>”, source)

Our Solution: Align Reference with Partial Translations Reference: <s> We made dinner </s> Soft Alignment 𝒃 𝟑 ℎ 4 ℎ 2 ℎ 3 ℎ 1 我们做了晚餐 <s> We will make dinner 𝒃 𝟐 logp( “ dinner ” | “ <s> ” , source) + 𝒃 𝟑 logp( “ dinner ” | “ <s> We ” , source)

Our Solution: Align Reference with Partial Translations Reference: <s> We made dinner </s> Soft Alignment 𝒃 𝟒 ℎ 4 ℎ 2 ℎ 3 ℎ 1 我们做了晚餐 <s> We will make dinner 𝒃 𝟐 logp( “ dinner ” | “ <s> ” , source) + 𝒃 𝟑 logp( “ dinner ” | “ <s> We ” , source) + 𝒃 𝟒 logp( “ dinner ” | “ <s> We will ” , source)

Our Solution: Align Reference with Partial Translations Reference: <s> We made dinner </s> Soft Alignment 𝒃 𝟓 ℎ 4 ℎ 2 ℎ 3 ℎ 1 我们做了晚餐 <s> We will make dinner 𝒃 𝟐 logp( “ dinner ” | “ <s> ” , source) + 𝒃 𝟑 logp( “ dinner ” | “ <s> We ” , source) + 𝒃 𝟒 logp( “ dinner ” | “ <s> We will ” , source) + 𝒃 𝟓 logp( “ dinner ” | “ <s> We will make ” , source)

Our Solution: Align Reference with Partial Translations Soft Alignment Reference: <s> We made dinner </s> 𝒃 𝒋 ∝ 𝐟𝐲𝐪(𝑭𝒏𝒄𝒇𝒆 𝒆𝒋𝒐𝒐𝒇𝒔 ⋅ 𝒊 𝒋 ) ℎ 4 ℎ 2 ℎ 3 ℎ 1 我们做了晚餐 <s> We will make dinner 𝒃 𝟐 logp (“dinner” | “<s>”, source) + 𝒃 𝟑 logp (“dinner” | “<s> We”, source) + 𝒃 𝟒 logp (“dinner” | “<s> We will”, source) + 𝒃 𝟓 logp (“dinner” | “<s> We will make”, source)

Our Solution: Align Reference with Partial Translations Soft Alignment Reference: <s> We made dinner </s> 𝒃 𝒋 ∝ 𝐟𝐲𝐪(𝑭𝒏𝒄𝒇𝒆 𝒆𝒋𝒐𝒐𝒇𝒔 ⋅ 𝒊 𝒋 ) ℎ 4 ℎ 2 ℎ 3 ℎ 1 我们做了晚餐 <s> We will make dinner 𝒃 𝟐 logp( “ dinner ” | “ <s> ” , source) + 𝒃 𝟑 logp( “ dinner ” | “ <s> We ” , source) + 𝒃 𝟒 logp( “ dinner ” | “ <s> We will ” , source) + 𝒃 𝟓 logp( “ dinner ” | “ <s> We will make ” , source)

Training Objective Ours: Scheduled Sampling: Hard alignment by time index t Soft alignment between 𝑧 𝑢 and ෤ 𝑧 <𝑘 𝑈 ′ 𝑈 𝑈 𝐾 𝑇𝑇 = ෍ ෍ 𝑚𝑝𝑕 𝑞 𝑧 𝑢 𝑧 <𝑢 , 𝑦) ෤ 𝐾 𝑇𝐵 = ෍ ෍ 𝑚𝑝𝑕 ෍ 𝑏 𝑢𝑘 𝑞 𝑧 𝑢 𝑧 <𝑘 , 𝑦) ෤ 𝑦,𝑧 ∈𝐸 𝑢=1 𝑦,𝑧 ∈𝐸 𝑢=1 𝑘=1

Training Objective Scheduled Sampling: Ours: Hard alignment by time index t Soft alignment between 𝑧 𝑢 and ෤ 𝑧 <𝑘 𝑈 ′ 𝑈 𝑈 𝐾 𝑇𝑇 = ෍ ෍ 𝑚𝑝𝑕 𝑞 𝑧 𝑢 𝑧 <𝑢 , 𝑦) ෤ 𝐾 𝑇𝐵 = ෍ ෍ 𝑚𝑝𝑕 ෍ 𝑏 𝑢𝑘 𝑞 𝑧 𝑢 𝑧 <𝑘 , 𝑦) ෤ 𝑦,𝑧 ∈𝐸 𝑢=1 𝑦,𝑧 ∈𝐸 𝑢=1 𝑘=1 Combined with maximum likelihood: 𝐾 = 𝐾 𝑇𝐵 + 𝐾 𝑁𝑀

Experiments • Model • Data • Bi-LSTM encoder, LSTM decoder, • IWSLT14 de-en multilayer perceptron attention • IWSLT15 vi-en • Differentiable sampling with Straight- Through Gumbel Softmax • Based on AWS sockeye

Our Method Outperforms Maximum Likelihood and Scheduled Sampling 28 Baseline 27 26 Scheduled Sampling BLEU 25 24 Differentiable Scheduled Sampling 23 Our Method 22 de-en en-de vi-en

Our Method Needs No Annealing Scheduled sampling: BLEU drops when used without annealing! 27 Baseline 25 Scheduled Sampling BLEU 23 w/ annealing Scheduled Sampling 21 w/o annealing Our Method (no 19 annealing) 17 de-en en-de vi-en

Summary Introduced a new training objective 1. Generate translation prefixes via differentiable sampling 2. Learn to align the reference words with sampled prefixes Better BLEU than the maximum likelihood and scheduled sampling (de-en, en-de, vi-en) Simple to train , no annealing schedule required

What you should know • Lots of things can be done to improve neural MT even without changing the model architecture • The domain of training data matters • Simple techniques can be used to measure distance from test domain • And to adapt model to domain of interest • The standard maximum likelihood objective is suboptimal • It does not directly measure translation quality • It is based on reference translations only, so the model is not exposed to its own errors during training • Developing reliable alternatives is an active area of research

Neural Machine Translation: directions for improvement CMSC 470 - PowerPoint PPT Presentation

Neural Machine Translation: directions for improvement CMSC 470 Marine Carpuat How can we improve on state-of-the-art machine translation approaches? Model Training Data Objective Algorithm Addressing domain mismatch

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Convolutional over Recurrent Encoder for Neural Machine Translation Praveen Dakwale and Christof

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Guess Whos Coming to Dinner? Psalm 23 Luke 7:36 50 In 1967, Spencer Tracy and Katherine

SOTL What is it good for? Scholarship of Teaching and Learning > Can inform teaching > Is

CARES Act - What's in it for Business Owners Disclaimer These slides are for educational

Inference for Numerical Data I Dajiang Liu @PHS 525 Feb-18-2016 How to Select Significance

Questions from the World Cafe Anuradha Srinivasan Elizabeth Simmons Round 1: How do you value

LEARN TOGETHER. NOT THE SAME. Amal Kakaiya Maria Neumayer @K4KYA @marianeum Consumer Rider

MuseNet (https://openai.com/blog/musenet/ Weve created MuseNet, a deep neural network that

Complexity Theory J org Kreiker Chair for Theoretical Computer Science Prof. Esparza TU M

Sambuz

Useful Links

Newsletter

Mail Us

Neural Machine Translation: directions for improvement CMSC 470 - PowerPoint PPT Presentation

Neural Machine Translation: directions for improvement CMSC 470 Marine Carpuat How can we improve on state-of-the-art machine translation approaches? Model Training Data Objective Algorithm Addressing domain mismatch

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Convolutional over Recurrent Encoder for Neural Machine Translation Praveen Dakwale and Christof

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Guess Whos Coming to Dinner? Psalm 23 Luke 7:36 50 In 1967, Spencer Tracy and Katherine

SOTL What is it good for? Scholarship of Teaching and Learning &gt; Can inform teaching &gt; Is

CARES Act - What's in it for Business Owners Disclaimer These slides are for educational

Inference for Numerical Data I Dajiang Liu @PHS 525 Feb-18-2016 How to Select Significance

Questions from the World Cafe Anuradha Srinivasan Elizabeth Simmons Round 1: How do you value

LEARN TOGETHER. NOT THE SAME. Amal Kakaiya Maria Neumayer @K4KYA @marianeum Consumer Rider

MuseNet (https://openai.com/blog/musenet/ Weve created MuseNet, a deep neural network that

Complexity Theory J org Kreiker Chair for Theoretical Computer Science Prof. Esparza TU M

Sambuz

Useful Links

Newsletter

Mail Us

SOTL What is it good for? Scholarship of Teaching and Learning > Can inform teaching > Is