neural machine translation
play

Neural Machine Translation: directions for improvement CMSC 470 - PowerPoint PPT Presentation

Neural Machine Translation: directions for improvement CMSC 470 Marine Carpuat How can we improve on state-of-the-art machine translation approaches? Model Training Data Objective Algorithm Addressing domain mismatch


  1. Neural Machine Translation: directions for improvement CMSC 470 Marine Carpuat

  2. How can we improve on state-of-the-art machine translation approaches? • Model • Training • Data • Objective • Algorithm

  3. Addressing domain mismatch Slides adapted from Kevin Duh [Domain Adaptation in Machine Translation, MTMA 2019]

  4. Supervised training data is not always in the domain we want to translate!

  5. Domain adaptation is an important practical problem in machine translation • It may be expensive to obtain training sets that are both large and relevant to test domain • So we often have to work with whatever we can!

  6. Possible strategies: “ Continued Training ” or “ fine-tuning ” • Requires small in-domain parallel data [Luong and Manning 2016]

  7. Possible strategies: back-translation

  8. Possible strategies: data selection • Train a language model on data representative of test domain • N-gram count based model [Moore & Lewis 2010] • Neural model [Duh et al. 2013] • Neural MT model [Junczys-Dowmunt 2018] • Use perplexity of LM on new data to measure distance from test domain

  9. Possible strategies: different weights for different training samples Corpus level weight Instance level weight Based on classifier that measures similarity of samples with in domain data

  10. How can we improve on state-of-the-art machine translation approaches? • Model • Training • Data • Objective • Algorithm

  11. Beyond Maximum Likelihood Training

  12. How can we improve NMT training? • Assumption: References can substitute for predicted translations during training • Our hypothesis: Modeling divergences between references and predictions improves NMT Based on paper by Weijia Xu [NAACL 2019]

  13. Exposure Bias: Gap Between Training and Inference ℎ 1 ℎ 2 𝑄 𝑧 𝑦 = Inference 我 们 做了 晚餐 𝑈 ? ෑ 𝑞 𝑧 𝑢 𝑧 <𝑢 , 𝑦 Model Translation <s> We will 𝑢=1 ℎ 2 ℎ 1 Maximum Loss = 我 们 做了 晚餐 Likelihood 𝑈 dinner Training ෍ log 𝑞 𝑧 𝑢 𝑧 <𝑢 , 𝑦 Reference 𝑢=1 <s> We made

  14. How to Address Exposure Bias? • Because of exposure bias • Models don ’ t learn to recover from their errors • Cascading errors at test time • Solution: • Expose models to their own predictions during training • But how to compute the loss when the partial translation diverges from the reference?

  15. Existing Method: Scheduled Sampling Reference: <s> We made dinner </s> = choose randomly P 我 们 做了 晚餐 predict <s> We P We We [Bengio et al., NeurIPS 2015]

  16. Existing Method: Scheduled Sampling Reference: <s> We made dinner </s> = choose randomly P ℎ 1 我 们 做了 晚餐 predict <s> We will P will made [Bengio et al., NeurIPS 2015]

  17. Existing Method: Scheduled Sampling Reference: <s> We made dinner </s> … ℎ 2 ℎ 3 ℎ 1 我 们 做了 晚餐 <s> We will [Bengio et al., NeurIPS 2015]

  18. Existing Method: Scheduled Sampling Reference: <s> We made dinner </s> … ℎ 2 ℎ 3 ℎ 1 我 们 做了 晚餐 We <s> will J = log p( “ We ” | “ <s> ” , source) [Bengio et al., NeurIPS 2015]

  19. Existing Method: Scheduled Sampling Reference: <s> We made dinner </s> ℎ 3 … ℎ 2 ℎ 1 我 们 做了 晚餐 We <s> will J = log p( “ made ” | “ <s> We ” , source) [Bengio et al., NeurIPS 2015]

  20. Existing Method: Scheduled Sampling Reference: <s> We made dinner </s> … Incorrect synthetic ℎ 2 ℎ 3 ℎ 1 我 们 做了 晚餐 reference: “ We will dinner ” We <s> will J = log p( “ dinner ” | “ <s> We will ” , source) [Bengio et al., NeurIPS 2015]

  21. Our Solution: Align Reference with Partial Translations Reference: <s> We made dinner </s> Soft Alignment 𝒃 𝟐 ℎ 4 ℎ 2 ℎ 3 ℎ 1 我 们 做了 晚餐 <s> We will make dinner 𝒃 𝟐 logp (“dinner” | “<s>”, source)

  22. Our Solution: Align Reference with Partial Translations Reference: <s> We made dinner </s> Soft Alignment 𝒃 𝟑 ℎ 4 ℎ 2 ℎ 3 ℎ 1 我 们 做了 晚餐 <s> We will make dinner 𝒃 𝟐 logp( “ dinner ” | “ <s> ” , source) + 𝒃 𝟑 logp( “ dinner ” | “ <s> We ” , source)

  23. Our Solution: Align Reference with Partial Translations Reference: <s> We made dinner </s> Soft Alignment 𝒃 𝟒 ℎ 4 ℎ 2 ℎ 3 ℎ 1 我 们 做了 晚餐 <s> We will make dinner 𝒃 𝟐 logp( “ dinner ” | “ <s> ” , source) + 𝒃 𝟑 logp( “ dinner ” | “ <s> We ” , source) + 𝒃 𝟒 logp( “ dinner ” | “ <s> We will ” , source)

  24. Our Solution: Align Reference with Partial Translations Reference: <s> We made dinner </s> Soft Alignment 𝒃 𝟓 ℎ 4 ℎ 2 ℎ 3 ℎ 1 我 们 做了 晚餐 <s> We will make dinner 𝒃 𝟐 logp( “ dinner ” | “ <s> ” , source) + 𝒃 𝟑 logp( “ dinner ” | “ <s> We ” , source) + 𝒃 𝟒 logp( “ dinner ” | “ <s> We will ” , source) + 𝒃 𝟓 logp( “ dinner ” | “ <s> We will make ” , source)

  25. Our Solution: Align Reference with Partial Translations Soft Alignment Reference: <s> We made dinner </s> 𝒃 𝒋 ∝ 𝐟𝐲𝐪(𝑭𝒏𝒄𝒇𝒆 𝒆𝒋𝒐𝒐𝒇𝒔 ⋅ 𝒊 𝒋 ) ℎ 4 ℎ 2 ℎ 3 ℎ 1 我 们 做了 晚餐 <s> We will make dinner 𝒃 𝟐 logp (“dinner” | “<s>”, source) + 𝒃 𝟑 logp (“dinner” | “<s> We”, source) + 𝒃 𝟒 logp (“dinner” | “<s> We will”, source) + 𝒃 𝟓 logp (“dinner” | “<s> We will make”, source)

  26. Our Solution: Align Reference with Partial Translations Soft Alignment Reference: <s> We made dinner </s> 𝒃 𝒋 ∝ 𝐟𝐲𝐪(𝑭𝒏𝒄𝒇𝒆 𝒆𝒋𝒐𝒐𝒇𝒔 ⋅ 𝒊 𝒋 ) ℎ 4 ℎ 2 ℎ 3 ℎ 1 我 们 做了 晚餐 <s> We will make dinner 𝒃 𝟐 logp( “ dinner ” | “ <s> ” , source) + 𝒃 𝟑 logp( “ dinner ” | “ <s> We ” , source) + 𝒃 𝟒 logp( “ dinner ” | “ <s> We will ” , source) + 𝒃 𝟓 logp( “ dinner ” | “ <s> We will make ” , source)

  27. Training Objective Ours: Scheduled Sampling: Hard alignment by time index t Soft alignment between 𝑧 𝑢 and ෤ 𝑧 <𝑘 𝑈 ′ 𝑈 𝑈 𝐾 𝑇𝑇 = ෍ ෍ 𝑚𝑝𝑕 𝑞 𝑧 𝑢 𝑧 <𝑢 , 𝑦) ෤ 𝐾 𝑇𝐵 = ෍ ෍ 𝑚𝑝𝑕 ෍ 𝑏 𝑢𝑘 𝑞 𝑧 𝑢 𝑧 <𝑘 , 𝑦) ෤ 𝑦,𝑧 ∈𝐸 𝑢=1 𝑦,𝑧 ∈𝐸 𝑢=1 𝑘=1

  28. Training Objective Ours: Scheduled Sampling: Hard alignment by time index t Soft alignment between 𝑧 𝑢 and ෤ 𝑧 <𝑘 𝑈 ′ 𝑈 𝑈 𝐾 𝑇𝑇 = ෍ ෍ 𝑚𝑝𝑕 𝑞 𝑧 𝑢 𝑧 <𝑢 , 𝑦) ෤ 𝐾 𝑇𝐵 = ෍ ෍ 𝑚𝑝𝑕 ෍ 𝑏 𝑢𝑘 𝑞 𝑧 𝑢 𝑧 <𝑘 , 𝑦) ෤ 𝑦,𝑧 ∈𝐸 𝑢=1 𝑦,𝑧 ∈𝐸 𝑢=1 𝑘=1

  29. Training Objective Scheduled Sampling: Ours: Hard alignment by time index t Soft alignment between 𝑧 𝑢 and ෤ 𝑧 <𝑘 𝑈 ′ 𝑈 𝑈 𝐾 𝑇𝑇 = ෍ ෍ 𝑚𝑝𝑕 𝑞 𝑧 𝑢 𝑧 <𝑢 , 𝑦) ෤ 𝐾 𝑇𝐵 = ෍ ෍ 𝑚𝑝𝑕 ෍ 𝑏 𝑢𝑘 𝑞 𝑧 𝑢 𝑧 <𝑘 , 𝑦) ෤ 𝑦,𝑧 ∈𝐸 𝑢=1 𝑦,𝑧 ∈𝐸 𝑢=1 𝑘=1 Combined with maximum likelihood: 𝐾 = 𝐾 𝑇𝐵 + 𝐾 𝑁𝑀

  30. Experiments • Model • Data • Bi-LSTM encoder, LSTM decoder, • IWSLT14 de-en multilayer perceptron attention • IWSLT15 vi-en • Differentiable sampling with Straight- Through Gumbel Softmax • Based on AWS sockeye

  31. Our Method Outperforms Maximum Likelihood and Scheduled Sampling 28 Baseline 27 26 Scheduled Sampling BLEU 25 24 Differentiable Scheduled Sampling 23 Our Method 22 de-en en-de vi-en

  32. Our Method Needs No Annealing Scheduled sampling: BLEU drops when used without annealing! 27 Baseline 25 Scheduled Sampling BLEU 23 w/ annealing Scheduled Sampling 21 w/o annealing Our Method (no 19 annealing) 17 de-en en-de vi-en

  33. Summary Introduced a new training objective 1. Generate translation prefixes via differentiable sampling 2. Learn to align the reference words with sampled prefixes Better BLEU than the maximum likelihood and scheduled sampling (de-en, en-de, vi-en) Simple to train , no annealing schedule required

  34. What you should know • Lots of things can be done to improve neural MT even without changing the model architecture • The domain of training data matters • Simple techniques can be used to measure distance from test domain • And to adapt model to domain of interest • The standard maximum likelihood objective is suboptimal • It does not directly measure translation quality • It is based on reference translations only, so the model is not exposed to its own errors during training • Developing reliable alternatives is an active area of research

Recommend


More recommend