Target Conditioned Sampling: Optimizing Data Selection for Multilingual NMT Xinyi Wang, Graham Neubig Language Technologies Institute Carnegie Mellon University
Multilingual NMT glg: A mañá que eu nunca vou spa: Una mañana que nunca olvidaré . por: Uma manhã que nunca vou esquecer . A morning that I will never forget . ita: Una mattina che non dimenticherò mai . jpn: その⽇旦の朝のことは 決し て忘れることはないでしょう • Particularly useful for low-resource languages (LRLs), such as Galician (glg)
Multilingual Training Paradigms • Multi-lingual training (Dong et al. 2015, Firat et al. 2016) • Train on related high-resource language, tune towards LRL (Zoph et al. 2016) • Train on multilingual data, tune towards LRL (Neubig and Hu 2018, Gu et al. 2018) • Our proposal: can we more intelligently select data in a less heuristic way?
Multilingual Objective for LRL NMT P s ( X , Y ) S S 1 T .... S n − 1 Q ( X , Y ) ≈ P s ( X , Y ) S n • How to construct the ? Q ( X , Y )
Target Conditioned Sampling union of targets A morning that I will never forget. .... When I was 11, I usually stay with spa: Una mañana.... Q ( Y ) Sampled Data por: Uma manhã .. por: Uma manhã .. Q ( X | y ) A morning that I will ita: Una mattina ... A morning that I will never forget. never forget. jpn: その⽇旦の朝 ...
Choosing the Distributions • � Q ( Y ) • assume each language data comes from same domain • uniform sample from all target � can match � y P s ( Y ) • � Q ( X | y ) • � measures how likely � is in language � P s ( X = x | y ) x s • Approximate using heuristic similarity measure � , sim ( x , s ) normalize over all multilingual � for a given target � x i y
Estimating � sim ( x , s ) Vocab Overlap Language Model character n-gram Language score document of each between S and each Level language language character n-gram Sentence use LM on S to score each between S and each Level sentence sentence
Algorithms • First sample � based on � , then sample � based y Q ( Y ) ( x i , y ) on � Q ( X | y ) • Stochastic (TCS-S): • dynamically sample each mini batch • Deterministic (TCS-D): • select � x ′ � = argmax x Q ( x | y ) , fixed during training
Experiment • Dataset • 58-language-to-English TED dataset (Qi et al., 2018) • 4 test languages: Azerbaijani (aze), Belarusian (bel), Galician (glg), Slovak (slk) • Baselines • Bi: each LRL paired with one related HRL (Neubig & Hu 2018) • All: train on all 59 languages • Copied: use union of English sentences as monolingual data by copying them to the source (Currey et al. 2017)
TCS vs. Baselines All copied TCS-S 3 Relative di ff erence from Bi 2.25 1.5 0.75 0 -0.75 -1.5 -2.25 -3 aze bel glg slk
TCS-D vs. TCS-S TCS-D TCS-S 2.2 1.65 1.1 0.55 0 aze bel glg slk • TCS-D already brings gains, TCS-S generally performs better
LM vs. Vocab LM Vocab 2.2 Relative di ff erence from Bi 1.65 1.1 0.55 0 aze bel glg slk • Simple vocab overlap heuristic is already competitive • LM performs better for slk, with highest amount of data
Sent vs. Lang Sent Lang 2.2 Relative di ff erence from Bi 1.65 1.1 0.55 0 aze bel glg slk • Language level heuristic is in general better
Conclusion • TCS is a simple method for better multi-lingual data selection • Brings significant improvements with little training overhead • Simple heuristics work well for LRLs to estimate language similarity https://github.com/cindyxinyiwang/TCS Thank You! Questions?
Extra Slides
Relationship with Back- Translation back-translate TCS-S 3.75 2.5 1.25 0 -1.25 -2.5 -3.75 -5 aze bel glg slk • TCS approximates back-translate probability � P s ( X | y ) • For LRL, heuristics performs better than back-translate model
Effect on SDE All copied TCS-S 3 2 1 0 -1 -2 -3 -4 aze bel glg slk • SDE: a better word encoding designed for multilingual data (Wang et. al. 2019) • TCS still brings significant gains on top of SDE
Recommend
More recommend