Target Conditioned Sampling: Optimizing Data Selection for - PowerPoint PPT Presentation

Target Conditioned Sampling:   Optimizing Data Selection for Multilingual NMT Xinyi Wang, Graham Neubig Language Technologies Institute Carnegie Mellon University

Multilingual NMT glg: A mañá que eu nunca vou spa: Una mañana que nunca olvidaré . por: Uma manhã que nunca vou esquecer . A morning that I will never forget . ita: Una mattina che non dimenticherò mai . jpn: その⽇旦の朝のことは決して忘れることはないでしょう • Particularly useful for low-resource languages (LRLs), such as Galician (glg)

Multilingual Training Paradigms • Multi-lingual training (Dong et al. 2015, Firat et al. 2016) • Train on related high-resource language, tune towards LRL (Zoph et al. 2016) • Train on multilingual data, tune towards LRL (Neubig and Hu 2018, Gu et al. 2018) • Our proposal: can we more intelligently select data in a less heuristic way?

Multilingual Objective for LRL NMT P s ( X , Y ) S S 1 T .... S n − 1 Q ( X , Y ) ≈ P s ( X , Y ) S n • How to construct the ? Q ( X , Y )

Target Conditioned Sampling union of targets A morning that I will never forget. .... When I was 11, I usually stay with spa: Una mañana.... Q ( Y ) Sampled Data por: Uma manhã .. por: Uma manhã .. Q ( X | y ) A morning that I will ita: Una mattina ... A morning that I will never forget. never forget. jpn: その⽇旦の朝 ...

Choosing the Distributions • � Q ( Y ) • assume each language data comes from same domain • uniform sample from all target � can match � y P s ( Y ) • � Q ( X | y ) • � measures how likely � is in language � P s ( X = x | y ) x s • Approximate using heuristic similarity measure � , sim ( x , s ) normalize over all multilingual � for a given target � x i y

Estimating � sim ( x , s ) Vocab Overlap Language Model character n-gram Language score document of each between S and each Level language language character n-gram Sentence use LM on S to score each between S and each Level sentence sentence

Algorithms • First sample � based on � , then sample � based y Q ( Y ) ( x i , y ) on � Q ( X | y ) • Stochastic (TCS-S): • dynamically sample each mini batch • Deterministic (TCS-D): • select � x ′ � = argmax x Q ( x | y ) , fixed during training

Experiment • Dataset • 58-language-to-English TED dataset (Qi et al., 2018) • 4 test languages: Azerbaijani (aze), Belarusian (bel), Galician (glg), Slovak (slk) • Baselines • Bi: each LRL paired with one related HRL (Neubig & Hu 2018) • All: train on all 59 languages • Copied: use union of English sentences as monolingual data by copying them to the source (Currey et al. 2017)

TCS vs. Baselines All copied TCS-S 3 Relative di ff erence from Bi 2.25 1.5 0.75 0 -0.75 -1.5 -2.25 -3 aze bel glg slk

TCS-D vs. TCS-S TCS-D TCS-S 2.2 1.65 1.1 0.55 0 aze bel glg slk • TCS-D already brings gains, TCS-S generally performs better

LM vs. Vocab LM Vocab 2.2 Relative di ff erence from Bi 1.65 1.1 0.55 0 aze bel glg slk • Simple vocab overlap heuristic is already competitive • LM performs better for slk, with highest amount of data

Sent vs. Lang Sent Lang 2.2 Relative di ff erence from Bi 1.65 1.1 0.55 0 aze bel glg slk • Language level heuristic is in general better

Conclusion • TCS is a simple method for better multi-lingual data selection • Brings significant improvements with little training overhead • Simple heuristics work well for LRLs to estimate language similarity https://github.com/cindyxinyiwang/TCS Thank You! Questions?

Extra Slides

Relationship with Back- Translation back-translate TCS-S 3.75 2.5 1.25 0 -1.25 -2.5 -3.75 -5 aze bel glg slk • TCS approximates back-translate probability � P s ( X | y ) • For LRL, heuristics performs better than back-translate model

Effect on SDE All copied TCS-S 3 2 1 0 -1 -2 -3 -4 aze bel glg slk • SDE: a better word encoding designed for multilingual data (Wang et. al. 2019) • TCS still brings significant gains on top of SDE

Target Conditioned Sampling: Optimizing Data Selection for - PowerPoint PPT Presentation

Target Conditioned Sampling: Optimizing Data Selection for Multilingual NMT Xinyi Wang, Graham Neubig Language Technologies Institute Carnegie Mellon University Multilingual NMT glg: A ma que eu nunca vou spa: Una maana que nunca

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Probability of any given neighbourhood of root, conditioned on the root, conditioned on the tree

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

A Conditioned Program Slicer Chris Fox, University of Essex 21st February 2005 Chris Fox,

SEDA: An Architecture for Well- Conditioned Scalable Internet Services Overview What does

Some centered random walks on weight lattices conditioned to stay in Weyl chambers Vivien Despax

Sentiment Expression Conditioned by Affective Transitions and Social Forces Moritz Sudhof Andrs

Target Risk vs. Target Date Funds in 401(k) Plans: Maybe the answer is both January 14, 2015

Connecting and Supporting Socially Responsible Educators Networking Workshop by Anastasia Khawaja

Vision and Sound Computer Vision Fall 2018 Columbia University Single-modality video

has been the focus of much interest recently, involving an interplay of methods Date : November 22,

management of patients with T2DM & CVD Richard Hobbs, MD Oxford, United Kingdom Cardio

Multilingual and low-resource ASR Lecture 18 CS 753 Instructor: Preethi Jyothi Recall Hybrid

Drupal 8 Multilingual Wonderland Gabor Hojtsy Acquia Foreign language site Multilingual site

AP Physics C Inductance Multiple Choice Slide 2 / 21 1 At time = 0, the switch is closed in

Final exam on Thursday, May 16 Drawing on the Web Final CSCI-UA 380 Review Multiple choice

Target Conditioned Sampling: Optimizing Data Selection for - PowerPoint PPT Presentation

Target Conditioned Sampling: Optimizing Data Selection for Multilingual NMT Xinyi Wang, Graham Neubig Language Technologies Institute Carnegie Mellon University Multilingual NMT glg: A ma que eu nunca vou spa: Una maana que nunca

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Probability of any given neighbourhood of root, conditioned on the root, conditioned on the tree

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean &amp; Hypothesis Testing Sampling

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

A Conditioned Program Slicer Chris Fox, University of Essex 21st February 2005 Chris Fox,

SEDA: An Architecture for Well- Conditioned Scalable Internet Services Overview What does

Some centered random walks on weight lattices conditioned to stay in Weyl chambers Vivien Despax

Sentiment Expression Conditioned by Affective Transitions and Social Forces Moritz Sudhof Andrs

Target Risk vs. Target Date Funds in 401(k) Plans: Maybe the answer is both January 14, 2015

Connecting and Supporting Socially Responsible Educators Networking Workshop by Anastasia Khawaja

Vision and Sound Computer Vision Fall 2018 Columbia University Single-modality video

has been the focus of much interest recently, involving an interplay of methods Date : November 22,

management of patients with T2DM &amp; CVD Richard Hobbs, MD Oxford, United Kingdom Cardio

Multilingual and low-resource ASR Lecture 18 CS 753 Instructor: Preethi Jyothi Recall Hybrid

Drupal 8 Multilingual Wonderland Gabor Hojtsy Acquia Foreign language site Multilingual site

AP Physics C Inductance Multiple Choice Slide 2 / 21 1 At time = 0, the switch is closed in

Final exam on Thursday, May 16 Drawing on the Web Final CSCI-UA 380 Review Multiple choice

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

management of patients with T2DM & CVD Richard Hobbs, MD Oxford, United Kingdom Cardio