Low Resource Machine Translation MarcAurelio Ranzato Facebook AI - PowerPoint PPT Presentation

Low Resource Machine Translation Marc’Aurelio Ranzato Facebook AI Research - NYC ranzato@fb.com Stanford - CS224N, 10 March 2020

Machine Translation English French Training data Ingredients: Train NMT • seq2seq with attention NMT System • SGD Ingredient: Test NMT • beam NMT System life is beautiful la vie est belle 2

Some Stats 6000+ languages in the world • 80% of the world population • does not speak English Less than 5% of the people in • the world are native English speakers. 3

The Long Tail of Languages The top 10 languages are spoken by less than 50% of the people. The remaining ~6500 are spoken by the rest! More than 2000 languages are spoken by less than 1000 people. source: https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/

(X to English) source: https://ai.googleblog.com/2019/10/exploring-massively-multilingual.html

Machine Translation in Practice English Nepali 25M people Training data 6

Machine Translation in Practice English Nepali 25M people Training data Parallel training data (collection of sentences with corresponding translation) is small! 7

Machine Translation in Practice English Nepali Training data Let’s represent data with rectangles. The color indicates the language. 8

Machine Translation in Practice English Nepali sentences originating corresponding Nepali Bible in English translations Domain Parliamentary corresponding sentences originating English translations in Nepali Let’s represent (human) translations with empty rectangles. • Some parallel data originates in the source, some in the target language. • Source and target domains may not match.

Machine Translation in Practice English Nepali TEST News mono mono Bible Domain Parliamentary mono • Test data might be in another domain. • There might exist source side in-domain monolingual data.

Machine Translation in Practice English Nepali Hindi TEST News mono mono Bible Domain Parliamentary mono Books mono • There might be parallel and monolingual data with a high resource language close to the low resource language of interest. This data may belong to a different domain.

English Nepali Hindi Sinhala Bengali Spanish Tamil Gujarati TEST Domain … … the Mondrian like learning setting!

Low Resource Machine Translation Loose definition: A language pair can be considered low resource when the number of parallel sentences is in the order of 10,000 or less. Note: modern NMT systems have several hundred million parameters nowadays! Challenges: - data sourcing data to train on - evaluation datasets - - modeling unclear learning paradigm - domain adaptation - generalization - 13

Why Low Resource MT Is Interesting? • It is about learning with less labeled data. • It is about modeling structured outputs and compositional learning. • It is a real problem to solve. 14

Outline MODEL DATA “The FLoRes evaluation for low “Phrase-based & Neural Unsup MT” resource MT:…” Guzmán, Chen et al. Lample et al. EMNLP 2018 ’EMNLP 2019 “FBAI WAT’19 My-En translation task submission” Chen et al., WAT@EMNLP 2019 life of a “Investigating Multilingual NMT Representations at Scale” Kudugunta et al., researcher EMNLP 2019 “Multilingual Denoising Pre-training for NMT” Liu et al., arXiv 2001:08210 2020 “Analyzing uncertainty in NMT” Ott et al. ICML 2018 “On the evaluation of MT systems trained with ANALYSIS back-translation” Edunov et al. ACL 2020 “The source-target domain mismatch problem in MT” Shen et al. arXiv 1909.13151 2019 15

A Big “Small-Data” Challenge http://opus.nlpl.eu/ 16

Case Study: En-Ne English Nepali TEST Wikipedia mono mono Bible, JW300, etc. Domain GNOME, Ubuntu, etc. Common Crawl mono mono In-domain data: no parallel, little monolingual. Out-of-domain: little parallel, quite a bit monolingual No translation originating from Nepali.

A Case Study: En-Ne • Parallel Training data: versions of bible and ubuntu handbook (<1M sentences). • Nepali Monolingual data: wikipedia (90K), common crawl (few millions). • English Monolingual data: unlimited almost. • Test data: ??? 18

FLoRes Evaluation Benchmark • Validation, test and hidden test set, each with 3000 sentences in English-Nepali and English-Sinhala. • Sentences taken from Wikipedia documents. Data Collection Process: • Very expensive and slow. • Very hard to produce high-quality translations: automatic checks (language model filtering, transliteration filtering, length filtering, • language id filtering, etc), human assessment. • Guzmàn, Chen et al. “The FLoRes evaluation datasets for low resource MT…” EMNLP 2019

Examples Si-En original translation En-Si Wikipedia originating in Si has different topics than Wikipedia originating in En Guzmàn, Chen et al. “The FLoRes evaluation datasets for low resource MT…” EMNLP 2019

Examples Ne-En En-Ne Guzmàn, Chen et al. “The FLoRes evaluation datasets for low resource MT…” EMNLP 2019

• Useful to evaluate truly low resource language pairs. WMT 2019 and WMT 2020 shared filtering task. • Several publications. • • Sustained effort, more to come… https://github.com/facebookresearch/flores data & baseline models 22

What Did We Learn? • Data is often as or more important than designing a model. • Collecting data is not trivial. • Look at the data!! 23

Outline MODEL DATA “The FLoRes evaluation for low resource MT:…” Guzmán, Chen et al. “Phrase-based & Neural Unsup MT” ’EMNLP 2019 Lample et al. EMNLP 2018 “FBAI WAT’19 My-En translation task submission” Chen et al., WAT@EMNLP 2019 life of a “Massively Multilingual NMT” Aharoni et researcher al.,ACL 2019 “Multilingual Denoising Pre-training for NMT” Liu et al., arXiv 2001:08210 2020 “Analyzing uncertainty in NMT” Ott et al. ICML 2018 “On the evaluation of MT systems trained with ANALYSIS back-translation” Edunov et al. ACL 2020 “The source-target domain mismatch problem in MT” Shen et al. arXiv 1909.13151 2019 24

English Nepali Hindi Sinhala Bengali Spanish Tamil Gujarati TEST Domain … …

Low Resource Machine Translation MarcAurelio Ranzato Facebook AI - PowerPoint PPT Presentation

Low Resource Machine Translation MarcAurelio Ranzato Facebook AI Research - NYC ranzato@fb.com Stanford - CS224N, 10 March 2020 Machine Translation English French Training data Ingredients: Train NMT seq2seq with attention NMT

Meta-Learning for Low Resource NMT Introduction Historically Statistical Translation

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Machine Translation 2 Wikipedia Machine translation, often referred to by the acronym MT, is a

Machine Translation (M2M) Machine Translation (M2M) SNMP MIB to CIM MOF SNMP MIB to CIM MOF

Natural Language Processing Machine Translation Dan Klein UC Berkeley 1 Machine Translation 2

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Natural Language Processing Machine Translation Machine Translation Dan Klein UC Berkeley

Use of the Machine Translation Module within Dj Vu X2 Quick Guidance Introduction Machine

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Machine Translation Machine Translation Berlin Chen 2003 References: 1. Natural Language

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

History & Evaluation CMSC 470 Marine Carpuat T odays topics Machine Translation

Learning Non-Isomorphic Tree Mappings for Machine Translation Syntax-Based Machine Translation

Introduction to Machine Translation CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides &

CRF Word Alignment & Noisy Channel Translation Machine Translation Lecture 6 Instructor:

Machine Translation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu T oday:

What can Statistical Machine Translation teach Neural Machine Translation about Structured

Social Translation: How Massive Online Collaboration Could Take Machine Translation to the Next

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Global Translation Services Website translation using post-edited machine translation and

Machine Translation: Going Deep Philipp Koehn 4 June 2015 Philipp Koehn Machine Translation:

Sequence to Sequence Models for Machine Translation CMSC 723 / LING 723 / INST 725 Marine

Low Resource Machine Translation MarcAurelio Ranzato Facebook AI - PowerPoint PPT Presentation

Low Resource Machine Translation MarcAurelio Ranzato Facebook AI Research - NYC ranzato@fb.com Stanford - CS224N, 10 March 2020 Machine Translation English French Training data Ingredients: Train NMT seq2seq with attention NMT

Meta-Learning for Low Resource NMT Introduction Historically Statistical Translation

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Machine Translation 2 Wikipedia Machine translation, often referred to by the acronym MT, is a

Machine Translation (M2M) Machine Translation (M2M) SNMP MIB to CIM MOF SNMP MIB to CIM MOF

Natural Language Processing Machine Translation Dan Klein UC Berkeley 1 Machine Translation 2

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Natural Language Processing Machine Translation Machine Translation Dan Klein UC Berkeley

Use of the Machine Translation Module within Dj Vu X2 Quick Guidance Introduction Machine

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Machine Translation Machine Translation Berlin Chen 2003 References: 1. Natural Language

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

History &amp; Evaluation CMSC 470 Marine Carpuat T odays topics Machine Translation

Learning Non-Isomorphic Tree Mappings for Machine Translation Syntax-Based Machine Translation

Introduction to Machine Translation CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides &amp;

CRF Word Alignment &amp; Noisy Channel Translation Machine Translation Lecture 6 Instructor:

Machine Translation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu T oday:

What can Statistical Machine Translation teach Neural Machine Translation about Structured

Social Translation: How Massive Online Collaboration Could Take Machine Translation to the Next

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Global Translation Services Website translation using post-edited machine translation and

Machine Translation: Going Deep Philipp Koehn 4 June 2015 Philipp Koehn Machine Translation:

Sequence to Sequence Models for Machine Translation CMSC 723 / LING 723 / INST 725 Marine

History & Evaluation CMSC 470 Marine Carpuat T odays topics Machine Translation

Introduction to Machine Translation CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides &

CRF Word Alignment & Noisy Channel Translation Machine Translation Lecture 6 Instructor: