Using Monolingual � Source-Side In-Domain Data Jen Drexler, Pamela Shapiro, Xuan Zhang SCALE Readout August 9, 2018
Continued Training In-domain Data Con-nued- General Trained Domain NMT NMT Model Model
Monolingual Source-Side � In-Domain Data In-domain Data Domain- General Adapted ??? Domain NMT NMT Model Model
Monolingual à Parallel • Forward Translation � • Use general-domain MT model to translate monolingual in-domain data � • Continued training with “synthetic” parallel in- domain data � • Data Selection � • Large corpora of parallel data from wide range of domains (web crawl) � • Use monolingual in-domain data to find parallel sentences that are closest to desired domain �
Forward Translation Using “synthetic” in-domain data for machine translation
Setup • Data � • Same general domain data, same in-domain validation and test sets as continued training experiments � • Use only source-side of in-domain training data � • Forward Translation Models � • Baseline models trained on general domain data � • Neural machine translation (NMT) or statistical machine translation (SMT) � • Continued Training (CT) � • “Synthetic” in-domain only � • General domain + synthetic in-domain data �
TED Results 40 +1.4 +1.2 +0.1 35 +0.8 +0.8 +0.5 30 +1.5 +1.5 +1.0 +0.8 +1.7 +0.9 25 +0.7 20 +0.7 +0.9 -0.3 15 +0.6 BLEU +0.3 10 5 0 Arabic German Farsi Korean Russian Chinese General domain NMT model CT with NMT in-domain CT with SMT in-domain CT with general domain + SMT in-domain CT with general domain + NMT in-domain
Patent Results +2.0 +1.9 40 35 30 +2.5 +0.9 +1.5 +0.8 25 20 +4.6 BLEU 15 -1.6 10 +0.5 +0.4 5 0 German Korean Russian Chinese General domain NMT model CT with NMT in-domain CT with SMT in-domain CT with general domain + SMT in-domain CT with general domain + NMT in-domain
Remaining Questions • How much synthetic in-domain vs. general domain data to use for continued training? � • Amount of available general domain data varies widely based on language, domain � • Should we treat general domain and synthetic in- domain data differently during continued training? � • Exploiting Source-side Monolingual Data in Neural Machine Translation (Zhang and Zong, EMNLP 2016) � • Synthetic target side of in-domain data is low-quality – decoder should not be trained to produce it �
Continued Training Updates • Alternate training on general domain, in- domain data � • Future work: experiment with ratio of general domain to in-domain data � • Optionally treat synthetic data differently � • Freeze decoder parameters � • Future work: experiment with choice of frozen parameters � • Multi-task training �
Multi-Task Training (Zhang and Zong, New Decoder EMNLP 2016) reordered source-side sentence aligned monolingual data bilingual data 𝑧 𝑈 𝑧 𝑦′ 1 𝑦′ 2 𝑦′ 𝑈 𝑦 𝑧 1 𝑧 2 𝑨 𝑈 𝑧 𝑡 1 𝑡 2 𝑡 𝑈 𝑦 𝑨 1 𝑨 2 ⋯ ⋯ reordering translation ⋯ ℎ 𝑈 ℎ 1 ℎ 2 𝑦 Baseline ⋯ ℎ 𝑈 ℎ 1 ℎ 2 𝑦 Auxiliary Model Model (In-Domain (General Domain Transla-on Task) 𝑦 1 𝑦 2 𝑦 𝑈 𝑦 Transla-on Task)
Chinese TED Results 18 17.1 16.8 16.7 16.6 17 16 15 BLEU 14 13 NMT In-Domain SMT In-Domain Synthe-c In-Domain Concatenated Alterna-ng Alterna-ng + Freezing Alterna-ng + Mul--Task
Chinese WIPO Results 18.6 20 18.1 17.2 16.6 14.7 15 11.0 10 BLEU 5 0 NMT In-Domain SMT In-Domain Synthe-c In-Domain Concatenated Alterna-ng Alterna-ng + Freezing Alterna-ng + Mul--Task
Summary • Continued training with synthetic in- domain data produces consistent in- domain BLEU improvements � • NMT more consistent than SMT � • SMT better than NMT for some languages, domains � • Many possible future research directions � • Modified Sockeye recipes enable alternating domains, multi-task training �
Mining Web-Crawled Data Moore-Lewis Selection on Paracrawl Data
MT training General Domain Data General Domain NMT Model
MT training ParaCrawl Data General Domain NMT Model
MT training ParaCrawl Data Con-nued- General Trained Domain NMT NMT Model Model In-Domain Src Data
� � � � ParaCrawl Data Pipeline for crawling parallel data from the web and cleaning it � Returns Cleanliness Scores, Threshold for Clean/Size tradeoff Document Sentence Domain Crawling Cleaning Alignment Alignment Selec-on Documents aligned based on language ID & URLs, then sentenced aligned, given score of cleanliness �
� � In-Domain Data Selection Classic approach (Moore and Lewis 2010) � Train source-side language models from IN, random sample of CRAWL � Score each source-side CRAWL sentence by: � Probability IN (sent) / Probability CRAWL (sent) � Strong method for SMT, we're investigating it for NMT �
Most TED-Like Sentence it changes the way we think ; it changes the way we walk in the world ; it changes our responses ; it changes our attitudes towards our current situations ; it changes the way we dress ; it changes the way we do things ; it changes the way we interact with people . �
Least TED-Like Sentence sunday , july 10 , 2016 in riverton the weather forecast would be : during the night the air temperature drops to + 21 ... + 24ºc , dew point : -1,6ºc ; precipitation is not expected , light breeze wind blowing from the south at a speed of 7-11 km / h , clear sky ; in the morning the air temperature drops to + 20 ... + 22ºc , dew point : + 1,24ºc ; precipitation is not expected , gentle breeze wind blowing from the west at a speed of 11-14 km / h , in the sky , sometimes there are small clouds ; in the afternoon the air temperature will be + 20 ... + 22ºc , dew point : + 4,06ºc ; ratio of temperature , wind speed and humidity : a bit dry for some ; precipitation is not expected , moderate breeze wind blowing from the north at a speed of 14-29 km / h , in the sky , sometimes there are small clouds ; in the evening the air temperature drops to + 15 ... + 19ºc , dew point : -0,12ºc ; precipitation is not expected , moderate breeze wind blowing from the north at a speed of 14-32 km / h , clear sky ; day length is 14:52 �
� Perplexity-Based Selection Rank sentences, then select amount of data based on perplexity on in-domain data � 400 350 300 250 Perplexity 200 150 100 50 0 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 Number of Sentences
German TED 45 40 +1.19 35 BLEU 30 GenDomain (28M) Con-nued on Random (28+1M) Con-nued on ParaCrawl (28+1M) Con-nued on TED Bitext
TED Results 50 +1.19 40 +0.34 30 20 10 BLEU 0 German Korean Russian General Domain Con-nued on Random Con-nued on ParaCrawl Con-nued on TED Bitext
Settings where Moore-Lewis performs poorly Too much noise: � • Korean web-crawl data very noisy � • In preliminary experiments, threshold for data cleaning mattered significantly � Domain specificity: � • For German patent data, seems like may not be enough relevant web-crawled data �
Analysis Long sentences: � • Quirk of Moore-Lewis method � • Smaller selections of Moore-Lewis ranked data à high average sentence length � • NMT has length limit, but having sentences near border may cause problems � • However, at thresholds we tried for TED (600K-1M), overall positive results �
Conclusions • Domain-based data selection from web- crawled data can help domain adaptation for NMT when you only have source data � • Cleanliness of web-crawled data matters � • Whether you can expect to find good data for it on web depends on domain �
Curriculum Learning (for Continued-Training)
Curriculum Learning • How to take advantage of Moore-Lewis scores? - Filter ParaCrawl data by a threshold - Arrange data processing order by Moore-Lewis score ranking • Curriculum Learning (CL): � [Curriculum Learning, Bengio et al. 2009] � Process samples by certain order. ( easy ➠ difficult ) Train better machine learning models faster .
CL for Continued-Training • Curriculum Learning (CL): Process samples by certain order. � ( easy ➠ difficult ) � • CL for ConKnued-Training (CT): Reorganize TED (bitext) + ParaCrawl data. Ranking Criterion: Moore-Lewis score ( easy: TED-like ➠ difficult: TED-unilke) � • Compare to pamela’s work: ParaCrawl threshold + random sampling (pamela) vs. ParaCrawl threshold + TED bitext + ordering (xuan)
Methods shards 0 1 2 3 4 • we put TED data in shard 0 • ParaCrawl data in shard 1-4 High (Moore-Lewis Score) Low ➠ TED-like TED-unlike Clean Noisy Jenks Natural Breaks ClassificaKon Algorithm (Maximize variance between classes and reduce variance within classes) ParaCrawl #samples 15163, 3066, 13519, 32179, 59708 samples% Density 12.26%, 2.48%, 10.93%, 26.03%, 48.29% ParaCrawl TED Moore-Lewis Score
Training Strategy Acer training the NMT model on general domain data, con-nued-training on TED + ParaCrawl as following: curriculum phase (1000 minibatches) curriculum update point timeline * shuffling (among and within shards) … until converged
Recommend
More recommend