Statistical Machine Translation: Rapid Development with Limited - PowerPoint PPT Presentation

Statistical Machine Translation: Rapid Development with Limited Resources George Foster, Simona Gandrabur, Philippe Langlais , Pierre Plamondon, Graham Russell and Michel Simard RALI-DIRO Universit´ e de Montr´ eal CP. 6128 succursale centre-ville Montr´ eal (Qu´ ebec) Canada, H3C 3J7 www-rali.iro.umontreal.ca 1 MT-Summit IX — New Orleans

Motivation What progress can a small team of de- velopers expect to achieve in creating a statistical MT system for an unfamil- iar language, using only data and technology readily available in-house, or at short notice from external sources? • Work conducted within the NIST 2003 MT evaluation task http://www.nist.gov/speech/tests/mt/ • Chinese-to-English task • Computing resources: Pentium-4 class PCs with a maximum of 1Gb RAM 2 MT-Summit IX — New Orleans

We had a plan... Rescoring approach built on top of a roughly state-of-the-art translation model such as IBM Model 4 • Extensively used in automatic speech recognition on n-best lists or word-graphs (Ortmanns et al., 1997; Rose and Riccardi, 1999) • More recently proposed for use in SMT (Och and Ney, 2002; Soricut et al., 2002; Ueffing et al., 2002). 3 MT-Summit IX — New Orleans

Step 1 – Preparing with Canadian Hansards Alegr´ ıas • Install necessary packages , • Train translation and language models , • Write IBM model 2&4 decoders → IBM4 models trained with: GIZA ++ and mkcls ֒ www-i6.informatik.rwth-aachen.de/Colleagues/och → Language and IBM model 2 trained with in-house packages ֒ → Multiple search strategy (Nießen et al., 1998; Germann et ֒ al., 2001) 4 MT-Summit IX — New Orleans

Step 1 – 3/4 weeks later . . . Seguiriyas We’ve got it ! Our first English-French GIZA ++ model was ready to use. • Establishing the limits of the package (maximum input size, etc.) → 2/3 days of computation to train on a corpus of 1 ֒ million pair of sentences → can’t train with more data (memory problems) ֒ • Running mkcls → around 10 hours of computation to cluster a vocabulary ֒ into 50 classes • Writing wrappers for the model data structures 5 MT-Summit IX — New Orleans

Step 2 – Corpus Preprocessing SMT is not exactly language blind . . . The Linguistic Data Consortium (LDC) dis- tributed the Training data for the NIST task (at least partially did). http://www.ldc.upenn.edu/ → A surprising variety of formats (sounds nice but is not) ֒ → Word boundaries inserted by means of a revised version of ֒ the mansegment program supplied by the LDC. → Doubt: is our sentence aligner supposed to work for ֒ Chinese/English corpora? One person-month of effort for a judicious mixture of automatic and semi-automatic approaches 6 MT-Summit IX — New Orleans

Step 2 – Corpus Preprocessing Take one • For the NIST exercice, only pre-aligned texts were used. → Some regions acknowledged by the supplier to be ֒ potentially unreliable were omitted. • Instead of recompiling GIZA ++ in order to account for sentences longer than 40 words, we devised an knowledge-poor splitter relying heavily on punctuations. → In cases where no suitable punctuation existed, ֒ sentences were split at an arbitrary token boundary. → Mostly a mix of un and hansard was used to train language ֒ and translation models 7 MT-Summit IX — New Orleans

Step 3 – Decoders The joy of diversity Three different decoders, all previously described in the statistical MT litera- ture, were implemented. Sounds odd to do that under time pressure, but we found possible advantages: • Detection of certain bugs (actually useful) • Competition between coders: “mine is better than yours” (did work too) • Could be fruitful in a rescoring strategy (details later) Detail: explicit enumeration of the candidate translations (n-best lists) 8 MT-Summit IX — New Orleans

Step 3 – Decoders Greedy decoder (Germann et al., 2001) • ISI ReWrite Decoder available a at: http://www.isi.edu/licensed-sw/rewrite-decoder • Requires the language model to be trained with the CMU-Cambridge Statistical Language Modeling Toolkit (Clarkson and Rosenfeld, 1997) → We found it easier to rewrite the ReWrite Decoder ֒ Hypotheses generated by the hill-climbing search were collected into an n-best list. a At least for Canadian residents 9 MT-Summit IX — New Orleans

Step 3 – Decoders Inverted Alignment Decoder (Nießen et al., 1998) Shame on us: we also tested the performance of a DP-decoder designed for IBM model 2 . for all target position i = 1 , 2 , . . . , I max do prune( i − 1); for all live hypotheses h i do for all word w in the Active Vocabulary do for all fertility f ∈ { 1 , 2 , 3 } do for all uncovered source positions j, . . . , j + f − 1 do Consider h ′ the extension of h i with w (at target position i ) aligned with j, . . . , j + f − 1 if score ( h ′ ) > Score ( i, j, f, c ) then Keep h ′ and record back-track information Best live hypotheses are kept in an n-best list 10 MT-Summit IX — New Orleans

Step 3 – Decoders Stack-based Decoder (FST) • loop until time limit reached: – pop best node from the stack – if final, add hypothesis to n-best list – else ∗ expand “exhaustively” ∗ add resulting hypotheses to graph and stack • main properties: – all nodes retained in graph – fast output of initial hypotheses, with successive refinement – precise time control – no heuristic function on suffixes 11 MT-Summit IX — New Orleans

Step 3 – Decoders Stack-based Decoder (FST) • graph properties: – 30M nodes in ≈ 1GB – nodes retain trigram state and source alignments – retroactive score correction • prefix heuristics: – multiple stacks to correct for prefix-score bias: ∗ number of source and target words ∗ unigram logprob – pop depends on stack and gain over parent • timing: max 3 minutes per source sentence; more time gives better model scores but worse NIST scores 12 MT-Summit IX — New Orleans

Step 3 – Decoders The cost of diversity decoder coding tuning greedy 2 1 FST 3 3 IBM2 2 3 total 7 7 Approximate number of person-weeks of development But it was worth it! (the price of a decent IBM-4 stack-based decoder) 13 MT-Summit IX — New Orleans

Step 3 – Decoders Finally • Two decoders for IBM model 4, one for IBM model 2. • Existential questions: “is it good or bad ?”, “why?”, etc. • Tuning the compromise between speed and quality is difficult • Incremental improvements → We compared the decoders by translating 100 sentences (of ֒ at most 20 words) • greedy (results within 10 minutes or so) • fst (results within half an hour or so) • ibm2-fast (results within few seconds) • ibm2-slow (results within half an hour or so) 14 MT-Summit IX — New Orleans

The bigger the better? Main factor is decoder type 1-best 100-best decoder wer nist nist% wer nist nist% hansard greedy 68 · 93 2 · 41448 24 · 20 61 · 71 3 · 68806 37 · 00 ibm2-fast 65 · 87 3 · 22954 32 · 30 59 · 22 4 · 42125 44 · 20 ibm2-slow 63 · 85 3 · 85769 38 · 50 53 · 03 5 · 28764 52 · 80 fst 62 · 86 4 · 19043 41 · 90 55 · 24 5 · 10464 51 · 00 un greedy 70 · 35 2 · 76181 26 · 10 62 · 97 3 · 98415 37 · 70 ibm2-fast 69 · 80 3 · 19254 30 · 20 63 · 04 4 · 38660 41 · 50 ibm2-slow 68 · 77 4 · 39036 41 · 50 58 · 65 5 · 77882 54 · 60 fst 65 · 57 4 · 56739 43 · 20 57 · 18 5 · 80536 54 · 90 sinorama greedy 86 · 89 0 · 79860 7 · 80 82 · 16 1 · 37465 13 · 40 ibm2-fast 87 · 55 1 · 09399 10 · 30 82 · 45 1 · 68875 15 · 80 ibm2-slow 87 · 56 1 · 46096 13 · 70 81 · 55 2 · 44893 23 · 00 fst 88 · 97 1 · 72001 16 · 10 85 · 40 2 · 35273 22 · 00 xinhua greedy 89 · 64 1 · 30970 12 · 70 85 · 10 2 · 00496 19 · 40 ibm2-fast 91 · 09 1 · 08899 10 · 30 85 · 90 1 · 86932 17 · 70 ibm2-slow 89 · 13 1 · 34132 12 · 70 83 · 86 2 · 29718 21 · 80 fst 90 · 82 1 · 08510 10 · 30 87 · 98 1 · 56167 14 · 80 15 MT-Summit IX — New Orleans

The bigger the better? Search space is also important 1-best 100-best decoder wer nist nist% wer nist nist% hansard greedy 68 · 93 2 · 41448 24 · 20 61 · 71 3 · 68806 37 · 00 ibm2-fast 65 · 87 3 · 22954 32 · 30 59 · 22 4 · 42125 44 · 20 ibm2-slow 63 · 85 3 · 85769 38 · 50 53 · 03 5 · 28764 52 · 80 fst 62 · 86 4 · 19043 41 · 90 55 · 24 5 · 10464 51 · 00 un greedy 70 · 35 2 · 76181 26 · 10 62 · 97 3 · 98415 37 · 70 ibm2-fast 69 · 80 3 · 19254 30 · 20 63 · 04 4 · 38660 41 · 50 ibm2-slow 68 · 77 4 · 39036 41 · 50 58 · 65 5 · 77882 54 · 60 fst 65 · 57 4 · 56739 43 · 20 57 · 18 5 · 80536 54 · 90 sinorama greedy 86 · 89 0 · 79860 7 · 80 82 · 16 1 · 37465 13 · 40 ibm2-fast 87 · 55 1 · 09399 10 · 30 82 · 45 1 · 68875 15 · 80 ibm2-slow 87 · 56 1 · 46096 13 · 70 81 · 55 2 · 44893 23 · 00 fst 88 · 97 1 · 72001 16 · 10 85 · 40 2 · 35273 22 · 00 xinhua greedy 89 · 64 1 · 30970 12 · 70 85 · 10 2 · 00496 19 · 40 ibm2-fast 91 · 09 1 · 08899 10 · 30 85 · 90 1 · 86932 17 · 70 ibm2-slow 89 · 13 1 · 34132 12 · 70 83 · 86 2 · 29718 21 · 80 fst 90 · 82 1 · 08510 10 · 30 87 · 98 1 · 56167 14 · 80 16 MT-Summit IX — New Orleans

Statistical Machine Translation: Rapid Development with Limited - PowerPoint PPT Presentation

Statistical Machine Translation: Rapid Development with Limited Resources George Foster, Simona Gandrabur, Philippe Langlais , Pierre Plamondon, Graham Russell and Michel Simard RALI-DIRO Universit e de Montr eal CP. 6128 succursale

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Statistical Machine Translation Graham Neubig Nara Institute of Science and Technology (NAIST)

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

What can Statistical Machine Translation teach Neural Machine Translation about Structured

Chapter 8 Evaluation Statistical Machine Translation Evaluation How good is a given machine

Unsupervised Morpheme Analysis Competition 3: Statistical Machine Translation Mikko Kurimo, Sami

Introduction to Machine Translation Joost Bastings ILLC, University of Amsterdam

Improved Word Alignments for Statistical Machine Translation Alex Fraser Institute for NLP

Chapter 4 Word-based models Statistical Machine Translation Lexical Translation How to

Machine Translation Classification of divergences Classical and Statistical Approaches

Domain Adaptation in Statistical Machine Translation Logic, Language and Computation Bart

Workshop on statistical machine translation for curious translators Vctor M. Snchez-Cartagena

Rapid Adaptation of Machine Translation to New Languages Graham Neubig, Junjie Hu @ EMNLP

Large-scale deployment of statistical machine translation Example Microsoft

Part II: NLP Applications: Statistical Machine Translation Stephen Clark 1 How do Google do

Statistical Machine Translation What works and what does not Andreas Maletti Universitt

Chapter 6 Decoding Statistical Machine Translation Decoding We have a mathematical model for

Statistical Machine Translation The Main Idea Treat translation as a noisy channel problem:

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Statistical Machine Translation Outline p Why Syntax? Lecture 5 Yamada and Knight:

Meta-Learning for Low Resource NMT Introduction Historically Statistical Translation

Lecture 14: Statistical Machine Translation Julia Hockenmaier juliahmr@illinois.edu 3324