Statistical Machine Translation: Rapid Development with Limited Resources George Foster, Simona Gandrabur, Philippe Langlais , Pierre Plamondon, Graham Russell and Michel Simard RALI-DIRO Universit´ e de Montr´ eal CP. 6128 succursale centre-ville Montr´ eal (Qu´ ebec) Canada, H3C 3J7 www-rali.iro.umontreal.ca 1 MT-Summit IX — New Orleans
Motivation What progress can a small team of de- velopers expect to achieve in creating a statistical MT system for an unfamil- iar language, using only data and tech- nology readily available in-house, or at short notice from external sources? • Work conducted within the NIST 2003 MT evaluation task http://www.nist.gov/speech/tests/mt/ • Chinese-to-English task • Computing resources: Pentium-4 class PCs with a maximum of 1Gb RAM 2 MT-Summit IX — New Orleans
We had a plan... Rescoring approach built on top of a roughly state-of-the-art transla- tion model such as IBM Model 4 • Extensively used in automatic speech recognition on n-best lists or word-graphs (Ortmanns et al., 1997; Rose and Riccardi, 1999) • More recently proposed for use in SMT (Och and Ney, 2002; Soricut et al., 2002; Ueffing et al., 2002). 3 MT-Summit IX — New Orleans
Step 1 – Preparing with Canadian Hansards Alegr´ ıas • Install necessary packages , • Train translation and language models , • Write IBM model 2&4 decoders → IBM4 models trained with: GIZA ++ and mkcls ֒ www-i6.informatik.rwth-aachen.de/Colleagues/och → Language and IBM model 2 trained with in-house packages ֒ → Multiple search strategy (Nießen et al., 1998; Germann et ֒ al., 2001) 4 MT-Summit IX — New Orleans
Step 1 – 3/4 weeks later . . . Seguiriyas We’ve got it ! Our first English-French GIZA ++ model was ready to use. • Establishing the limits of the package (maximum input size, etc.) → 2/3 days of computation to train on a corpus of 1 ֒ million pair of sentences → can’t train with more data (memory problems) ֒ • Running mkcls → around 10 hours of computation to cluster a vocabulary ֒ into 50 classes • Writing wrappers for the model data structures 5 MT-Summit IX — New Orleans
Step 2 – Corpus Preprocessing SMT is not exactly language blind . . . The Linguistic Data Consortium (LDC) dis- tributed the Training data for the NIST task (at least partially did). http://www.ldc.upenn.edu/ → A surprising variety of formats (sounds nice but is not) ֒ → Word boundaries inserted by means of a revised version of ֒ the mansegment program supplied by the LDC. → Doubt: is our sentence aligner supposed to work for ֒ Chinese/English corpora? One person-month of effort for a judicious mixture of automatic and semi-automatic approaches 6 MT-Summit IX — New Orleans
Step 2 – Corpus Preprocessing Take one • For the NIST exercice, only pre-aligned texts were used. → Some regions acknowledged by the supplier to be ֒ potentially unreliable were omitted. • Instead of recompiling GIZA ++ in order to account for sentences longer than 40 words, we devised an knowledge-poor splitter relying heavily on punctuations. → In cases where no suitable punctuation existed, ֒ sentences were split at an arbitrary token boundary. → Mostly a mix of un and hansard was used to train language ֒ and translation models 7 MT-Summit IX — New Orleans
Step 3 – Decoders The joy of diversity Three different decoders, all previously described in the statistical MT litera- ture, were implemented. Sounds odd to do that under time pressure, but we found possible advantages: • Detection of certain bugs (actually useful) • Competition between coders: “mine is better than yours” (did work too) • Could be fruitful in a rescoring strategy (details later) Detail: explicit enumeration of the candidate translations (n-best lists) 8 MT-Summit IX — New Orleans
Step 3 – Decoders Greedy decoder (Germann et al., 2001) • ISI ReWrite Decoder available a at: http://www.isi.edu/licensed-sw/rewrite-decoder • Requires the language model to be trained with the CMU-Cambridge Statistical Language Modeling Toolkit (Clarkson and Rosenfeld, 1997) → We found it easier to rewrite the ReWrite Decoder ֒ Hypotheses generated by the hill-climbing search were collected into an n-best list. a At least for Canadian residents 9 MT-Summit IX — New Orleans
Step 3 – Decoders Inverted Alignment Decoder (Nießen et al., 1998) Shame on us: we also tested the performance of a DP-decoder designed for IBM model 2 . for all target position i = 1 , 2 , . . . , I max do prune( i − 1); for all live hypotheses h i do for all word w in the Active Vocabulary do for all fertility f ∈ { 1 , 2 , 3 } do for all uncovered source positions j, . . . , j + f − 1 do Consider h ′ the extension of h i with w (at target position i ) aligned with j, . . . , j + f − 1 if score ( h ′ ) > Score ( i, j, f, c ) then Keep h ′ and record back-track information Best live hypotheses are kept in an n-best list 10 MT-Summit IX — New Orleans
Step 3 – Decoders Stack-based Decoder (FST) • loop until time limit reached: – pop best node from the stack – if final, add hypothesis to n-best list – else ∗ expand “exhaustively” ∗ add resulting hypotheses to graph and stack • main properties: – all nodes retained in graph – fast output of initial hypotheses, with successive refinement – precise time control – no heuristic function on suffixes 11 MT-Summit IX — New Orleans
Step 3 – Decoders Stack-based Decoder (FST) • graph properties: – 30M nodes in ≈ 1GB – nodes retain trigram state and source alignments – retroactive score correction • prefix heuristics: – multiple stacks to correct for prefix-score bias: ∗ number of source and target words ∗ unigram logprob – pop depends on stack and gain over parent • timing: max 3 minutes per source sentence; more time gives better model scores but worse NIST scores 12 MT-Summit IX — New Orleans
Step 3 – Decoders The cost of diversity decoder coding tuning greedy 2 1 FST 3 3 IBM2 2 3 total 7 7 Approximate number of person-weeks of development But it was worth it! (the price of a decent IBM-4 stack-based decoder) 13 MT-Summit IX — New Orleans
Step 3 – Decoders Finally • Two decoders for IBM model 4, one for IBM model 2. • Existential questions: “is it good or bad ?”, “why?”, etc. • Tuning the compromise between speed and quality is difficult • Incremental improvements → We compared the decoders by translating 100 sentences (of ֒ at most 20 words) • greedy (results within 10 minutes or so) • fst (results within half an hour or so) • ibm2-fast (results within few seconds) • ibm2-slow (results within half an hour or so) 14 MT-Summit IX — New Orleans
The bigger the better? Main factor is decoder type 1-best 100-best decoder wer nist nist% wer nist nist% hansard greedy 68 · 93 2 · 41448 24 · 20 61 · 71 3 · 68806 37 · 00 ibm2-fast 65 · 87 3 · 22954 32 · 30 59 · 22 4 · 42125 44 · 20 ibm2-slow 63 · 85 3 · 85769 38 · 50 53 · 03 5 · 28764 52 · 80 fst 62 · 86 4 · 19043 41 · 90 55 · 24 5 · 10464 51 · 00 un greedy 70 · 35 2 · 76181 26 · 10 62 · 97 3 · 98415 37 · 70 ibm2-fast 69 · 80 3 · 19254 30 · 20 63 · 04 4 · 38660 41 · 50 ibm2-slow 68 · 77 4 · 39036 41 · 50 58 · 65 5 · 77882 54 · 60 fst 65 · 57 4 · 56739 43 · 20 57 · 18 5 · 80536 54 · 90 sinorama greedy 86 · 89 0 · 79860 7 · 80 82 · 16 1 · 37465 13 · 40 ibm2-fast 87 · 55 1 · 09399 10 · 30 82 · 45 1 · 68875 15 · 80 ibm2-slow 87 · 56 1 · 46096 13 · 70 81 · 55 2 · 44893 23 · 00 fst 88 · 97 1 · 72001 16 · 10 85 · 40 2 · 35273 22 · 00 xinhua greedy 89 · 64 1 · 30970 12 · 70 85 · 10 2 · 00496 19 · 40 ibm2-fast 91 · 09 1 · 08899 10 · 30 85 · 90 1 · 86932 17 · 70 ibm2-slow 89 · 13 1 · 34132 12 · 70 83 · 86 2 · 29718 21 · 80 fst 90 · 82 1 · 08510 10 · 30 87 · 98 1 · 56167 14 · 80 15 MT-Summit IX — New Orleans
The bigger the better? Search space is also important 1-best 100-best decoder wer nist nist% wer nist nist% hansard greedy 68 · 93 2 · 41448 24 · 20 61 · 71 3 · 68806 37 · 00 ibm2-fast 65 · 87 3 · 22954 32 · 30 59 · 22 4 · 42125 44 · 20 ibm2-slow 63 · 85 3 · 85769 38 · 50 53 · 03 5 · 28764 52 · 80 fst 62 · 86 4 · 19043 41 · 90 55 · 24 5 · 10464 51 · 00 un greedy 70 · 35 2 · 76181 26 · 10 62 · 97 3 · 98415 37 · 70 ibm2-fast 69 · 80 3 · 19254 30 · 20 63 · 04 4 · 38660 41 · 50 ibm2-slow 68 · 77 4 · 39036 41 · 50 58 · 65 5 · 77882 54 · 60 fst 65 · 57 4 · 56739 43 · 20 57 · 18 5 · 80536 54 · 90 sinorama greedy 86 · 89 0 · 79860 7 · 80 82 · 16 1 · 37465 13 · 40 ibm2-fast 87 · 55 1 · 09399 10 · 30 82 · 45 1 · 68875 15 · 80 ibm2-slow 87 · 56 1 · 46096 13 · 70 81 · 55 2 · 44893 23 · 00 fst 88 · 97 1 · 72001 16 · 10 85 · 40 2 · 35273 22 · 00 xinhua greedy 89 · 64 1 · 30970 12 · 70 85 · 10 2 · 00496 19 · 40 ibm2-fast 91 · 09 1 · 08899 10 · 30 85 · 90 1 · 86932 17 · 70 ibm2-slow 89 · 13 1 · 34132 12 · 70 83 · 86 2 · 29718 21 · 80 fst 90 · 82 1 · 08510 10 · 30 87 · 98 1 · 56167 14 · 80 16 MT-Summit IX — New Orleans
Recommend
More recommend