statistical machine translation in a few slides
play

Statistical machine translation in a few slides Mikel L. Forcada 1 , - PowerPoint PPT Presentation

Statistical machine translation in a few slides Mikel L. Forcada 1 , 2 1 Departament de Llenguatges i Sistemes Informtics, Universitat dAlacant, E-03071 Alacant (Spain) 2 Prompsit Language Engineering, S.L., E-03690 St. Vicent del Raspeig


  1. Statistical machine translation in a few slides Mikel L. Forcada 1 , 2 1 Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant, E-03071 Alacant (Spain) 2 Prompsit Language Engineering, S.L., E-03690 St. Vicent del Raspeig (Spain) April 14-16, 2009: Free/open-source MT tutorial at the CNGL

  2. Translation as probability/1 ◮ Instead of saying that ◮ a source-language (SL) sentence s in a SL text ◮ and a target-language (TL) sentence t as found in a SL–TL bitext are or are not a translation of each other, ◮ in SMT one says that they are a translation of each other with a probability p ( s , t ) = p ( t , s ) (a joint probability). ◮ We’ll assume we have such a probability model available. Or at least a reasonable estimate.

  3. Translation as probability/2 ◮ According to basic probability laws, we can write: p ( s , t ) = p ( t , s ) = p ( s | t ) p ( t ) = p ( t | s ) p ( s ) (1) where p ( x | y ) is the conditional probability of x given y . ◮ We are interested in translating from SL to TL. That is, we want to find the most likely translation given the SL sentence s : t ⋆ = arg max p ( t | s ) (2) t

  4. The “canonical” model ◮ We can rewrite eq. (1) as p ( t | s ) = p ( s | t ) p ( t ) (3) p ( s ) ◮ and then with (2) to get t ⋆ = arg max p ( s | t ) p ( t ) (4) t

  5. “Decoding”/1 t ⋆ = arg max p ( s | t ) p ( t ) t ◮ We have a product of two probability models: ◮ A reverse translation model p ( s | t ) which tells us how likely the SL sentence s is a translation of the candidate TL sentence t , and ◮ a target-language model p ( t ) which tells us how likely the sentence t is in the TL side of bitexts. ◮ These may be related (respectively) to the usual notions of ◮ [reverse] adequacy : how much of the meaning of t is conveyed by s ◮ fluency : how fluent is the candidate TL sentence. ◮ The arg max strikes a balance between the two.

  6. “Decoding”/2 ◮ In SMT parlance, the process of finding t ∗ is called decoding . 1 ◮ Obviously, it does not explore all possible translations t in the search space . There are infinitely many. ◮ The search space is pruned . ◮ Therefore, one just gets a reasonable t ≃ ⋆ instead of the ideal t ⋆ ◮ Pruning and search strategies are a very active research topic. Free/open-source software: Moses . 1 Reading SMT articles usually entails deciphering jargon which may be very obscure to outsiders or newcomers

  7. Training/1 ◮ So where do these probabilities come from? ◮ p ( t ) may easily be estimated from a large monolingual TL corpus (free/open-source software: irstlm ) ◮ The estimation of p ( s | t ) is more complex. It’s usually made of ◮ a lexical model describing the probability that the translation of certain TL word or sequence of words (“phrase” 2 ) is a certain SL word or sequence of words. ◮ an alignment model describing the reordering of words or “phrases”. 2 A very unfortunate choice in SMT jargon

  8. Training/2 ◮ The lexical model and the alignment model are estimated using a large sentence-aligned bilingual corpus through a complex iterative process. ◮ An initial set of lexical probabilities is obtained by assuming, for instance, that any word in the TL sentence aligns with any word in its SL counterpart. And then: ◮ Alignment probabilities in accordance with the lexical probabilities are computed. ◮ Lexical probabilities are obtained in accordance with the alignment probabilities This process (“expectation maximization”) is repeated a fixed number of times or until some convergence is observed (free/open-source software: Giza++ ).

  9. Training/3 ◮ In “phrase-based” SMT, alignments may be used to extract ◮ (SL- phrase , TL- phrase ) pairs of phrases ◮ and their corresponding probabilities for easier decoding and to avoid “word salad”.

  10. “Log-linear”/1 ◮ More SMT jargon! ◮ It’s short for linear combination of log arithms of probabilities. ◮ And, sometimes, even features that aren’t logarithms or probabilities of any kind. ◮ OK, let’s take a look at the maths.

  11. “Log-linear”/2 ◮ One can write a more general formula: p ( t | s ) = exp ( � n F k = 1 λ k f k ( t , s )) (5) Z with n F feature functions f k ( t , s ) which can depend on s , t or both. ◮ Setting n F = 2, f 1 ( s , t ) = log p ( s | t ) , f 2 ( s , t ) = log p ( t ) , and Z = p ( s ) one recovers the canonical formula (3). ◮ The best translation is then n F t ⋆ = arg max � λ k f k ( t , s ) (6) t k = 1 Most of the f k ( t , s ) are logarithms, hence “log-linear”.

  12. “Log-linear”/3 ◮ “Feature selection is a very open problem in SMT” (Lopez 2008) ◮ Other possible functions include length penalties (discouraging unreasonably short or long translations), “inverted” versions of p ( s | t ) , etc. ◮ Where do we get the λ k ’s from? ◮ They are usually tuned so as to optimize the results on a tuning set , according to a certain objective function that ◮ is taken to be an indicator that correlates with translation quality ◮ may be automatically obtained from the output of the SMT system and the translation in the corpus. This is called MERT ( minimum error rate training ) sometimes (free/open-source software: the Moses suite).

  13. Ain’t got nothin’ but the BLEUs? ◮ The most famous “quality indicator” is called BLEU, but there are many others. ◮ BLEU counts which fraction of the 1-word, 2-word,. . . n -word sequences in the output match the reference translation. ◮ Correlation with subjective assessments of quality is still an open question. ◮ A lot of SMT research is currently BLEU-driven and makes little contact with real applications of MT.

  14. The SMT lifecycle Development: Training: monolingual and sentence-aligned bilingual corpora are used to estimate probability models (features) Tuning: a held-out portion of the sentence-aligned bilingual corpus is used to tune the coeficients λ k Decoding: sentences s are fed into the SMT system and “decoded” into their translations t . Evaluation: the system is evaluated against a reference corpus.

  15. License This work may be distributed under the terms of ◮ the Creative Commons Attribution–Share Alike license: http: //creativecommons.org/licenses/by-sa/3.0/ ◮ the GNU GPL v. 3.0 License: http://www.gnu.org/licenses/gpl.html Dual license! E-mail me to get the sources: mlf@ua.es

Recommend


More recommend