Machine Translation 2: Statistical MT: Phrase-Based and Neural Ondˇ rej Bojar bojar@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague December 2018 MT2: PBMT, NMT
Outline of Lectures on MT 1. Introduction. • Why is MT difficult. • MT evaluation. • Approaches to MT. • First peek into phrase-based MT • Document, sentence and word alignment. 2. Statistical Machine Translation. • Phrase-based: Assumptions, beam search, key issues. • Neural MT: Sequence-to-sequence, attention, self-attentive. 3. Advanced Topics. • Linguistic Features in SMT and NMT. • Multilinguality, Multi-Task, Learned Representations. December 2018 MT2: PBMT, NMT 1
Outline of MT Lecture 2 1. What makes MT statistical. • Brute-force statistical MT. • Noisy channed model. • Log-linear model. 2. Phrase-based translation model. • Phrase extraction. • Decoding (gradual construction of hypotheses). • Minimum error-rate training (weight optimization). 3. Neural machine translation (NMT). • Sequence-to-sequence, with attention. December 2018 MT2: PBMT, NMT 2
Quotes Warren Weaver (1949): I have a text in front of me which is written in Russian but I am going to pretend that it is really written in English and that is has been coded in some strange symbols. All I need to do is strip off the code in order to retrieve the information contained in the text. Noam Chomsky (1969): . . . the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term. Frederick Jelinek (80’s; IBM; later JHU and sometimes ´ UFAL) Every time I fire a linguist, the accuracy goes up. Hermann Ney (RWTH Aachen University): MT = Linguistic M odelling + Statistical Decision T heory December 2018 MT2: PBMT, NMT 3
The Statistical Approach (Statistical = Information-theoretic.) • Specify a probabilistic model. = How is the probability mass distributed among possible outputs given observed inputs. • Specify the training criterion and procedure. = How to learn free parameters from training data. Notice: • Linguistics helpful when designing the models: – How to divide input into smaller units. – Which bits of observations are more informative. December 2018 MT2: PBMT, NMT 4
Statistical MT Given a source (foreign) language sentence f J 1 = f 1 . . . f j . . . f J , Produce a target language (English) sentence e I 1 = e 1 . . . e j . . . e I . Among all possible target language sentences, choose the sentence with the highest probability: ˆ I p ( e I 1 | f J ˆ 1 = argmax 1 ) (1) e I,e I 1 We stick to the e I 1 , f J 1 notation despite translating from English to Czech. December 2018 MT2: PBMT, NMT 5
Brute-Force MT (1/2) Translate only sentences listed in a “translation memory” (TM): Good morning. = Dobr´ e r´ ano. How are you? = Jak se m´ aˇ s? How are you? = Jak se m´ ate? � 1 if e I 1 = f J 1 seen in the TM p ( e I 1 | f J 1 ) = (2) 0 otherwise Any problems with the definition? • Not a probability. There may be f J 1 p ( e I 1 | f J 1 , s.t. � 1 ) > 1 . e I ⇒ Have to normalize, use count( e I 1 ,f J 1 ) instead of 1. count( f J 1 ) • Not “smooth”, no generalization: ⇒ Good morning. Dobr´ e r´ ano. December 2018 MT2: PBMT, NMT 6
Brute-Force MT (2/2) Translate only sentences listed in a “translation memory” (TM): Good morning. = Dobr´ e r´ ano. How are you? = Jak se m´ aˇ s? How are you? = Jak se m´ ate? � 1 if e I 1 = f J 1 seen in the TM p ( e I 1 | f J 1 ) = (3) 0 otherwise • Not a probability. There may be f J 1 p ( e I 1 | f J 1 , s.t. � 1 ) > 1 . e I ⇒ Have to normalize, use count( e I 1 ,f J 1 ) instead of 1. count( f J 1 ) • Not “smooth”, no generalization: ⇒ Good morning. Dobr´ e r´ ano. ⇒ ∅ Good evening. December 2018 MT2: PBMT, NMT 7
Bayes’ Law Bayes’ law for conditional probabilities: p ( a | b ) = p ( b | a ) p ( a ) p ( b ) So in our case: ˆ I p ( e I 1 | f J ˆ 1 = argmax 1 ) Apply Bayes’ law e I,e I 1 p ( f J 1 | e I 1 ) p ( e I p ( f J 1 ) 1 ) constant = argmax p ( f J ⇒ irrelevant in maximization 1 ) I,e I 1 p ( f J 1 | e I 1 ) p ( e I = argmax 1 ) I,e I 1 Also called “Noisy Channel” model. December 2018 MT2: PBMT, NMT 8
Motivation for Noisy Channel ˆ I p ( f J 1 | e I 1 ) p ( e I ˆ 1 = argmax 1 ) (4) e I,e I 1 Bayes’ law divided the model into components: p ( f J 1 | e I Translation model (“reversed”, e I 1 → f J 1 ) 1 ) . . . is it a likely translation? p ( e I 1 ) Language model (LM) . . . is the output a likely sentence of the target language? • The components can be trained on different sources. There are far more monolingual data ⇒ language model more reliable. December 2018 MT2: PBMT, NMT 9
Without Equations Parallel Texts Monolingual Texts Translation Model Language Model Global Search Input Output for sentence with highest probability December 2018 MT2: PBMT, NMT 10
Summary of Language Models • p ( e I 1 ) should report how “good” sentence e I 1 is. • We surely want p ( The the the. ) < p ( Hello. ) • How about p ( The cat was black. ) < p ( Hello. ) ? . . . We don’t really care in MT. We hope to compare synonymic sentences. LM is usually a 3-gram language model: p ( � � The cat was black . � � ) = p ( The | � � ) p ( cat | � The ) p ( was | The cat ) p ( black | cat was ) p ( . | was black ) p ( � | black . ) p ( � | . � ) Formally, with n = 3 : I � p ( e i | e i − 1 p LM ( e I 1 ) = i − n +1 ) (5) i =1 December 2018 MT2: PBMT, NMT 11
Estimating and Smoothing LM count( w 1 ) p ( w 1 ) = Unigram probabilities. total words observed p ( w 2 | w 1 ) = count( w 1 w 2 ) Bigram probabilities. count( w 1 ) p ( w 3 | w 2 , w 1 ) = count( w 1 w 2 w 3 ) Trigram probabilities. count( w 1 w 2 ) Unseen ngrams ( p ( ngram ) = 0 ) are a big problem, invalidate whole sentence: p LM ( e I 1 ) = · · · · 0 · · · · = 0 ⇒ Back-off with shorter ngrams: � 1 ) = � I p LM ( e I 0 . 8 · p ( e i | e i − 1 , e i − 2 )+ i =1 0 . 15 · p ( e i | e i − 1 )+ (6) 0 . 049 · p ( e i )+ � 0 . 001 � = 0 December 2018 MT2: PBMT, NMT 12
From Bayes to Log-Linear Model Och (2002) discusses some problems of Equation 19: • Models estimated unreliably ⇒ maybe LM more important: ˆ 1 )) 2 I p ( f J 1 | e I 1 )( p ( e I ˆ 1 = argmax (7) e I,e I 1 • In practice, “direct” translation model equally good: ˆ I p ( e I 1 | f J 1 ) p ( e I ˆ 1 = argmax 1 ) (8) e I,e I 1 • Complicated to correctly introduce other dependencies. ⇒ Use log-linear model instead. December 2018 MT2: PBMT, NMT 13
Log-Linear Model (1) • p ( e I 1 | f J 1 ) is modelled as a weighted combination of models, called “feature functions”: h 1 ( · , · ) . . . h M ( · , · ) exp( � M m =1 λ m h m ( e I 1 , f J 1 )) p ( e I 1 | f J 1 ) = (9) 1 exp( � M m =1 λ m h m ( e ′ I ′ 1 , f J � 1 )) I ′ e ′ • Each feature function h m ( e, f ) relates source f to target e . E.g. the feature for n -gram language model: I � p ( e i | e i − 1 h LM ( f J 1 , e I 1 ) = log i − n +1 ) (10) i =1 • Model weights λ M 1 specify the relative importance of features. December 2018 MT2: PBMT, NMT 14
Log-Linear Model (2) As before, the constant denominator not needed in maximization: exp( � M m =1 λ m h m ( e I 1 , f J 1 )) e ˆ I ˆ 1 = argmax I,e I 1 exp( � M m =1 λ m h m ( e ′ I ′ 1 , f J � 1 1 )) (11) I ′ e ′ 1 exp( � M m =1 λ m h m ( e I 1 , f J = argmax I,e I 1 )) December 2018 MT2: PBMT, NMT 15
Relation to Noisy Channel With equal weights and only two features: • h TM ( e I 1 , f J 1 ) = log p ( f J 1 | e I 1 ) for the translation model, • h LM ( e I 1 , f J 1 ) = log p ( e I 1 ) for the language model, log-linear model reduces to Noisy Channel: e ˆ 1 exp( � M I m =1 λ m h m ( e I 1 , f J ˆ 1 = argmax I,e I 1 )) 1 exp( h TM ( e I 1 , f J 1 ) + h LM ( e I 1 , f J = argmax I,e I 1 )) (12) 1 exp(log p ( f J 1 | e I 1 ) + log p ( e I = argmax I,e I 1 )) 1 p ( f J 1 | e I 1 ) p ( e I = argmax I,e I 1 ) December 2018 MT2: PBMT, NMT 16
Phrase-Based MT Overview This time around = Nyn´ ı . they ’re moving = zareagovaly faster even = dokonce jeˇ stˇ e even . . . = . . . moving This time around, they ’re moving = Nyn´ ı zareagovaly ’re they even faster = dokonce jeˇ stˇ e rychleji , . . . = . . . around time This Phrase-based MT: choose such segmentation of input string and such phrase “replacements” Nyn´ rychleji. stˇ zareagovaly ı e dokonce jeˇ to make the output sequence “coherent” (3-grams most probable). December 2018 MT2: PBMT, NMT 17
Recommend
More recommend