Guy Dar Machine Translation Seminar Tel Aviv University 2014
} Pr Problems: lems: ◦ Poor grammar. ◦ Distortion model is local local . (Instance of the former) } Solution (?) (?) : Unsup Unsupervised ervised syntax-based translation model. } Wh Which ich m mean eans: No linguistic predefined rules. } The system learns from a bilingual corpus.
Man Mandarin darin (Ch (Chin ines ese): e): Aozhou shi yu Bei Han you bangjiao Australia is with North Korea have diplomatic relations de shaoshu guojia zhiyi that few countries one of Correct t Translati tion: Australia is one of the few countries that have diplomatic relations with North Korea. Note: Correct translation requires reversing 5 elements.
} Idea Idea: Translating ‘linguistic’ structures - “templates” to templates, and not phrases to phrases. } How? How? Rules! for example: ◦ [1 [1] de [2 [2] à à the [2 [2] that [1 [1] ] ◦ [1 [1] zhiyi à à one of [1 [1] ◦ yu [1 [1] you [2 [2] à à have [2 [2] with [1 [1] } We can apply rules recursively. } This way we can derive the correct translation.
} Formal constr tructi tion: ◦ Each rule will be of the following form: X à < α, ¡γ, ¡~> ¡ where X is a non-terminal (variable), α is a string in the source language, and γ is a string in the target. Both strings consist of non-terminals and terminals, and ~ is a one-to-one correspondence between non-terminals in S and T. } In our model, we will use only two non- terminals: S, X.
} Our system will learn rules from the bilingual corpus only. } The only rules we add manually are two gl glue ue ru rules les: S à <S [1 [1] X [2 [2] , S [1 [1] X [2 [2] > • S à < X [1 [1] ,X [1 [1] > •
<S [1] [1] , S [1] [1] > initial pair à <S [2] [2] X [3] [3] , S [2] [2] X [3] [3] > S à < S [1] [1] X [2] [2], S [1] [1] X [2] [2] > à <S [4] [4] X [5] [5] X [3] [3] , S [4] [4] X [5] [5] X [3] [3] > S à < S [1] [1] X [2] [2], S [1] [1] X [2] [2] > à <X [6] [6] X [5] [5] X [3] [3] , X [6] [6] X [5] [5] X [3] [3] > S à < X [1] [1], X [1] [1] > à <Aozhou X [5] [5] X [3] [3] , Australia X [5] [5] X [3] [3] > X à < Aozhou , Australia> à <Aozhou shi X [3] [3] , Australia is X [3] [3] > X à <shi , is> à <Aozhou shi X [7] [7] zhiyi, Australia is one of X [7] [7] > X à <X [1] [1] zhiyi, one of X [1] [1] > à <Aozhou shi X [8] [8] de X [9] [9] zhiyi, Australia is one of the X [9] [9] that X [8] [8] > X à <X [1] [1] de X [2] [2] , the X [2] [2] that X [1] [1] > à <Aozhou shi yu X [1] [1] you X [2] [2] de X [9] [9] zhiyi, Australia is one of the X [9] [9] that have X [2] [2] with X [1] [1] > X à <yu X [1] [1] you X [2] [2] , have X [2] [2] with X [1] [1] >
} Let us now return to our system. } Every rule gets a weight (Log-linear model) : } φ i ¡ are ¡called ¡the ¡ features . ¡ } λ i ¡ ¡ are ¡the ¡ feature ¡weights .
} In our design, we have the following features: ◦ P( P( γ | α ) - what are the chances that γ is translated to α . . ◦ P( P( α | γ ) - the other way around. ◦ P w ( α | γ ) , P ) , P w ( γ | α ) – Lexical weights estimate how well the words are translated. (word alignment) ◦ Phrase penalty ty – a constant e = exp(1); We use it to penalize long derivations.(encourage?)
} Two special rules: ◦ w( S à <S [1 [2] > ) = exp(- λ g ) [1] X [2 [2] , S [1 [1] X [2 ◦ w(S à < X [1] ,X [1] >) = 1 } We also give weights to derivations (a sequence of rules), for every derivation D: Where the product is over all rules used in D. p lm is the language model and exp(- λ wp |e|) is the word penalty, to discourage use of too many words. (as opposed to phrase penalty) } Note: For things to go right, we must integrate the extra factors into the rule weights.
} Input: A word-aligned bilingual corpus. (many- to-many) } Objective: Learn hierarchical rules. } We are given a pair of word-aligned sentences <f,e,~> (f for French, e for English, ~ is the word-alignment) } Big pictu ture: First we extract initi tial phrase pairs pairs , then we refine them into more “sophisticated” rules.
} Initi tial phrase pair is a pair <f’,e’> s.t. : ◦ f’ is a substring of f, and e’ is a substring of e (a substring must be of the form str[i:j], no ‘holes’ are allowed) ◦ All words in f’ are aligned to words in e’ ◦ And vice versa, no words outside f’ mapped to e’ } Reminds something? Philipp Koehn, http://www.statmt.org/book/slides/05-phrase-based-models.pdf
} Every initial phrase pair gives us a rule X à <f’,e’> } Now, we construct new rules from existing: ◦ If X à < α, ¡γ> ¡is ¡a ¡rule, ¡ ¡ ◦ and ¡there’s ¡an ¡ini7al ¡phrase-‑pair ¡<f’,e’> ¡such ¡that ¡ α = ¡α 1 f’α 2 , ¡ γ = ¡ γ 1 e’γ 2 ¡ ¡ ◦ Then, ¡add ¡the ¡rule ¡ ¡ ¡ X à < ¡α 1 ¡ X [k] ¡ α 2 , ¡γ 1 ¡ X [k] ¡ γ 2 ¡ > ¡ Practi tically, we use additi tional heuristi tics to to make th this procedure more efficient t and less ambiguous.
} Our ¡es7mate ¡will ¡distribute ¡weights ¡equally ¡among ¡ all ¡ ini*al ¡phrase ¡pairs ; ¡ } Then, ¡every ¡ini7al ¡phrase ¡pair ¡distributes ¡its ¡weight ¡ equally ¡among ¡ all ¡rules ¡extracted ¡from ¡it. ¡ } Now, ¡we ¡use ¡this ¡es7mate ¡to ¡determine ¡ P( α | γ ), P( γ | α ) . ¡ } No7ce ¡that ¡we ¡yet ¡to ¡have ¡values ¡for ¡our ¡feature ¡ weights. ¡
} We are given a sentence f in the foreign language. } we would try to find the derivati tion with the best score that ends with f on the French side: arg argmax ax w(D) s. s.t. f(D)=f ◦ the English side of this derivation will be our translation of f.
} Our algorithm is basically a CKY parser. ◦ An algorithm to check whether a word belongs to a CFG. ◦ There is a CKY parser for weighted CFGs. } Since we cannot try all options, we use pru prunin ing techniques. (Similar to what we saw in Koehn’s chapter on decoding: http://www.statmt.org/book/slides/06-decoding.pdf)
} Consti titu tuent t (liguisti tics) – A single unit within a heirarchical structure. } We can factor a consti titu tuent t featu ture into the weight of a derivation D: 1 f[i:j] is a constituent c(i,j) = 0 otherwise } For every rule r. f[i:j] is the slice of the French side that r is ‘responsible for’. (the [leaves of] the subtree derived from r) } c(i,j) was learnt from Penn Chinese Treebank (ver. 3)
} Lan Languag ages es: Mandarin to English } Models Models com compared: pared: ◦ Pharaoh (Baseline) ◦ Hierarchical model ◦ Hierarchical model + constituent feature } Training set t ◦ Translation model - FBIS corpus (7.2M+9.2M) ◦ Language model - English newswire text (155M words)
} De Development t set t ◦ 2002 NIST MT evaluation test set } Test t set t ◦ 2003 NIST MT evaluation test set } Ev Evaluati tion ◦ BLEU
} Featu ture weights ts tu tuned by running Minimum Er Error- Rate te Trainer (MER ERT) on the development set. } Tuning results ts
} Difference between Baseline and hierarchical model is statistically significant
} New system improves state-of-art results.(in 2005) } Constituent feature improves results only slightly. (Statistically insignificant) } Further study suggests that increasing initial phrase max. length from 10 to 15 improve accuracy.
} David Chiang , A Hierarchical Phrase-Based Model for Statistical Machine Translation, http://www.aclweb.org/anthology/P05-1033 } Philipp Koehn , Statistical Machine Translation, http://www.statmt.org/book/ } Wikipedia , ◦ CYK algorithm [Last Modified Dec. 16, 2014], http://en.wikipedia.org/wiki/CYK_algorithm ◦ Constituent (Linguistics) [Last Modified Nov. 17, 2014], http://en.wikipedia.org/wiki/Constituent_%28linguistics%29
Recommend
More recommend