Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev Khudanpur Center for Language and Speech Processing Computer Science Department Johns Hopkins University 1 Monday, August 17, 2009
Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity • Many different derivations (e.g., trees or segmentations) generate the same translation string • Regular phrase-based MT systems • phrase segmentation ambiguity • Tree-based MT systems • derivation tree ambiguity 2 Monday, August 17, 2009
Spurious Ambiguity in Phrase Segmentations machine translation software 机器 翻译 软件 machine translation software • Same output: “machine translation software” • 机器 翻译 软件 Three different phrase segmentations machine translation software machine transfer software 翻译 软件 机器 machine translation software 3 Monday, August 17, 2009
Spurious Ambiguity in Derivation Trees 机器 翻译 软件 • Same output: “machine translation software” S->(S0 S1, S0 S1) • Three different derivation trees S->(S0 S1, S0 S1) S ->( 机器 , machine) S ->( 翻译 , translation) S ->( 软件 , software) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1) S ->( 机器 , machine) S ->( 翻译 , translation) S ->( 软件 , software) S->(S0 翻译 S1, S0 translation S1) S ->( 机器 , machine) S ->( 软件 , software) 翻译 4 Monday, August 17, 2009
Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.14 0.14 blue translation 0.13 green translation 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 5 Monday, August 17, 2009
Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.13 green translation 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 6 Monday, August 17, 2009
Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 7 Monday, August 17, 2009
Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 8 Monday, August 17, 2009
Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 9 Monday, August 17, 2009
Hypergraph as a search space A hypergraph is a compact structure to encode exponentially many trees. S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
Hypergraph as a search space The hypergraph defines a probability distribution over derivation trees , i.e. p(y, d | x), • Exact MAP decoding and also a distribution (implicit) over strings , i.e. p(y | x). y ∗ = arg y ∈ HG( x ) p ( y | x ) max � = arg max p ( y, d | x ) S 0,4 y ∈ HG( x ) d ∈ D( x,y ) S →� X 0 , X 0 � exponential size S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat NP-hard (Sima’an 1996) X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � Probabilistic X →� X 0 de X 1 , X 0 X 1 � Hypergraph X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
Decoding with spurious ambiguity? • Maximum a posterior (MAP) decoding • Viterbi approximation • N-best approximation (crunching) (May and Knight 2006) Monday, August 17, 2009
Viterbi Approximation translation probability derivation MAP Viterbi string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 0.10 • Viterbi approximation y ∗ = arg max d ∈ D( x,y ) p ( y, d | x ) max y ∈ Trans( x ) = Y(arg max d ∈ D( x ) p ( y, d | x )) Monday, August 17, 2009
Viterbi Approximation translation probability derivation MAP Viterbi string 0.16 red translation 0.28 0.16 0.14 0.14 blue translation 0.28 0.14 0.13 green translation 0.44 0.13 0.12 0.11 0.10 0.10 • Viterbi approximation y ∗ = arg max d ∈ D( x,y ) p ( y, d | x ) max y ∈ Trans( x ) = Y(arg max d ∈ D( x ) p ( y, d | x )) Monday, August 17, 2009
N-best Approximation translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.14 0.14 blue translation 0.28 0.14 0.13 green translation 0.44 0.13 0.12 0.11 0.10 0.10 • N-best approximation (crunching) (May and Knight 2006) � y ∗ = arg max p ( y, d | x ) y ∈ Trans( x ) d ∈ D( x,y ) ∩ ND( x ) Monday, August 17, 2009
N-best Approximation translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 • N-best approximation (crunching) (May and Knight 2006) � y ∗ = arg max p ( y, d | x ) y ∈ Trans( x ) d ∈ D( x,y ) ∩ ND( x ) Monday, August 17, 2009
MAP vs. Approximations translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 • Exact MAP decoding under spurious ambiguity is intractable • Viterbi and crunching are efficient, but ignore most derivations • Our goal: develop an approximation that considers all the derivations but still allows tractable decoding Monday, August 17, 2009
Variational Decoding Decoding using Variational approximation Decoding using a sentence-specific approximate distribution 18 Monday, August 17, 2009
Variational Decoding for MT: an Overview Sentence-specific decoding MAP decoding under P is intractable Three steps: 1 Generate a hypergraph S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat p(y, d | x) Foreign X →� X 0 de X 1 , X 1 on X 0 � SMT X →� X 0 de X 1 , X 0 ’s X 1 � sentence x X →� X 0 de X 1 , X 1 of X X →� X 0 de X 1 , X 0 X 1 � p(y | x) X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 q* is an n-gram model 2 S 0,4 Estimate a model S →� X 0 , X 0 � over output strings. S →� X 0 , X 0 � X 0,4 a · · · mat from the hypergraph X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � q*(y | x) X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat ≈∑ d ∈ D(x,y) p(y,d|x) X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 3 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat Decode using q* X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � q*(y | x) X →� X 0 de X 1 , X 0 ’s X 1 � on the hypergraph X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* P p q ∗ = arg min q ∈ Q KL( p || q ) q* Q • Then, we will use q* as a surrogate for p in inference = arg max q ∗ ( y | x ) y ∗ y 21 Monday, August 17, 2009
Recommend
More recommend