Variational Decoding for Statistical Machine Translation Zhifei Li, - PowerPoint PPT Presentation

Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev Khudanpur Center for Language and Speech Processing Computer Science Department Johns Hopkins University 1 Monday, August 17, 2009

Spurious Ambiguity • Statistical models in MT exhibit spurious ambiguity • Many different derivations (e.g., trees or segmentations) generate the same translation string • Regular phrase-based MT systems • phrase segmentation ambiguity • Tree-based MT systems • derivation tree ambiguity 2 Monday, August 17, 2009

Spurious Ambiguity in Phrase Segmentations machine translation software 机器翻译软件 machine translation software • Same output: “machine translation software” • 机器翻译软件 Three different phrase segmentations machine translation software machine transfer software 翻译软件机器 machine translation software 3 Monday, August 17, 2009

Spurious Ambiguity in Derivation Trees 机器翻译软件 • Same output: “machine translation software” S->(S0 S1, S0 S1) • Three different derivation trees S->(S0 S1, S0 S1) S ->( 机器 , machine) S ->( 翻译 , translation) S ->( 软件 , software) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1) S ->( 机器 , machine) S ->( 翻译 , translation) S ->( 软件 , software) S->(S0 翻译 S1, S0 translation S1) S ->( 机器 , machine) S ->( 软件 , software) 翻译 4 Monday, August 17, 2009

Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.14 0.14 blue translation 0.13 green translation 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 5 Monday, August 17, 2009

Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.13 green translation 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 6 Monday, August 17, 2009

Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 7 Monday, August 17, 2009

Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 8 Monday, August 17, 2009

Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 9 Monday, August 17, 2009

Hypergraph as a search space A hypergraph is a compact structure to encode exponentially many trees. S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

Hypergraph as a search space The hypergraph defines a probability distribution over derivation trees , i.e. p(y, d | x), • Exact MAP decoding and also a distribution (implicit) over strings , i.e. p(y | x). y ∗ = arg y ∈ HG( x ) p ( y | x ) max � = arg max p ( y, d | x ) S 0,4 y ∈ HG( x ) d ∈ D( x,y ) S →� X 0 , X 0 � exponential size S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat NP-hard (Sima’an 1996) X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � Probabilistic X →� X 0 de X 1 , X 0 X 1 � Hypergraph X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

Decoding with spurious ambiguity? • Maximum a posterior (MAP) decoding • Viterbi approximation • N-best approximation (crunching) (May and Knight 2006) Monday, August 17, 2009

Viterbi Approximation translation probability derivation MAP Viterbi string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 0.10 • Viterbi approximation y ∗ = arg max d ∈ D( x,y ) p ( y, d | x ) max y ∈ Trans( x ) = Y(arg max d ∈ D( x ) p ( y, d | x )) Monday, August 17, 2009

Viterbi Approximation translation probability derivation MAP Viterbi string 0.16 red translation 0.28 0.16 0.14 0.14 blue translation 0.28 0.14 0.13 green translation 0.44 0.13 0.12 0.11 0.10 0.10 • Viterbi approximation y ∗ = arg max d ∈ D( x,y ) p ( y, d | x ) max y ∈ Trans( x ) = Y(arg max d ∈ D( x ) p ( y, d | x )) Monday, August 17, 2009

N-best Approximation translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.14 0.14 blue translation 0.28 0.14 0.13 green translation 0.44 0.13 0.12 0.11 0.10 0.10 • N-best approximation (crunching) (May and Knight 2006) � y ∗ = arg max p ( y, d | x ) y ∈ Trans( x ) d ∈ D( x,y ) ∩ ND( x ) Monday, August 17, 2009

N-best Approximation translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 • N-best approximation (crunching) (May and Knight 2006) � y ∗ = arg max p ( y, d | x ) y ∈ Trans( x ) d ∈ D( x,y ) ∩ ND( x ) Monday, August 17, 2009

MAP vs. Approximations translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 • Exact MAP decoding under spurious ambiguity is intractable • Viterbi and crunching are efficient, but ignore most derivations • Our goal: develop an approximation that considers all the derivations but still allows tractable decoding Monday, August 17, 2009

Variational Decoding Decoding using Variational approximation Decoding using a sentence-specific approximate distribution 18 Monday, August 17, 2009

Variational Decoding for MT: an Overview Sentence-specific decoding MAP decoding under P is intractable Three steps: 1 Generate a hypergraph S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat p(y, d | x) Foreign X →� X 0 de X 1 , X 1 on X 0 � SMT X →� X 0 de X 1 , X 0 ’s X 1 � sentence x X →� X 0 de X 1 , X 1 of X X →� X 0 de X 1 , X 0 X 1 � p(y | x) X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 q* is an n-gram model 2 S 0,4 Estimate a model S →� X 0 , X 0 � over output strings. S →� X 0 , X 0 � X 0,4 a · · · mat from the hypergraph X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � q*(y | x) X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat ≈∑ d ∈ D(x,y) p(y,d|x) X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 3 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat Decode using q* X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � q*(y | x) X →� X 0 de X 1 , X 0 ’s X 1 � on the hypergraph X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* P p q ∗ = arg min q ∈ Q KL( p || q ) q* Q • Then, we will use q* as a surrogate for p in inference = arg max q ∗ ( y | x ) y ∗ y 21 Monday, August 17, 2009

Variational Decoding for Statistical Machine Translation Zhifei Li, - PowerPoint PPT Presentation

Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev Khudanpur Center for Language and Speech Processing Computer Science Department Johns Hopkins University 1 Monday, August 17, 2009 Spurious

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Decoding Philipp Koehn 17 September 2020 Philipp Koehn Machine Translation: Decoding 17

Chapter 6 Decoding Statistical Machine Translation Decoding We have a mathematical model for

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Syntax-Based Decoding Philipp Koehn 9 November 2017 Philipp Koehn Machine Translation:

Syntax-Based Decoding 2 Philipp Koehn 14 November 2017 Philipp Koehn Machine Translation:

Statistical Machine Translation Graham Neubig Nara Institute of Science and Technology (NAIST)

Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Domain Adaptation in Statistical Machine Translation Logic, Language and Computation Bart

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Resurgence of Instantons in Resurgence Applications String Theory Summary/Future Directions

CSE 105 THEORY OF COMPUTATION Fall 2016 http://cseweb.ucsd.edu/classes/fa16/cse105-abc/ Today's

L4S Issues Related to CE Ambiguity #16, #17, #20, #21 ,#22 Issue #16 L4S - Interaction w/

Ambiguity Resolution: Statistical Method Prof. Ahmed Rafea Ch.7 Ambiguity Resolution:Statistical

FROM FOES TO FRIENDS: FOREIGN TROOPS IN POST- CONFLICT SITUATIONS No Self-contained Status

1 Some Preliminary Definitions Gospel - literally,

ON BROWDERS 1962 THEOREM ON THE HOMOTOPY TYPES OF DIFFERENTIABLE MANIFOLDS Andrew Ranicki

G. W. Gibbons The gravitational memory effect: what it is and why Stephen and I did not discover