Online Versus Offline NMT Quality An In-depth Analysis on English-German and German-English Maha Elbayad 1,2 Michael Ustaszewski 3 Emmanuelle Esperança-Rodier 1 Francis Brunet-Manquat 1 Jakob Verbeek 4 Laurent Besacier 1 (1) (2) (3) (4)
Introduction Online NMT models Automatic Evaluation Human Evaluation Outline 1 Introduction to online translation 2 Neural architectures for online NMT a Transformer (Vaswani et al. 2017) b Pervasive Attention (Elbayad et al. 2018) 3 Automatic evaluation 4 Human evaluation 5 Conclusion Elbayad et al. Online vs. Offline NMT Quality 1 / 16
Introduction Online NMT models Automatic Evaluation Human Evaluation Online Neural Machine Translation source source x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 1 x 2 x 3 x 4 x 5 x 6 x 7 < / s> < / s> <s> <s> y 1 y 1 y 2 y 2 y 3 y 3 target target y 4 y 4 y 5 y 5 y 6 y 6 y 7 y 7 y 8 y 8 < / s> < / s> Offline translation Online translation Elbayad et al. Online vs. Offline NMT Quality 2 / 16
Introduction Online NMT models Automatic Evaluation Human Evaluation Wait- k Decoders for Online Translation z wait- k ∀ t ∈ [ 1 .. | y | ] , = min( k + t − 1 , | x | ) t source source source x 1 x 2 x 3 x 4 x 5 < x 1 x 2 x 3 x 4 x 5 < x 1 x 2 x 3 x 4 x 5 < / s> / s> / s> <s> <s> <s> y 1 y 1 y 1 y 2 y 2 y 2 target y 3 y 3 y 3 y 4 y 4 y 4 y 5 y 5 y 5 < / s> < / s> < / s> Wait-1 Wait-3 Wait- ∞ Wait-k or prefix-to-prefix decoding (Dalvi et al. 2018; Ma et al. 2019; Elbayad et al. 2020) Elbayad et al. Online vs. Offline NMT Quality 3 / 16
Online NMT models Introduction Automatic Evaluation Human Evaluation Online Transformer ◮ Unidirectional encoder (Elbayad et al. 2020) s 3 x x Encoder states Source tokens x 1 x 2 x 3 x 4 x 5 x 6 x 1 x 2 x 3 x 4 x 5 x 6 z t = 4 z t +1 = 5 Elbayad et al. Online vs. Offline NMT Quality 4 / 16
Online NMT models Introduction Automatic Evaluation Human Evaluation Online Transformer ◮ Unidirectional encoder (Elbayad et al. 2020) s 3 x x Encoder states Source tokens x 1 x 2 x 3 x 4 x 5 x 6 x 1 x 2 x 3 x 4 x 5 x 6 z t = 4 z t +1 = 5 ◮ Masked decoder - masking the attention energies wrt. z t h t − 1 , z t = 4 x Decoder Encoder s 1 s 2 s 3 s 4 s 5 s 6 Elbayad et al. Online vs. Offline NMT Quality 4 / 16
Online NMT models Introduction Automatic Evaluation Human Evaluation The Pervasive Attention Architecture A ) g g x r e ( g a t i o e n c r u o s p ( y 1 | y < 1 , x ) · · . . . target ( y ) · · · p ( y | y | | y < | y | , x ) H out H conv H 0 H 1 H 1 H 2 H N Concatenated Convolutional source-target Elbayad et al. 2018 feature maps embeddings Elbayad et al. Online vs. Offline NMT Quality 5 / 16
Online NMT models Introduction Automatic Evaluation Human Evaluation Online Pervasive Attention x z t source W y t − 1 target y t 2D causal convolution Features aggregation Elbayad et al. Online vs. Offline NMT Quality 6 / 16
Online NMT models Introduction Automatic Evaluation Human Evaluation Online Pervasive Attention x z t source W y t − 1 target y t + Masking the future source for The appropriate context size z t is unidirectional encoding. controlled during aggregation. Elbayad et al. Online vs. Offline NMT Quality 6 / 16
Online NMT models Introduction Automatic Evaluation Human Evaluation Training and Evaluation Setup Data ◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K. Elbayad et al. Online vs. Offline NMT Quality 7 / 16
Online NMT models Introduction Automatic Evaluation Human Evaluation Training and Evaluation Setup Data ◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K. Elbayad et al. Online vs. Offline NMT Quality 7 / 16
Online NMT models Introduction Automatic Evaluation Human Evaluation Training and Evaluation Setup Data ◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K. Elbayad et al. Online vs. Offline NMT Quality 7 / 16
Online NMT models Introduction Automatic Evaluation Human Evaluation Training and Evaluation Setup Data ◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K. Elbayad et al. Online vs. Offline NMT Quality 7 / 16
Online NMT models Introduction Automatic Evaluation Human Evaluation Training and Evaluation Setup Data ◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K. Elbayad et al. Online vs. Offline NMT Quality 7 / 16
Online NMT models Introduction Automatic Evaluation Human Evaluation Training and Evaluation Setup Data ◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K. Elbayad et al. Online vs. Offline NMT Quality 7 / 16
Online NMT models Introduction Automatic Evaluation Human Evaluation Training and Evaluation Setup Data ◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K. Models ◮ For each direction and for each architecture, an online and an offline model. ◮ Pervasive Attention ( PA ) with 14 layers and 7 × 7 filters (effectively 4 × 4). ◮ Transformer ( TF ) small. ◮ Online trained with k train = 7 and evaluated with k eval = 3. ◮ Greedy decoding for all. Elbayad et al. Online vs. Offline NMT Quality 7 / 16
Online NMT models Introduction Automatic Evaluation Human Evaluation Training and Evaluation Setup Data ◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K. Models ◮ For each direction and for each architecture, an online and an offline model. ◮ Pervasive Attention ( PA ) with 14 layers and 7 × 7 filters (effectively 4 × 4). ◮ Transformer ( TF ) small. ◮ Online trained with k train = 7 and evaluated with k eval = 3. ◮ Greedy decoding for all. Elbayad et al. Online vs. Offline NMT Quality 7 / 16
Online NMT models Introduction Automatic Evaluation Human Evaluation Training and Evaluation Setup Data ◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K. Models ◮ For each direction and for each architecture, an online and an offline model. ◮ Pervasive Attention ( PA ) with 14 layers and 7 × 7 filters (effectively 4 × 4). ◮ Transformer ( TF ) small. ◮ Online trained with k train = 7 and evaluated with k eval = 3. ◮ Greedy decoding for all. Elbayad et al. Online vs. Offline NMT Quality 7 / 16
Online NMT models Introduction Automatic Evaluation Human Evaluation Training and Evaluation Setup Data ◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K. Models ◮ For each direction and for each architecture, an online and an offline model. ◮ Pervasive Attention ( PA ) with 14 layers and 7 × 7 filters (effectively 4 × 4). ◮ Transformer ( TF ) small. ◮ Online trained with k train = 7 and evaluated with k eval = 3. ◮ Greedy decoding for all. Elbayad et al. Online vs. Offline NMT Quality 7 / 16
Recommend
More recommend