Multimodal Machine Translation with Embedding Prediction Tosho Hirasawa, Hayahide Yamagishi, Yukio Matsumura, Mamoru Komachi hirasawa-tosho@ed.tmu.ac.jp Tokyo Metropolitan University NAACL SRW 2019
Multimodal Machine Translation • Practical application of machine translation • Translate a source sentence along with related nonlinguistic information • Visual information two young girls are sitting on the street eating corn . deux jeunes filles sont assises dans la rue , mangeant du maïs . 1 6/11/19 NAACL SRW 2019, Minneapolis
Issue of MMT • Multi30k [Elliott et al., 2016] has only small mount of data • Statistic of training data Sentences Tokens Types English 377,534 10,210 29,000 French 409,845 11,219 • Hard to train rare word translation • Tend to output synonyms guided by language model Source deux jeunes filles sont assises dans la rue , mangeant du maïs . Reference two young girls are sitting on the street eating corn . NMT two young girls are sitting on the street eating food . 2 6/11/19 NAACL SRW 2019, Minneapolis
Previous Solutions • Parallel corpus without images [Elliott and Kádár, 2017; Grönroos et al., 2018] • Out-of-domain data • Pseudo in-domain data by filtering general domain data • Pseudo-parallel corpus [Sennrich et al., 2016; Helcl et al., 2018] • Back-translation of caption/monolingual data • Monolingual data • Pretrained Word Embedding • Seldomly studied 3 6/11/19 NAACL SRW 2019, Minneapolis
Motivation • Introduce pretrained word embedding to MMT • Improve rare word translation in MMT • Pretrained word embeddings with conventional MMT? • See our paper on MT Summit 2019 (https://arxiv.org/abs/1905.10464) ! • Pretrained Word Embedding in text-only NMT • Initialize embedding layers in encoder/decoder [Qi et al., 2018] ü Improve overall performance in low-resource domain • Search-based decoder with continuous output [Kumar and Tsvetkov, 2019] ü Improve rare word translation 4 6/11/19 NAACL SRW 2019, Minneapolis
1. Multimodal Machine Translation 2. MMT with Embedding Prediction 3. Pretrained Word Embedding 4. Result & Conclusion 5 6/11/19 NAACL SRW 2019, Minneapolis
Baseline: IMAGINATION [Elliot and Kádáar, 2017] MT Model: Bahdanau et al., 2015 While validating, testing Multitask Learning: Train both MT task and shared space learning task to improve the shared encoder. While training 6 6/11/19 NAACL SRW 2019, Minneapolis
MMT with Embedding Prediction 1. Use embedding prediction in decoder While validating, testing 2. Initialize embedding layers in encoder/decoder with pretrained word embeddings 3. Shift visual features to make While training the mean vector be a zero 7 6/11/19 NAACL SRW 2019, Minneapolis
Embedding Prediction (Continuous Output) • i.e. Continuous Output [Kumar and Tsvetkov, 2019] • Predict a word embedding and search for the nearest word 1. Predict a word embedding of next word. 3 2. Compute cosine similarities with each word in pretrained 2 word embedding. 1 3. Find and output the most similar word as system output. Keep unchanged: Pretrained word embedding will NOT be updated during training. 8 6/11/19 NAACL SRW 2019, Minneapolis
Embedding Layer Initialization [Qi et al., 2018] • Initialize embedding layer with pretrained word embedding • Fine-tune the embedding layer in encoder • DO NOT update the embedding layer in decoder Fine Tune Unchanged 9 6/11/19 NAACL SRW 2019, Minneapolis
<latexit sha1_base64="Y1TJQZNz6khHVoTzkNhLmVfILGs=">AC2XicjVFNbxMxEPVu+SjhoykcuVgEpESk0W6pVC5IFVwQElKRmrZSNl3NOk7WyXq9smeByPKBG+LKmR/Gv8Gb5tC0PTCS5ef3mhGz1lVCINR9DcIt+7cvXd/+0Hr4aPHT3bau09Pjao140OmCqXPMzC8ECUfosCn1eag8wKfpYtPjT62VeujVDlCS4rPpYwK8VUMEBPpe0/n9JEAuZa2hPXTDnCH2aVLm4wvfoO5qYWqbzi8/U09pYiPvmoGUQF/TSTfJpE1yQMtdOndeat7rfu5c91tq5+7C7rlej+79h3+Zzr2zlbY70SBaFb0J4jXokHUdp7vBy2SiWC15iawAY0ZxVOHYgkbBCu5aSW14BWwBMz7ysATJzdiucnT0lWcmdKq0PyXSFXu1w4I0Zikz72x2Nde1hrxNG9U4fTu2oqxq5CW7HDStC4qKNp9CJ0JzhsXSA2Ba+F0py0EDQ/91G1OYkqspt63V93ezuGkA5rKfyU1fptQCITPO5xpfT/EmON0fxNEg/nLQOXq/TnibPCcvSJfE5JAckY/kmAwJC7aCXrAfvAlH4Y/wZ/jr0hoG65nZKPC3/8AK7LkIw=</latexit> <latexit sha1_base64="oJs4glWS4qBuFUzG605uRQdJnvU=">ADHicjVFNixMxGM6MX7V+tevRS7AIXeiWmUXQi7DoxeMKtrvQ1CGTpm3afAxJxlpi7v4Kf4038Sp49ZeYmY5gt3vwhZBnvd5n3JkxecGZskv6L4xs1bt+07rbv3X/w8FGnezQ2qtSEjojiSl/m2FDOJB1Zjm9LDTFIuf0Il+/qfoXH6k2TMn3dlvQqcALyeaMYBuorPN5k7mV/+BOPHwFEdYLgT9lbgMRkxAJbJcEczf2HiIHZ32UC4eW2Drqs5UfwPq7UmkRKN/fHB9DePIfwm2CtKs0uGSV3wEKQN6IGmzrNu9AXNFCkFlZwbMwkTQo7dVhbRj1bVQaWmCyxgs6CVBiQc3U1c/k4bPAzOBc6XCkhTX7r8NhYcxW5EFZrWqu9iryut6ktPOXU8dkUVoqyW7QvOTQKli9OZwxTYnl2wAw0SzsCskSa0xsSGZvClGinLdWoNwV4ubCtilGORiX5crtbY4N7dRpJu6n/JmWtS9bsAVOGQFrDhEGeC2eA4MDB5aAjcX0OVXo1qUMwPh2myTB97x39rpJsQWegKegD1LwApyBt+AcjABv6NW1I2O4q/xt/h7/GMnjaPG8xjsVfzD3W5A3I=</latexit> <latexit sha1_base64="mcKAhji/xZOUN+j4d5fjAQJCiQ=">ADLHicbVHLihNBFK1uX2N8ZXQjuCkniDOYCd0i6EYdCOuRjCZgVQI1ZVKukg92qrbo6HovQu/xa9xI+LWD/ALrO5pYTLJhaZPnXNP3UudrJDCQZL8jOIrV69dv7Fzs3Pr9p2797q790fOlJbxITPS2NOMOi6F5kMQIPlpYTlVmeQn2fJtrZ+ceuE0R9hVfCJogst5oJRCNS0+39lCgKuV+VO0TyDnQPilycYE+wK8xcaWaepIpf/a0wkTzT7g5VOGg6BdMPE76mFBZ5BQ/w7P9WiU5hbqlj1vjAT7cLgWFVHja7SWDpCm8CdIW9FBbx9Pd6CuZGVYqroFJ6tw4TQqYeGpBMmrDikdLyhb0gUfB6ip4m7im2er8JPAzPDc2PBpwA170eGpcm6lstBZv4W7rNXkNm1cwvzVxAtdlMA1Ox80LyUGg+sM8ExYzkCuAqDMirArZjm1lEFIam0KM6qZsm2tfvjXi7saQK76mVrvy4xZAs1c1emEwD43d+mZJ9QuQmSVbxI2hSdW4ZYjUigBwbFhEHrTELj/hjq79HJSm2D0fJAmg/TDi97RmzbFHfQI7aF9lKX6Ai9Q8doiBj6Gz2MHkd78f4R/wr/n3eGket5wFaq/jP9RgCJk=</latexit> <latexit sha1_base64="XQxfQrA162bkpg+QKWvn5A4Rd6c=">ADBXicbZHNbtNAEMc3Lh8lfDSFI5cVEVIqQmQjJLgVXBPRWpStlo2i93sSr7Ie1O26JLJ/Ly3BDXHkAnoDH4NpeWLuaJqMZPmv/8xvZzQTZ1I4CM/rWDrzt1797cftB8+evxkp7P7dORMbhkfMiONPYmp41JoPgQBkp9klMVS34cLz5V+eNTbp0w+giWGZ8oOtdiJhgFb07/AB/wER6IKH4YEoUhdSq4qjsEUg50D7JUnHD3sOvcC/Cr6+Zvf/MaDMz8sy0w0HYR14XUSN6KImDqe7rW8kMSxXAOT1LlxFGYwKagFwSQv2yR3PKNsQed87KWmirtJUe+jxC+9k+CZsf7TgGv3JlFQ5dxSxb6yGtLdzlXmptw4h9n7SF0lgPX7KrRLJcYDK6WixNhOQO59IyK/ysmKXUgb+BCtdmF1l01j9f2/GtxVAlLVj9VqXWzMAmjsynabaH5Wv6WTglA7V/RrWdSrN1lBrMKNR6RQAjyxBgi9DnjvGqhuF92+1LoYvRlE4SD68ra7/7G54jZ6jl6gHorQO7SPqNDNEQM/UZ/0QW6DM6D78GP4OdVadBqmGdoJYJf/wCZqPu9</latexit> Loss Function • Model loss: Interpolation of each loss [Elliot and Kádáar, 2017] • MT task: Max-margin with negative sampling [Lazaridou et al., 2015] • negative sampling • Shared space learning task: Max-margin [Elliot and Kádáar, 2017] 10 6/11/19 NAACL SRW 2019, Minneapolis
1. Multimodal Machine Translation 2. MMT with Embedding Prediction 3. Pretrained Word Embedding 4. Result & Conclusion 11 6/11/19 NAACL SRW 2019, Minneapolis
Hubness Problem [Lazaridou et al., 2015] • Certain words (hubs) appear frequently in the neighbors of other words • Even of the word that has entirely no relationship with hubs • Prevent the embedding prediction model from searching for correct output words • Incorrectly output the hub word 12 6/11/19 NAACL SRW 2019, Minneapolis
All-but-the-Top [Mu and Viswanath, 2018] • Address hubness problem in other NLP tasks • Debias a pretrained word embedding based on its global bias 1. Shift all word embeddings to make their mean vector into a zero vector 2. Subtract top 5 PCA components from each shifted word embedding • Applied to pretrained word embeddings for encoder/decoder 13 6/11/19 NAACL SRW 2019, Minneapolis
1. Multimodal Machine Translation 2. MMT with Embedding Prediction 3. Pretrained Word Embedding 4. Result & Conclusion 14 6/11/19 NAACL SRW 2019, Minneapolis
Implementation & Dataset • Implementation • Based on nmtpytorch v3.0.0 [Caglayan et al., 2017] • Dataset • Multi30k (French to English) • Pretrained ResNet50 for visual encoder • Pretrained Word Embedding • FastText • Trained on Common Crawl and Wikipedia • https://fasttext.cc/docs/en/crawl-vectors.html Our code is here: https://github.com/toshohirasawa/nmtpytorch-emb-pred 15 6/11/19 NAACL SRW 2019, Minneapolis
Hyper Parameters • Model • dimension of hidden state: 256 • RNN type: GRU • dimension of word embedding: 300 • dimension of shared space: 2048 • Vocabulary size (French, English): 10,000 • Training • λ = 0.99 • Optimizer: Adam • Learning rate: 0.0004 • Dropout rate: 0.3 16 6/11/19 NAACL SRW 2019, Minneapolis
Word-level F 1 -score 80 69.66 69.98 71.24 Bahdanau et al., 2015 70 IMAGINATION 52.13 51.12 49.66 60 Ours F-score of word 38.03 50 33.64 33.64 Rare words 32.44 28.34 40 24.65 22.74 19.97 19.77 30 16.76 13.59 12.86 12.46 20 5.63 5.48 10 0 1 2 3 4 5 - 9 10 - 99 100+ Frequency in training data 17 6/11/19 NAACL SRW 2019, Minneapolis
Ablation w.r.t. Embedding Layers Encoder Decoder Fixed BLEU METEOR FastText FastText Yes 53.49 43.89 random FastText Yes 53.22 43.83 FastText random No 51.53 43.07 random random No 51.42 42.77 FastText FastText No 51.42 42.88 random FastText No 50.72 42.52 Encoder/Decoder: Initialize embedding layer with random values or FastText word embedding. Fixed (Yes/No): Whether fix the embedding layer in decoder or fine-tune that while training. • Fixing the embedding layer in decoder is essential • Keep word embeddings in input/output layers consistent 18 6/11/19 NAACL SRW 2019, Minneapolis
Recommend
More recommend