attention strategies for multi source sequence to
play

Attention Strategies for Multi-Source Sequence-to-Sequence Learning - PowerPoint PPT Presentation

Attention Strategies for Multi-Source Sequence-to-Sequence Learning Jindich Libovick, Jindich Helcl Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University August 2, 2017 Motivation


  1. Attention Strategies for Multi-Source Sequence-to-Sequence Learning Jindřich Libovický, Jindřich Helcl Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University August 2, 2017

  2. Motivation Introduction No universal method that models explicitly the importance of each input. • Attention over multiple source sequences relatively unexplored. • This work proposes two techniques: • Flat attention combination • Hierarchical attention combination • Applied to tasks of multimodal translation and automatic post-editing.

  3. Multi-Source Sequence-to-Sequence Learning Any number of input sequences with possibly difgerent modalities. Figure 1: Multimodal translation example. Examples Multimodal translation, automatic post-editing, multi-source machine translation, ...

  4. Attentive Sequence Learning (1) (3) T x (2) In each decoder step i What about multiple inputs? states given the decoder state decide about its output e ij = v ⊤ a tanh ( W a s i + U a h j ) exp ( e ij ) • compute distribution over encoder α ij = ∑ T x k =1 exp ( e ik ) • the decoder gets a context vector to ∑ c i = α ij h j j =1

  5. Attentive Sequence Learning (1) (3) T x (2) In each decoder step i What about multiple inputs? states given the decoder state decide about its output e ij = v ⊤ a tanh ( W a s i + U a h j ) exp ( e ij ) • compute distribution over encoder α ij = ∑ T x k =1 exp ( e ik ) • the decoder gets a context vector to ∑ c i = α ij h j j =1

  6. Context Vector Concatenation • Widely used technique [Firat et al., 2016, Zoph and Knight, 2016]. • Attention over input sequences computed independently. • Combination resolved later on in the network

  7. Importance of difgerent inputs refmected in the joint attention distribution. Flat Attention Combination A t t e n t i o n

  8. Flat Attention Combination a x im one source j x c project states to a common space ij T x N ij N sources → e ( k ) a tanh ( W a s i + U a ( k ) h j ) e ij = v ⊤ = v ⊤ a tanh ( W a s i + U a h j ) → exp ( e ( k ) ij ) exp ( e ij ) α ( k ) α ij = = → ∑ T ( n ) ∑ T x ( e ( n ) ) k =1 exp ( e ik ) ∑ N n =1 m =1 exp T ( k ) α ( k ) ij U c ( k ) h ( k ) ∑ ∑ ∑ c i = α ij h j c i = → j =1 k =1 j =1 • U ( k ) a , U ( k ) • Question: Should U ( k ) = U ( k ) c ? (i.e. should the projection parameters be shared?)

  9. Hierarchical Attention Combination Attention distribution is factored by input. A t t e n t i o n A t t e n t i o n

  10. Hierarchical Attention Combination i i c i i N 1. i i i resulting context vector c i . and get the i b distribution over the intermediate Compute another attention 2. …using the vanilla attention i inputs Compute the context vector: x ij j i ∀ = ∑ T ( k ) c ( k ) j =1 α ( k ) ij h ( k ) , where α ( k ) = … k = 1 . . . N e ( k ) b tanh ( W b s i + U ( k ) b c ( k ) v ⊤ = ) exp ( e ( k ) ) β ( k ) context vectors c ( k ) = n =1 exp ( e ( n ) ∑ N ) β ( k ) U ( k ) c c ( k ) ∑ = k =1 • As in the fmat scenario, the context vectors have to be projected to a shared space. • Same question arises – should U ( k ) = U ( k ) c ?

  11. Experiments and Results post-editing (APE) projection matrices. decoder to decide whether or not to attend to any encoder. Experiments conducted using Neural Monkey, code available here: https://github.com/ufal/neuralmonkey . • Experiments conducted on multimodal translation (MMT) and automatic • In both fmat and hierarchical scenarios, we tried both sharing and not sharing the • Additionally, we tried using the sentinel gate [Lu et al., 2016], which enables the

  12. Experiments and Results concat. whether the projection matrix is shared for energies and context vector computation, Results on the Multi30k dataset and the APE dataset. The column ‘share’ denotes hierarchical share fmat ‘sent.’ indicates whether the sentinel vector has been used or not. H TER B LEU MMT sent. B LEU APE M ETEOR 31.4 ± .8 48.0 ± .7 62.3 ± .5 24.4 ± .4 × × 30.2 ± .8 46.5 ± .7 62.6 ± .5 24.2 ± .4 × ✓ 29.3 ± .8 45.4 ± .7 62.3 ± .5 24.3 ± .4 ✓ × 30.9 ± .8 47.1 ± .7 62.4 ± .6 24.4 ± .4 ✓ ✓ 29.4 ± .8 46.9 ± .7 62.5 ± .6 24.2 ± .4 × × 32.1 ± .8 49.1 ± .7 62.3 ± .5 24.1 ± .4 × ✓ 28.1 ± .8 45.5 ± .7 62.6 ± .6 24.1 ± .4 ✓ × 26.1 ± .7 42.4 ± .7 62.4 ± .5 24.3 ± .4 ✓ ✓ 22.0 ± .7 38.5 ± .6 62.5 ± .5 24.1 ± .4

  13. Example Source: ein Mann schläft in einem grünen Reference: (1) source, (2) image, (3) sentinel Raum auf einem Sofa . A man sleeping in a green room on a couch. Output with attention: t f n n ä m e m e n l e n a e n m n n h f n ü f n ü u i a c u i r o n i r a e M s a e g S i e g R . (1) (2) (3)

  14. Conclusions approach (concatenation of the context vectors). train. individual inputs. Thank you for your attention! • The results show both methods achieve comparable results to the existing • Hierarchical attention combination achieved best results on MMT, and is faster to • Both methods provide a trivial way to inspect the attention distribution w.r.t. the

  15. Conclusions approach (concatenation of the context vectors). train. individual inputs. Thank you for your attention! • The results show both methods achieve comparable results to the existing • Hierarchical attention combination achieved best results on MMT, and is faster to • Both methods provide a trivial way to inspect the attention distribution w.r.t. the

  16. References I Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Association for Computational Linguistics, San Diego, CA, USA, pages 866–875. http://www.aclweb.org/anthology/N16-1101. Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2016. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. CoRR abs/1612.01887. http://arxiv.org/abs/1612.01887. Barret Zoph and Kevin Knight. 2016. Multi-source neural translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Association for Computational Linguistics, San Diego, CA, USA, pages 30–34. http://www.aclweb.org/anthology/N16-1004.

Recommend


More recommend