Attention Strategies for Multi-Source Sequence-to-Sequence Learning - PowerPoint PPT Presentation

Attention Strategies for Multi-Source Sequence-to-Sequence Learning Jindřich Libovický, Jindřich Helcl Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University August 2, 2017

Motivation Introduction No universal method that models explicitly the importance of each input. • Attention over multiple source sequences relatively unexplored. • This work proposes two techniques: • Flat attention combination • Hierarchical attention combination • Applied to tasks of multimodal translation and automatic post-editing.

Multi-Source Sequence-to-Sequence Learning Any number of input sequences with possibly difgerent modalities. Figure 1: Multimodal translation example. Examples Multimodal translation, automatic post-editing, multi-source machine translation, ...

Attentive Sequence Learning (1) (3) T x (2) In each decoder step i What about multiple inputs? states given the decoder state decide about its output e ij = v ⊤ a tanh ( W a s i + U a h j ) exp ( e ij ) • compute distribution over encoder α ij = ∑ T x k =1 exp ( e ik ) • the decoder gets a context vector to ∑ c i = α ij h j j =1

Context Vector Concatenation • Widely used technique [Firat et al., 2016, Zoph and Knight, 2016]. • Attention over input sequences computed independently. • Combination resolved later on in the network

Importance of difgerent inputs refmected in the joint attention distribution. Flat Attention Combination A t t e n t i o n

Flat Attention Combination a x im one source j x c project states to a common space ij T x N ij N sources → e ( k ) a tanh ( W a s i + U a ( k ) h j ) e ij = v ⊤ = v ⊤ a tanh ( W a s i + U a h j ) → exp ( e ( k ) ij ) exp ( e ij ) α ( k ) α ij = = → ∑ T ( n ) ∑ T x ( e ( n ) ) k =1 exp ( e ik ) ∑ N n =1 m =1 exp T ( k ) α ( k ) ij U c ( k ) h ( k ) ∑ ∑ ∑ c i = α ij h j c i = → j =1 k =1 j =1 • U ( k ) a , U ( k ) • Question: Should U ( k ) = U ( k ) c ? (i.e. should the projection parameters be shared?)

Hierarchical Attention Combination Attention distribution is factored by input. A t t e n t i o n A t t e n t i o n

Hierarchical Attention Combination i i c i i N 1. i i i resulting context vector c i . and get the i b distribution over the intermediate Compute another attention 2. …using the vanilla attention i inputs Compute the context vector: x ij j i ∀ = ∑ T ( k ) c ( k ) j =1 α ( k ) ij h ( k ) , where α ( k ) = … k = 1 . . . N e ( k ) b tanh ( W b s i + U ( k ) b c ( k ) v ⊤ = ) exp ( e ( k ) ) β ( k ) context vectors c ( k ) = n =1 exp ( e ( n ) ∑ N ) β ( k ) U ( k ) c c ( k ) ∑ = k =1 • As in the fmat scenario, the context vectors have to be projected to a shared space. • Same question arises – should U ( k ) = U ( k ) c ?

Experiments and Results post-editing (APE) projection matrices. decoder to decide whether or not to attend to any encoder. Experiments conducted using Neural Monkey, code available here: https://github.com/ufal/neuralmonkey . • Experiments conducted on multimodal translation (MMT) and automatic • In both fmat and hierarchical scenarios, we tried both sharing and not sharing the • Additionally, we tried using the sentinel gate [Lu et al., 2016], which enables the

Experiments and Results concat. whether the projection matrix is shared for energies and context vector computation, Results on the Multi30k dataset and the APE dataset. The column ‘share’ denotes hierarchical share fmat ‘sent.’ indicates whether the sentinel vector has been used or not. H TER B LEU MMT sent. B LEU APE M ETEOR 31.4 ± .8 48.0 ± .7 62.3 ± .5 24.4 ± .4 × × 30.2 ± .8 46.5 ± .7 62.6 ± .5 24.2 ± .4 × ✓ 29.3 ± .8 45.4 ± .7 62.3 ± .5 24.3 ± .4 ✓ × 30.9 ± .8 47.1 ± .7 62.4 ± .6 24.4 ± .4 ✓ ✓ 29.4 ± .8 46.9 ± .7 62.5 ± .6 24.2 ± .4 × × 32.1 ± .8 49.1 ± .7 62.3 ± .5 24.1 ± .4 × ✓ 28.1 ± .8 45.5 ± .7 62.6 ± .6 24.1 ± .4 ✓ × 26.1 ± .7 42.4 ± .7 62.4 ± .5 24.3 ± .4 ✓ ✓ 22.0 ± .7 38.5 ± .6 62.5 ± .5 24.1 ± .4

Example Source: ein Mann schläft in einem grünen Reference: (1) source, (2) image, (3) sentinel Raum auf einem Sofa . A man sleeping in a green room on a couch. Output with attention: t f n n ä m e m e n l e n a e n m n n h f n ü f n ü u i a c u i r o n i r a e M s a e g S i e g R . (1) (2) (3)

Conclusions approach (concatenation of the context vectors). train. individual inputs. Thank you for your attention! • The results show both methods achieve comparable results to the existing • Hierarchical attention combination achieved best results on MMT, and is faster to • Both methods provide a trivial way to inspect the attention distribution w.r.t. the

References I Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Association for Computational Linguistics, San Diego, CA, USA, pages 866–875. http://www.aclweb.org/anthology/N16-1101. Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2016. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. CoRR abs/1612.01887. http://arxiv.org/abs/1612.01887. Barret Zoph and Kevin Knight. 2016. Multi-source neural translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Association for Computational Linguistics, San Diego, CA, USA, pages 30–34. http://www.aclweb.org/anthology/N16-1004.

Attention Strategies for Multi-Source Sequence-to-Sequence Learning - PowerPoint PPT Presentation

Attention Strategies for Multi-Source Sequence-to-Sequence Learning Jindich Libovick, Jindich Helcl Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University August 2, 2017 Motivation

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H.

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

Overdetermined systems, conformal differential geometry, and the BGG complex Andreas Cap

Lecure 3: Models and measurements for thermal syste ms Types of inverse problems Denis Maillet ,

MIMO OFDM Detection on SDR Platforms Daniel Guenther Chair ISS Integrierte Systeme der

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

Semimonotone Matrices Megan Wendler May 27, 2018 Megan Wendler Semimonotone Matrices May 27,

Some Sponsors find a combination Slide 1 Restau Re stauran rant/Fa t/Family mily Ser

Convex Combinatorial Optimization ORSIS Prize 2005 Shmuel Onn and Uri Rothblum Technion

for Optimizing the Partial AUC Harikrishna Narasimhan (Joint work with Shivani Agarwal) A paper

Attention Strategies for Multi-Source Sequence-to-Sequence Learning - PowerPoint PPT Presentation

Attention Strategies for Multi-Source Sequence-to-Sequence Learning Jindich Libovick, Jindich Helcl Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University August 2, 2017 Motivation

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H.

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

Overdetermined systems, conformal differential geometry, and the BGG complex Andreas Cap

Lecure 3: Models and measurements for thermal syste ms Types of inverse problems Denis Maillet ,

MIMO OFDM Detection on SDR Platforms Daniel Guenther Chair ISS Integrierte Systeme der

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

Semimonotone Matrices Megan Wendler May 27, 2018 Megan Wendler Semimonotone Matrices May 27,

Some Sponsors find a combination Slide 1 Restau Re stauran rant/Fa t/Family mily Ser

Convex Combinatorial Optimization ORSIS Prize 2005 Shmuel Onn and Uri Rothblum Technion

for Optimizing the Partial AUC Harikrishna Narasimhan (Joint work with Shivani Agarwal) A paper

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or