Effective Approaches to Attention-based Neural Machine Translation - PowerPoint PPT Presentation

Effective Approaches to Attention-based Neural Machine Translation Thang Luong Hieu Pham and Chris Manning EMNLP 2015 Presented by: Yunan Zhang

Neural Machine Translation Attention Mechanism (Sutskever et al., 2014) (Bahdanau et al., 2015) _ suis étudiant Je _ étudiant I am a student Je suis Recent innovation in deep learning: • Control problem (Mnih et al., 14) New approach: recent SOTA • Speech recognition (Chorowski et al., 14) results • Image captioning (Xu et al., 15) • English-French (Luong et al., • Propose a new and better attention mechanism. 15. Our work .) • Examine other variants of attention models. • Achieve new SOTA results WMT English-German. • English-German (Jean et al., 15)

Neural Machine Translation (NMT) _ suis étudiant Je _ étudiant I am a student Je suis • Big RNNs trained end-to-end.

Neural Machine Translation (NMT) _ suis étudiant Je _ étudiant I am a student Je suis • Big RNNs trained end-to-end: encoder-decoder. – Generalize well to long sequences. – Small memory footprint. – Simple decoder.

Attention Mechanism suis Attention Layer Context vector 0.6 0.2 0.1 0.1 _ I am a student Je • Maintain a memory of source hidden states • Memory here means a weighted average of the hidden states • The weight is determined by comparing the current target hidden state and all the source

Attention Mechanism suis Context vector 0.6 0.2 0.1 0.1 _ I am a student Je • Maintain a memory of source hidden states – Able to translate long sentences. – f

Motivation • A new attention mechanism: local attention – Use a subset of source states each time. – Better results with focused attention! • Global attention: use all source states – Other variants of (Bahdanau et al., 15)

Global Attention • Alignment weight vector:

Global Attention • Alignment weight vector: (Bahdanau et al., 15)

Global Attention Context vector : weighted average of source states.

Global Attention Attentional vector

Local Attention aligned positions? • defines a focused window . • A blend between soft & hard attention (Xu et al., ’1��.

Local Attention (2) • Predict aligned positions: Real value in [0, S] Source sentence How do we learn to the position parameters?

Local Attention (3) Alignment weights 1 5.5 2 0.8 0.6 0.4 0.2 0 3.5 4 4.5 5 5.5 6 6.5 7 7.5 s • Like global model: for integer in – Compute

Local Attention (3) Truncated Gaussian 1 0.8 0.6 0.4 0.2 0 3.5 4 4.5 5 5.5 6 6.5 7 7.5 s • Favor points close to the center.

Local Attention (3) 1 0.8 New Peak 0.6 0.4 0.2 0 3.5 4 4.5 5 5.5 6 6.5 7 7.5 s

Experiments • WMT English ⇄ German (4.5M sentence pairs). • Setup: (Sutskever et al., 14, Luong et al., 15) – 4-layer stacking LSTMs: 1000-dim cells/embeddings. – 50K most frequent English & German words

English- Ger�a� WMT’1� Results Systems Ppl BLEU Winning system – phrase-based + large LM (Buck et al.) 20.7 Our NMT systems Base 10.6 11.3 • Large progressive gains: – Attention: +2.8 BLEU Feed input: +1.3 BLEU • BLEU & perplexity correlation (Luong et al., ’1��.

English- Ger�a� WMT’1� Results Systems Ppl BLEU Winning system – phrase-based + large LM (Buck et al.) 20.7 Our NMT systems Base 10.6 11.3 Base + reverse 9.9 12.6 (+1.3) • Large progressive gains: – Attention: +2.8 BLEU Feed input: +1.3 BLEU • BLEU & perplexity correlation (Luong et al., ’1��.

English- Ger�a� WMT’1� Results Systems Ppl BLEU Winning system – phrase-based + large LM (Buck et al.) 20.7 Our NMT systems Base 10.6 11.3 Base + reverse 9.9 12.6 (+1.3) Base + reverse + dropout 8.1 14.0 (+1.4) • Large progressive gains: – Attention: +2.8 BLEU Feed input: +1.3 BLEU • BLEU & perplexity correlation (Luong et al., ’1��.

English- Ger�a� WMT’1� Results Systems Ppl BLEU Winning system – phrase-based + large LM (Buck et al.) 20.7 Our NMT systems Base 10.6 11.3 Base + reverse 9.9 12.6 (+1.3) Base + reverse + dropout 8.1 14.0 (+1.4) Base + reverse + dropout + global attn 7.3 16.8 (+2.8) • Large progressive gains: – Attention: +2.8 BLEU Feed input: +1.3 BLEU • BLEU & perplexity correlation (Luong et al., ’1��.

English- Ger�a� WMT’1� Results Systems Ppl BLEU Winning system – phrase-based + large LM (Buck et al.) 20.7 Our NMT systems Base 10.6 11.3 Base + reverse 9.9 12.6 (+1.3) Base + reverse + dropout 8.1 14.0 (+1.4) Base + reverse + dropout + global attn 7.3 16.8 (+2.8) Base + reverse + dropout + global attn + feed input 6.4 18.1 (+1.3) • Large progressive gains: – Attention: +2.8 BLEU Feed input: +1.3 BLEU • BLEU & perplexity correlation (Luong et al., ’1��.

English- Ger�a� WMT’1� Results Systems Ppl BLEU Winning sys – phrase-based + large LM (Buck et al., 2014) 20.7 Existing NMT systems (Jean et al., 2015) RNNsearch 16.5 RNNsearch + unk repl. + large vocab + ensemble 8 models 21.6 Our NMT systems Global attention 7.3 16.8 (+2.8) Global attention + feed input 6.4 18.1 (+1.3)

English- Ger�a� WMT’1� Results Systems Ppl BLEU Winning sys – phrase-based + large LM (Buck et al., 2014) 20.7 Existing NMT systems (Jean et al., 2015) RNNsearch 16.5 RNNsearch + unk repl. + large vocab + ensemble 8 models 21.6 Our NMT systems Global attention 7.3 16.8 (+2.8) Global attention + feed input 6.4 18.1 (+1.3) Local attention + feed input 5.9 19.0 (+0.9) • Local-predictive attention: +0.9 BLEU gain.23.0

English- Ger�a� WMT’1� Results Systems Ppl BLEU Winning sys – phrase-based + large LM (Buck et al., 2014) 20.7 Existing NMT systems (Jean et al., 2015) RNNsearch 16.5 RNNsearch + unk repl. + large vocab + ensemble 8 models 21.6 Our NMT systems Global attention 7.3 16.8 (+2.8) Global attention + feed input 6.4 18.1 (+1.3) Local attention + feed input 5.9 19.0 (+0.9) Local attention + feed input + unk replace 5.9 20.9 (+1.9) • Unknown replacement: +1.9 BLEU – �Luo�g et al., ’1��, �Jea� et al., ’1��.

English- Ger�a� WMT’1� Results Systems Ppl BLEU Winning sys – phrase-based + large LM (Buck et al., 2014) 20.7 Existing NMT systems (Jean et al., 2015) RNNsearch 16.5 RNNsearch + unk repl. + large vocab + ensemble 8 models 21.6 Our NMT systems Global attention 7.3 16.8 (+2.8) Global attention + feed input 6.4 18.1 (+1.3) Local attention + feed input 5.9 19.0 (+0.9) Local attention + feed input + unk replace 5.9 20.9 (+1.9) Ensemble 8 models + unk replace 23.0 (+2.1) New SOTA!

WMT’1� E�glish -Results English-German Systems BLEU Winning system – NMT + 5-gram LM reranker (Montreal) 24.9 Our ensemble 8 models + unk replace 25.9 New SOTA! • WMT’1� German-English : similar gains – Attention: +2.7 BLEU Feed input: +1.0 BLEU

Analysis • Learning curves • Long sentences • Alignment quality • Sample translations

Learning Curves • faf No attention Attention

Translate Long Sentences Attention No Attention

Alignment Quality Models AER Berkeley aligner 0.32 Our NMT systems Global attention 0.39 Local attention 0.36 Ensemble 0.34 • RWTH gold alignment data – 508 English-German Europarl sentences. • Force decode our models. Competitive AERs!

Sample English-German translations src ′′ We ′ re pleased the FAA re�og�izes that a� e�jo�a�le passe�ger e�perie��e is not incompatible �ith safet� a�d se�urit� , ′′ said Roger Do� , CEO of the U.S. Travel Association . ref � Wir freue� u�s , dass die FAA erke��t , dass ei� a�ge�eh�es Passagiererle��is nicht im Wider- spruch zur Sicherheit steht � , sagte Roger Do� , CEO der U.S. Travel Association . be ′′ Wir freue� u�s , dass die FAA a�erke��t , dass ei� a�ge�eh�es ist �i�ht �it st Sicherheit und Sicherheit unvereinbar ist ′′ , sagte Roger Do� , CEO der US - die . ba ′′ Wir freue� u�s u ̈�er die <u�k> , dass ei� <u�k> <u�k> �it Si�herheit �i�ht se vereinbar ist �it Si�herheit u�d Si�herheit ′′ , sagte Roger Ca�ero� , CEO der US - <unk> . • Translate a doubly-negated phrase correctly • Fail to tra�slate �passe�ger e�perie��e�.

Effective Approaches to Attention-based Neural Machine Translation - PowerPoint PPT Presentation

Effective Approaches to Attention-based Neural Machine Translation Thang Luong Hieu Pham and Chris Manning EMNLP 2015 Presented by: Yunan Zhang Neural Machine Translation Attention Mechanism (Sutskever et al., 2014) (Bahdanau et al., 2015) _ suis

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

The Attention Economy What is the attention economy? A business model where you (as the

Effective Approaches to Monitoring National HOPWA Institute 2017 Tampa, FL Effective Approaches to

Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong , Hieu Pham,

Consciousness First? Attention First? David Chalmers Some Issues Q1: Is there consciousness

Attention Models Attention Models: Motivation bird Image: H x W x 3 The whole input volume is

Attention Models Focus on parts of input Olof Mogren Improves NN performance on different

Advanced Neural Machine Translation Gongbo Tang 23 September 2019 Outline NMT with Attention

Attention is All You Need (Vaswani et. al. 2017) Slides and figures when not cited are from:

Attention and its (mis)interpretation Danish Pruthi 1 Acknowledgements Mansi Gupta Bhuwan

Visual Attention FEF V4 spatial attention: simultaneous neural recordings in V4

A Convolutional Attention Network for Extreme Summarization of Source Code ATTENTION

Advanced Neural Machine Translation Gongbo Tang 21 September 2020 Outline NMT with Attention

or the fine art of knowing what to do and when and why by @infinitary codefin cynefin decisions

Computational social science: opportunities and risks Dr Giuseppe A. Veltri Data revolution?

Racism and Drawing Near: How the Gospel Speaks into a Broken World Ron Jones & Julie Lowe

WELCOME! Mens Fellowship Breakfast September 27, 2019 Messa Me ssage and Stru ructure of

Chapter 3 Chapter 3 Financial statements are developed to measure Financial statements are

Hoflemt Monomorphic types Theory of Programming Languages Computer Science Department Wellesley

Getting the Most out of New Technology Thursday 16 th July 2020 Getting the most out of new

The Long-Baseline Neutrino Facility Jim Strait, LBNE Project Director Open meeting for the

Effective Approaches to Attention-based Neural Machine Translation - PowerPoint PPT Presentation

Effective Approaches to Attention-based Neural Machine Translation Thang Luong Hieu Pham and Chris Manning EMNLP 2015 Presented by: Yunan Zhang Neural Machine Translation Attention Mechanism (Sutskever et al., 2014) (Bahdanau et al., 2015) _ suis

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

The Attention Economy What is the attention economy? A business model where you (as the

Effective Approaches to Monitoring National HOPWA Institute 2017 Tampa, FL Effective Approaches to

Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong , Hieu Pham,

Consciousness First? Attention First? David Chalmers Some Issues Q1: Is there consciousness

Attention Models Attention Models: Motivation bird Image: H x W x 3 The whole input volume is

Attention Models Focus on parts of input Olof Mogren Improves NN performance on different

Advanced Neural Machine Translation Gongbo Tang 23 September 2019 Outline NMT with Attention

Attention is All You Need (Vaswani et. al. 2017) Slides and figures when not cited are from:

Attention and its (mis)interpretation Danish Pruthi 1 Acknowledgements Mansi Gupta Bhuwan

Visual Attention FEF V4 spatial attention: simultaneous neural recordings in V4

A Convolutional Attention Network for Extreme Summarization of Source Code ATTENTION

Advanced Neural Machine Translation Gongbo Tang 21 September 2020 Outline NMT with Attention

or the fine art of knowing what to do and when and why by @infinitary codefin cynefin decisions

Computational social science: opportunities and risks Dr Giuseppe A. Veltri Data revolution?

Racism and Drawing Near: How the Gospel Speaks into a Broken World Ron Jones &amp; Julie Lowe

WELCOME! Mens Fellowship Breakfast September 27, 2019 Messa Me ssage and Stru ructure of

Chapter 3 Chapter 3 Financial statements are developed to measure Financial statements are

Hoflemt Monomorphic types Theory of Programming Languages Computer Science Department Wellesley

Getting the Most out of New Technology Thursday 16 th July 2020 Getting the most out of new

The Long-Baseline Neutrino Facility Jim Strait, LBNE Project Director Open meeting for the

Racism and Drawing Near: How the Gospel Speaks into a Broken World Ron Jones & Julie Lowe