Attentive Sequence-to-Sequence Learning March 6, 2018 Jindich - PowerPoint PPT Presentation

NPFL116 Compendium of Neural Machine Translation Attentive Sequence-to-Sequence Learning March 6, 2018 Jindřich Helcl, Jindřich Libovický Charles Univeristy in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics

Neural Network Language Models

RNN Language Model defines a distribution over sentences • Train RNN as classifier for next words (unlimited history) w 1 w 2 w 3 w 4 <s> p (w 1 ) p (w 2 ) p (w 3 ) p (w 4 ) p (w 5 ) • Can be used to estimate sentence probability / perplexity → • We can sample from the distribution <s> ~w 1 ~w 2 ~w 3 ~w 4 ~w 5

Two views on RNN LM distribution over sequences of words n • RNN is a for loop (functional map ) over sequential data • All outputs are conditional distributions → probabilistic ∏ P ( w 1 , . . . , w n ) = P ( w i | w i − 1 , . . . , w 1 ) i =1

Vanilla Sequence-to-Sequence Model

Encoder-Decoder NMT 1. A network processing the input sentence into a single vector representation ( encoder ) 2. A neural language model initialized with the output of the encoder ( decoder ) Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning with neural networks.” Advances in neural information processing systems. 2014. • Exploits the conditional LM scheme • Two networks

Source language input + target language LM Encoder-Decoder – Image x 1 x 2 x 3 x 4 <s> <s> ~y 1 ~y 2 ~y 3 ~y 4 ~y 5

Encoder-Decoder Model – Code last_w_embeding = target_embeddings [ last_w ] y i e l d last_w = np . argmax ( l o g i t s ) l o g i t s = output_projection ( dec_output ) last_w_embeding ) dec_output = d e c _ c e l l ( state , state , last_w != ”</s>” : s t a t e = np . zero s ( emb_size ) while last_w = ”<s>” input_embedding ) state , _ = e n c _ c e l l ( encoder_state , input_embedding = source_embeddings [w] input_words : for w in last_w

Encoder-Decoder Model – Formal Notation final state output or multi-layer projection i -th word score i -th decoder state initial state Decoder Data h T x j -th state initial state Encoder output embeddings (target language) input embeddings (source language) x = ( x 1 , . . . , x T x ) y = ( y 1 , . . . , y T y ) h 0 ≡ 0 h j = RNN enc ( h j − 1 , x j ) s 0 = h T x s i = RNN dec ( s i − 1 , ˆ y i ) t i +1 = U o + V o Ey i + b o , ˆ y i +1 = arg max t i +1

Encoder-Decoder: Training Objective exp t i function) p and p : For output word y i we have: • Estimated conditional distribution ˆ p i = ∑ exp t i (softmax • Unknown true distribution p i , we lay p i ≡ 1 [ y i ] Cross entropy ≈ distance of ˆ p , p ) = E p ( − log ˆ L = H (ˆ p ) = − log ˆ p ( y i ) …computing ∂ L ∂ t i is super simple

Implementation: Runtime vs. training y j (decoded) × runtime: ˆ training: y j (ground truth) y 1 y 2 y 3 y 4 <s> x 1 x 2 x 3 x 4 <s> loss <s> ~y 1 ~y 2 ~y 3 ~y 4 ~y 5

Sutskever et al. Sutskever et al.: reversed 28.5 Bahdanau’s attention 36.5 –”–: vanilla SMT rescoring 34.8 –”–: ensemble + beam search 30.6 37.0 tuned SMT 33.0 vanilla SMT BLEU score method is way to go Why is better Bahdanau’s model worse? • Reverse input sequence • Impressive empirical results – made researchers believe NMT Evaluation on WMT14 EN → FR test set:

28.5 1,000 dimensions attention model 13.9 encoder-decoder BLEU score method With Bahdanau’s model size: 5 epochs 7.5 epochs training time 620 dimensions word embeddings Sutskever et al. GRU, 1,000 units decoder bidi GRU, 2,000 encoder 30k both 160k enc, 80k dec vocabulary Bahdanau et al. Sutskever et al. × Bahdanau et al. 4 × LSTM, 1,000 units 4 × LSTM, 1,000 units

Attentive Sequence-to-Sequence Learning

Main Idea long-distance dependencies as query for the source word sentence their hidden states • Same as reversing input: do not force the network to catch • Use decoder state only for target sentence dependencies and a • RNN can serve as LM — it can store the language context in

Inspiration: Neural Turing Machine learning algorithmic tasks, finite imitation of a Turing Machine somehow – either by position or by content algorithmic tasks • General architecture for • Needs to address memory • In fact does not work well – it hardly manages simple • Content-based addressing → attention

Attention Model x 1 x 2 x 3 x 4 <s> ... h 0 h 1 h 2 h 3 h 4 α 0 α 1 α 2 α 3 α 4 × × × × × s i-1 s i s i+1 + + ~y i ~y i+1

Attention Model in Equations (1) h j T x Context vector: Attention distribution: Inputs: Attention energies: s i decoder state encoder states [ − → h j ; ← − ] h j = ∀ i = 1 . . . T x exp ( e ij ) α ij = e ij = v ⊤ a tanh ( W a s i − 1 + U a h j + b a ) ∑ T x k =1 exp ( e ik ) ∑ c i = α ij h j j =1

Attention Model in Equations (2) Output projection: …context vector is mixed with the hidden state Output distribution: t i = MLP ( U o s i − 1 + V o Ey i − 1 + C o c i + b o ) p ( y i = w | s i , y i − 1 , c i ) ∝ exp ( W o t i ) w + b w

Attention Visualization

Attention vs. Alignment Differences between attention model and word alignment used for phrase table generation: attention (NMT) alignment (SMT) probabilistic discrete declarative imperative LM generates LM discriminates

Image Captioning Attention over CNN for image classification: Source: Xu, Kelvin, et al. ”Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.” ICML . Vol. 14. 2015.

Reading for the Next Week Vaswani, Ashish, et al. “Attention is all you need.” Advances in Neural Information Processing Systems. 2017. http://papers.nips.cc/paper/ 7181-attention-is-all-you-need.pdf Question: The model uses the scaled dot-product attention which is a non-parametric variant of the attention mechanism. Why do you think it is sufficient in this setup? Do you think it would work in the recurrent model as well? The way the model processes the sequence is principally different from RNNs or CNNs. Does it agree with your intuition of how language should be processed?

Attentive Sequence-to-Sequence Learning March 6, 2018 Jindich - PowerPoint PPT Presentation

NPFL116 Compendium of Neural Machine Translation Attentive Sequence-to-Sequence Learning March 6, 2018 Jindich Helcl, Jindich Libovick Charles Univeristy in Prague Faculty of Mathematics and Physics Institute of Formal and Applied

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Granger-causal Attentive Mixtures of Experts Learning Important Features with Neural Networks

Explainable Recommendation Through Attentive Multi-View Learning Advisor: Jia-Ling Koh

Attentive Moment Retrieval in Videos Meng Liu 1 , Xiang Wang 2 , Liqiang Nie 1 , Xiangnan He 2 ,

Sparse Attentive Backtracking: Temporal credit assignment through reminding Nan Rosemary Ke 1,2 ,

The boating store has its best sale ever: Pronunciation-attentive Contextualized Pun

Sparse Attentive Backtracking: Temporal credit assignment through reminding Nan Rosemary Ke 1,2 ,

Hidden Markov Models Training Selecting model parameters What we know The terminology and

Runtime Behavior via Deep Sequence Learning Stephen Zekany Daniel Rings Nathan Harada Michael

Spectral Sequence Training Montage, Day 1 Arun Debray and Richard Wong Summer Minicourses 2020

EECS 373 Design of Microprocessor-Based Systems Prabal Dutta University of Michigan Lecture 7:

INF5470 Fall 2011 Lecture 1: Basic Analog CMOS NFET symbol and cross section

Tools and Models for Power and Energy Analysis of Parallel Scientific Applications Pedro Alonso 1

All Programmable SoC based on FPGA for IoT Maria Liz Crespo ICTP MLAB mcrespo@ictp.it 1 ICTP

CIS 4930/6930: Principles of Cyber-Physical Systems Chapter 4: Hybrid Systems - Hybrid Automata

Attentive Sequence-to-Sequence Learning March 6, 2018 Jindich - PowerPoint PPT Presentation

NPFL116 Compendium of Neural Machine Translation Attentive Sequence-to-Sequence Learning March 6, 2018 Jindich Helcl, Jindich Libovick Charles Univeristy in Prague Faculty of Mathematics and Physics Institute of Formal and Applied

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Granger-causal Attentive Mixtures of Experts Learning Important Features with Neural Networks

Explainable Recommendation Through Attentive Multi-View Learning Advisor: Jia-Ling Koh

Attentive Moment Retrieval in Videos Meng Liu 1 , Xiang Wang 2 , Liqiang Nie 1 , Xiangnan He 2 ,

Sparse Attentive Backtracking: Temporal credit assignment through reminding Nan Rosemary Ke 1,2 ,

The boating store has its best sale ever: Pronunciation-attentive Contextualized Pun

Sparse Attentive Backtracking: Temporal credit assignment through reminding Nan Rosemary Ke 1,2 ,

Hidden Markov Models Training Selecting model parameters What we know The terminology and

Runtime Behavior via Deep Sequence Learning Stephen Zekany Daniel Rings Nathan Harada Michael

Spectral Sequence Training Montage, Day 1 Arun Debray and Richard Wong Summer Minicourses 2020

EECS 373 Design of Microprocessor-Based Systems Prabal Dutta University of Michigan Lecture 7:

INF5470 Fall 2011 Lecture 1: Basic Analog CMOS NFET symbol and cross section

Tools and Models for Power and Energy Analysis of Parallel Scientific Applications Pedro Alonso 1

All Programmable SoC based on FPGA for IoT Maria Liz Crespo ICTP MLAB mcrespo@ictp.it 1 ICTP

CIS 4930/6930: Principles of Cyber-Physical Systems Chapter 4: Hybrid Systems - Hybrid Automata

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or