NPFL116 Compendium of Neural Machine Translation Attentive Sequence-to-Sequence Learning March 6, 2018 Jindřich Helcl, Jindřich Libovický Charles Univeristy in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics
Neural Network Language Models
RNN Language Model defines a distribution over sentences • Train RNN as classifier for next words (unlimited history) w 1 w 2 w 3 w 4 <s> p (w 1 ) p (w 2 ) p (w 3 ) p (w 4 ) p (w 5 ) • Can be used to estimate sentence probability / perplexity → • We can sample from the distribution <s> ~w 1 ~w 2 ~w 3 ~w 4 ~w 5
Two views on RNN LM distribution over sequences of words n • RNN is a for loop (functional map ) over sequential data • All outputs are conditional distributions → probabilistic ∏ P ( w 1 , . . . , w n ) = P ( w i | w i − 1 , . . . , w 1 ) i =1
Vanilla Sequence-to-Sequence Model
Encoder-Decoder NMT 1. A network processing the input sentence into a single vector representation ( encoder ) 2. A neural language model initialized with the output of the encoder ( decoder ) Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning with neural networks.” Advances in neural information processing systems. 2014. • Exploits the conditional LM scheme • Two networks
Source language input + target language LM Encoder-Decoder – Image x 1 x 2 x 3 x 4 <s> <s> ~y 1 ~y 2 ~y 3 ~y 4 ~y 5
Encoder-Decoder Model – Code last_w_embeding = target_embeddings [ last_w ] y i e l d last_w = np . argmax ( l o g i t s ) l o g i t s = output_projection ( dec_output ) last_w_embeding ) dec_output = d e c _ c e l l ( state , state , last_w != ”</s>” : s t a t e = np . zero s ( emb_size ) while last_w = ”<s>” input_embedding ) state , _ = e n c _ c e l l ( encoder_state , input_embedding = source_embeddings [w] input_words : for w in last_w
Encoder-Decoder Model – Formal Notation final state output or multi-layer projection i -th word score i -th decoder state initial state Decoder Data h T x j -th state initial state Encoder output embeddings (target language) input embeddings (source language) x = ( x 1 , . . . , x T x ) y = ( y 1 , . . . , y T y ) h 0 ≡ 0 h j = RNN enc ( h j − 1 , x j ) s 0 = h T x s i = RNN dec ( s i − 1 , ˆ y i ) t i +1 = U o + V o Ey i + b o , ˆ y i +1 = arg max t i +1
Encoder-Decoder: Training Objective exp t i function) p and p : For output word y i we have: • Estimated conditional distribution ˆ p i = ∑ exp t i (softmax • Unknown true distribution p i , we lay p i ≡ 1 [ y i ] Cross entropy ≈ distance of ˆ p , p ) = E p ( − log ˆ L = H (ˆ p ) = − log ˆ p ( y i ) …computing ∂ L ∂ t i is super simple
Implementation: Runtime vs. training y j (decoded) × runtime: ˆ training: y j (ground truth) y 1 y 2 y 3 y 4 <s> x 1 x 2 x 3 x 4 <s> loss <s> ~y 1 ~y 2 ~y 3 ~y 4 ~y 5
Sutskever et al. Sutskever et al.: reversed 28.5 Bahdanau’s attention 36.5 –”–: vanilla SMT rescoring 34.8 –”–: ensemble + beam search 30.6 37.0 tuned SMT 33.0 vanilla SMT BLEU score method is way to go Why is better Bahdanau’s model worse? • Reverse input sequence • Impressive empirical results – made researchers believe NMT Evaluation on WMT14 EN → FR test set:
28.5 1,000 dimensions attention model 13.9 encoder-decoder BLEU score method With Bahdanau’s model size: 5 epochs 7.5 epochs training time 620 dimensions word embeddings Sutskever et al. GRU, 1,000 units decoder bidi GRU, 2,000 encoder 30k both 160k enc, 80k dec vocabulary Bahdanau et al. Sutskever et al. × Bahdanau et al. 4 × LSTM, 1,000 units 4 × LSTM, 1,000 units
Attentive Sequence-to-Sequence Learning
Main Idea long-distance dependencies as query for the source word sentence their hidden states • Same as reversing input: do not force the network to catch • Use decoder state only for target sentence dependencies and a • RNN can serve as LM — it can store the language context in
Inspiration: Neural Turing Machine learning algorithmic tasks, finite imitation of a Turing Machine somehow – either by position or by content algorithmic tasks • General architecture for • Needs to address memory • In fact does not work well – it hardly manages simple • Content-based addressing → attention
Attention Model x 1 x 2 x 3 x 4 <s> ... h 0 h 1 h 2 h 3 h 4 α 0 α 1 α 2 α 3 α 4 × × × × × s i-1 s i s i+1 + + ~y i ~y i+1
Attention Model in Equations (1) h j T x Context vector: Attention distribution: Inputs: Attention energies: s i decoder state encoder states [ − → h j ; ← − ] h j = ∀ i = 1 . . . T x exp ( e ij ) α ij = e ij = v ⊤ a tanh ( W a s i − 1 + U a h j + b a ) ∑ T x k =1 exp ( e ik ) ∑ c i = α ij h j j =1
Attention Model in Equations (2) Output projection: …context vector is mixed with the hidden state Output distribution: t i = MLP ( U o s i − 1 + V o Ey i − 1 + C o c i + b o ) p ( y i = w | s i , y i − 1 , c i ) ∝ exp ( W o t i ) w + b w
Attention Visualization
Attention vs. Alignment Differences between attention model and word alignment used for phrase table generation: attention (NMT) alignment (SMT) probabilistic discrete declarative imperative LM generates LM discriminates
Image Captioning Attention over CNN for image classification: Source: Xu, Kelvin, et al. ”Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.” ICML . Vol. 14. 2015.
Reading for the Next Week Vaswani, Ashish, et al. “Attention is all you need.” Advances in Neural Information Processing Systems. 2017. http://papers.nips.cc/paper/ 7181-attention-is-all-you-need.pdf Question: The model uses the scaled dot-product attention which is a non-parametric variant of the attention mechanism. Why do you think it is sufficient in this setup? Do you think it would work in the recurrent model as well? The way the model processes the sequence is principally different from RNNs or CNNs. Does it agree with your intuition of how language should be processed?
Recommend
More recommend