Neural Machine Translation Philipp Koehn 6 October 2020 Philipp - PowerPoint PPT Presentation

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Language Models 1 • Modeling variants – feed-forward neural network – recurrent neural network – long short term memory neural network • May include input context Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Feed Forward Neural Language Model 2 w i Output Word Softmax h Hidden Layer FF Ew Embedding Embed Embed Embed Embed w i-4 w i-3 w i-2 w i-1 History Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Recurrent Neural Language Model 3 y i Output Word the Output Word t i Softmax Prediction Recurrent h j RNN State Input Word E x j Embed Embedding x j Input Word <s> Predict the first word of a sentence Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Recurrent Neural Language Model 4 y i Output Word the house Output Word t i Softmax Softmax Prediction Recurrent h j RNN RNN State Input Word E x j Embed Embed Embedding x j Input Word <s> the Predict the second word of a sentence Re-use hidden state from first word prediction Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Recurrent Neural Language Model 5 y i Output Word the house is Output Word t i Softmax Softmax Softmax Prediction Recurrent h j RNN RNN RNN State Input Word E x j Embed Embed Embed Embedding x j Input Word <s> the house Predict the third word of a sentence ... and so on Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Recurrent Neural Language Model 6 y i Output Word the house is big . </s> Output Word t i Softmax Softmax Softmax Softmax Softmax Softmax Prediction Recurrent h j RNN RNN RNN RNN RNN RNN State Input Word E x j Embed Embed Embed Embed Embed Embed Embedding x j Input Word <s> the house is big . Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Recurrent Neural Translation Model 7 • We predicted the words of a sentence • Why not also predict their translations? Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Encoder-Decoder Model 8 y i Output Word the house is big . </s> das Haus ist groß . </s> Output Word t i Softmax Softmax Softmax Softmax Softmax Softmax Softmax Softmax Softmax Softmax Softmax Softmax Prediction Recurrent h j RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN State Input Word E x j Embed Embed Embed Embed Embed Embed Embed Embed Embed Embed Embed Embed Embedding x j Input Word <s> the house is big . </s> das Haus ist groß . • Obviously madness • Proposed by Google (Sutskever et al. 2014) Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

What is Missing? 9 • Alignment of input words to output words ⇒ Solution: attention mechanism Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

10 neural translation model with attention Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Input Encoding 11 y i Output Word the house is big . </s> Output Word t i Softmax Softmax Softmax Softmax Softmax Softmax Prediction Recurrent h j RNN RNN RNN RNN RNN RNN State Input Word E x j Embed Embed Embed Embed Embed Embed Embedding x j Input Word <s> the house is big . • Inspiration: recurrent neural network language model on the input side Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Hidden Language Model States 12 • This gives us the hidden states RNN RNN RNN RNN RNN RNN RNN • These encode left context for each word • Same process in reverse: right context for each word RNN RNN RNN RNN RNN RNN RNN Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Input Encoder 13 Right-to-Left h j RNN RNN RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN RNN RNN Encoder Input Word E x j Embed Embed Embed Embed Embed Embed Embed Embedding x j Input Word <s> the house is big . </s> • Input encoder: concatenate bidrectional RNN states • Each word representation includes full left and right sentence context Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Encoder: Math 14 Right-to-Left h j RNN RNN RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN RNN RNN Encoder Input Word E x j Embed Embed Embed Embed Embed Embed Embed Embedding x j Input Word <s> the house is big . </s> • Input is sequence of words x j , mapped into embedding space ¯ E x j • Bidirectional recurrent neural networks ← h j = f ( ← − − − h j +1 , ¯ E x j ) − → h j = f ( − − → h j − 1 , ¯ E x j ) • Various choices for the function f () : feed-forward layer, GRU, LSTM, ... Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Decoder 15 • We want to have a recurrent neural network predicting output words Output Word t i Softmax Softmax Softmax Prediction s i Decoder State RNN RNN RNN RNN Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Decoder 16 • We want to have a recurrent neural network predicting output words Output Word E y i Embed Embed Embed Embed Embeddings Output Word t i Softmax Softmax Softmax Prediction s i Decoder State RNN RNN RNN RNN • We feed decisions on output words back into the decoder state Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Decoder 17 • We want to have a recurrent neural network predicting output words Output Word E y i Embed Embed Embed Embed Embeddings Output Word t i Softmax Softmax Softmax Prediction s i Decoder State RNN RNN RNN RNN c i Input Context • We feed decisions on output words back into the decoder state • Decoder state is also informed by the input context Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

More Detail 18 • Decoder is also recurrent neural network over sequence of hidden states s i Output Word E y i Embed Embed Embeddings s i = f ( s i − 1 , Ey − 1 , c i ) y i Output Word • Again, various choices for the function f () : <s> das feed-forward layer, GRU, LSTM, ... Output Word t i Softmax Prediction • Output word y i is selected by computing a vector t i (same size as vocabulary) s i Decoder State RNN RNN t i = W ( Us i − 1 + V Ey i − 1 + Cc i ) c i Input Context then finding the highest value in vector t i • If we normalize t i , we can view it as a probability distribution over words • Ey i is the embedding of the output word y i Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Attention 19 s i Decoder State RNN RNN Input Context α ij Attention Attention Right-to-Left h j RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN Encoder • Given what we have generated so far (decoder hidden state) • ... which words in the input should we pay attention to (encoder states)? Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Attention 20 s i Decoder State RNN RNN Input Context α ij Attention Attention Right-to-Left h j RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN Encoder • Given: – the previous hidden state of the decoder s i − 1 – the representation of input words h j = ( ← h j , − − → h j ) • Predict an alignment probability a ( s i − 1 , h j ) to each input word j (modeled with with a feed-forward neural network layer) Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Attention 21 s i Decoder State RNN RNN Input Context α ij Attention Attention Right-to-Left h j RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN Encoder • Normalize attention (softmax) exp ( a ( s i − 1 , h j )) α ij = � k exp ( a ( s i − 1 , h k )) Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Attention 22 s i Decoder State RNN RNN Weighted c i Input Context Sum α ij Attention Attention Right-to-Left h j RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN Encoder • Relevant input context: weigh input words according to attention: c i = � j α ij h j Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Attention 23 s i Decoder State RNN RNN RNN Weighted c i Input Context Sum α ij Attention Attention Right-to-Left h j RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN Encoder • Use context to predict next hidden state and output word Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

24 training Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp - PowerPoint PPT Presentation

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 Language Models 1 Modeling variants feed-forward neural network recurrent neural network long

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Convolutional over Recurrent Encoder for Neural Machine Translation Praveen Dakwale and Christof

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

Neural Machine Translation: Breaking the Performance Plateau Rico Sennrich Institute for

What can Statistical Machine Translation teach Neural Machine Translation about Structured

Googles Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

BREAKTHROUGHS IN NEURAL MACHINE TRANSLATION Olof Mogren Chalmers University of Technology

Neural Machine Translation: directions for improvement CMSC 470 Marine Carpuat How can we

Sparse and Constrained Attention for Neural Machine Translation Chaitanya Malaviya 1 ,

An Introduction to Neural Machine Translation Prof. John D. Kelleher @johndkelleher ADAPT Centre

Neural machine translation with less supervision CMSC 470 Marine Carpuat Neural MT only helps

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

Fundamentals of Machine Learning for Neural Machine Translation Dr. John D. Kelleher ADAPT

Effective Approaches to Attention-based Neural Machine Translation Thang Luong Hieu Pham and Chris

Neural Reranking Improves Subjective Quality of Machine Translation: NAIST at WAT 2015 Graham

What can Statistical Machine Translation teach Neural Text Generation about Optimization? Graham

Sequence to Sequence Models for Machine Translation (2) CMSC 723 / LING 723 / INST 725 Marine

Document Context Neural Machine Translation with Memory Networks Sameen Maruf, Gholamreza Haffari

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond rej Bojar

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond rej Bojar

Machine Translation 2: Statistical MT: Neural MT and Representations Ondej Bojar