Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Language Models 1 • Modeling variants – feed-forward neural network – recurrent neural network – long short term memory neural network • May include input context Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Feed Forward Neural Language Model 2 w i Output Word Softmax h Hidden Layer FF Ew Embedding Embed Embed Embed Embed w i-4 w i-3 w i-2 w i-1 History Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Recurrent Neural Language Model 3 y i Output Word the Output Word t i Softmax Prediction Recurrent h j RNN State Input Word E x j Embed Embedding x j Input Word <s> Predict the first word of a sentence Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Recurrent Neural Language Model 4 y i Output Word the house Output Word t i Softmax Softmax Prediction Recurrent h j RNN RNN State Input Word E x j Embed Embed Embedding x j Input Word <s> the Predict the second word of a sentence Re-use hidden state from first word prediction Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Recurrent Neural Language Model 5 y i Output Word the house is Output Word t i Softmax Softmax Softmax Prediction Recurrent h j RNN RNN RNN State Input Word E x j Embed Embed Embed Embedding x j Input Word <s> the house Predict the third word of a sentence ... and so on Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Recurrent Neural Language Model 6 y i Output Word the house is big . </s> Output Word t i Softmax Softmax Softmax Softmax Softmax Softmax Prediction Recurrent h j RNN RNN RNN RNN RNN RNN State Input Word E x j Embed Embed Embed Embed Embed Embed Embedding x j Input Word <s> the house is big . Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Recurrent Neural Translation Model 7 • We predicted the words of a sentence • Why not also predict their translations? Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Encoder-Decoder Model 8 y i Output Word the house is big . </s> das Haus ist groß . </s> Output Word t i Softmax Softmax Softmax Softmax Softmax Softmax Softmax Softmax Softmax Softmax Softmax Softmax Prediction Recurrent h j RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN State Input Word E x j Embed Embed Embed Embed Embed Embed Embed Embed Embed Embed Embed Embed Embedding x j Input Word <s> the house is big . </s> das Haus ist groß . • Obviously madness • Proposed by Google (Sutskever et al. 2014) Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
What is Missing? 9 • Alignment of input words to output words ⇒ Solution: attention mechanism Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
10 neural translation model with attention Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Input Encoding 11 y i Output Word the house is big . </s> Output Word t i Softmax Softmax Softmax Softmax Softmax Softmax Prediction Recurrent h j RNN RNN RNN RNN RNN RNN State Input Word E x j Embed Embed Embed Embed Embed Embed Embedding x j Input Word <s> the house is big . • Inspiration: recurrent neural network language model on the input side Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Hidden Language Model States 12 • This gives us the hidden states RNN RNN RNN RNN RNN RNN RNN • These encode left context for each word • Same process in reverse: right context for each word RNN RNN RNN RNN RNN RNN RNN Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Input Encoder 13 Right-to-Left h j RNN RNN RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN RNN RNN Encoder Input Word E x j Embed Embed Embed Embed Embed Embed Embed Embedding x j Input Word <s> the house is big . </s> • Input encoder: concatenate bidrectional RNN states • Each word representation includes full left and right sentence context Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Encoder: Math 14 Right-to-Left h j RNN RNN RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN RNN RNN Encoder Input Word E x j Embed Embed Embed Embed Embed Embed Embed Embedding x j Input Word <s> the house is big . </s> • Input is sequence of words x j , mapped into embedding space ¯ E x j • Bidirectional recurrent neural networks ← h j = f ( ← − − − h j +1 , ¯ E x j ) − → h j = f ( − − → h j − 1 , ¯ E x j ) • Various choices for the function f () : feed-forward layer, GRU, LSTM, ... Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Decoder 15 • We want to have a recurrent neural network predicting output words Output Word t i Softmax Softmax Softmax Prediction s i Decoder State RNN RNN RNN RNN Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Decoder 16 • We want to have a recurrent neural network predicting output words Output Word E y i Embed Embed Embed Embed Embeddings Output Word t i Softmax Softmax Softmax Prediction s i Decoder State RNN RNN RNN RNN • We feed decisions on output words back into the decoder state Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Decoder 17 • We want to have a recurrent neural network predicting output words Output Word E y i Embed Embed Embed Embed Embeddings Output Word t i Softmax Softmax Softmax Prediction s i Decoder State RNN RNN RNN RNN c i Input Context • We feed decisions on output words back into the decoder state • Decoder state is also informed by the input context Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
More Detail 18 • Decoder is also recurrent neural network over sequence of hidden states s i Output Word E y i Embed Embed Embeddings s i = f ( s i − 1 , Ey − 1 , c i ) y i Output Word • Again, various choices for the function f () : <s> das feed-forward layer, GRU, LSTM, ... Output Word t i Softmax Prediction • Output word y i is selected by computing a vector t i (same size as vocabulary) s i Decoder State RNN RNN t i = W ( Us i − 1 + V Ey i − 1 + Cc i ) c i Input Context then finding the highest value in vector t i • If we normalize t i , we can view it as a probability distribution over words • Ey i is the embedding of the output word y i Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Attention 19 s i Decoder State RNN RNN Input Context α ij Attention Attention Right-to-Left h j RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN Encoder • Given what we have generated so far (decoder hidden state) • ... which words in the input should we pay attention to (encoder states)? Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Attention 20 s i Decoder State RNN RNN Input Context α ij Attention Attention Right-to-Left h j RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN Encoder • Given: – the previous hidden state of the decoder s i − 1 – the representation of input words h j = ( ← h j , − − → h j ) • Predict an alignment probability a ( s i − 1 , h j ) to each input word j (modeled with with a feed-forward neural network layer) Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Attention 21 s i Decoder State RNN RNN Input Context α ij Attention Attention Right-to-Left h j RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN Encoder • Normalize attention (softmax) exp ( a ( s i − 1 , h j )) α ij = � k exp ( a ( s i − 1 , h k )) Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Attention 22 s i Decoder State RNN RNN Weighted c i Input Context Sum α ij Attention Attention Right-to-Left h j RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN Encoder • Relevant input context: weigh input words according to attention: c i = � j α ij h j Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Attention 23 s i Decoder State RNN RNN RNN Weighted c i Input Context Sum α ij Attention Attention Right-to-Left h j RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN Encoder • Use context to predict next hidden state and output word Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
24 training Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020
Recommend
More recommend