neural machine translation
play

Neural Machine Translation Luke Zettlemoyer (Slides adapted from - PowerPoint PPT Presentation

CSEP 517 Natural Language Processing Neural Machine Translation Luke Zettlemoyer (Slides adapted from Karthik Narasimhan, Greg Durrett, Chris Manning, Dan Jurafsky) Last time Statistical MT Word-based Phrase-based Syntactic


  1. CSEP 517 Natural Language Processing Neural Machine Translation Luke Zettlemoyer (Slides adapted from Karthik Narasimhan, Greg Durrett, Chris Manning, Dan Jurafsky)

  2. Last time • Statistical MT • Word-based • Phrase-based • Syntactic

  3. NMT: the biggest success story of NLP Deep Learning Neural Machine Translation went from a fringe research activity in 2014 to the leading standard method in 2016 • 2014 : First seq2seq paper published • 2016 : Google Translate switches from SMT to NMT • This is amazing! • SMT systems, built by hundreds of engineers over many years, outperformed by NMT systems trained by a handful of engineers in a few months 3

  4. Neural Machine Translation ‣ A single neural network is used to translate from source to target ‣ Architecture: Encoder-Decoder ‣ Two main components: ‣ Encoder: Convert source sentence (input) into a vector/matrix ‣ Decoder: Convert encoding into a sentence in target language (output)

  5. Recall: RNNs h t = g ( Wh t − 1 + Ux t + b ) ∈ ℝ d

  6. Sequence to Sequence learning (Seq2seq) • Encode entire input sequence into a single vector (using an RNN) • Decode one word at a time (again, using an RNN!) • Beam search for better inference • Learning is not trivial! (vanishing/exploding gradients) (Sutskever et al., 2014)

  7. Neural Machine Translation (NMT) Target sentence (output) Encoding of the source sentence. poor don’t have any money <END> the Provides initial hidden state for Decoder RNN. argmax argmax argmax argmax argmax argmax argmax Decoder RNN Encoder RNN the poor don’t have any money les pauvres sont démunis <START> Source sentence (input) Decoder RNN is a Language Model that generates target sentence conditioned on encoding. Encoder RNN produces an encoding of the Note: This diagram shows test time behavior: source sentence. decoder output is fed in --> as next step’s input

  8. Seq2seq training ‣ Similar to training a language model! ‣ Minimize cross-entropy loss: T ∑ − log P ( y t | y 1 , . . . , y t − 1 , x 1 , . . . , x n ) t =1 ‣ Back-propagate gradients through both decoder and encoder ‣ Need a really big corpus 36M sentence pairs Russian : Машинный перевод - это крут o! English: Machine translation is cool!

  9. Training a Neural Machine Translation system = negative log = negative log = negative log prob of “have” prob of “the” prob of <END> 𝑈 𝐾 = 1 ∑ = + + + + + + 𝐾 𝑢 𝐾 3 𝐾 1 𝐾 4 𝐾 2 𝐾 5 𝐾 6 𝐾 7 𝑈 𝑢 =1 ^ ^ ^ ^ ^ ^ ^ 𝑧 3 𝑧 4 𝑧 1 𝑧 2 𝑧 5 𝑧 6 𝑧 7 Encoder RNN Decoder RNN <START> the poor don’t have any money les pauvres sont démunis Target sentence (from corpus) Source sentence (from corpus) Seq2seq is optimized as a single system. Backpropagation operates “ end to end” .

  10. Greedy decoding ‣ Compute argmax at every step of decoder to generate word ‣ What’s wrong?

  11. Exhaustive search? ‣ Find P ( y 1 , . . . , y T | x 1 , . . . , x n ) arg max y 1 ,..., y T ‣ Requires computing all possible sequences ‣ O ( V T ) complexity! ‣ Too expensive

  12. A middle ground: Beam search ‣ Key idea: At every step, keep track of the k most probable partial translations (hypotheses) ‣ Score of each hypothesis = log probability j ∑ log P ( y t | y 1 , . . . , y t − 1 , x 1 , . . . , x n ) t =1 ‣ Not guaranteed to be optimal ‣ More e ffi cient than exhaustive search

  13. Beam decoding (slide credit: Abigail See)

  14. Beam decoding (slide credit: Abigail See)

  15. Beam decoding (slide credit: Abigail See)

  16. Beam decoding ‣ Di ff erent hypotheses may produce (end) token at di ff erent time steps ⟨ e ⟩ ‣ When a hypothesis produces , stop expanding it and place it aside ⟨ e ⟩ ‣ Continue beam search until: ‣ All hypotheses produce OR ⟨ e ⟩ k ‣ Hit max decoding limit T ‣ Select top hypotheses using the normalized likelihood score T 1 ∑ log P ( y t | y 1 , . . . , y t − 1 , x 1 , . . . , x n ) T t =1 ‣ Otherwise shorter hypotheses have higher scores

  17. NMT vs SMT Pros Cons ‣ Requires more data and compute ‣ Better performance ‣ Less interpretable ‣ Fluency ‣ Hard to debug ‣ Longer context ‣ Uncontrollable ‣ Single NN optimized end-to- ‣ Heavily dependent on data - end could lead to unwanted ‣ Less engineering biases ‣ More parameters ‣ Works out of the box for many language pairs

  18. How seq2seq changed the MT landscape

  19. MT Progress (source: Rico Sennrich)

  20. Versatile seq2seq ‣ Seq2seq finds applications in many other tasks! ‣ Any task where inputs and outputs are sequences of words/ characters ‣ Summarization (input text summary) → ‣ Dialogue (previous utterance reply) → ‣ Parsing (sentence parse tree in sequence form) → ‣ Question answering (context+question answer) →

  21. Issues with vanilla seq2seq Bottleneck ‣ A single encoding vector, h enc , needs to capture all the information about source sentence ‣ Longer sequences can lead to vanishing gradients ‣ Overfitting

  22. Remember alignments?

  23. Attention ‣ The neural MT equivalent of alignment models ‣ Key idea: At each time step during decoding, focus on a particular part of source sentence ‣ This depends on the decoder’s current hidden state (i.e. notion of what you are trying to decode) ‣ Usually implemented as a probability distribution over h enc the hidden states of the encoder ( ) i

  24. Sequence-to-sequence with attention dot product Attention scores Decoder RNN Encoder RNN <START> les pauvres sont démunis Source sentence (input)

  25. Sequence-to-sequence with attention dot product Attention scores Decoder RNN Encoder RNN <START> les pauvres sont démunis Source sentence (input)

  26. Sequence-to-sequence with attention dot product Attention scores Decoder RNN Encoder RNN <START> les pauvres sont démunis Source sentence (input)

  27. Sequence-to-sequence with attention dot product Attention scores Decoder RNN Encoder RNN <START> les pauvres sont démunis Source sentence (input)

  28. Sequence-to-sequence with attention On this decoder timestep, we’re mostly focusing on the first distribution Attention encoder hidden state ( ”les” ) Take softmax to turn the scores into a probability Attention distribution scores Decoder RNN Encoder RNN <START> les pauvres sont démunis Source sentence (input)

  29. Sequence-to-sequence with attention Attention Use the attention distribution to take a output weighted sum of the encoder hidden states. distribution Attention The attention output mostly contains information the hidden states that received high attention. Attention scores Decoder RNN Encoder RNN <START> les pauvres sont démunis Source sentence (input)

  30. Sequence-to-sequence with attention Attention the output Concatenate attention output distribution ^ with decoder hidden state, Attention 𝑧 1 then use to compute as ^ 𝑧 1 before Attention scores Decoder RNN Encoder RNN <START> les pauvres sont démunis Source sentence (input)

  31. Sequence-to-sequence with attention Attention poor output distribution ^ Attention 𝑧 2 Attention scores Decoder RNN Encoder RNN <START> the les pauvres sont démunis Source sentence (input)

  32. Sequence-to-sequence with attention Attention don’t output distribution ^ Attention 𝑧 3 Attention scores Decoder RNN Encoder RNN poor <START> the les pauvres sont démunis Source sentence (input)

  33. Sequence-to-sequence with attention Attention have output distribution ^ Attention 𝑧 4 Attention scores Decoder RNN Encoder RNN poor don’t <START> the les pauvres sont démunis Source sentence (input)

  34. Sequence-to-sequence with attention Attention any output distribution ^ Attention 𝑧 5 Attention scores Decoder RNN Encoder RNN poor don’t have <START> the les pauvres sont démunis Source sentence (input)

  35. Sequence-to-sequence with attention Attention money output ^ distribution Attention 𝑧 6 Attention scores Decoder RNN Encoder RNN any poor don’t have <START> the les pauvres sont démunis Source sentence (input)

  36. Computing attention ‣ Encoder hidden states: h enc 1 , . . . , h enc n ‣ Decoder hidden state at time : t h dec t ‣ First, get attention scores for this time step (we will see what is soon!): 
 g e t = [ g ( h enc 1 , h dec ), . . . , g ( h enc n , h dec )] t t ‣ Obtain the attention distribution using softmax: 
 α t = softmax ( e t ) ∈ ℝ n ‣ Compute weighted sum of encoder hidden states: 
 n ∑ α t i h enc ∈ ℝ h a t = i i =1 ‣ Finally, concatenate with decoder state and pass on to output layer: [ a t ; h dec ] ∈ ℝ 2 h t

Recommend


More recommend