neural machine translation
play

Neural Machine Translation Spring 2020 2020-03-12 Adapted from - PowerPoint PPT Presentation

SFU NatLangLab CMPT 825: Natural Language Processing Neural Machine Translation Spring 2020 2020-03-12 Adapted from slides from Danqi Chen, Karthik Narasimhan, and Jetic Gu. (with some content from slides from Abigail See, Graham Neubig) Course


  1. SFU NatLangLab CMPT 825: Natural Language Processing Neural Machine Translation Spring 2020 2020-03-12 Adapted from slides from Danqi Chen, Karthik Narasimhan, and Jetic Gu. (with some content from slides from Abigail See, Graham Neubig)

  2. Course Logistics ‣ Project proposal is due today ‣ What problem are you addressing? Why is it interesting? ‣ What specific aspects will your project be on? ‣ Re-implement paper? Compare di ff erent methods? ‣ What data do you plan to use? ‣ What is your method? ‣ How do you plan to evaluate? What metrics?

  3. Last time • Statistical MT • Word-based • Phrase-based • Syntactic

  4. Neural Machine Translation ‣ A single neural network is used to translate from source to target ‣ Architecture: Encoder-Decoder ‣ Two main components: ‣ Encoder: Convert source sentence (input) into a vector/ matrix ‣ Decoder: Convert encoding into a sentence in target language (output)

  5. Sequence to Sequence learning (Seq2seq) • Encode entire input sequence into a single vector (using an RNN) • Decode one word at a time (again, using an RNN!) • Beam search for better inference • Learning is not trivial! (vanishing/exploding gradients) (Sutskever et al., 2014)

  6. Encoder Sentence: This cat is cute h h t +3 h t h t +1 h t +2 h t − 1 x t +3 x t x t +1 x t +2 word embedding This cat is cute

  7. Encoder Sentence: This cat is cute h h t +3 h 0 h 1 h t +1 h t +2 x t +3 x 1 x t +1 x t +2 word embedding This cat is cute

  8. Encoder Sentence: This cat is cute h h t +3 h 4 h 0 h 1 h 2 h t +2 h 3 x t +3 x 4 x 1 x 2 x t +2 x 3 word embedding This cat is cute

  9. Encoder (encoded representation) Sentence: This cat is cute h enc h 4 h 0 h 1 h 2 h 3 x 4 x 1 x 2 x 3 word embedding This cat is cute

  10. Decoder est mignon <e> ce chat o o o o o z 5 z 4 h enc z 3 z 1 z 2 x ′ x ′ x ′ x ′ x ′ 4 5 1 2 3 word embedding <s> ce chat est mianon mignon

  11. Decoder est mignon <e> ce chat o o o o o z 5 z 4 h enc z 3 z 1 z 2 x ′ x ′ y 1 x ′ x ′ 4 5 2 3 word embedding <s> ce chat est mianon mignon

  12. Decoder est mignon <e> ce chat o o o o o z 5 z 4 h enc z 3 z 1 z 2 x ′ x ′ y 1 x ′ y 2 4 5 3 word embedding <s> ce chat est mianon mignon

  13. Decoder • A conditioned language model est mignon <e> ce chat o o o o o z 5 z 4 h enc z 3 z 1 z 2 y 5 y 4 y 1 y 3 y 2 word embedding <s> ce chat est mianon

  14. Seq2seq training ‣ Similar to training a language model! ‣ Minimize cross-entropy loss: T ∑ − log P ( y t | y 1 , . . . , y t − 1 , x 1 , . . . , x n ) t =1 ‣ Back-propagate gradients through both decoder and encoder ‣ Need a really big corpus 36M sentence pairs Russian : Машинный перевод - это крут o! English: Machine translation is cool!

  15. Seq2seq training (slide credit: Abigail See)

  16. Remember masking Use masking to help compute loss for batched sequences 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0

  17. Scheduled Sampling Possible decay schedules (probability using true y decays over time) (figure credit: Bengio et al, 2015)

  18. How seq2seq changed the MT landscape

  19. MT Progress (source: Rico Sennrich)

  20. (Wu et al., 2016)

  21. NMT vs SMT Pros Cons ‣ Requires more data and ‣ Better performance compute ‣ Fluency ‣ Less interpretable ‣ Longer context ‣ Hard to debug ‣ Single NN optimized end-to-end ‣ Uncontrollable ‣ Less engineering ‣ Heavily dependent on data - could lead to unwanted biases ‣ Works out of the box for many ‣ More parameters language pairs

  22. Seq2Seq for more than NMT Task/Application Input Output Machine Translation French English Summarization Document Short Summary Dialogue Utterance Response Parse tree Parsing Sentence (as sequence) Question Answering Context + Question Answer

  23. Cross-Modal Seq2Seq Task/Application Input Output Speech Recognition Speech Signal Transcript Image Captioning Image Text Video Captioning Video Text

  24. Issues with vanilla seq2seq Bottleneck ‣ A single encoding vector, h enc , needs to capture all the information about source sentence ‣ Longer sequences can lead to vanishing gradients ‣ Overfitting

  25. Issues with vanilla seq2seq Bottleneck ‣ A single encoding vector, h enc , needs to capture all the information about source sentence ‣ Longer sequences can lead to vanishing gradients ‣ Overfitting

  26. Remember alignments?

  27. Attention ‣ The neural MT equivalent of alignment models ‣ Key idea: At each time step during decoding, focus on a particular part of source sentence ‣ This depends on the decoder’s current hidden state (i.e. notion of what you are trying to decode) ‣ Usually implemented as a probability distribution over the h enc hidden states of the encoder ( ) i

  28. Seq2seq with attention (slide credit: Abigail See)

  29. Seq2seq with attention (slide credit: Abigail See)

  30. Seq2seq with attention (slide credit: Abigail See)

  31. ̂ Seq2seq with attention Can also use as input y 1 for next time step (slide credit: Abigail See)

  32. Seq2seq with attention (slide credit: Abigail See)

  33. Computing attention ‣ Encoder hidden states: h enc 1 , . . . , h enc n ‣ Decoder hidden state at time : t h dec t ‣ First, get attention scores for this time step (we will see what is soon!): 
 g e t = [ g ( h enc 1 , h dec ), . . . , g ( h enc n , h dec )] t t ‣ Obtain the attention distribution using softmax: 
 α t = softmax ( e t ) ∈ ℝ n ‣ Compute weighted sum of encoder hidden states: 
 n ∑ α t i h enc ∈ ℝ h a t = i i =1 ‣ Finally, concatenate with decoder state and pass on to output layer: [ a t ; h dec ] ∈ ℝ 2 h t

  34. 
 Types of attention ‣ Assume encoder hidden states and decoder hidden state h 1 , h 2 , . . . , h n z 1. Dot-product attention (assumes equal dimensions for and : 
 a b e i = g ( h i , z ) = z T h i ∈ ℝ 2. Multiplicative attention: 
 g ( h i , z ) = z T Wh i ∈ ℝ , where is a weight matrix W 3. Additive attention: 
 g ( h i , z ) = v T tanh ( W 1 h i + W 2 z ) ∈ ℝ where are weight matrices and is a weight vector W 1 , W 2 v

  35. Issues with vanilla seq2seq Bottleneck ‣ A single encoding vector, h enc , needs to capture all the information about source sentence ‣ Longer sequences can lead to vanishing gradients ‣ Overfitting

  36. Dropout ‣ Form of regularization for RNNs (and any NN in general) ‣ Idea: “Handicap” NN by removing hidden units stochastically ‣ set each hidden unit in a layer to 0 with probability p during training ( usually works well) p = 0.5 ‣ scale outputs by 1/(1 − p ) ‣ hidden units forced to learn more general patterns ‣ Test time: Use all activations (no need to rescale)

  37. Handling large vocabularies ‣ Softmax can be expensive for large vocabularies exp( w i ⋅ h + b i ) P ( y i ) = Expensive to ∑ | V | j =1 exp( w j ⋅ h + b j ) compute ‣ English vocabulary size: 10K to 100K

  38. Approximate Softmax ‣ Negative Sampling ‣ Structured softmax ‣ Embedding prediction

  39. Negative Sampling • Softmax is expensive when vocabulary size is large (figure credit: Graham Neubig)

  40. Negative Sampling • Sample just a subset of the vocabulary for negative • Saw simple negative sampling in word2vec (Mikolov 2013) Other ways to sample: Importance Sampling (Bengio and Senecal 2003) (figure credit: Graham Neubig) Noise Contrastive Estimation (Mnih & Teh 2012)

  41. Hierarchical softmax (Morin and Bengio 2005) (figure credit: Quora)

  42. Class based softmax ‣ Two-layer: cluster words into classes, predict class and then predict word. (figure credit: Graham Neubig) ‣ Clusters can be based on frequency , random , or word contexts . (Gooding 2001, Mikolov et al 2011)

  43. Embedding prediction ‣ Directly predict embeddings of outputs themselves ‣ What loss to use? (Kumar and Tsvetkov 2019) ‣ L2? Cosine? ‣ Von-Mises Fisher distribution loss, make embeddings close on the unit ball (slide credit: Graham Neubig)

  44. Generation How can we use our model (decoder) to generate sentences? • Sampling: Try to generate a random sentence according the the probability distribution • Argmax: Try to generate the best sentence, the sentence with the highest probability

  45. Decoding Strategies ‣ Ancestral sampling ‣ Greedy decoding ‣ Exhaustive search ‣ Beam search

  46. Ancestral Sampling • Randomly sample words one by one • Provides diverse output (high variance) (figure credit: Luong, Cho, and Manning)

  47. Greedy decoding ‣ Compute argmax at every step of decoder to generate word ‣ What’s wrong?

Recommend


More recommend