neural machine translation
play

Neural Machine Translation Gongbo Tang 8 October 2018 Outline - PowerPoint PPT Presentation

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1 Advances and Challenges 2 Gongbo Tang Neural Machine Translation 2/52 Neural Machine Translation Figure Recurrent neural network based NMT


  1. Neural Machine Translation Gongbo Tang 8 October 2018

  2. Outline Neural Machine Translation 1 Advances and Challenges 2 Gongbo Tang Neural Machine Translation 2/52

  3. Neural Machine Translation Figure – Recurrent neural network based NMT model From Thang Luong’s Thesis on Neural Machine Translation Gongbo Tang Neural Machine Translation 3/52

  4. Neural Machine Translation Figure – NMT model with attention mechanism Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 4/52

  5. Modelling Translation Suppose that we have: a source sentence S of length m ( x 1 , . . . , x m ) a target sentence T of length n ( y 1 , . . . , y n ) We can express translation as a probabilistic model T ∗ = arg max p ( T | S ) T Expanding using the chain rule gives p ( T | S ) = p ( y 1 , . . . , y n | x 1 , . . . , x m ) n � = p ( y i | y 1 , . . . , y i − 1 , x 1 , . . . , x m ) i =1 Gongbo Tang Neural Machine Translation 5/52

  6. Modelling Translation Target-side language model: n � p ( T ) = p ( y i | y 1 , . . . , y i − 1 ) i =1 Translation model: n � p ( T | S ) = p ( y i | y 1 , . . . , y i − 1 , x 1 , . . . , x m ) i =1 We could just treat sentence pair as one long sequence, but: We do not care about p ( S ) We may want different vocabulary, network architecture for source text Gongbo Tang Neural Machine Translation 6/52

  7. Attentional Encoder-Decoder : Maths simplifications of model by [Bahdanau et al., 2015] (for illustration) plain RNN instead of GRU simpler output layer we do not show bias terms decoder follows Look, Update, Generate strategy [Sennrich et al., 2017] Details in https://github.com/amunmt/amunmt/blob/master/contrib/notebooks/dl4mt.ipynb notation W , U , E , C , V are weight matrices (of different dimensionality) E one-hot to embedding (e.g. 50000 · 512 ) W embedding to hidden (e.g. 512 · 1024 ) U hidden to hidden (e.g. 1024 · 1024 ) C context (2x hidden) to hidden (e.g. 2048 · 1024 ) V o hidden to one-hot (e.g. 1024 · 50000 ) separate weight matrices for encoder and decoder (e.g. E x and E y ) input X of length T x ; output Y of length T y Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 7/52

  8. Encoder Figure – NMT model with attention mechanism Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 8/52

  9. Encoder Encoder with bidirectional Recurent Neural Networks Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN Gongbo Tang Neural Machine Translation 9/52

  10. Encoder Encoder with bidirectional Recurent Neural Networks Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN � → − 0 , , if j = 0 h j = tanh ( − → W x E x x j + − → U x h j − 1 ) , if j > 0 � ← − 0 , , if j = T x + 1 h j = tanh ( ← W x E x x j + ← − − U x h j +1 ) , if j ≤ T x h j = ( − → h j , ← − h j ) Gongbo Tang Neural Machine Translation 9/52

  11. Decoder Figure – NMT model with attention mechanism Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 10/52

  12. Decoder Context c i-1 c i State s i-1 s i Word t i-1 t i Prediction Selected y i-1 y i Word Embedding Ey i-1 Ey i Gongbo Tang Neural Machine Translation 11/52

  13. Decoder Context c i-1 c i State s i-1 s i Word t i-1 t i Prediction Selected y i-1 y i Word Embedding Ey i-1 Ey i ← − � tanh ( W s h i ) , , if i = 0 s i = tanh ( W y E y y i − 1 + U y s i − 1 + Cc i ) , if i > 0 t i = tanh ( U o s i + W o E y y i − 1 + C o c i ) y i = softmax ( V o t i ) Gongbo Tang Neural Machine Translation 11/52

  14. Decoder Training y i is known, the training objective is to assign higher possibility to the correct output word. One of the cost functions is the negative log of the probability given to the correct word transaltion. cost = − logt i [ y i ] Gongbo Tang Neural Machine Translation 12/52

  15. Decoder Training y i is known, the training objective is to assign higher possibility to the correct output word. One of the cost functions is the negative log of the probability given to the correct word transaltion. cost = − logt i [ y i ] Inference y i is unknown, we compute the probability distribution over all the vocabulary. Greedy search : select the word with the highest probability. Beam search : keep the top k most likely word choices. Gongbo Tang Neural Machine Translation 12/52

  16. Decoding 0 hello 0.946 0.056 world 0.957 0.100 ! 0.928 0.175 <eos> 0.999 0.175 Greedy search Gongbo Tang Neural Machine Translation 13/52

  17. Decoding 0 0 hello 0.946 HI 0.007 Hey 0.006 0.056 4.920 5.107 hello 0.946 0.056 world 0.957 World 0.010 world 0.684 0.100 4.632 5.299 world 0.957 0.100 . 0.030 ! 0.928 ... 0.014 3.609 0.175 4.384 ! 0.928 <eos> 0.999 <eos> 0.999 <eos> 0.994 0.175 3.609 0.175 4.390 <eos> 0.999 K = 3 0.175 Greedy search Beam search Gongbo Tang Neural Machine Translation 13/52

  18. Attention Figure – NMT model with attention mechanism Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 14/52

  19. Attention Predictions Attention Softmax c i α i h s i − 1 Decoder Networks Encoder hidden states Gongbo Tang Neural Machine Translation 15/52

  20. Attention Predictions Attention Softmax c i α i h s i − 1 Decoder Networks Encoder hidden states e ij = v ⊤ a tanh ( W a s i − 1 + U a h j ) α ij = softmax ( e ij ) T x � c i = α ij h j j =1 Gongbo Tang Neural Machine Translation 15/52

  21. Overview of NMT Pros More fluent translation Less lexical errors Less word order errors Less morphology errors Cons Expensive computation Over-translation and under-translation (Adequacy) Bad at translating long sentences Need more data Black box Gongbo Tang Neural Machine Translation 16/52

  22. Overview of NMT Pros More fluent translation Less lexical errors Less word order errors Less morphology errors Cons Expensive computation Over-translation and under-translation (Adequacy) Bad at translating long sentences Need more data Black box Gongbo Tang Neural Machine Translation 16/52

  23. Advances and Challenges Attention mechanism Model Architectures Monolingual data Models at different levels Linguistic features Modelling coverage Domain adaption Transfer learning what is NMT not good at ? Gongbo Tang Neural Machine Translation 17/52

  24. Advances and Challenges Attention mechanism Model Architectures Monolingual data Models at different levels Linguistic features Modelling coverage Domain adaption Transfer learning what is NMT not good at ? Gongbo Tang Neural Machine Translation 17/52

  25. Advances and Challenges Attention mechanism Model Architectures Monolingual data Models at different levels Linguistic features Modelling coverage Domain adaption Transfer learning what is NMT not good at ? Gongbo Tang Neural Machine Translation 17/52

  26. Advances and Challenges Attention mechanism Model Architectures Monolingual data Models at different levels Linguistic features Modelling coverage Domain adaption Transfer learning what is NMT not good at ? Gongbo Tang Neural Machine Translation 17/52

  27. Advances and Challenges Attention mechanism Model Architectures Monolingual data Models at different levels Linguistic features Modelling coverage Domain adaption Transfer learning what is NMT not good at ? Gongbo Tang Neural Machine Translation 17/52

  28. Advances and Challenges Attention mechanism Model Architectures Monolingual data Models at different levels Linguistic features Modelling coverage Domain adaption Transfer learning what is NMT not good at ? Gongbo Tang Neural Machine Translation 17/52

  29. Advances and Challenges Attention mechanism Model Architectures Monolingual data Models at different levels Linguistic features Modelling coverage Domain adaption Transfer learning what is NMT not good at ? Gongbo Tang Neural Machine Translation 17/52

  30. Advances and Challenges Attention mechanism Model Architectures Monolingual data Models at different levels Linguistic features Modelling coverage Domain adaption Transfer learning what is NMT not good at ? Gongbo Tang Neural Machine Translation 17/52

  31. Advances and Challenges Attention mechanism Model Architectures Monolingual data Models at different levels Linguistic features Modelling coverage Domain adaption Transfer learning what is NMT not good at ? Gongbo Tang Neural Machine Translation 17/52

  32. Attention Mechanism and Alignment From What does Attention in Neural Machine Translation Pay Attention to ? Gongbo Tang Neural Machine Translation 18/52

  33. Attention Mechanism and Alignment attention to attention to POS tag alignment points % other words % NUM 73 27 NOUN 68 32 ADJ 66 34 PUNC 55 45 ADV 50 50 CONJ 50 50 VERB 49 51 ADP 47 53 DET 45 55 PRON 45 55 PRT 36 64 Overall 54 46 Table 8: Distribution of attention probability mass (in %) over alignment points and the rest of the words for each POS tag. From What does Attention in Neural Machine Translation Pay Attention to ? Gongbo Tang Neural Machine Translation 19/52

Recommend


More recommend