neural machine translation
play

Neural Machine Translation Thang Luong Kyunghyun Cho Christopher - PowerPoint PPT Presentation

Neural Machine Translation Thang Luong Kyunghyun Cho Christopher Manning @lmthang @kchonyc @chrmanning ACL 2016 tutorial https://sites.google.com/site/acl16nmt/ IWSLT 2015, TED talk MT, English-German BLEU (CASED) HUMAN EVALUATION


  1. Neural Machine Translation Thang Luong Kyunghyun Cho Christopher Manning @lmthang · @kchonyc · @chrmanning ACL 2016 tutorial · https://sites.google.com/site/acl16nmt/

  2. IWSLT 2015, TED talk MT, English-German BLEU (CASED) HUMAN EVALUATION (HTER ) 35 30.85 28.18 30 30 26 26.18 26.02 24.96 23.42 % 25 25 22.67 22.51 21.84 20.08 20 20 16.16 15 15 10 10 5 5 0 0 9

  3. Progress in Machine Translation [Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal] Phrase-based SMT Syntax-based SMT Neural MT 25 20 15 10 5 0 2013 2014 2015 2016 From [Sennrich 2016, http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf]

  4. Neural encoder-decoder architectures − 0.2 Input − 0.1 Translated 0.1 Encoder Decoder text 0.4 text − 0.3 1.1 15

  5. NMT system for translating a single word 16

  6. NMT system for translating a single word 17

  7. NMT system for translating a single word 18

  8. Softmax function: Standard map from � V to a probability distribution Exponentiate to make positive Normalize to give probability 19

  9. The three big wins of Neural MT 1. End-to-end training All parameters are simultaneously optimized to minimize a loss function on the network’s output 2. Distributed representations share strength Better exploitation of word and phrase similarities 3. Better exploitation of context NMT can use a much bigger context – both source and partial target text – to translate more accurately 24

  10. A Non-Markovian Language Model Can we directly model the true conditional probability? T Y p ( x 1 , x 2 , . . . , x T ) = p ( x t | x 1 , . . . , x t − 1 ) t =1 Can we build a neural language model for this? 1. Feature extraction: h t = f ( x 1 , x 2 , . . . , x t ) p ( x t +1 | x 1 , . . . , x t − 1 ) = g ( h t ) 2. Prediction: How can f take a variable-length input? 45

  11. A Non-Markovian Language Model Can we directly model the true conditional probability? T Y p ( x 1 , x 2 , . . . , x T ) = p ( x t | x 1 , . . . , x t − 1 ) t =1 Recursive construction of f h h 0 = 0 1. Initialization h t = f ( x t , h t − 1 ) f 2. Recursion 2016-08-07 We call a hidden state or memory h t x t ( x 1 , . . . , x t ) h t summarizes the history 46

  12. A Non-Markovian Language Model Example: p (the , cat , is , eating) (1) Initialization: h 0 = 0 (2) Recursion with Prediction: h 1 = f ( h 0 , h bos i ) ! p (the) = g ( h 1 ) h 2 = f ( h 1 , cat) ! p (cat | the) = g ( h 2 ) h 3 = f ( h 2 , is) ! p (is | the , cat) = g ( h 3 ) h 4 = f ( h 3 , eating) ! p (eating | the , cat , is) = g ( h 4 ) (3) Combination: p (the , cat , is , eating) = g ( h 1 ) g ( h 2 ) g ( h 3 ) g ( h 4 ) Read, Update and Predict 47 2016-08-07

  13. A Recurrent Neural Network Language Model solves the second problem! Example: p (the , cat , is , eating) Read, Update and Predict 48

  14. Building a Recurrent Language Model h t = f ( h t − 1 , x t ) Transition Function Inputs x t ∈ { 1 , 2 , . . . , | V |} i. Current word h t − 1 ∈ R d ii. Previous state Parameters W ∈ R | V | × d i. Input weight matrix U ∈ R d × d ii. Transition weight matrix b ∈ R d iii. Bias vector 49

  15. Building a Recurrent Language Model h t = f ( h t − 1 , x t ) Transition Function Naïve Transition Function f ( h t − 1 , x t ) = tanh( W [ x t ] + Uh t − 1 + b ) Element-wise nonlinear Linear transformation of transformation previous state Trainable word vector 50

  16. Building a Recurrent Language Model Prediction Function p ( x t +1 = w | x ≤ t ) = g w ( h t ) Inputs h t ∈ R d i. Current state Parameters R ∈ R | V | × d i. Softmax matrix c ∈ R | V | ii. Bias vector 51

  17. Building a Recurrent Language Model Prediction Function p ( x t +1 = w | x ≤ t ) = g w ( h t ) exp( R [ w ] > h t + c w ) p ( x t +1 = w | x  t ) = g w ( h t ) = i =1 exp( R [ i ] > h t + c i ) P | V | Compatibility between trainable word vector Normalize and hidden state Exponentiate 52

  18. Training a recurrent language model Having determined the model form, we: 1. Initialize all parameters of the models, including the word representations with small random numbers 2. Define a loss function: how badly we predict actual next words [log loss or cross-entropy loss] 3. Repeatedly attempt to predict each next word 4. Backpropagate our loss to update all parameters 5. Just doing this learns good word representations and good prediction functions – it’s almost magic 53

  19. Recurrent Language Model Example) p (the , cat , is , eating) Read, Update and Predict 56 2016-08-07

  20. Training a Recurrent Language Model Log-probability of one training sentence • T n X log p ( x n 1 , x n 2 , . . . , x n log p ( x n t | x n 1 , . . . , x n T n ) = t − 1 ) t =1 Training set X 1 , X 2 , . . . , X N • � D = Log-likelihood Functional • T n N L ( θ , D ) = 1 X X log p ( x n t | x n 1 , . . . , x n t − 1 ) N n =1 t =1 Minimize !! − L ( θ , D ) 2016-08-07 58

  21. Gradient Descent Move slowly in the steepest descent direction • θ θ � η r L ( θ , D ) Computational cost of a single update: • O ( N ) Not suitable for a large corpus • 59 2016-08-07

  22. Stochastic Gradient Descent Estimate the steepest direction with a minibatch • X 1 , . . . , X n � r L ( θ , D ) ⇡ r L ( θ , ) Until the convergence (w.r.t. a validation set) • |L ( ✓ , D val ) − L ( ✓ − ⌘ L ( ✓ , D ) , D val ) | ≤ ✏ 60 2016-08-07

  23. Stochastic Gradient Descent • Not trivial to build a minibatch Sentence 1 Sentence 2 Sentence 3 Sentence 4 1. Padding and Masking: suitable for GPU’s, but wasteful Wasted computation • Sentence 1 0’s 0’s Sentence 2 Sentence 3 Sentence 4 0’s 61 2016-08-07

  24. Stochastic Gradient Descent 1. Padding and Masking: suitable for GPU’s, but wasteful Wasted computation • 0’s Sentence 1 0’s Sentence 2 Sentence 3 0’s Sentence 4 2. Smarter Padding and Masking: minimize the waste Ensure that the length differences are minimal. • Sort the sentences and sequentially build a minibatch • Sentence 1 0’s Sentence 2 0’s Sentence 3 0’s Sentence 4 62 2016-08-07

  25. Backpropagation through Time How do we compute ? r L ( θ , D ) Cost as a sum of per-sample cost function • X r L ( θ , D ) = r L ( θ , X ) X ∈ D Per-sample cost as a sum of per-step cost functions • T log p ( x t | x <t ) X r L ( θ , X ) = r log p ( x t | x <t , θ ) t =1 63 2016-08-07

  26. Backpropagation through Time How do we compute ? r log p ( x t | x <t , θ ) Compute per-step cost function from time • t = T 1. Cost derivative ∂ log p ( x t | x <t ) / ∂ g 2. Gradient w.r.t. : R × ∂ g/ ∂ R 3. Gradient w.r.t. : h t × ∂ g/ ∂ h t + ∂ h t +1 / ∂ h t 4. Gradient w.r.t. : U × ∂ h t / ∂ U 5. Gradient w.r.t. and : W b log p ( x t | x <t ) and × ∂ h t / ∂ b × ∂ h t / ∂ W 6. Accumulate the gradient and t ← t − 1 64 2016-08-07

  27. Backpropagation through Time Intuitively, what’s happening here? 1. Measure the influence of the past on the future ∂ log p ( x t + n | x <t + n ) = ∂ log p ( x t + n | x <t + n ) ∂ g ∂ h t + n · · · ∂ h t +1 ∂ h t ∂ g ∂ h t + n ∂ h t + n − 1 ∂ h t 2. How does the perturbation at affect ? p ( x t + n | x <t + n ) t ? ✏ x t 65 2016-08-07

  28. Backpropagation through Time Intuitively, what’s happening here? 1. Measure the influence of the past on the future ∂ log p ( x t + n | x <t + n ) = ∂ log p ( x t + n | x <t + n ) ∂ g ∂ h t + n · · · ∂ h t +1 ∂ h t ∂ g ∂ h t + n ∂ h t + n − 1 ∂ h t 2. How does the perturbation at affect ? p ( x t + n | x <t + n ) t ? x t ✏ 3. Change the parameters to maximize p ( x t + n | x <t + n ) 66 2016-08-07

  29. Backpropagation through Time Intuitively, what’s happening here? 1. Measure the influence of the past on the future ∂ log p ( x t + n | x <t + n ) = ∂ log p ( x t + n | x <t + n ) ∂ g ∂ h t + n · · · ∂ h t +1 ∂ h t ∂ g ∂ h t + n ∂ h t + n − 1 ∂ h t 2. With a naïve transition function f ( h t − 1 , x t − 1 ) = tanh( W [ x t − 1 ] + Uh t − 1 + b ) N ✓ ∂ tanh( a t + n ) ◆ ∂ J t + n = ∂ J t + n ∂ g Y We get U > diag ∂ h t ∂ g ∂ h t + N ∂ a t + n n =1 | {z } Problematic! 67 [Bengio, IEEE 1994] 2016-08-07

  30. Backpropagation through Time Gradient either vanishes or explodes What happens? • N ✓ ∂ tanh( a t + n ) ◆ ∂ J t + n = ∂ J t + n ∂ g Y U > diag ∂ h t ∂ g ∂ h t + N ∂ a t + n n =1 | {z } 1. The gradient likely explodes if 1 max tanh 0 ( x ) = 1 e max ≥ 2. The gradient likely vanishes if 1 max tanh 0 ( x ) = 1 : largest eigenvalue of , where e max U e max < [Bengio, Simard, Frasconi, TNN1994; 68 2016-08-07 Hochreiter, Bengio, Frasconi, Schmidhuber, 2001]

  31. Backpropagation through Time Addressing Exploding Gradient “when gradients explode so does the curvature • along v, leading to a wall in the error surface” Gradient Clipping • 1. Norm clipping ⇢ krk r ,if krk � c c ˜ r r ,otherwise 2. Element-wise clipping r i min( c, | r i | )sgn( r i ) , for all i 2 { 1 , . . . , dim r } 69 2016-08-07 [Pascanu, Mikolov, Bengio, ICML 2013]

Recommend


More recommend