Neural Machine Translation Thang Luong Kyunghyun Cho Christopher - PowerPoint PPT Presentation

Neural Machine Translation Thang Luong Kyunghyun Cho Christopher Manning @lmthang · @kchonyc · @chrmanning ACL 2016 tutorial · https://sites.google.com/site/acl16nmt/

IWSLT 2015, TED talk MT, English-German BLEU (CASED) HUMAN EVALUATION (HTER ) 35 30.85 28.18 30 30 26 26.18 26.02 24.96 23.42 % 25 25 22.67 22.51 21.84 20.08 20 20 16.16 15 15 10 10 5 5 0 0 9

Progress in Machine Translation [Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal] Phrase-based SMT Syntax-based SMT Neural MT 25 20 15 10 5 0 2013 2014 2015 2016 From [Sennrich 2016, http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf]

Neural encoder-decoder architectures − 0.2 Input − 0.1 Translated 0.1 Encoder Decoder text 0.4 text − 0.3 1.1 15

NMT system for translating a single word 16

Softmax function: Standard map from � V to a probability distribution Exponentiate to make positive Normalize to give probability 19

The three big wins of Neural MT 1. End-to-end training All parameters are simultaneously optimized to minimize a loss function on the network’s output 2. Distributed representations share strength Better exploitation of word and phrase similarities 3. Better exploitation of context NMT can use a much bigger context – both source and partial target text – to translate more accurately 24

A Non-Markovian Language Model Can we directly model the true conditional probability? T Y p ( x 1 , x 2 , . . . , x T ) = p ( x t | x 1 , . . . , x t − 1 ) t =1 Can we build a neural language model for this? 1. Feature extraction: h t = f ( x 1 , x 2 , . . . , x t ) p ( x t +1 | x 1 , . . . , x t − 1 ) = g ( h t ) 2. Prediction: How can f take a variable-length input? 45

A Non-Markovian Language Model Can we directly model the true conditional probability? T Y p ( x 1 , x 2 , . . . , x T ) = p ( x t | x 1 , . . . , x t − 1 ) t =1 Recursive construction of f h h 0 = 0 1. Initialization h t = f ( x t , h t − 1 ) f 2. Recursion 2016-08-07 We call a hidden state or memory h t x t ( x 1 , . . . , x t ) h t summarizes the history 46

A Non-Markovian Language Model Example: p (the , cat , is , eating) (1) Initialization: h 0 = 0 (2) Recursion with Prediction: h 1 = f ( h 0 , h bos i ) ! p (the) = g ( h 1 ) h 2 = f ( h 1 , cat) ! p (cat | the) = g ( h 2 ) h 3 = f ( h 2 , is) ! p (is | the , cat) = g ( h 3 ) h 4 = f ( h 3 , eating) ! p (eating | the , cat , is) = g ( h 4 ) (3) Combination: p (the , cat , is , eating) = g ( h 1 ) g ( h 2 ) g ( h 3 ) g ( h 4 ) Read, Update and Predict 47 2016-08-07

A Recurrent Neural Network Language Model solves the second problem! Example: p (the , cat , is , eating) Read, Update and Predict 48

Building a Recurrent Language Model h t = f ( h t − 1 , x t ) Transition Function Inputs x t ∈ { 1 , 2 , . . . , | V |} i. Current word h t − 1 ∈ R d ii. Previous state Parameters W ∈ R | V | × d i. Input weight matrix U ∈ R d × d ii. Transition weight matrix b ∈ R d iii. Bias vector 49

Building a Recurrent Language Model h t = f ( h t − 1 , x t ) Transition Function Naïve Transition Function f ( h t − 1 , x t ) = tanh( W [ x t ] + Uh t − 1 + b ) Element-wise nonlinear Linear transformation of transformation previous state Trainable word vector 50

Building a Recurrent Language Model Prediction Function p ( x t +1 = w | x ≤ t ) = g w ( h t ) exp( R [ w ] > h t + c w ) p ( x t +1 = w | x  t ) = g w ( h t ) = i =1 exp( R [ i ] > h t + c i ) P | V | Compatibility between trainable word vector Normalize and hidden state Exponentiate 52

Training a recurrent language model Having determined the model form, we: 1. Initialize all parameters of the models, including the word representations with small random numbers 2. Define a loss function: how badly we predict actual next words [log loss or cross-entropy loss] 3. Repeatedly attempt to predict each next word 4. Backpropagate our loss to update all parameters 5. Just doing this learns good word representations and good prediction functions – it’s almost magic 53

Recurrent Language Model Example) p (the , cat , is , eating) Read, Update and Predict 56 2016-08-07

Training a Recurrent Language Model Log-probability of one training sentence • T n X log p ( x n 1 , x n 2 , . . . , x n log p ( x n t | x n 1 , . . . , x n T n ) = t − 1 ) t =1 Training set X 1 , X 2 , . . . , X N • � D = Log-likelihood Functional • T n N L ( θ , D ) = 1 X X log p ( x n t | x n 1 , . . . , x n t − 1 ) N n =1 t =1 Minimize !! − L ( θ , D ) 2016-08-07 58

Gradient Descent Move slowly in the steepest descent direction • θ θ � η r L ( θ , D ) Computational cost of a single update: • O ( N ) Not suitable for a large corpus • 59 2016-08-07

Stochastic Gradient Descent Estimate the steepest direction with a minibatch • X 1 , . . . , X n � r L ( θ , D ) ⇡ r L ( θ , ) Until the convergence (w.r.t. a validation set) • |L ( ✓ , D val ) − L ( ✓ − ⌘ L ( ✓ , D ) , D val ) | ≤ ✏ 60 2016-08-07

Stochastic Gradient Descent • Not trivial to build a minibatch Sentence 1 Sentence 2 Sentence 3 Sentence 4 1. Padding and Masking: suitable for GPU’s, but wasteful Wasted computation • Sentence 1 0’s 0’s Sentence 2 Sentence 3 Sentence 4 0’s 61 2016-08-07

Stochastic Gradient Descent 1. Padding and Masking: suitable for GPU’s, but wasteful Wasted computation • 0’s Sentence 1 0’s Sentence 2 Sentence 3 0’s Sentence 4 2. Smarter Padding and Masking: minimize the waste Ensure that the length differences are minimal. • Sort the sentences and sequentially build a minibatch • Sentence 1 0’s Sentence 2 0’s Sentence 3 0’s Sentence 4 62 2016-08-07

Backpropagation through Time How do we compute ? r L ( θ , D ) Cost as a sum of per-sample cost function • X r L ( θ , D ) = r L ( θ , X ) X ∈ D Per-sample cost as a sum of per-step cost functions • T log p ( x t | x <t ) X r L ( θ , X ) = r log p ( x t | x <t , θ ) t =1 63 2016-08-07

Backpropagation through Time How do we compute ? r log p ( x t | x <t , θ ) Compute per-step cost function from time • t = T 1. Cost derivative ∂ log p ( x t | x <t ) / ∂ g 2. Gradient w.r.t. : R × ∂ g/ ∂ R 3. Gradient w.r.t. : h t × ∂ g/ ∂ h t + ∂ h t +1 / ∂ h t 4. Gradient w.r.t. : U × ∂ h t / ∂ U 5. Gradient w.r.t. and : W b log p ( x t | x <t ) and × ∂ h t / ∂ b × ∂ h t / ∂ W 6. Accumulate the gradient and t ← t − 1 64 2016-08-07

Backpropagation through Time Intuitively, what’s happening here? 1. Measure the influence of the past on the future ∂ log p ( x t + n | x <t + n ) = ∂ log p ( x t + n | x <t + n ) ∂ g ∂ h t + n · · · ∂ h t +1 ∂ h t ∂ g ∂ h t + n ∂ h t + n − 1 ∂ h t 2. How does the perturbation at affect ? p ( x t + n | x <t + n ) t ? ✏ x t 65 2016-08-07

Backpropagation through Time Intuitively, what’s happening here? 1. Measure the influence of the past on the future ∂ log p ( x t + n | x <t + n ) = ∂ log p ( x t + n | x <t + n ) ∂ g ∂ h t + n · · · ∂ h t +1 ∂ h t ∂ g ∂ h t + n ∂ h t + n − 1 ∂ h t 2. How does the perturbation at affect ? p ( x t + n | x <t + n ) t ? x t ✏ 3. Change the parameters to maximize p ( x t + n | x <t + n ) 66 2016-08-07

Backpropagation through Time Intuitively, what’s happening here? 1. Measure the influence of the past on the future ∂ log p ( x t + n | x <t + n ) = ∂ log p ( x t + n | x <t + n ) ∂ g ∂ h t + n · · · ∂ h t +1 ∂ h t ∂ g ∂ h t + n ∂ h t + n − 1 ∂ h t 2. With a naïve transition function f ( h t − 1 , x t − 1 ) = tanh( W [ x t − 1 ] + Uh t − 1 + b ) N ✓ ∂ tanh( a t + n ) ◆ ∂ J t + n = ∂ J t + n ∂ g Y We get U > diag ∂ h t ∂ g ∂ h t + N ∂ a t + n n =1 | {z } Problematic! 67 [Bengio, IEEE 1994] 2016-08-07

Backpropagation through Time Gradient either vanishes or explodes What happens? • N ✓ ∂ tanh( a t + n ) ◆ ∂ J t + n = ∂ J t + n ∂ g Y U > diag ∂ h t ∂ g ∂ h t + N ∂ a t + n n =1 | {z } 1. The gradient likely explodes if 1 max tanh 0 ( x ) = 1 e max ≥ 2. The gradient likely vanishes if 1 max tanh 0 ( x ) = 1 : largest eigenvalue of , where e max U e max < [Bengio, Simard, Frasconi, TNN1994; 68 2016-08-07 Hochreiter, Bengio, Frasconi, Schmidhuber, 2001]

Backpropagation through Time Addressing Exploding Gradient “when gradients explode so does the curvature • along v, leading to a wall in the error surface” Gradient Clipping • 1. Norm clipping ⇢ krk r ,if krk � c c ˜ r r ,otherwise 2. Element-wise clipping r i min( c, | r i | )sgn( r i ) , for all i 2 { 1 , . . . , dim r } 69 2016-08-07 [Pascanu, Mikolov, Bengio, ICML 2013]

Neural Machine Translation Thang Luong Kyunghyun Cho Christopher - PowerPoint PPT Presentation

Neural Machine Translation Thang Luong Kyunghyun Cho Christopher Manning @lmthang @kchonyc @chrmanning ACL 2016 tutorial https://sites.google.com/site/acl16nmt/ IWSLT 2015, TED talk MT, English-German BLEU (CASED) HUMAN EVALUATION

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Convolutional over Recurrent Encoder for Neural Machine Translation Praveen Dakwale and Christof

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Prayer and fasting Acts 13:1-3 For John the Baptist has come eating no bread and drinking no

Acts 3: : 11-15 15 Peter and John in Solomons Portico Exegesis of perhaps the most important

Automatic Abstraction for Congruences A Story of Beauty and the Beast Andy King and Harald

Status report on Double Chooz Amanda Porta for the Double Chooz collaboration A. Porta,

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Key Clinical Research Considerations During the COVID-19 Pandemic: A Trialists Perspectives

Transportation Director in the most Progressive (and weird) Transportation City in America? 1

20,800 residents ~15 people/acre 15 miles of new trails 286 acres of new parks Fully