Introduction to Neural Machine Translation Gongbo Tang 16 September 2019
Outline Why Neural Machine Translation ? 1 Introduction to Neural Networks 2 Neural Language Models 3 Gongbo Tang Introduction to Neural Machine Translation 2/38
A Review of SMT Language Model Reordering Model Translation Model Training Morphology Syntactic Trees Factored SMT Compound Pre-reordering Figure : An overview of SMT The Problems of SMT Many different sub-models More and more complicated Performace bottleneck Limitted context window size Gongbo Tang Introduction to Neural Machine Translation 3/38
A Review of SMT Language Model Reordering Model Translation Model Training Morphology Syntactic Trees Factored SMT Compound Pre-reordering Figure : An overview of SMT The Problems of SMT Many different sub-models More and more complicated Performace bottleneck Limitted context window size Gongbo Tang Introduction to Neural Machine Translation 3/38
The Background of Neural Networks Why now ? More data More powerful machines (GPUs) Anvanced neural networks and algorithms Using neural networks to improve SMT Replace the word alignment model Replace the translation model using word embedding Replace n-gram language models with neural language models Replace the reordering model Gongbo Tang Introduction to Neural Machine Translation 4/38
The Background of Neural Networks Why now ? More data More powerful machines (GPUs) Anvanced neural networks and algorithms Using neural networks to improve SMT Replace the word alignment model Replace the translation model using word embedding Replace n-gram language models with neural language models Replace the reordering model Gongbo Tang Introduction to Neural Machine Translation 4/38
Pure Neural Machine Translation Models −0.2 Input −0.1 Translated Encoder 0.1 Decoder text 0.4 text −0.3 1.1 One single model, end-to-end Consider the entire sentence, rather than a local context Smaller model size Figure from Luong et al. ACL 2016 NMT tutorial Gongbo Tang Introduction to Neural Machine Translation 5/38
SMT vs. NMT NMT models have replaced SMT models in many online 机器翻译方法对比 translation engines (Google, Baidu, Bing, Sogou, ...) Is Neural Machine Translation Ready for Deployment? A Case Study on 30 Translation Directions Gongbo Tang Introduction to Neural Machine Translation (Junczys-Dowmunt et al, 2016) 6/38 15
SMT vs. NMT Figure from Tie-Yan Liu’s NMT slides Gongbo Tang Introduction to Neural Machine Translation 7/38
Neural Networks What is a neural network ? is built from simpler units (neurons, nodes, ...) maps input vectors (matrices) to output vectors (matrices) each neuron has a non-linear activation function each activation function can be viewed as a feature detector non-linear functions are expressive Gongbo Tang Introduction to Neural Machine Translation 8/38
Neural Networks Typical activation functions in neural networks Hyperbolic tangent Logistic function Rectified linear unit tanh( x ) = sinh ( x ) cosh ( x ) = e x − e − x 1 sigmoid ( x ) = relu( x ) = max(0, x ) e x + e − x 1+ e − x output ranges output ranges output ranges from –1 to +1 from 0 to +1 from 0 to ∞ ✻ ✻ ✻ � � ✲ � � ✲ � ✲ Figure 13.3: Typical activation functions in neural networks. Figure from Philipp Koehn’s NMT chapter Gongbo Tang Introduction to Neural Machine Translation 9/38
A Simple Neural Network Classifier x 1 y > 0 x 2 x 3 g ( w · x + b ) y . . . y < = 0 x n x is a vector input, y is a scalar output w and b are the parameters ( b is a bias term) g is a non-linear activation function Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Introduction to Neural Machine Translation 10/38
Neural Networks Figure 13.2: A neural network with a hidden layer. Gongbo Tang Introduction to Neural Machine Translation 11/38
Backpropagation Algorithm Training Neural Networks We use backpropagation (BP) algorithm to update neural network weights, then minimize the loss. step1 : forward pass (computation) step2 : calculate the total error step3 : backward pass (using gradient to update weights) repeat step 1, 2, 3 until convergence Gongbo Tang Introduction to Neural Machine Translation 12/38
Backpropagation Algorithm Figure from https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ Gongbo Tang Introduction to Neural Machine Translation 13/38
Backpropagation Algorithm Figure from https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ Gongbo Tang Introduction to Neural Machine Translation 14/38
Neural Networks Training progress over time error validation minimum validation training training progress Gongbo Tang Introduction to Neural Machine Translation 15/38
Neural Networks Learning rate error( λ ) error( λ ) error( λ ) local optimum λ λ λ global optimum Too high learning rate Bad initialization Local optimum More advanced optimation method (use adapting learning rate) : Adagrad, adadelta, Adam. Gongbo Tang Introduction to Neural Machine Translation 16/38
Neural Networks Dropout It could avoid local optima. It reduces overfitting and makes the model more robust. (a) Standard Neural Net (b) After applying dropout. Figure from Dropout : A Simple Way to Prevent Neural Networks from Overfitting Gongbo Tang Introduction to Neural Machine Translation 17/38
Neural Networks Mini-batch training Online learning : update the model with each training example. Mini-batch training : update weights in batches (parallelly), can speed up training. 1. Padding and Masking: suitable for GPU’s, but wasteful • Wasted computation Sentence 1 0’s Sentence 2 0’s Sentence 3 Sentence 4 0’s 2. Smarter Padding and Masking: minimize the waste • Ensure that the length differences are minimal. • Sort the sentences and sequentially build a minibatch Sentence 1 0’s Sentence 2 0’s Sentence 3 0’s Sentence 4 62 2016-08-07 Figure from Luong et al. ACL 2016 NMT tutorial Gongbo Tang Introduction to Neural Machine Translation 18/38
Neural Networks Layer normalzation Large or small values at each layer may cause gradient explosion or gradient vanishing Normalize the values on a per-layer basis to solve it Early stopping Stop training when we get the best result on development set. Ensembling Combine multiple models together. Random seed Used for reproduction. Different seeds lead to different results. Gongbo Tang Introduction to Neural Machine Translation 19/38
Neural Networks Layer normalzation Large or small values at each layer may cause gradient explosion or gradient vanishing Normalize the values on a per-layer basis to solve it Early stopping Stop training when we get the best result on development set. Ensembling Combine multiple models together. Random seed Used for reproduction. Different seeds lead to different results. Gongbo Tang Introduction to Neural Machine Translation 19/38
Neural Networks Layer normalzation Large or small values at each layer may cause gradient explosion or gradient vanishing Normalize the values on a per-layer basis to solve it Early stopping Stop training when we get the best result on development set. Ensembling Combine multiple models together. Random seed Used for reproduction. Different seeds lead to different results. Gongbo Tang Introduction to Neural Machine Translation 19/38
Neural Networks Layer normalzation Large or small values at each layer may cause gradient explosion or gradient vanishing Normalize the values on a per-layer basis to solve it Early stopping Stop training when we get the best result on development set. Ensembling Combine multiple models together. Random seed Used for reproduction. Different seeds lead to different results. Gongbo Tang Introduction to Neural Machine Translation 19/38
Neural Networks Some pratical concepts Tensors : scalars, vectors, and matrices Epoch : update parameters over the training set Batch size : the number of sentence pairs in a batch Step : update parameters over a batch Gongbo Tang Introduction to Neural Machine Translation 20/38
Computation Graph Figure 13.2: A neural network with a hidden layer. The descriptive language of deep learning models Using simple functions to form complex models Functional description of the required computation Gongbo Tang Introduction to Neural Machine Translation 21/38
Computation Graph h = sigmoid ( W 1 x + b 1 ) y = sigmoid ( W 2 h + b 2 ) � 3 � 4 � 1 . 0 � W 1 x 2 3 0 . 0 � − 2 � � 3 � b 1 prod − 4 2 � 1 � sum − 2 � . 731 � � � − 5 W 2 5 sigmoid . 119 � � − 2 � � b 2 3 . 06 prod � � 1 . 06 sum � � sigmoid . 743 Figure 13.8: Two layer feed-forward neural network as a computation graph, consisting of the input value x , weight parameters W 1 , W 2 , b 1 , b 2 , and computation nodes (product, sum, sigmoid). To the right of each parameter node, its value is shown. To the left of input and computation nodes, we show how the input (1 , 0) T is processed by the graph. Gongbo Tang Introduction to Neural Machine Translation 22/38
Recommend
More recommend