natural language processing
play

Natural Language Processing Anoop Sarkar - PowerPoint PPT Presentation

SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University October 25, 2018 0 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University The


  1. SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University October 25, 2018 0

  2. Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University The following slides are taken from Tomas Mikolov’s presenta7on at Google circa 2010 Part 1: Neural Language Models 1

  3. Model description - recurrent NNLM w ( t ) y ( t ) s ( t ) U V W s ( t -1) Input layer w and output layer y have the same dimensionality as the vocabulary (10K - 200K) Hidden layer s is orders of magnitude smaller (50 - 1000 neurons) U is the matrix of weights between input and hidden layer, V is the matrix of weights between hidden and output layer Without the recurrent weights W , this model would be a bigram neural network language model 9 / 59

  4. Model Description - Recurrent NNLM The output values from neurons in the hidden and output layers are computed as follows: s ( t ) = f ( Uw ( t ) + Ws ( t − 1)) (1) y ( t ) = g ( Vs ( t )) , (2) where f ( z ) and g ( z ) are sigmoid and softmax activation functions (the softmax function in the output layer is used to ensure that the outputs form a valid probability distribution, i.e. all outputs are greater than 0 and their sum is 1): e z m 1 f ( z ) = g ( z m ) = (3) 1 + e − z , P k e z k 10 / 59

  5. Training of RNNLM The training is performed using Stochastic Gradient Descent (SGD) We go through all the training data iteratively, and update the weight matrices U , V and W online (after processing every word) Training is performed in several epochs (usually 5-10) 11 / 59

  6. Training of RNNLM Gradient of the error vector in the output layer e o ( t ) is computed using a cross entropy criterion: e o ( t ) = d ( t ) − y ( t ) (4) where d ( t ) is a target vector that represents the word w ( t + 1) (encoded as 1-of-V vector). 12 / 59

  7. Training of RNNLM Weights V between the hidden layer s ( t ) and the output layer y ( t ) are updated as V ( t +1) = V ( t ) + s ( t ) e o ( t ) T α , (5) where α is the learning rate. 13 / 59

  8. Training of RNNLM Next, gradients of errors are propagated from the output layer to the hidden layer e o ( t ) T V , t � � e h ( t ) = d h , (6) where the error vector is obtained using function d h () that is applied element-wise d hj ( x, t ) = xs j ( t )(1 − s j ( t )) . (7) 14 / 59

  9. Training of RNNLM Weights U between the input layer w ( t ) and the hidden layer s ( t ) are then updated as U ( t +1) = U ( t ) + w ( t ) e h ( t ) T α . (8) Note that only one neuron is active at a given time in the input vector w ( t ) . As can be seen from the equation 8, the weight change for neurons with zero activation is none, thus the computation can be speeded up by updating weights that correspond just to the active input neuron. 15 / 59

  10. Training of RNNLM - Backpropagation Through Time The recurrent weights W are updated by unfolding them in time and training the network as a deep feedforward neural network. The process of propagating errors back through the recurrent weights is called Backpropagation Through Time (BPTT). 16 / 59

  11. Training of RNNLM - Backpropagation Through Time w ( t ) y ( t ) s ( t ) U V w ( t -1) W U w ( t -2) s ( t -1) W U W s ( t -2) s ( t -3) Figure: Recurrent neural network unfolded as a deep feedforward network, here for 3 time steps back in time. 17 / 59

  12. Training of RNNLM - Backpropagation Through Time Error propagation is done recursively as follows (note that the algorithm requires the states of the hidden layer from the previous time steps to be stored): e h ( t − τ ) T W , t − τ − 1 � � e h ( t − τ − 1) = d h . (9) The unfolding can be applied for as many time steps as many training examples were already seen, however the error gradients quickly vanish as they get backpropagated in time (in rare cases the errors can explode ), so several steps of unfolding are sufficient (this is sometimes referred to as truncated BPTT ). 18 / 59

  13. Training of RNNLM - Backpropagation Through Time The recurrent weights W are updated as T X s ( t − z − 1) e h ( t − z ) T α . W ( t +1) = W ( t ) + (10) z =0 Note that the matrix W is changed in one update at once, and not during backpropagation of errors. It is more computationally efficient to unfold the network after processing several training examples, so that the training complexity does not increase linearly with the number of time steps T for which the network is unfolded in time. 19 / 59

  14. Training of RNNLM - Backpropagation Through Time w ( t ) y ( t ) s ( t ) U V w ( t -1) W U V w ( t -2) s ( t -1) W y ( t -1) U V s ( t -2) W y ( t -2) V s ( t -3) y ( t -3) Figure: Example of batch mode training. Red arrows indicate how the gradients are propagated through the unfolded recurrent neural network. 20 / 59

  15. Extensions: Classes Computing full probability distribution over all V words can be very complex, as V can easily be more than 100K. We can instead do: Assign all words from V to a single class Compute probability distribution over all classes Compute probability distribution over words that belong to the specific class Assignment of words to classes can be trivial: we can use frequency binning. 21 / 59

  16. Extensions: Classes w ( t ) y ( t ) s ( t ) V U X W s ( t -1) c ( t ) Figure: Factorization of the output layer, c ( t ) is the class layer. 22 / 59

  17. Extensions: Classes By using simple classes, we can achieve speedups on large data sets more than 100 times. We lose a bit of accuracy of the model (usually 5-10% in perplexity). 23 / 59

  18. Empirical Results Penn Treebank Comparison of advanced language modeling techniques Combination Wall Street Journal JHU setup Kaldi setup NIST RT04 Broadcast News speech recognition Additional experiments: machine translation, text compression 28 / 59

  19. Penn Treebank We have used the Penn Treebank Corpus, with the same vocabulary and data division as other researchers: Sections 0-20: training data, 930K tokens Sections 21-22: validation data, 74K tokens Sections 23-24: test data, 82K tokens Vocabulary size: 10K 29 / 59

  20. Penn Treebank - Comparison Model Perplexity Entropy reduction over baseline individual +KN5 +KN5+cache KN5 KN5+cache 3-gram, Good-Turing smoothing (GT3) 165.2 - - - - 5-gram, Good-Turing smoothing (GT5) 162.3 - - - - 3-gram, Kneser-Ney smoothing (KN3) 148.3 - - - - 5-gram, Kneser-Ney smoothing (KN5) 141.2 - - - - 5-gram, Kneser-Ney smoothing + cache 125.7 - - - - PAQ8o10t 131.1 - - - - Maximum entropy 5-gram model 142.1 138.7 124.5 0.4% 0.2% Random clusterings LM 170.1 126.3 115.6 2.3% 1.7% Random forest LM 131.9 131.3 117.5 1.5% 1.4% Structured LM 146.1 125.5 114.4 2.4% 1.9% Within and across sentence boundary LM 116.6 110.0 108.7 5.0% 3.0% Log-bilinear LM 144.5 115.2 105.8 4.1% 3.6% Feedforward neural network LM 140.2 116.7 106.6 3.8% 3.4% Syntactical neural network LM 131.3 110.0 101.5 5.0% 4.4% Recurrent neural network LM 124.7 105.7 97.5 5.8% 5.3% Dynamically evaluated RNNLM 123.2 102.7 98.0 6.4% 5.1% Combination of static RNNLMs 102.1 95.5 89.4 7.9% 7.0% Combination of dynamic RNNLMs 101.0 92.9 90.0 8.5% 6.9% 30 / 59

  21. Penn Treebank - Comparison Model Perplexity Entropy reduction over baseline individual +KN5 +KN5+cache KN5 KN5+cache 3-gram, Good-Turing smoothing (GT3) 165.2 - - - - 5-gram, Good-Turing smoothing (GT5) 162.3 - - - - 3-gram, Kneser-Ney smoothing (KN3) 148.3 - - - - 5-gram, Kneser-Ney smoothing (KN5) 141.2 - - - - 5-gram, Kneser-Ney smoothing + cache 125.7 - - - - PAQ8o10t 131.1 - - - - Maximum entropy 5-gram model 142.1 138.7 124.5 0.4% 0.2% Random clusterings LM 170.1 126.3 115.6 2.3% 1.7% Random forest LM 131.9 131.3 117.5 1.5% 1.4% Structured LM 146.1 125.5 114.4 2.4% 1.9% Within and across sentence boundary LM 116.6 110.0 108.7 5.0% 3.0% Log-bilinear LM 144.5 115.2 105.8 4.1% 3.6% Feedforward neural network LM 140.2 116.7 106.6 3.8% 3.4% Syntactical neural network LM 131.3 110.0 101.5 5.0% 4.4% Recurrent neural network LM 124.7 105.7 97.5 5.8% 5.3% Dynamically evaluated RNNLM 123.2 102.7 98.0 6.4% 5.1% Combination of static RNNLMs 102.1 95.5 89.4 7.9% 7.0% Combination of dynamic RNNLMs 101.0 92.9 90.0 8.5% 6.9% 31 / 59

  22. Penn Treebank - Comparison Model Perplexity Entropy reduction over baseline individual +KN5 +KN5+cache KN5 KN5+cache 3-gram, Good-Turing smoothing (GT3) 165.2 - - - - 5-gram, Good-Turing smoothing (GT5) 162.3 - - - - 3-gram, Kneser-Ney smoothing (KN3) 148.3 - - - - 5-gram, Kneser-Ney smoothing (KN5) 141.2 - - - - 5-gram, Kneser-Ney smoothing + cache 125.7 - - - - PAQ8o10t 131.1 - - - - Maximum entropy 5-gram model 142.1 138.7 124.5 0.4% 0.2% Random clusterings LM 170.1 126.3 115.6 2.3% 1.7% Random forest LM 131.9 131.3 117.5 1.5% 1.4% Structured LM 146.1 125.5 114.4 2.4% 1.9% Within and across sentence boundary LM 116.6 110.0 108.7 5.0% 3.0% Log-bilinear LM 144.5 115.2 105.8 4.1% 3.6% Feedforward neural network LM 140.2 116.7 106.6 3.8% 3.4% Syntactical neural network LM 131.3 110.0 101.5 5.0% 4.4% Recurrent neural network LM 124.7 105.7 97.5 5.8% 5.3% Dynamically evaluated RNNLM 123.2 102.7 98.0 6.4% 5.1% Combination of static RNNLMs 102.1 95.5 89.4 7.9% 7.0% Combination of dynamic RNNLMs 101.0 92.9 90.0 8.5% 6.9% 32 / 59

Recommend


More recommend