recurrent neural networks for language modeling
play

Recurrent Neural Networks for Language Modeling CSE392 - Spring - PowerPoint PPT Presentation

Recurrent Neural Networks for Language Modeling CSE392 - Spring 2019 Special Topic in CS Tasks Recurrent Neural Network and Language Modeling: Generate how? Sequence Models next word, sentence capture hidden representation of


  1. Recurrent Neural Networks for Language Modeling CSE392 - Spring 2019 Special Topic in CS

  2. Tasks ● Recurrent Neural Network and ● Language Modeling: Generate how? Sequence Models next word, sentence ≈ capture hidden representation of sentences.

  3. Language Modeling Task: Estimate P(w n | w 1 , w 2 , …, w n-1 ) :probability of a next word given history P(fork | He ate the cake with the) = ?

  4. Task: Estimate P(w n | w 1 , w 2 , …, w n-1 ) Language Modeling :probability of a next word given history P(fork | He ate the cake with the) = ? Trained What is the next word History Language in the sequence? (He, at, the, cake, with, the) Model training Training Corpus (fit, learn) icing the fork carrots cheese spoon

  5. Neural Networks: Graphs of Operations (excluding the optimization nodes) y (t) = f(h (t) W) Activation Function h (t) = g(h (t-1) U + x (t) V) “hidden layer” (Jurafsky, 2019)

  6. Task: Estimate P(w n | w 1 , w 2 , …, w n-1 ) Language Modeling :probability of a next word given history P(fork | He ate the cake with the) = ? Trained What is the next word History Language in the sequence? (He, at, the, cake, with, the) Model training Training Corpus (fit, learn) icing the fork carrots cheese spoon

  7. Task: Estimate P(w n | w 1 , w 2 , …, w n-1 ) Language Modeling :probability of a next word given history P(fork | He ate the cake with the) = ? Trained What is the next word HistoryLast word Language in the sequence? (He, at, the, cake, with, the) Model training Training Corpus (fit, learn) icing the fork carrots cheese spoon

  8. Task: Estimate P(w n | w 1 , w 2 , …, w n-1 ) Language Modeling :probability of a next word given history P(fork | He ate the cake with the) = ? Trained What is the next word Last word Language in the sequence? (the) Model h t : a vector that we hope “stores” relevant history from previous inputs: He, at, the, cake, with, training Training Corpus (fit, learn) icing the fork carrots cheese spoon

  9. cost Optimization: Backward Propagation ... #define forward pass graph: h (0) = 0 for i in range(1, len(x)): h (i) = tf.tanh(tf.matmul(U,h (i-1) )+ tf.matmul(W,x (i) )) #update hidden state y (i) = tf.softmax(tf.matmul(V, h (i) )) #update output ... cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(y_pred))

  10. cost Optimization: Backward Propagation To find the gradient for the overall graph, we ... use back propogation, which essentially #define forward pass graph: chains together the gradients for each node h (0) = 0 for i in range(1, len(x)): (function) in the graph. h (i) = tf.tanh(tf.matmul(U,h (i-1) )+ tf.matmul(W,x (i) )) #update hidden With many recursions, the gradients can state vanish or explode (become too large or y (i) = tf.softmax(tf.matmul(V, h (i) )) #update output ... small for floating point operations). cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(y_pred))

  11. cost Optimization: Backward Propagation (Geron, 2017)

  12. How to address exploding and vanishing gradients? Ad Hoc approaches: e.g. stop backprop iterations very early. “clip” gradients when too high.

  13. How to address exploding and vanishing gradients? Dominant approach: Use Long Short Term Memory Networks (LSTM) RNN model “unrolled” depiction (Geron, 2017)

  14. How to address exploding and vanishing gradients? The LSTM Cell RNN model “unrolled” depiction (Geron, 2017)

  15. How to address exploding and vanishing gradients? The LSTM Cell “long term state” RNN model “unrolled” depiction “short term state” (Geron, 2017)

  16. How to address exploding and vanishing gradients? The LSTM Cell “long term state” RNN model “unrolled” depiction “short term state” (Geron, 2017)

  17. How to address exploding and vanishing gradients? The LSTM Cell “long term state” “short term state”

  18. How to address exploding and vanishing gradients? The LSTM Cell “long term state” bias term “short term state”

  19. Common Activation Functions z = h (t) W Logistic: 𝜏 (z) = 1 / (1 + e -z ) Hyperbolic tangent: tanh(z) = 2 𝜏 (2z) - 1 = (e 2z - 1) / (e 2z + 1) Rectified linear unit (ReLU): ReLU(z) = max (0, z)

  20. LSTM The LSTM Cell “long term state” “short term state”

  21. LSTM The LSTM Cell “long term state” “short term state”

  22. LSTM The LSTM Cell “long term state” “short term state”

  23. Input to LSTM ?

  24. Input to LSTM ? ● One-hot encoding? ● Word Embedding

  25. Input to LSTM -0.5 3.5 3.21 -1.3 1.6

  26. 1.53 12 1.5 Input to LSTM 0.15 -3.2 2.3 1.1 10 -0.7 -2.0 -5.4 5.5 -0.3 -1.1 6.3 1.53 1.5 0.53 -3.2 2.5 2.3 3 10 -2.3 0.76 -0.5 3.5 3.21 -1.3 1.6

  27. 1.53 12 1.5 Input to LSTM 0.15 -3.2 2.3 1.1 10 -0.7 -2.0 -5.4 5.5 same -0.3 -1.1 6.3 1.53 1.5 0.53 -3.2 2.5 2.3 3 10 -2.3 0.76 -0.5 3.5 3.21 -1.3 1.6

  28. The GRU Gated Recurrent Unit (Geron, 2017)

  29. The GRU Gated Recurrent Unit update gate relevance gate (Geron, 2017)

  30. The GRU Gated Recurrent Unit update gate A candidate for updating h, relevance gate sometimes called: h~ (Geron, 2017)

  31. The GRU Gated Recurrent Unit The cake, which contained candles, was eaten.

  32. What about the gradient? The gates (i.e. multiplications based on a logistic) often end up keeping the hidden state exactly (or nearly exactly) as it was. Thus, for most dimensions of h, h (t) ≈ h (t-1) The cake, which contained candles, was eaten.

  33. What about the gradient? The gates (i.e. multiplications based on a logistic) often end up keeping the hidden state exactly (or nearly exactly) as it was. Thus, for most dimensions of h, h (t) ≈ h (t-1) This tends to keep the gradient from vanishing since the same values will be present through multiple times in backpropagation through time. (The same idea applies to LSTMs but is easier to see here). The cake, which contained candles, was eaten.

  34. How to train an LSTM-style RNN cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(y_pred)) Cost Function: -- ”cross entropy error”

  35. How to train an LSTM-style RNN cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(y_pred)) Cost Function: -- ”cross entropy error”

  36. How to train an LSTM-style RNN cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(y_pred)) Cost Function: -- ”cross entropy error” Stochastic Gradient Descent -- a method

  37. RNN-Based Language Models Take-Aways ● Simple RNNs are difficult to train: exploding and vanishing gradients ● LSTM and GRU cells solve ○ Hidden states past from one time-step to the next, allow for long-distance dependencies. ○ Gates are used to keep hidden states from changing rapidly (and thus keeps gradients under control). ○ LSTM and GRU are complex, but simply a series of functions: ■ logit (w٠x) ■ tanh (w٠x) ■ element-wise multiplication and addition ○ To train: mini-batch stochastic gradient descent over cross-entropy cost

  38. 0.53 1.5 3.21 -2.3 .76

Recommend


More recommend