Recurrent Neural Networks for Language Modeling CSE392 - Spring - PowerPoint PPT Presentation

Recurrent Neural Networks for Language Modeling CSE392 - Spring 2019 Special Topic in CS

Tasks ● Recurrent Neural Network and ● Language Modeling: Generate how? Sequence Models next word, sentence ≈ capture hidden representation of sentences.

Language Modeling Task: Estimate P(w n | w 1 , w 2 , …, w n-1 ) :probability of a next word given history P(fork | He ate the cake with the) = ?

Task: Estimate P(w n | w 1 , w 2 , …, w n-1 ) Language Modeling :probability of a next word given history P(fork | He ate the cake with the) = ? Trained What is the next word History Language in the sequence? (He, at, the, cake, with, the) Model training Training Corpus (fit, learn) icing the fork carrots cheese spoon

Neural Networks: Graphs of Operations (excluding the optimization nodes) y (t) = f(h (t) W) Activation Function h (t) = g(h (t-1) U + x (t) V) “hidden layer” (Jurafsky, 2019)

Task: Estimate P(w n | w 1 , w 2 , …, w n-1 ) Language Modeling :probability of a next word given history P(fork | He ate the cake with the) = ? Trained What is the next word History Language in the sequence? (He, at, the, cake, with, the) Model training Training Corpus (fit, learn) icing the fork carrots cheese spoon

Task: Estimate P(w n | w 1 , w 2 , …, w n-1 ) Language Modeling :probability of a next word given history P(fork | He ate the cake with the) = ? Trained What is the next word HistoryLast word Language in the sequence? (He, at, the, cake, with, the) Model training Training Corpus (fit, learn) icing the fork carrots cheese spoon

Task: Estimate P(w n | w 1 , w 2 , …, w n-1 ) Language Modeling :probability of a next word given history P(fork | He ate the cake with the) = ? Trained What is the next word Last word Language in the sequence? (the) Model h t : a vector that we hope “stores” relevant history from previous inputs: He, at, the, cake, with, training Training Corpus (fit, learn) icing the fork carrots cheese spoon

cost Optimization: Backward Propagation ... #define forward pass graph: h (0) = 0 for i in range(1, len(x)): h (i) = tf.tanh(tf.matmul(U,h (i-1) )+ tf.matmul(W,x (i) )) #update hidden state y (i) = tf.softmax(tf.matmul(V, h (i) )) #update output ... cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(y_pred))

cost Optimization: Backward Propagation To find the gradient for the overall graph, we ... use back propogation, which essentially #define forward pass graph: chains together the gradients for each node h (0) = 0 for i in range(1, len(x)): (function) in the graph. h (i) = tf.tanh(tf.matmul(U,h (i-1) )+ tf.matmul(W,x (i) )) #update hidden With many recursions, the gradients can state vanish or explode (become too large or y (i) = tf.softmax(tf.matmul(V, h (i) )) #update output ... small for floating point operations). cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(y_pred))

cost Optimization: Backward Propagation (Geron, 2017)

How to address exploding and vanishing gradients? Ad Hoc approaches: e.g. stop backprop iterations very early. “clip” gradients when too high.

How to address exploding and vanishing gradients? Dominant approach: Use Long Short Term Memory Networks (LSTM) RNN model “unrolled” depiction (Geron, 2017)

How to address exploding and vanishing gradients? The LSTM Cell RNN model “unrolled” depiction (Geron, 2017)

How to address exploding and vanishing gradients? The LSTM Cell “long term state” RNN model “unrolled” depiction “short term state” (Geron, 2017)

How to address exploding and vanishing gradients? The LSTM Cell “long term state” “short term state”

How to address exploding and vanishing gradients? The LSTM Cell “long term state” bias term “short term state”

Common Activation Functions z = h (t) W Logistic: 𝜏 (z) = 1 / (1 + e -z ) Hyperbolic tangent: tanh(z) = 2 𝜏 (2z) - 1 = (e 2z - 1) / (e 2z + 1) Rectified linear unit (ReLU): ReLU(z) = max (0, z)

LSTM The LSTM Cell “long term state” “short term state”

Input to LSTM ?

Input to LSTM ? ● One-hot encoding? ● Word Embedding

Input to LSTM -0.5 3.5 3.21 -1.3 1.6

1.53 12 1.5 Input to LSTM 0.15 -3.2 2.3 1.1 10 -0.7 -2.0 -5.4 5.5 -0.3 -1.1 6.3 1.53 1.5 0.53 -3.2 2.5 2.3 3 10 -2.3 0.76 -0.5 3.5 3.21 -1.3 1.6

1.53 12 1.5 Input to LSTM 0.15 -3.2 2.3 1.1 10 -0.7 -2.0 -5.4 5.5 same -0.3 -1.1 6.3 1.53 1.5 0.53 -3.2 2.5 2.3 3 10 -2.3 0.76 -0.5 3.5 3.21 -1.3 1.6

The GRU Gated Recurrent Unit (Geron, 2017)

The GRU Gated Recurrent Unit update gate relevance gate (Geron, 2017)

The GRU Gated Recurrent Unit update gate A candidate for updating h, relevance gate sometimes called: h~ (Geron, 2017)

The GRU Gated Recurrent Unit The cake, which contained candles, was eaten.

What about the gradient? The gates (i.e. multiplications based on a logistic) often end up keeping the hidden state exactly (or nearly exactly) as it was. Thus, for most dimensions of h, h (t) ≈ h (t-1) The cake, which contained candles, was eaten.

What about the gradient? The gates (i.e. multiplications based on a logistic) often end up keeping the hidden state exactly (or nearly exactly) as it was. Thus, for most dimensions of h, h (t) ≈ h (t-1) This tends to keep the gradient from vanishing since the same values will be present through multiple times in backpropagation through time. (The same idea applies to LSTMs but is easier to see here). The cake, which contained candles, was eaten.

How to train an LSTM-style RNN cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(y_pred)) Cost Function: -- ”cross entropy error”

How to train an LSTM-style RNN cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(y_pred)) Cost Function: -- ”cross entropy error” Stochastic Gradient Descent -- a method

RNN-Based Language Models Take-Aways ● Simple RNNs are difficult to train: exploding and vanishing gradients ● LSTM and GRU cells solve ○ Hidden states past from one time-step to the next, allow for long-distance dependencies. ○ Gates are used to keep hidden states from changing rapidly (and thus keeps gradients under control). ○ LSTM and GRU are complex, but simply a series of functions: ■ logit (w٠x) ■ tanh (w٠x) ■ element-wise multiplication and addition ○ To train: mini-batch stochastic gradient descent over cross-entropy cost

0.53 1.5 3.21 -2.3 .76

Recurrent Neural Networks for Language Modeling CSE392 - Spring - PowerPoint PPT Presentation

Recurrent Neural Networks for Language Modeling CSE392 - Spring 2019 Special Topic in CS Tasks Recurrent Neural Network and Language Modeling: Generate how? Sequence Models next word, sentence capture hidden representation of

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

Natural Language Processing with Deep Learning Language Modeling with Recurrent Neural Networks

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

Lecture 4: Recurrent neural networks for natural language processing Plan of the lecture Part

Introduction to Recurrent Neural Networks Jakob Verbeek Modeling sequential data with Recurrent

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

IN5550 Neural Methods in Natural Language Processing Applications of Recurrent Neural Networks

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Recurrent Neural Network Attention Mechanisms for Interpretable System Log Anomaly Detection

Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview 1. Modeling sequences 2.

Gated Orthogonal Recurrent Units: On Learning to Forget Li Jing, a lar Glehre, John

Differential Categories, Recurrent Neural Networks, and Machine Learning Shin-ya Katsumata and

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Dependent Random

CSCE 496/896 Lecture 6: Architectures Stephen Scott Recurrent Architectures Introduction Basic

Recurrent machines for likelihood-free inference Arthur Pesah Antoine Wehenkel Gilles Louppe

Machine Learning for NLP Sequential NN models Aurlie Herbelot 2019 Centre for Mind/Brain