Machine Learning for NLP Sequential NN models Aurélie Herbelot 2019 Centre for Mind/Brain Sciences University of Trento 1
The unreasonable effectiveness... 2
Karpathy (2015) • In 2015, Andrej Karpathy wrote a blog entry which became famous: The unreasonable effectiveness of Recurrent Neural Networks 1 . • How a simple model can be unbelievably effective. 1 https://karpathy.github.io/2015/05/21/rnn-effectiveness/ 3
Recurrence • Feedforward NNs which take a vector as input and produce a vector as output are limited. • Putting recurrence into our model, we can now process sequences of vectors, at each layer of the network. 4
Architectures What might these architectures be used for? https://karpathy.github.io/2015/05/21/rnn-effectiveness/ 5
Is this a recurrent architecture? https://github.com/avisingh599/visual-qa 6
Is this a recurrent architecture? https://github.com/avisingh599/visual-qa 7
Is this a recurrent architecture? Venugopalan et al (2016) 8
Reminder: language modeling A language model (LM) is a model that computes the probability of a sequence of words, given some previously observed data. LMs are used widely, for instance in predictive text on your smartphone: Today, I am in (bed|heaven|Rovereto|Ulaanbaatar). 9
The Markov assumption • Let’s assume the following sentence: I am in Rovereto. • We are going to use the chain rule for calculating its probability: P ( A n , . . . , A 1 ) = P ( A n | A n − 1 , . . . , A 1 ) · P ( A n − 1 , . . . , A 1 ) • For our example: P ( I , am , in , Rovereto ) = P ( Rovereto | in , am , I ) · P ( in | am , I ) · P ( am | I ) · P ( I ) 10
The Markov assumption • The problem is, we cannot easily estimate the probability of a word in a long sequence. • There are too many possible sequences that are not observable in our data or have very low frequency: P ( Rovereto | in , am , I , today , but , yesterday , there ... ) • So we make a simplifying Markov assumption: P ( Rovereto | in , am , I ) ≈ P ( Rovereto | in ) (bigram) or P ( Rovereto | in , am , I ) ≈ P ( Rovereto | in , am ) (trigram) 11
The Markov assumption • Coming back to our example: P ( I , am , in , Rovereto ) = P ( Rovereto | in , am , I ) · P ( in | am , I ) · P ( am | I ) · P ( I ) • A bigram model simplifies this to: P ( I , am , in , Rovereto ) = P ( Rovereto | in ) · P ( in | am ) · P ( am | I ) · P ( I ) • That is, we are not taking into account long-distance dependencies in language. • Trade-off between accuracy of the model and trainability. 12
LMs as generative models • In your smartphone, the LM does not just calculate a sentence probability, it suggests the next word to what you’re writing. • Given the sequence I am in , for each word w in the vocabulary, the LM can calculate: P ( w | in , am , I ) • The word with highest probability is returned. 13
Language modeling with RNNs • The sequence given to the RNN is equivalent to the n-gram of a language model. • Given a word or character, it has to predict the next one. https://karpathy.github.io/2015/05/21/rnn-effectiveness/ 14
Example: rewriting Harry Potter http://www.botnik.org/content/harry-potter.html 15
Example: writing code https://karpathy.github.io/2015/05/21/rnn-effectiveness/ 16
Sequences for non-sequential input Check animation at https://karpathy.github.io/2015/05/21/rnn-effectiveness/ 17
Types of recurrent NNs • RNNs (Recurrent Neural Networks): the original version. Simple architecture but does not have much memory. • LSTMs (Long Short-Term Memory Networks): an RNN able to remember and forget selectively. • GRUs (Gated Recurrent Units): a variation on LSTMs. 18
Recurrent Neural Networks 19
Recurrent Neural Networks (RNNs) • Traditional neural networks do not have persistence : when presented with a new input, they forget the previous one. • RNNs solve this problem by ‘having loops’: like several copies of a NN, passing a message to the next instance. https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 20
Recurrent Neural Networks (RNNs) https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 21
The step functions • A simple RNN consists has a single step function which: • updates the hidden layer of the unit; • computes the output. • Hidden layer at time t gets updated as: h t = a h ( W hh · h t − 1 + W xh · x t ) • Output is then given by: y = a o ( W hy · h t ) 22
The state space • A recurrent network is a dynamical system described by the two equations in the step function (see previous slide). • The state of the system is the summary of its past behaviour , i.e. the set of hidden unit activations h t . • In addition to the input and output spaces, we have a state space which has the dimensionality of the hidden layer. 23
Backpropagation through time (BPTT) • Imagine doing backprop over an unfolded RNN. • Let us have a network training sequence from time t 0 to time t k . • The cost function E ( t 0 , t k ) is the sum of error E ( t ) over time: t k � E ( t 0 , t k ) = E ( t ) t = t 0 • Similarly, the gradient descent has contributions from all time steps: θ j := θ j − α δ E ( t 0 , t k ) δθ j 24
Backpropagation through time (BPTT) • Imagine doing backprop over an unfolded RNN. • Let us have a network training sequence from time t 0 to time t k . • The cost function E ( t 0 , t k ) is the sum of error E ( t ) over time: t k � E ( t 0 , t k ) = E ( t ) t = t 0 • Similarly, the gradient descent has contributions from all time steps: t k δ � θ j := θ j − α E ( t ) δθ j t = t 0 24
An RNN, step by step • Let us see what happens in an RNN with a simple example of forward and backpropagation. • Let’s assume a character-based language modeling task. The model has to predict the next character given a sequence. • We will set the vocabulary to four letters: e, h, l, o . • We will express each element in the input sequence as a 4-dimensional one-hot vector: • 1 0 0 0 = e • 0 1 0 0 = h • 0 0 1 0 = l • 0 0 0 1 = o 25
An RNN, step by step • We will have sequences of length 4, e.g. ‘lloo’ or ‘oleh’. • We will have an RNN with a hidden layer of dimension 3. https://karpathy.github.io/2015/05/21/rnn-effectiveness/ 26
An RNN, step by step • Let’s imagine we give the following training example to the RNN. We input hell and we want to get the sequence ello . • Let’s have: • x = [[ 0100 ] , [ 1000 ] , [ 0010 ] , [ 0010 ]] • y = [[ 1000 ] , [ 0010 ] , [ 0010 ] , [ 0001 ]] • Each vector in x and y corresponds to a time step, so x t 2 = [ 1000 ] . y t will be prediction by the model at time t . • ˆ • ˆ y will be entire sequence predicted by the model. 27
An RNN, step by step We do a forward pass over the input sequence. It will mean calculating each state of the hidden layer and the resulting output. h t 1 = a h ( x t 1 W xh + h t 0 W hh ) y t 1 = a o ( h t 1 W hy ) ˆ h t 2 = a h ( x t 2 W xh + h t 1 W hh ) y t 2 = a o ( h t 2 W hy ) ˆ h t 3 = a h ( x t 3 W xh + h t 2 W hh ) y t 3 = a o ( h t 3 W hy ) ˆ h t 4 = a h ( x t 4 W xh + h t 3 W hh ) y t 4 = a o ( h t 4 W hy ) ˆ 28
An RNN, step by step • Let’s now assume that the network did not do very well and predicted lole instead of ello , so the sequence ˆ y = [[ 0010 ] , [ 0001 ] , [ 0010 ] , [ 1000 ]] • We now want our error: t k δ � θ j := θ j − α E ( t ) δθ j t = t 0 • This requires calculating the derivative of the error at each time step, for each parameter θ j in the RNN: δ E ( t ) δθ j 29
An RNN, step by step • Our error E ( t ) at each time step is y t − y t , over all our some function of ˆ training instances, as normal. For instance, MSE: N 1 t − y i ) 2 � (ˆ E ( t ) = y i 2 N i = 1 • The entire error is the sum of those errors (see slide 24): t k � E = E ( t ) t = t 0 NB: t 0 is the input, there is no error on it! 30
An RNN, step by step • Now we backpropagate through time. • Note that backpropagation happens also across timesteps. 31
An RNN, step by step • How many parameters do we have in the network? • 4 × 3 for W xh • 3 × 3 for W hh • 3 × 4 for W hy • That is 33 parameters, plus associated biases (not shown). • A real network will have many more. So RNNs are expensive to train when backpropagating through the whole sequence. 32
RNNs and memory • RNNs are known not to have much memory: they cannot process long-distance dependencies. • Consider the following sentences: 1) Harry had not revised for the exams, having spent time fighting dementors, [insert long list of monsters], so he got a bad mark. 2) Hermione revised course material the whole time while fighting dementors, [insert long list of monsters], so she got a good mark. • When modeling this text, the RNN must remember the gender of the proper noun to correctly predict the pronoun. 33
RNNs and vanishing/exploding gradients • Reminder: at the points where an activation function is very steep and/or very flat, its gradient will be very large (exploding) or very small (vanishing). • For instance, the sigmoid function as a vanishing gradient for low and high values of x . 34
Recommend
More recommend