recurrent neural networks
play

Recurrent Neural Networks LING572 Advanced Statistical Methods for - PowerPoint PPT Presentation

Recurrent Neural Networks LING572 Advanced Statistical Methods for NLP March 5 2020 1 Outline Word representations and MLPs for NLP tasks Recurrent neural networks for sequences Fancier RNNs Vanishing/exploding gradients


  1. Recurrent Neural Networks LING572 Advanced Statistical Methods for NLP March 5 2020 1

  2. Outline ● Word representations and MLPs for NLP tasks ● Recurrent neural networks for sequences ● Fancier RNNs ● Vanishing/exploding gradients ● LSTMs (Long Short-Term Memory) ● Variants ● Seq2seq architecture ● Attention 2

  3. MLPs for text classification 3

  4. Word Representations ● Traditionally: words are discrete features ● e.g. curWord=“class” ● As vectors: one-hot encoding ● Each vector is | V | -dimensional, where V is the vocabulary ● Each dimension corresponds to one word of the vocabulary ● A 1 for the current word; 0 everywhere else w 1 = [1 ⋯ 0] 0 0 w 3 = [0 0 1 ⋯ 0] 4

  5. Word Embeddings ● Problem 1: every word is equally different from every other. ● All words are orthogonal to each other. ● Problem 2: very high dimensionality ● Solution: Move words into dense , lower-dimensional space ● Grouping similar words to each other ● These denser representations are called embeddings 5

  6. Word Embeddings ● Formally, a d -dimensional embedding is a matrix E with shape (|V|, d) ● Each row is the vector for one word in the vocabulary ● Matrix multiplying by a one-hot vector returns the corresponding row, i.e. the right word vector ● Trained on prediction tasks (see LING571 slides) ● Continuous bag of words ● Skip-gram ● … ● Can be trained on specific task, or download pre-trained (e.g. GloVe, fastText) ● Fancier versions now to deal with OOV: sub-word (e.g. BPE), character CNN/LSTM 6

  7. Relationships via Offsets WOMAN AUNT MAN UNCLE QUEEN KING Mikolov et al 2013b 7

  8. Relationships via Offsets WOMAN AUNT QUEENS MAN UNCLE KINGS QUEEN QUEEN KING KING Mikolov et al 2013b 7

  9. One More Example Mikolov et al 2013c 8

  10. One More Example 9

  11. Caveat Emptor Linzen 2016, a.o. 10

  12. Example MLP for Language Modeling Bengio et al 2003 11

  13. Example MLP for Language Modeling Bengio et al 2003 : one-hot vector w t 11

  14. Example MLP for Language Modeling Bengio et al 2003 embeddings = concat ( Cw t − 1 , Cw t − 2 , …, Cw t − ( n +1) ) : one-hot vector w t 11

  15. Example MLP for Language Modeling Bengio et al 2003 hidden = tanh ( W 1 embeddings + b 1 ) embeddings = concat ( Cw t − 1 , Cw t − 2 , …, Cw t − ( n +1) ) : one-hot vector w t 11

  16. Example MLP for Language Modeling Bengio et al 2003 probabilities = softmax ( W 2 hidden + b 2 ) hidden = tanh ( W 1 embeddings + b 1 ) embeddings = concat ( Cw t − 1 , Cw t − 2 , …, Cw t − ( n +1) ) : one-hot vector w t 11

  17. Example MLP for sentiment classification ● Issue: texts of different length. ● One solution: average (or sum, or…) all the embeddings, which are of same dim IMDB Model accuracy Deep averaging 89.4 network NB-SVM 91.2 (Wang and Manning 2012) Iyyer et al 2015 12

  18. Recurrent Neural Networks 13

  19. RNNs: high-level 14

  20. RNNs: high-level ● Feed-forward networks: fixed-size input, fixed-size output ● Previous classifier: average embeddings of words ● Other solutions: n -gram assumption (i.e. fixed-size context of word embeddings) 14

  21. RNNs: high-level ● Feed-forward networks: fixed-size input, fixed-size output ● Previous classifier: average embeddings of words ● Other solutions: n -gram assumption (i.e. fixed-size context of word embeddings) ● RNNs process sequences of vectors ● Maintaining “hidden” state ● Applying the same operation at each step 14

  22. RNNs: high-level ● Feed-forward networks: fixed-size input, fixed-size output ● Previous classifier: average embeddings of words ● Other solutions: n -gram assumption (i.e. fixed-size context of word embeddings) ● RNNs process sequences of vectors ● Maintaining “hidden” state ● Applying the same operation at each step ● Different RNNs: ● Different operations at each step ● Operation also called “recurrent cell” ● Other architectural considerations (e.g. depth; bidirectionally) 14

  23. RNNs Steinert-Threlkeld and Szymanik 2019; Olah 2015 15

  24. RNNs h t = f ( x t , h t − 1 ) Steinert-Threlkeld and Szymanik 2019; Olah 2015 15

  25. RNNs h t = f ( x t , h t − 1 ) h t = tanh ( W x x t + W h h t − 1 + b ) Simple/“Vanilla” RNN: Steinert-Threlkeld and Szymanik 2019; Olah 2015 15

  26. RNNs This class … interesting h t = f ( x t , h t − 1 ) h t = tanh ( W x x t + W h h t − 1 + b ) Simple/“Vanilla” RNN: Steinert-Threlkeld and Szymanik 2019; Olah 2015 15

  27. RNNs Linear + Linear + Linear + softmax softmax softmax This class … interesting h t = f ( x t , h t − 1 ) h t = tanh ( W x x t + W h h t − 1 + b ) Simple/“Vanilla” RNN: Steinert-Threlkeld and Szymanik 2019; Olah 2015 15

  28. Using RNNs MLP seq2seq (later) e.g. POS tagging e.g. text classification 16

  29. Training: BPTT ● “Unroll” the network across time-steps ● Apply backprop to the “wide” network ● Each cell has the same parameters ● When updating parameters using the gradients, take the average across the time steps 17

  30. Fancier RNNs 18

  31. Vanishing/Exploding Gradients Problem ● BPTT with vanilla RNNs faces a major problem: ● The gradients can vanish (approach 0) across time ● This makes it hard/impossible to learn long distance dependencies , which are rampant in natural language 19

  32. Vanishing Gradients source If these are small (depends on W), the effect from t=4 on t=1 will be very small 20

  33. Vanishing Gradient Problem source 21

  34. Vanishing Gradient Problem Graves 2012 22

  35. Vanishing Gradient Problem ● Gradient measures the effect of the past on the future ● If it vanishes between t and t+n, can’t tell if: ● There’s no dependency in fact ● The weights in our network just haven’t yet captured the dependency 23

  36. The need for long-distance dependencies ● Language modeling (fill-in-the-blank) ● The keys ____ ● The keys on the table ____ ● The keys next to the book on top of the table ____ ● To get the number on the verb, need to look at the subject, which can be very far away ● And number can disagree with linearly-close nouns ● Need models that can capture long-range dependencies like this. Vanishing gradients means vanilla RNNs will have difficulty. 24

  37. Long Short-Term Memory (LSTM) 25

  38. LSTMs ● Long Short-Term Memory (Hochreiter and Schmidhuber 1997) ● The gold standard / default RNN ● If someone says “RNN” now, they almost always mean “LSTM” ● Originally: to solve the vanishing/exploding gradient problem for RNNs ● Vanilla: re-writes the entire hidden state at every time-step ● LSTM: separate hidden state and memory ● Read, write to/from memory; can preserve long-term information 26

  39. ̂ LSTMs f t = σ ( W f ⋅ h t − 1 x t + b f ) i t = σ ( W i ⋅ h t − 1 x t + b i ) c t = tanh ( W c ⋅ h t − 1 x t + b c ) c t = f t ⊙ c t − 1 + i t ⊙ ̂ c t o t = σ ( W o ⋅ h t − 1 x t + b o ) h t = o t ⊙ tanh ( c t ) 27

  40. ̂ LSTMs f t = σ ( W f ⋅ h t − 1 x t + b f ) 🤕🤕🤸🤯 i t = σ ( W i ⋅ h t − 1 x t + b i ) c t = tanh ( W c ⋅ h t − 1 x t + b c ) c t = f t ⊙ c t − 1 + i t ⊙ ̂ c t o t = σ ( W o ⋅ h t − 1 x t + b o ) h t = o t ⊙ tanh ( c t ) 27

  41. ̂ LSTMs f t = σ ( W f ⋅ h t − 1 x t + b f ) ● Key innovation: ● c t , h t = f ( x t , c t − 1 , h t − 1 ) 🤕🤕🤸🤯 i t = σ ( W i ⋅ h t − 1 x t + b i ) ● c t : a memory cell c t = tanh ( W c ⋅ h t − 1 x t + b c ) ● Reading/writing (smooth) c t = f t ⊙ c t − 1 + i t ⊙ ̂ c t controlled by gates ● f t o t = σ ( W o ⋅ h t − 1 x t + b o ) : forget gate ● i t : input gate h t = o t ⊙ tanh ( c t ) ● o t : output gate 27

  42. LSTMs 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

  43. LSTMs f t ∈ [0,1] m : which cells to forget 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

  44. LSTMs Element-wise multiplication: 0: erase 1: retain f t ∈ [0,1] m : which cells to forget 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

  45. LSTMs Element-wise multiplication: 0: erase 1: retain f t ∈ [0,1] m : which cells to forget : which cells to write to i t ∈ [0,1] m 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

  46. LSTMs Element-wise multiplication: 0: erase 1: retain f t ∈ [0,1] m : which cells to forget “candidate” / new values : which cells to write to i t ∈ [0,1] m 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

  47. LSTMs Element-wise multiplication: 0: erase Add new values to memory 1: retain f t ∈ [0,1] m : which cells to forget “candidate” / new values : which cells to write to i t ∈ [0,1] m 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

  48. LSTMs Element-wise multiplication: 0: erase Add new values to memory 1: retain = f t ⊙ c t − 1 + i t ⊙ ̂ c t f t ∈ [0,1] m : which cells to forget “candidate” / new values : which cells to write to i t ∈ [0,1] m 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

  49. LSTMs Element-wise multiplication: 0: erase Add new values to memory 1: retain = f t ⊙ c t − 1 + i t ⊙ ̂ c t f t ∈ [0,1] m : which cells to forget o t ∈ [0,1] m : which cells to output “candidate” / new values : which cells to write to i t ∈ [0,1] m 28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

Recommend


More recommend