recurrent neural networks
play

Recurrent Neural Networks M. Soleymani Sharif University of - PowerPoint PPT Presentation

Recurrent Neural Networks M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from Socher lectures, cs224d, Stanford, 2017. . Vanilla


  1. Recurrent Neural Networks M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from Socher lectures, cs224d, Stanford, 2017. .

  2. “ Vanilla ” Neural Networks “ Vanilla ” NN

  3. Recurrent Neural Networks: Process Sequences e.g. Image Captioning e.g. Video e.g. Machine Translation “ Vanilla ” NN e.g. Sentiment image -> seq of words classification on seq of words -> seq of words Classification frame level seq of words -> sentiment

  4. Recurrent Neural Network usually want to y predict a vector at some time steps RNN x

  5. Recurrent Neural Network We can process a sequence of vectors x by applying a recurrence formula at every y time step: RNN new state old state input vector at some time step some function x with parameters W

  6. Recurrent Neural Network We can process a sequence of vectors x by applying a recurrence formula at every y time step: RNN Notice: the same function and the same x set of parameters are used at every time step.

  7. (Vanilla) Recurrent Neural Network The state consists of a single “ hidden ” vector h : y RNN x

  8. RNN: Computational Graph

  9. RNN: Computational Graph Re-use the same weight matrix at every time-step

  10. RNN: Computational Graph: Many to One

  11. RNN: Computational Graph: Many to Many

  12. RNN: Computational Graph: Many to Many

  13. RNN: Computational Graph: Many to Many

  14. RNN: Computational Graph: One to Many

  15. Sequence to Sequence: Many-to-one + one-to-many

  16. Character-level language model example Vocabulary: [h,e,l,o] y Example training sequence: “ hello ” RNN x

  17. Character-level language model example Vocabulary: [h,e,l,o] Example training sequence: “ hello ”

  18. Character-level language model example Vocabulary: [h,e,l,o] Example training sequence: “ hello ”

  19. Character-level language model example Vocabulary: [h,e,l,o] Example training sequence: “ hello ”

  20. Example: Character-level Language Model Sampling • Vocabulary: [h,e,l,o] • At test-time sample characters one at a time, feed back to model

  21. Example: Character-level Language Model Sampling • Vocabulary: [h,e,l,o] • At test-time sample characters one at a time, feed back to model

  22. Example: Character-level Language Model Sampling • Vocabulary: [h,e,l,o] • At test-time sample characters one at a time, feed back to model

  23. Example: Character-level Language Model Sampling • Vocabulary: [h,e,l,o] • At test-time sample characters one at a time, feed back to model

  24. min-char-rnn.py gist: 112 lines of Python (https://gist.github.com/karpathy /d4dee566867f8291f086)

  25. Language Modeling: Example I y RNN x

  26. at first: train more train more train more

  27. open source textbook on algebraic geometry Latex source

  28. Generated C code

  29. Searching for interpretable cells

  30. Searching for interpretable cells

  31. Searching for interpretable cells quote detection cell

  32. Searching for interpretable cells line length tracking cell

  33. Searching for interpretable cells if statement cell

  34. Searching for interpretable cells quote/comment cell if statement cell

  35. Searching for interpretable cells code depth cell

  36. Backpropagation through time Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient

  37. Truncated Backpropagation through time Run forward and backward through chunks of the sequence instead of whole sequence

  38. Truncated Backpropagation through time Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps

  39. Truncated Backpropagation through time

  40. Example: Language Models • A language model computes a probability for a sequence of words: 𝑄(𝑥 1 , … , 𝑥 𝑈 ) • Useful for machine translation, spelling correction, and … – Word ordering: p(the cat is small) > p(small the is cat) – Word choice: p(walking home after school) > p(walking house after school)

  41. Example: RNN language model • Given list of word vectors: 𝑦 1 , 𝑦 2 , … , 𝑦 𝑈 • At a single time step: ℎ 𝑢 = tanh 𝑋 𝑦ℎ 𝑦 𝑢 + 𝑋 ℎℎ ℎ 𝑢−1 • Output: 𝑧 𝑢 = softmax 𝑋 ℎ𝑧 ℎ 𝑢 𝑄(𝑧 𝑢 = 𝑤 𝑘 |𝑦 1 , … , 𝑦 𝑢 ) ≈ 𝑧 𝑢,𝑘

  42. Example: RNN language model • Given list of word vectors: 𝑦 1 , 𝑦 2 , … , 𝑦 𝑈 • At a single time step: ℎ 𝑢 = tanh 𝑋 𝑦ℎ 𝑦 𝑢 + 𝑋 ℎℎ ℎ 𝑢−1 • Output: 𝑧 𝑢 = softmax 𝑋 ℎ𝑧 ℎ 𝑢 𝑄(𝑧 𝑢 = 𝑤 𝑘 |𝑦 1 , … , 𝑦 𝑢 ) ≈ 𝑧 𝑢,𝑘 ℎ 0 is some initialization vector for the hidden layer at time step 0 𝑦 𝑢 is the column vector at time step t

  43. Example: RNN language model loss 𝑧 ∈ ℝ 𝑊 is a probability distribution over the vocabulary • • Cross entropy loss function at location t of the sequence: |𝑊| 𝑧 𝑢,𝑘 = 1 when 𝑥 𝑢 must be 𝐹 𝑢 = − 𝑧 𝑢,𝑘 log 𝑧 𝑢,𝑘 the word 𝑘 of vocabulary 𝑘=1 • Cost function over the entire sequence: 𝑈 |𝑊| 𝐹 = − 1 𝑈 𝑧 𝑢,𝑘 log 𝑧 𝑢,𝑘 𝑢=1 𝑘=1

  44. Training RNN

  45. Training RNN

  46. Training RNN ℎ 𝑘 = 𝑋 ℎℎ 𝑔 ℎ 𝑘−1 + 𝑋 𝑦ℎ 𝑦 𝑘 𝜖ℎ 𝑘 𝜖ℎ 𝑘,𝑛 𝑈 diag 𝑔 ′ ℎ 𝑘−1 𝑈 𝑜,. 𝑔 ′ 𝑛 = 𝑋 = 𝑋 ℎℎ ℎℎ 𝜖ℎ 𝑘−1,𝑜 𝜖ℎ 𝑘−1 𝜖ℎ 𝑘 𝑔 ′ ℎ 𝑘−1 𝑈 ≤ 𝑋 ≤ 𝛾 𝑋 𝛾 ℎ ℎℎ 𝜖ℎ 𝑘−1 𝑢 𝑢 𝜖ℎ 𝑢 𝜖ℎ 𝑘 𝑈 diag 𝑔 ′ ℎ 𝑘−1 ≤ 𝛾 𝑋 𝛾 ℎ 𝑢−𝑙 = = 𝑋 ℎℎ 𝜖ℎ 𝑙 𝜖ℎ 𝑘−1 𝑘=𝑙+1 𝑘=𝑙+1 • This can become very small or very large quickly (vanishing/exploding gradients) [Bengio et al 1994].

  47. Training RNNs is hard • Multiply the same matrix at each time step during forward prop • Ideally inputs from many time steps ago can modify output y

  48. The vanishing gradient problem: Example • In the case of language modeling words from time steps far away are not taken into consideration when training to predict the next word • Example: Jane walked into the room. John walked in too. It was late in the day. Jane said hi to ____

  49. Vanilla RNN Gradient Flow

  50. Vanilla RNN Gradient Flow

  51. Vanilla RNN Gradient Flow Computing gradient of h0 involves many factors of W (and repeated tanh) Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients

  52. Trick for exploding gradient: clipping trick • The solution first introduced by Mikolov is to clip gradients to a maximum value. • Makes a big difference in RNNs.

  53. Gradient clipping intuition • Error surface of a single hidden unit RNN – High curvature walls • Solid lines: standard gradient descent trajectories • Dashed lines gradients rescaled to fixed size

  54. Vanilla RNN Gradient Flow Computing gradient of h0 involves many factors Gradient clipping: Scale Computing of W (and repeated tanh) gradient if its norm is too big Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients Change RNN architecture

  55. For vanishing gradients: Initialization + ReLus! • Initialize Ws to identity matrix I and activations to RelU • New experiments with recurrent neural nets. Le et al. A Simple Way to Initialize Recurrent Networks of Rectified Linear Unit, 2015.

  56. Better units for recurrent models • More complex hidden unit computation in recurrence! – ℎ 𝑢 = 𝑀𝑇𝑈𝑁(𝑦 𝑢 , ℎ 𝑢−1 ) – ℎ 𝑢 = 𝐻𝑆𝑉(𝑦 𝑢 , ℎ 𝑢−1 ) • Main ideas: – keep around memories to capture long distance dependencies – allow error messages to flow at different strengths depending on the inputs

  57. Long Short Term Memory (LSTM)

  58. Long-short-term-memories (LSTMs) ℎ 𝑢−1 • Input gate (current cell matterst 𝑗 𝑢 = 𝜏 𝑋 + 𝑐 𝑗 𝑗 𝑦 𝑢 ℎ 𝑢−1 • Forget (gate 0, forget past): 𝑔 𝑢 = 𝜏 𝑋 + 𝑐 𝑔 𝑔 𝑦 𝑢 ℎ 𝑢−1 • Output (how much cell is exposed): 𝑝 𝑢 = 𝜏 𝑋 + 𝑐 𝑝 𝑝 𝑦 𝑢 ℎ 𝑢−1 • New memory cell: 𝑕 𝑢 = tanh 𝑋 + 𝑐 𝑕 𝑕 𝑦 𝑢 • Final memory cell: 𝑑 𝑢 = 𝑗 𝑢 ∘ 𝑕 𝑢 + 𝑔 𝑢 ∘ 𝑑 𝑢−1 • Final hidden state: ℎ 𝑢 = 𝑝 𝑢 ∘ tanh 𝑑 𝑢

  59. Some visualization By Chris Ola: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

  60. LSTM Gates • Gates are ways to let information through (or not): – Forget gate: look at previous cell state and current input, and decide which information to throw away. – Input gate: see which information in the current state we want to update. – Output: Filter cell state and output the filtered result. – Gate or update gate: propose new values for the cell state. • For instance: store gender of subject until another subject is seen.

Recommend


More recommend