sequence to sequence learning using recurrent neural
play

Sequence-to-Sequence Learning using Recurrent Neural Networks - PowerPoint PPT Presentation

Sequence-to-Sequence Learning using Recurrent Neural Networks Jindich Helcl, Jindich Libovick March 4, 2020 NPFL116 Compendium of Neural Machine Translation Charles University Faculty of Mathematics and Physics Institute of Formal and


  1. Sequence-to-Sequence Learning using Recurrent Neural Networks Jindřich Helcl, Jindřich Libovický March 4, 2020 NPFL116 Compendium of Neural Machine Translation Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

  2. Outline Symbol Embeddings Recurrent Networks Neural Network Language Models Vanilla Sequence-to-Sequence Model Attentive Sequence-to-Sequence Learning Reading Assignment Sequence-to-Sequence Learning using Recurrent Neural Networks 1/ 41

  3. Symbol Embeddings

  4. Discrete symbol vs. continuous representation Simple task: predict next word given three previous: Source: Bengio, Yoshua, et al. ”A neural probabilistic language model.” Journal of machine learning research 3.Feb (2003): 1137-1155. http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf Sequence-to-Sequence Learning using Recurrent Neural Networks 2/ 41

  5. Embeddings Think of training-related problems when using word embeddings... Embeddings get updated only rarely – only when a symbol appears. Sequence-to-Sequence Learning using Recurrent Neural Networks 3/ 41 • Natural solution: one-hot vector (vector of vocabulary length with exactly one 1) • It would mean a huge matrix every time a symbol is on the input • Rather factorize this matrix and share the fjrst part ⇒ embeddings • “Embeddings” because they embed discrete symbols into a continuous space

  6. Embeddings Think of training-related problems when using word embeddings... Embeddings get updated only rarely – only when a symbol appears. Sequence-to-Sequence Learning using Recurrent Neural Networks 3/ 41 • Natural solution: one-hot vector (vector of vocabulary length with exactly one 1) • It would mean a huge matrix every time a symbol is on the input • Rather factorize this matrix and share the fjrst part ⇒ embeddings • “Embeddings” because they embed discrete symbols into a continuous space

  7. Embeddings Think of training-related problems when using word embeddings... Embeddings get updated only rarely – only when a symbol appears. Sequence-to-Sequence Learning using Recurrent Neural Networks 3/ 41 • Natural solution: one-hot vector (vector of vocabulary length with exactly one 1) • It would mean a huge matrix every time a symbol is on the input • Rather factorize this matrix and share the fjrst part ⇒ embeddings • “Embeddings” because they embed discrete symbols into a continuous space

  8. Properties of embeddings Source: https://blogs.mathworks.com/loren/2017/09/21/math-with-words-word-embeddings-with-matlab-and-text-analytics-toolbox/ Sequence-to-Sequence Learning using Recurrent Neural Networks 4/ 41

  9. Recurrent Networks

  10. Why RNNs Sequence-to-Sequence Learning using Recurrent Neural Networks 5/ 41 • for loops over sequential data • the most frequently used type of network in NLP

  11. General Formulation Sequence-to-Sequence Learning using Recurrent Neural Networks 6/ 41 • inputs: 𝑦 1 , … , 𝑦 𝑈 • initial state ℎ 0 : • 0 • result of previous computation • trainable parameter • recurrent computation: ℎ 𝑢 = 𝐵(ℎ 𝑢−1 , 𝑦 𝑢 )

  12. RNN as Imperative Code def rnn(initial_state, inputs): prev_state = initial_state for x in inputs: new_state, output = rnn_cell(x, prev_state) prev_state = new_state yield output Sequence-to-Sequence Learning using Recurrent Neural Networks 7/ 41

  13. RNN as a Fancy Image Sequence-to-Sequence Learning using Recurrent Neural Networks 8/ 41

  14. Vanilla RNN Sequence-to-Sequence Learning using Recurrent Neural Networks 9/ 41 ℎ 𝑢 = tanh (𝑋[ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐) • cannot propagate long-distance relations • vanishing gradient problem

  15. Vanishing Gradient Problem (1) 1 + 𝑓 −2𝑦 Sequence-to-Sequence Learning using Recurrent Neural Networks Weights initialized ∼ 𝒪(0, 1) to have gradients further from zero. d𝑦 d tanh 𝑦 10/ 41 1 1 0.8 0.5 0.6 0 Y Y 0.4 -0.5 0.2 -1 0 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 X X tanh 𝑦 = 1 − 𝑓 −2𝑦 = 1 − tanh 2 𝑦 ∈ (0, 1]

  16. Vanishing Gradient Problem (1) 1 + 𝑓 −2𝑦 Sequence-to-Sequence Learning using Recurrent Neural Networks Weights initialized ∼ 𝒪(0, 1) to have gradients further from zero. d𝑦 d tanh 𝑦 10/ 41 1 1 0.8 0.5 0.6 0 Y Y 0.4 -0.5 0.2 -1 0 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 X X tanh 𝑦 = 1 − 𝑓 −2𝑦 = 1 − tanh 2 𝑦 ∈ (0, 1]

  17. = ∂𝐹 𝑢+1 ⋅ ∂ℎ 𝑢+1 Vanishing Gradient Problem (2) ∂𝐹 𝑢+1 ∂𝑐 ∂ℎ 𝑢+1 ∂𝑐 (chain rule) Sequence-to-Sequence Learning using Recurrent Neural Networks 11/ 41

  18. ⋅ ∂ℎ 𝑢+1 Vanishing Gradient Problem (2) ∂𝐹 𝑢+1 ∂𝑐 = ∂𝐹 𝑢+1 ∂ℎ 𝑢+1 ∂𝑐 (chain rule) Sequence-to-Sequence Learning using Recurrent Neural Networks 11/ 41

  19. Vanishing Gradient Problem (2) ∂𝐹 𝑢+1 ∂𝑐 ∂ℎ 𝑢+1 ∂𝑐 (chain rule) Sequence-to-Sequence Learning using Recurrent Neural Networks 11/ 41 = ∂𝐹 𝑢+1 ⋅ ∂ℎ 𝑢+1

  20. =𝑨 𝑢 (activation) (𝑋 ℎ ℎ 𝑢−1 + 𝑋 𝑦 𝑦 𝑢 + 𝑐) ( tanh ′ is derivative of tanh ) tanh ′ (𝑨 𝑢 ) ⋅ ⎛ + ∂𝑋 𝑦 𝑦 𝑢 + ∂𝑐 =1 ⎞ ⎟ ⎟ ⎠ = 𝑋 ⏟ ∼𝒪(0,1) Vanishing Gradient Problem (3) ⏟ ⏟ ⏟ ⏟ ⏟ ⏟ ∈(0;1] ∂ℎ 𝑢−1 ∂𝑐 + tanh ′ (𝑨 𝑢 ) Sequence-to-Sequence Learning using Recurrent Neural Networks tanh ′ (𝑨 𝑢 ) ⏟ ∂𝑐 ⏞ ∂𝑐 = ∂ tanh ⏞⏞ ⏞ ⏞ ⏞⏞⏞ ⏞ ⏞⏞ =0 ∂𝑐 = ⎜ ⎜ ⎝ ∂𝑋 ℎ ℎ 𝑢−1 ∂𝑐 ∂𝑐 ∂ℎ 𝑢 12/ 41

  21. tanh ′ (𝑨 𝑢 ) ⋅ ⎛ + ∂𝑋 𝑦 𝑦 𝑢 + ∂𝑐 Vanishing Gradient Problem (3) ⏟ =1 ⎞ ⎟ ⎟ ⎠ = 𝑋 ⏟ ∼𝒪(0,1) tanh ′ (𝑨 𝑢 ) ⏟ ⏟ ⏟ ⏟ ⏟ ∈(0;1] ∂ℎ 𝑢−1 ∂𝑐 + tanh ′ (𝑨 𝑢 ) Sequence-to-Sequence Learning using Recurrent Neural Networks ∂𝑐 ⏟ =0 ∂ℎ 𝑢 ∂𝑐 = ∂ tanh ⏞⏞ ⏞ ⏞ ⏞⏞⏞ ⏞ ⏞ ⏞⏞ ∂𝑐 = ⎜ ⎜ ⎝ ∂𝑋 ℎ ℎ 𝑢−1 ∂𝑐 ∂𝑐 12/ 41 =𝑨 𝑢 (activation) (𝑋 ℎ ℎ 𝑢−1 + 𝑋 𝑦 𝑦 𝑢 + 𝑐) ( tanh ′ is derivative of tanh )

  22. Vanishing Gradient Problem (3) ∼𝒪(0,1) ⏟ =1 ⎞ ⎟ ⎟ ⎠ = 𝑋 ⏟ tanh ′ (𝑨 𝑢 ) =0 ⏟ ⏟ ⏟ ⏟ ⏟ ∈(0;1] ∂ℎ 𝑢−1 ∂𝑐 + tanh ′ (𝑨 𝑢 ) Sequence-to-Sequence Learning using Recurrent Neural Networks ∂𝑐 ⏟ ∂ℎ 𝑢 ∂𝑐 ∂𝑐 = ∂ tanh ⏞⏞ ⏞ ⏞ ⏞⏞⏞ ⏞ ⏞ ⏞⏞ 12/ 41 ∂𝑐 = ⎜ ⎜ ⎝ ∂𝑋 ℎ ℎ 𝑢−1 ∂𝑐 =𝑨 𝑢 (activation) (𝑋 ℎ ℎ 𝑢−1 + 𝑋 𝑦 𝑦 𝑢 + 𝑐) ( tanh ′ is derivative of tanh ) tanh ′ (𝑨 𝑢 ) ⋅ ⎛ + ∂𝑋 𝑦 𝑦 𝑢 + ∂𝑐

  23. Vanishing Gradient Problem (3) ∼𝒪(0,1) ⏟ =1 ⎞ ⎟ ⎟ ⎠ = 𝑋 ⏟ tanh ′ (𝑨 𝑢 ) =0 ⏟ ⏟ ⏟ ⏟ ⏟ ∈(0;1] ∂ℎ 𝑢−1 ∂𝑐 + tanh ′ (𝑨 𝑢 ) Sequence-to-Sequence Learning using Recurrent Neural Networks ∂𝑐 ⏟ ∂ℎ 𝑢 ∂𝑐 ∂𝑐 = ∂ tanh ⏞⏞ ⏞ ⏞ ⏞⏞⏞ ⏞ ⏞ ⏞⏞ 12/ 41 ∂𝑐 = ⎜ ⎜ ⎝ ∂𝑋 ℎ ℎ 𝑢−1 ∂𝑐 =𝑨 𝑢 (activation) (𝑋 ℎ ℎ 𝑢−1 + 𝑋 𝑦 𝑦 𝑢 + 𝑐) ( tanh ′ is derivative of tanh ) tanh ′ (𝑨 𝑢 ) ⋅ ⎛ + ∂𝑋 𝑦 𝑦 𝑢 + ∂𝑐

  24. • what to use from input, • what to use from hidden state, • what to put on output LSTMs LSTM = Long short-term memory Control the gradient fmow by explicitly gating: Sequence-to-Sequence Learning using Recurrent Neural Networks 13/ 41

  25. • what to use from input, • what to use from hidden state, • what to put on output LSTMs LSTM = Long short-term memory Control the gradient fmow by explicitly gating: Sequence-to-Sequence Learning using Recurrent Neural Networks 13/ 41

  26. LSTMs LSTM = Long short-term memory Control the gradient fmow by explicitly gating: Sequence-to-Sequence Learning using Recurrent Neural Networks 13/ 41 • what to use from input, • what to use from hidden state, • what to put on output

  27. Hidden State Sequence-to-Sequence Learning using Recurrent Neural Networks 14/ 41 • two types of hidden states • ℎ 𝑢 — “public” hidden state, used an output • 𝑑 𝑢 — “private” memory, no non-linearities on the way • direct fmow of gradients (without multiplying by ≤ derivatives) • only vectors guaranteed to live in the same space are manipulated • information highway metaphor

  28. Forget Gate Sequence-to-Sequence Learning using Recurrent Neural Networks 15/ 41 𝑔 𝑢 = 𝜏 (𝑋 𝑔 [ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐 𝑔 ) • based on input and previous state, decide what to forget from the memory

  29. Input Gate ̃ • ̃ 𝐷 — candidate what may want to add to the memory Sequence-to-Sequence Learning using Recurrent Neural Networks 16/ 41 𝑗 𝑢 = 𝜏 (𝑋 𝑗 ⋅ [ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐 𝑗 ) 𝐷 𝑢 = tanh (𝑋 𝑑 ⋅ [ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐 𝐷 ) • 𝑗 𝑢 — decide how much of the information we want to store

  30. Cell State Update ̃ 𝐷 𝑢 Sequence-to-Sequence Learning using Recurrent Neural Networks 17/ 41 𝐷 𝑢 = 𝑔 𝑢 ⊙ 𝐷 𝑢−1 + 𝑗 𝑢 ⊙

Recommend


More recommend