neural architectures for nlp
play

Neural Architectures for NLP Jindich Helcl, Jindich Libovick - PowerPoint PPT Presentation

Neural Architectures for NLP Jindich Helcl, Jindich Libovick February 26, 2020 NPFL116 Compendium of Neural Machine Translation Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless


  1. Neural Architectures for NLP Jindřich Helcl, Jindřich Libovický February 26, 2020 NPFL116 Compendium of Neural Machine Translation Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

  2. Outline Symbol Embeddings Recurrent Networks Convolutional Networks Self-attentive Networks Reading Assignment Neural Architectures for NLP 1/ 33

  3. Symbol Embeddings

  4. Discrete symbol vs. continuous representation Simple task: predict next word given three previous: Source: Bengio, Yoshua, et al. ”A neural probabilistic language model.” Journal of machine learning research 3.Feb (2003): 1137-1155. http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf Neural Architectures for NLP 2/ 33

  5. Embeddings What is the biggest problem during training? Embeddings get updated only rarely – only when a symbol appears. Neural Architectures for NLP 3/ 33 • Natural solution: one-hot vector (vector of vocabulary length with exactly one 1) • It would mean a huge matrix every time a symbol is on the input • Rather factorize this matrix and share the fjrst part ⇒ embeddings • “Embeddings” because they embed discrete symbols into a continuous space

  6. Embeddings What is the biggest problem during training? Embeddings get updated only rarely – only when a symbol appears. Neural Architectures for NLP 3/ 33 • Natural solution: one-hot vector (vector of vocabulary length with exactly one 1) • It would mean a huge matrix every time a symbol is on the input • Rather factorize this matrix and share the fjrst part ⇒ embeddings • “Embeddings” because they embed discrete symbols into a continuous space

  7. Embeddings What is the biggest problem during training? Embeddings get updated only rarely – only when a symbol appears. Neural Architectures for NLP 3/ 33 • Natural solution: one-hot vector (vector of vocabulary length with exactly one 1) • It would mean a huge matrix every time a symbol is on the input • Rather factorize this matrix and share the fjrst part ⇒ embeddings • “Embeddings” because they embed discrete symbols into a continuous space

  8. Properties of embeddings Source: https://blogs.mathworks.com/loren/2017/09/21/math-with-words-word-embeddings-with-matlab-and-text-analytics-toolbox/ Neural Architectures for NLP 4/ 33

  9. Recurrent Networks

  10. Why RNNs Neural Architectures for NLP 5/ 33 • for loops over sequential data • the most frequently used type of network in NLP

  11. General Formulation computation, trainable parameter Neural Architectures for NLP 6/ 33 • inputs: 𝑦 , … , 𝑦 𝑈 • initial state ℎ 0 = 0 , a result of previous • recurrent computation: ℎ 𝑢 = 𝐵(ℎ 𝑢−1 , 𝑦 𝑢 )

  12. RNN as Imperative Code def rnn(initial_state, inputs): prev_state = initial_state for x in inputs: new_state, output = rnn_cell(x, prev_state) prev_state = new_state yield output Neural Architectures for NLP 7/ 33

  13. RNN as a Fancy Image Neural Architectures for NLP 8/ 33

  14. Vanilla RNN Neural Architectures for NLP 9/ 33 ℎ 𝑢 = tanh (𝑋[ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐) • cannot propagate long-distance relations • vanishing gradient problem

  15. Vanishing Gradient Problem (1) tanh 𝑦 Neural Architectures for NLP Weight initialized ∼ 𝒪(0, 1) to have gradients further from zero. 𝑦 10/ 33 1 + 𝑓 −2𝑦 tanh 𝑦 = 1 − 𝑓 −2𝑦 = 1 − tanh 2 𝑦 ∈ (0, 1] 1 1 0.8 0.5 0.6 0 Y Y 0.4 -0.5 0.2 0 -1 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 X X

  16. Vanishing Gradient Problem (1) tanh 𝑦 Neural Architectures for NLP Weight initialized ∼ 𝒪(0, 1) to have gradients further from zero. 𝑦 10/ 33 1 + 𝑓 −2𝑦 tanh 𝑦 = 1 − 𝑓 −2𝑦 = 1 − tanh 2 𝑦 ∈ (0, 1] 1 1 0.8 0.5 0.6 0 Y Y 0.4 -0.5 0.2 0 -1 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 X X

  17. = ∂𝐹 𝑢+1 ⋅ ∂ℎ 𝑢+1 Vanishing Gradient Problem (2) ∂𝐹 𝑢+1 ∂𝑐 ∂ℎ 𝑢+1 ∂𝑐 (chain rule) Neural Architectures for NLP 11/ 33

  18. ⋅ ∂ℎ 𝑢+1 Vanishing Gradient Problem (2) ∂𝐹 𝑢+1 ∂𝑐 = ∂𝐹 𝑢+1 ∂ℎ 𝑢+1 ∂𝑐 (chain rule) Neural Architectures for NLP 11/ 33

  19. Vanishing Gradient Problem (2) ∂𝐹 𝑢+1 ∂𝑐 ∂ℎ 𝑢+1 ∂𝑐 (chain rule) Neural Architectures for NLP 11/ 33 = ∂𝐹 𝑢+1 ⋅ ∂ℎ 𝑢+1

  20. =𝑨 𝑢 (activation) (𝑋 ℎ ℎ 𝑢−1 + 𝑋 𝑦 𝑦 𝑢 + 𝑐) ( tanh ′ is derivative of tanh ) tanh ′ (𝑨 𝑢 ) ⋅ ⎛ + ∂𝑋 𝑦 𝑦 𝑢 + ∂𝑐 =1 ⎞ ⎟ ⎟ ⎠ = 𝑋 ⏟ ∼𝒪(0,1) Vanishing Gradient Problem (3) ⏟ ⏟ ⏟ ⏟ ⏟ ⏟ ∈(0;1] ∂ℎ 𝑢−1 ∂𝑐 + tanh ′ (𝑨 𝑢 ) Neural Architectures for NLP tanh ′ (𝑨 𝑢 ) ⏟ ∂𝑐 ⏞ ∂𝑐 = ∂ tanh ⏞⏞ ⏞ ⏞ ⏞⏞⏞ ⏞ ⏞⏞ =0 ∂𝑐 = ⎜ ⎜ ⎝ ∂𝑋 ℎ ℎ 𝑢−1 ∂𝑐 ∂𝑐 ∂ℎ 𝑢 12/ 33

  21. tanh ′ (𝑨 𝑢 ) ⋅ ⎛ + ∂𝑋 𝑦 𝑦 𝑢 + ∂𝑐 Vanishing Gradient Problem (3) ⏟ =1 ⎞ ⎟ ⎟ ⎠ = 𝑋 ⏟ ∼𝒪(0,1) tanh ′ (𝑨 𝑢 ) ⏟ ⏟ ⏟ ⏟ ⏟ ∈(0;1] ∂ℎ 𝑢−1 ∂𝑐 + tanh ′ (𝑨 𝑢 ) Neural Architectures for NLP ∂𝑐 ⏟ =0 ∂ℎ 𝑢 ∂𝑐 = ∂ tanh ⏞⏞ ⏞ ⏞ ⏞⏞⏞ ⏞ ⏞ ⏞⏞ ∂𝑐 = ⎜ ⎜ ⎝ ∂𝑋 ℎ ℎ 𝑢−1 ∂𝑐 ∂𝑐 12/ 33 =𝑨 𝑢 (activation) (𝑋 ℎ ℎ 𝑢−1 + 𝑋 𝑦 𝑦 𝑢 + 𝑐) ( tanh ′ is derivative of tanh )

  22. Vanishing Gradient Problem (3) ∼𝒪(0,1) ⏟ =1 ⎞ ⎟ ⎟ ⎠ = 𝑋 ⏟ tanh ′ (𝑨 𝑢 ) =0 ⏟ ⏟ ⏟ ⏟ ⏟ ∈(0;1] ∂ℎ 𝑢−1 ∂𝑐 + tanh ′ (𝑨 𝑢 ) Neural Architectures for NLP ∂𝑐 ⏟ ∂ℎ 𝑢 ∂𝑐 ∂𝑐 = ∂ tanh ⏞⏞ ⏞ ⏞ ⏞⏞⏞ ⏞ ⏞ ⏞⏞ 12/ 33 ∂𝑐 = ⎜ ⎜ ⎝ ∂𝑋 ℎ ℎ 𝑢−1 ∂𝑐 =𝑨 𝑢 (activation) (𝑋 ℎ ℎ 𝑢−1 + 𝑋 𝑦 𝑦 𝑢 + 𝑐) ( tanh ′ is derivative of tanh ) tanh ′ (𝑨 𝑢 ) ⋅ ⎛ + ∂𝑋 𝑦 𝑦 𝑢 + ∂𝑐

  23. Vanishing Gradient Problem (3) ∼𝒪(0,1) ⏟ =1 ⎞ ⎟ ⎟ ⎠ = 𝑋 ⏟ tanh ′ (𝑨 𝑢 ) =0 ⏟ ⏟ ⏟ ⏟ ⏟ ∈(0;1] ∂ℎ 𝑢−1 ∂𝑐 + tanh ′ (𝑨 𝑢 ) Neural Architectures for NLP ∂𝑐 ⏟ ∂ℎ 𝑢 ∂𝑐 ∂𝑐 = ∂ tanh ⏞⏞ ⏞ ⏞ ⏞⏞⏞ ⏞ ⏞ ⏞⏞ 12/ 33 ∂𝑐 = ⎜ ⎜ ⎝ ∂𝑋 ℎ ℎ 𝑢−1 ∂𝑐 =𝑨 𝑢 (activation) (𝑋 ℎ ℎ 𝑢−1 + 𝑋 𝑦 𝑦 𝑢 + 𝑐) ( tanh ′ is derivative of tanh ) tanh ′ (𝑨 𝑢 ) ⋅ ⎛ + ∂𝑋 𝑦 𝑦 𝑢 + ∂𝑐

  24. • what to use from input, • what to use from hidden state, • what to put on output LSTMs LSTM = Long short-term memory Control the gradient fmow by explicitly gating: Neural Architectures for NLP 13/ 33

  25. • what to use from input, • what to use from hidden state, • what to put on output LSTMs LSTM = Long short-term memory Control the gradient fmow by explicitly gating: Neural Architectures for NLP 13/ 33

  26. LSTMs LSTM = Long short-term memory Control the gradient fmow by explicitly gating: Neural Architectures for NLP 13/ 33 • what to use from input, • what to use from hidden state, • what to put on output

  27. Hidden State Neural Architectures for NLP 14/ 33 • two types of hidden states • ℎ 𝑢 — “public” hidden state, used an output • 𝑑 𝑢 — “private” memory, no non-linearities on the way • direct fmow of gradients (without multiplying by ≤ derivatives) • only vectors guaranteed to live in the same space are manipulated • information highway metaphor

  28. Forget Gate Neural Architectures for NLP 15/ 33 𝑔 𝑢 = 𝜏 (𝑋 𝑔 [ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐 𝑔 ) • based on input and previous state, decide what to forget from the memory

  29. Input Gate ̃ • ̃ 𝐷 — candidate what may want to add to the memory Neural Architectures for NLP 16/ 33 𝑗 𝑢 = 𝜏 (𝑋 𝑗 ⋅ [ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐 𝑗 ) 𝐷 𝑢 = tanh (𝑋 𝑑 ⋅ [ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐 𝐷 ) • 𝑗 𝑢 — decide how much of the information we want to store

  30. Cell State Update ̃ 𝐷 𝑢 Neural Architectures for NLP 17/ 33 𝐷 𝑢 = 𝑔 𝑢 ⊙ 𝐷 𝑢−1 + 𝑗 𝑢 ⊙

  31. Output Gate Neural Architectures for NLP 18/ 33 𝑝 𝑢 = 𝜏 (𝑋 𝑝 ⋅ [ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐 𝑝 ) ℎ 𝑢 = 𝑝 𝑢 ⊙ tanh 𝐷 𝑢

  32. Here we are! ̃ ̃ 𝐷 𝑢 How would you implement it effjciently? Compute all gates in a single matrix multiplication. Neural Architectures for NLP 19/ 33 𝑔 𝑢 = 𝜏 (𝑋 𝑔 [ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐 𝑔 ) 𝑗 𝑢 = 𝜏 (𝑋 𝑗 ⋅ [ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐 𝑗 ) 𝑝 𝑢 = 𝜏 (𝑋 𝑝 ⋅ [ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐 𝑝 ) 𝐷 𝑢 = tanh (𝑋 𝑑 ⋅ [ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐 𝐷 ) 𝐷 𝑢 = 𝑔 𝑢 ⊙ 𝐷 𝑢−1 + 𝑗 𝑢 ⊙ ℎ 𝑢 = 𝑝 𝑢 ⊙ tanh 𝐷 𝑢

Recommend


More recommend