Natural Language Processing with Deep Learning LSTM, GRU, and applications in summarization and contextualized word embeddings Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Institute of Computational Perception
Agenda • Vanishing/Exploding gradient • RNNs with Gates: LSTM, GRU • Contextualized word embeddings with RNNs • Extractive summarization with RNNs Some slides are adopted from http://web.stanford.edu/class/cs224n/
Element-wise Multiplication 𝒅 § 𝒃⨀𝒄 = dimensions: 1 × d ⨀ 1 × d = - 1 × d 3 ⨀ 3 −2 = 3 0 −6 1 2 0 𝑫 § 𝑩⨀𝑪 = dimensions: l × m ⨀ l × m = - l × m 2 3 −1 0 −2 0 ⨀ = 0 1 0 2 0 2 1 −1 0.5 −1 0.5 1 3
Agenda • Vanishing/Exploding gradient • RNNs with Gates: LSTM, GRU • Contextualized word embeddings with RNNs • Extractive summarization with RNNs
Recap: Backpropagation Through Time (BPTT) § Unrolling RNN (simplified) ℒ ($) 𝒊 ($(*) 𝒊 (') 𝒊 ($()) 𝒊 ($) 𝑿 & 𝑿 & 𝑿 & 𝑿 & 𝑿 & … 𝜖ℒ ($) =? 𝜖𝑿 & 5
Recap: Backpropagation Through Time (BPTT) ℒ (,) 𝒊 (*) 𝒊 ()) 𝒊 (+) 𝒊 (') 𝒊 (,) 𝑿 & 𝑿 & 𝑿 & 𝑿 & 𝜖ℒ (,) =? 𝜖𝑿 & 6
Recap: Backpropagation Through Time (BPTT) ( ℒ (,) 𝜖ℒ (() 𝜖ℒ (() = 1 2 𝜖𝑿 ' 𝜖𝑿 ' 𝜖ℒ (") )*& ()) 𝜖𝒊 (") 𝒊 (*) 𝒊 ()) 𝒊 (+) 𝒊 (') 𝒊 (,) 𝑿 & 𝑿 & 𝑿 & 𝑿 & 𝜖𝒊 (&) 𝜖𝒊 (%) 𝜖𝒊 ($) 𝜖𝒊 (") 𝜖𝑿 ' 𝜖𝒊 (&) 𝜖𝒊 (%) 𝜖𝒊 ($) 𝜖ℒ (") = 𝜖ℒ (") 𝜖𝒊 (") ! 𝜖𝑿 $ 𝜖𝒊 (") 𝜖𝑿 $ (") 𝜖ℒ (") = 𝜖ℒ (") 𝜖𝒊 (") 𝜖𝒊 (%) 𝜖ℒ ($) = 𝜖ℒ ($) 𝜖𝒊 ($(*) … 𝜖𝒊 (3) 𝜖𝒊 ($) ! 𝜖𝒊 (") 𝜖𝒊 (%) 𝜖𝑿 $ 𝜖𝑿 $ . (%) 𝜖𝒊 ($) 𝜖𝑿 & (3) 𝜖𝑿 & 𝜖ℒ (") = 𝜖ℒ (") 𝜖𝒊 (") 𝜖𝒊 (%) 𝜖𝒊 (&) ! 𝜖𝒊 (") 𝜖𝒊 (%) 𝜖𝒊 (&) 𝜖𝑿 $ 𝜖𝑿 $ (&) 𝜖ℒ (") = 𝜖ℒ (") 𝜖𝒊 (") 𝜖𝒊 (%) 𝜖𝒊 (&) 𝜖𝒊 (') ! 𝜖𝒊 (") 𝜖𝒊 (%) 𝜖𝒊 (&) 𝜖𝒊 (') 𝜖𝑿 $ 𝜖𝑿 $ (') 7
Vanishing/Exploding gradient ℒ (,) 𝜖ℒ (") 𝜖𝒊 (") 𝒊 (*) 𝒊 ()) 𝒊 (+) 𝒊 (') 𝒊 (,) 𝑿 & 𝑿 & 𝑿 & 𝑿 & 𝜖𝒊 (&) 𝜖𝒊 (%) 𝜖𝒊 ($) 𝜖𝒊 (") 𝜖𝑿 ' 𝜖𝒊 (&) 𝜖𝒊 (%) 𝜖𝒊 ($) § In practice, the gradient regarding each time step becomes smaller and smaller as it goes back in time → Vanishing gradient § While less often, this may also happen other way around: the gradient regarding further time steps becomes larger and larger→ Exploding gradient 8
Vanishing/Exploding gradient – why? ℒ (,) 𝜖ℒ (") 𝜖𝒊 (") 𝒊 (*) 𝒊 ()) 𝒊 (+) 𝒊 (') 𝒊 (,) 𝑿 & 𝑿 & 𝑿 & 𝑿 & 𝜖𝒊 (&) 𝜖𝒊 (%) 𝜖𝒊 ($) 𝜖𝒊 (") 𝜖𝑿 ' 𝜖𝒊 (&) 𝜖𝒊 (%) 𝜖𝒊 ($) If these gradients are small, their multiplication gets smaller. As we go further back, the final gradient contains more of these! 𝜖ℒ (") = 𝜖ℒ (") 𝜖𝒊 (") 𝜖𝒊 (&) 𝜖𝒊 (') 𝜖𝒊 (%) + 𝜖𝒊 (") 𝜖𝒊 (&) 𝜖𝒊 (') 𝜖𝒊 (%) 𝜖𝑿 $ 𝜖𝑿 $ (%) 9
Vanishing/Exploding gradient – why? § What is 4𝒊 (") 4𝒊 ("$%) ?! § Recall the definition of RNN: 𝒊 ($) = 𝜏(𝒊 $(* 𝑿 & + 𝒇 ($) 𝑿 5 + 𝒄) § Let’s replace sigmoid ( 𝜏 ) with a simple linear activation ( 𝑧 = 𝑦 ) function. 𝒊 ($) = 𝒊 $(* 𝑿 & + 𝒇 ($) 𝑿 5 + 𝒄 § In this case: 𝜖𝒊 (1) 𝜖𝒊 (134) = 𝑿 5 10
Vanishing/Exploding gradient – why? § Recall the BPTT formula (for the simplified case): 𝜖ℒ ($) = 𝜖ℒ ($) 𝜖𝒊 ($(*) … 𝜖𝒊 (36*) 𝜖𝒊 ($) 𝜖𝒊 (3) . 𝜖𝒊 ($) 𝜖𝒊 (3) 𝜖𝑿 & (3) 𝜖𝑿 & § Given 𝑚 = 𝑢 − 𝑗 , the BPTT formula can be rewritten as: 𝜖ℒ ($) = 𝜖ℒ $ (𝑿 & ) 7 𝜖𝒊 (3) . 𝜖𝑿 & (3) 𝜖𝒊 $ 𝜖𝑿 & If weights in 𝑿 $ are small (i.e. eigenvalues of 𝑿 $ are smaller then 1), these term gets exponentially smaller 11
Why is vanishing/exploding gradient a problem? § Vanishing gradient - Gradient signal from faraway “fades away” and becomes insignificant in comparison with the gradient signal from close-by - Long-term dependencies are not captured, since model weights are updated only with respect to near effects → one approach to address it: RNNs with gates – LSTM, GRU § Exploding gradient - Gradients become too big → SGD update steps become too large - This causes (large loss values and) large updates on parameters, and eventually unstable training → main approach to address it: Gradient clipping 12
Gradient clipping § Gradient clipping: if the norm of the gradient is greater than some threshold, scale the gradient down § Intuition: take a step in the same direction, but in a smaller step 13
Problem with vanilla RNN – summary § It is too difficult for the hidden state of vanilla RNN to learn and preserve information of several time steps - In particular as new contents are constantly added to the hidden state in every step 𝒊 ($) = 𝜏(𝒊 $(* 𝑿 & + 𝒚 ($) 𝑿 5 + 𝒄) In every step, input vector “adds” new content to hidden state 14
Agenda • Vanishing & Exploding gradient • RNNs with Gates: LSTM, GRU • Contextualized word embeddings with LSTM • Extractive summarization with LSTM
Gate vector § Gate vector: - A vector with values between 0 and 1 - Gate vector acts as “gate-keeper”, such that it controls the content flow of another vector § Gate vectors are typically defined using sigmoid: 𝒉 = 𝜏(𝑡𝑝𝑛𝑓 𝑤𝑓𝑑𝑢𝑝𝑠) … and are applied to a vector 𝒘 with element-wise multiplication to control its contents: 𝒉 ⊙ 𝒘 § For each element (feature) 𝑗 of the vectors: - If ( is 1 → 𝑤 ( remains the same; everything passes; open gate! - If ( is 0 → 𝑤 ( becomes 0; nothing passes; closed gate! 16
Long Short-Term Memory (LSTM) § Proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients problem § LSTM exploits a new vector cell state 𝒅 (() to carry the memory of previous states - The cell state stores long-term information - As in vanilla RNN, hidden states 𝒊 ()) is used as output vector § LSTM controls the process of reading, writing, and erasing information in/from memory states - These controls are done using gate vectors - Gates are dynamic and defined based on the input vector and hidden state 17 Hochreiter, Sepp, and Jürgen Schmidhuber. " Long short-term memory " Neural computation (1997)
LSTM – unrolled 𝒊 (*) 𝒊 ()) 𝒊 (+) 𝒊 ($) 𝒊 (') 𝒊 ($(*) 𝒅 ($(*) LSTM LSTM LSTM 𝒅 (') LSTM 𝒅 (+) 𝒅 ()) 𝒅 (*) 𝒚 ($) 𝒚 (*) 𝒚 ()) 𝒚 (+) 18
LSTM definition – gates § Gates are functions of input vector 𝒚 (() and previous hidden state 𝒊 ((*+) 𝒋 ($) = function(𝒊 $(* , 𝒚 $ ) input gate : controls what parts of the new cell 𝒋 ($) = 𝜏(𝒊 $(* 𝑿 &3 + 𝒚 $ 𝑿 83 + 𝒄 3 ) content are written to cell 𝒈 ($) = function(𝒊 $(* , 𝒚 $ ) forget gate : controls what is kept vs forgotten, from 𝒈 ($) = 𝜏(𝒊 $(* 𝑿 &9 + 𝒚 $ 𝑿 89 + 𝒄 9 ) previous cell state 𝒑 ($) = function(𝒊 $(* , 𝒚 $ ) output gate : controls what parts of cell are 𝒑 ($) = 𝜏(𝒊 $(* 𝑿 &: + 𝒚 $ 𝑿 8: + 𝒄 : ) output to hidden state Model parameters (weights) are shown in red 19
LSTM definition – states new cell content : the new content to be used for cell 𝒅 ($) = function(𝒊 $(* , 𝒚 $ ) and hidden (output) state G 𝒅 ($) = tanh(𝒊 $(* 𝑿 &; + 𝒚 $ 𝑿 8; + 𝒄 ; ) G cell state : erases (“forgets”) 𝒅 ($) = 𝒈 ($) ⊙ 𝒅 $ + 𝒋 ($) ⊙ G 𝒅 ($) some content from last cell state, and writes (“inputs”) some new cell content 𝒊 ($) = 𝒑 ($) ⊙ tanh(𝒅 ($) ) hidden state : reads (“outputs”) some content from the current cell state Model parameters (weights) are shown in red 20
LSTM definition – all together input gate : controls what parts of the new cell 𝒋 ($) = 𝜏(𝒊 $(* 𝑿 &3 + 𝒚 $ 𝑿 83 + 𝒄 3 ) content are written to cell forget gate : controls what 𝒈 ($) = 𝜏(𝒊 $(* 𝑿 &9 + 𝒚 $ 𝑿 89 + 𝒄 9 ) is kept vs forgotten, from previous cell state 𝒑 ($) = 𝜏(𝒊 $(* 𝑿 &: + 𝒚 $ 𝑿 8: + 𝒄 : ) output gate : controls what parts of cell are output to hidden state new cell content : the new content to be used for cell 𝒅 ($) = tanh(𝒊 $(* 𝑿 &; + 𝒚 $ 𝑿 8; + 𝒄 ; ) G and hidden (output) state 𝒅 ($) = 𝒈 ($) ⊙ 𝒅 $ + 𝒋 ($) ⊙ G 𝒅 ($) cell state : erases (“forgets”) some content from last cell state, and writes 𝒊 ($) = 𝒑 ($) ⊙ tanh(𝒅 ($) ) (“inputs”) some new cell content hidden state : reads (“outputs”) some content from the current cell state Model parameters (weights) are shown in red 21
LSTM definition – visually! http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 22
Recommend
More recommend