lecture 6 rnn wrap up
play

Lecture 6: RNN wrap-up Julia Hockenmaier juliahmr@illinois.edu - PowerPoint PPT Presentation

CS546: Machine Learning in NLP (Spring 2020) http://courses.engr.illinois.edu/cs546/ Lecture 6: RNN wrap-up Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Monday, 11am12:30pm Todays class: RNN architectures


  1. CS546: Machine Learning in NLP (Spring 2020) http://courses.engr.illinois.edu/cs546/ Lecture 6: 
 RNN wrap-up Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Monday, 11am—12:30pm

  2. Today’s class: RNN architectures RNNs are among the workhorses of neural NLP: — Basic RNNs are rarely used — LSTMs and GRUs are commonly used. What’s the difference between these variants? 
 RNN odds and ends: — Character RNNs — Attention mechanisms (LSTMs/GRUs) 2 CS546 Machine Learning in NLP

  3. Character RNNs and BPE Character RNNs: — Each input element is one character: ’t’,’h’, ‘e’,… — Can be used to replace word embeddings, 
 or to compute embeddings for rare/unknown words (in languages with an alphabet, like English…) 
 see e.g. http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 
 (in Chinese, RNNs can be used directly on characters without word segmentation; the equivalent of “character RNNs” might be models that decompose characters into radicals/strokes) Byte Pair Encoding (BPE): — Learn which character sequences are common 
 in the language (‘ing’, ‘pre’, ‘at’, …) — Split input into these sequences and learn embeddings for these sequences 3 CS546 Machine Learning in NLP

  4. Attention mechanisms α = ( α 1 t , . . . , α St ) Compute a probability distribution over the h ( s ) h ( t ) encoder’s hidden states that depends on the decoder’s current exp( s ( h ( t ) , h ( s ) )) α ts = ∑ s ′ � exp( s ( h ( t ) , h ( s ′ � ) )) h ( s ) c ( t ) = ∑ α ts h ( s ) Compute a weighted avg. of the encoder’s : s =1.. S o ( t ) = tanh( W 1 h ( t ) + W 2 c ( t ) ) h ( t ) that gets then used with , e.g. in — Hard attention (degenerate case, non-differentiable): 
 α is a one-hot vector 
 α — Soft attention (general case): is not a one-hot s ( h ( t ) , h ( s ) ) = h ( t ) ⋅ h ( s ) — is the dot product (no learned parameters) s ( h ( t ) , h ( s ) ) = ( h ( t ) ) T W h ( s ) — (learn a bilinear matrix W) s ( h ( t ) , h ( s ) ) = v T tanh( W 1 h ( t ) + W 2 h ( s ) ) — concat. hidden states 4 CS546 Machine Learning in NLP

  5. Activation functions CS546 Machine Learning in NLP 5

  6. 
 Recap: Activation functions 3 1/(1+exp(-x)) Sigmoid (logistic function): 
 tanh(x) max(0,x) 2.5 σ (x) = 1/(1 + e − x ) 
 2 Returns values bound above and below 
 1.5 [0,1] in the range 
 1 0.5 0 Hyperbolic tangent: 
 -0.5 tanh(x) = (e 2x − 1)/(e 2x +1) -1 -3 -2 -1 0 1 2 3 Returns values bound above and below 
 [ − 1, +1] in the range 
 Rectified Linear Unit: 
 ReLU(x) = max(0, x) Returns values bound below 
 [0, + ∞ ] in the range 6 CS546 Machine Learning in NLP

  7. From RNNs to LSTMs CS546 Machine Learning in NLP 7

  8. 
 
 From RNNs to LSTMs h ( t ) In Vanilla (Elman) RNNs, the current hidden state 
 h ( t − 1) is a nonlinear function of the previous hidden state x ( t ) and the current input : h ( t ) = g ( W h [ h ( t − 1) , x ( t ) ] + b h ) With g =tanh (the original definition): 
 ⇒ Models suffer from the vanishing gradient problem: 
 they can’t be trained effectively on long sequences. With g =ReLU 
 ⇒ Models suffer from the exploding gradient problem: 
 they can’t be trained effectively on long sequences. 8 CS546 Machine Learning in NLP

  9. From RNNs to LSTMs LSTMs (Long Short-Term Memory networks) were introduced by Hochreiter and Schmidhuber to overcome this problem. — They introduce an additional cell state that also gets passed through the network and updated at each time step — LSTMs define three different gates that read in the previous hidden state and current input to decide how much of the past hidden and cell states to keep. — This gating mechanism mitigates the vanishing/ exploding gradient problems of traditional RNNs 9 CS546 Machine Learning in NLP

  10. Gating mechanisms Gates are trainable layers with a sigmoid activation function 
 h ( t − 1) x ( t ) often determined by the current input and the (last) hidden state eg.: k = σ ( W k x ( t ) + U k h ( t − 1) + b k ) g ( t ) ∀ i : 0 ≤ g i ≤ 1 g is a vector of (Bernoulli) probabilities ( ) Unlike traditional (0,1) gates, neural gates are differentiable (we can train them) 
 g u is combined with another vector (of the same dimensionality) v = g ⊗ u by element-wise multiplication (Hadamard product): g i ≈ 0 v i ≈ 0 g i ≈ 1 v i ≈ u i — If , , and if , g i — Each is associated with its own set of trainable parameters 
 u i and determines how much of to keep or forget u , v Gates are used to form linear combinations of vectors : w = g ⊗ u + ( 1 − g ) ⊗ v — Linear interpolation (coupled gates): w = g 1 ⊗ u + g 2 ⊗ v — Addition of two gates: 10 CS546 Machine Learning in NLP

  11. Long Short Term Memory Networks (LSTMs) c (t-1) c (t) h (t-1) h (t-1) https://colah.github.io/posts/2015-08-Understanding-LSTMs/ t At time , the LSTM cell reads in c ( t − 1) — a c -dimensional previous cell state vector h ( t − 1) — an h- dimensional previous hidden state vector x ( t ) — a d -dimensional current input vector t At time , the LSTM cell returns c ( t ) — a c -dimensional new cell state vector h ( t ) — an h- dimensional new hidden state vector 
 (which may also be passed to an output layer) 11 CS546 Machine Learning in NLP

  12. 
 
 LSTM operations c ( t − 1) h ( t − 1) Based on the previous cell state and hidden state 
 x ( t ) and the current input , the LSTM computes: 
 h ( t − 1) c ( t ) x ( t ) 1) A new intermediate cell state ˜ that depends on and : c ( t ) = tanh( W c [ h ( t − 1) , x ( t ) ] + b c ) ˜ h ( t − 1) x ( t ) 2) Three gates (which each depend on and ) f ( t ) = σ ( W f [ h ( t − 1) , x ( t ) ] + b f ) a) The forget gate decides 
 f ( t ) ⊗ c ( t − 1) c ( t − 1) how much of the last to remember in the cell state: i ( t ) = σ ( W i [ h ( t − 1) , x ( t ) ] + b i ) b) The input gate decides 
 i ( t ) ⊗ ˜ c ( t ) c ( t ) ˜ how much of the intermediate to use in the new cell state: o ( t ) = σ ( W o [ h ( t − 1) , x ( t ) ] + b o ) c) The output gate decides 
 h ( t ) = o ( t ) ⊗ tanh( c ( t ) ) c ( t ) how much of the new to use in c ( t ) = f ( t ) ⊗ c ( t − 1) + i ( t ) ⊗ ˜ c ( t ) 3) The new cell state is a linear combination of c ( t − 1) c ( t ) f ( t ) i ( t ) ˜ cell states and that depends on forget gate and input gate h ( t ) = o ( t ) ⊗ tanh( c ( t ) ) 4) The new hidden state 12 CS546 Machine Learning in NLP

  13. 
 LSTM summary c ( t − 1) h ( t − 1) x ( t ) Based on , , and , the LSTM computes: 
 c ( t ) = tanh( W c [ h ( t − 1) , x ( t ) ] + b c ) ˜ — Intermediate cell state f ( t ) = σ ( W f [ h ( t − 1) , x ( t ) ] + b f ) — Forget gate i ( t ) = σ ( W i [ h ( t − 1) , x ( t ) ] + b i ) — Input gate c ( t ) = f ( t ) ⊗ c ( t − 1) + i ( t ) ⊗ ˜ c ( t ) — New (final) cell state o ( t ) = σ ( W o [ h ( t − 1) , x ( t ) ] + b o ) — Output gate h ( t ) = o ( t ) ⊗ tanh( c ( t ) ) — New hidden state c ( t ) h ( t ) and are passed on to the next time step. 13 CS546 Machine Learning in NLP

  14. Gated Recurrent Units (GRUs) Cho et al. (2014) Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation https://arxiv.org/pdf/1406.1078.pdf CS546 Machine Learning in NLP 14

  15. 
 
 GRU definition h ( t − 1) x ( t ) Based on , and , the GRU computes: 
 h ( t − 1) ˜ r ( t ) h ( t ) — a reset gate to determine how much of to keep in r ( t ) = σ ( W r x ( t ) + U r h ( t − 1) + b r ) 
 r ( t ) ⊗ h ( t − 1) ˜ h ( t ) x ( t ) — an intermediate hidden state that depends on and h ( t ) = ϕ ( W h x ( t ) + U h ( r ( t ) ⊗ h ( t − 1) ) + b r ) ˜ h ( t − 1) z ( t ) h ( t ) — an update gate to determine how much of to keep in z ( t ) = σ ( W z x ( t ) + U z h ( t − 1) + b r ) h ( t − 1) ˜ h ( t ) h ( t ) — a new hidden state as a linear interpolation of and 
 z ( t ) with weights determined by the update gate h ( t ) = z ( t ) ⊗ h ( t − 1) + ( 1 − z ( t ) ) ⊗ ˜ h ( t ) 15 CS546 Machine Learning in NLP

  16. Expressive power of RNN, LSTM, GRU Weiss, Goldberg, Yahav (2018) 
 On the Practical Computational Power 
 of Finite Precision RNNs for Language Recognition 
 https://www.aclweb.org/anthology/P18-2117.pdf CS546 Machine Learning in NLP 16

  17. 
 Models Basic RNNs: h ( t ) = tanh( W x ( t ) + U h ( t − 1) + b ) Simple (Elman) SRNN: h ( t ) = ReLU ( W x ( t ) + U h ( t − 1) + b ) IRNN: Gated RNNs (GRUs and LSTMs) k = σ ( W k x ( t ) + U k h ( t − 1) + b k ) g ( t ) Gates : each element is a probability 0 1 NB: a gate can return or by setting its matrices to 0 and b=0 or b=1 r ( t ) , z ( t ) GRU with gates h ( t ) = tanh( W h x ( t ) + U h ( r ( t ) ⊗ h ( t − 1) ) + b r ) ˜ hidden state h ( t ) = z ( t ) ⊗ c ( t − 1) + ( 1 − z ( t ) ) ⊗ ˜ h ( t − 1) r = 1 , z = 0 NB: GRU reduces to SRNN with f ( t ) , i ( t ) , o ( t ) LSTM with gates , 
 c ( t ) = tanh( W c x ( t ) + U c h ( t − 1) + b c ) ˜ memory cell 
 c ( t ) = f ( t ) ⊗ c ( t − 1) + i ( t ) ⊗ ˜ c ( t ) h ( t ) = o ( t ) ⊗ ϕ ( c ( t ) ) ϕ hidden state for = identity or tanh f = 0 , i = 1 , o = 1 NB: LSTM reduces to SRNN with 17 CS546 Machine Learning in NLP

Recommend


More recommend