lecture 16 language model
play

Lecture 16: Language Model CS109B Data Science 2 Pavlos Protopapas, - PowerPoint PPT Presentation

Lecture 16: Language Model CS109B Data Science 2 Pavlos Protopapas, Mark Glickman, and Chris Tanner Outline Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 2


  1. Language Modelling: neural networks • Each circle is a specific floating point scalar IDEA: Let’s use a neural networks! • Words that are more semantically similar to one another will have embeddings that are proportionally similar, First, each word is represented by a word embedding too (e.g., vector of length 200) • We can use pre-existing word embeddings that have been trained on gigantic corpora man woman table CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  2. Language Modelling: neural networks These word embeddings are so rich that you get nice properties: king - man + ____________ ____________ woman ~ queen Word2vec: https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf CS109B, P ROTOPAPAS , G LICKMAN , T ANNER GloVe: https://www.aclweb.org/anthology/D14-1162.pdf

  3. Language Modelling: neural networks How can we use these embeddings to build a LM? Remember, we only need a system that can estimate: 𝑄 𝑦 *=# |𝑦 * , 𝑦 *+# , … , 𝑦 # next word previous words Example input sentence class She went to CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  4. Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word class? Output layer 𝑋 Hidden layer 𝑊 Example input sentence She went to CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  5. Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word class? F = softmax 𝑋ℎ + 𝑐 9 ∈ ℝ P 𝑧 Output layer 𝑋 ℎ = 𝑔(𝑊𝑦 + 𝑐 # ) Hidden layer 𝑊 𝑦 = [𝑦 # , 𝑦 9 , 𝑦 : ] Example input sentence Co Conc ncatena enated ed wo word em embed eddi ding ngs She went to CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  6. Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word after F = softmax 𝑋ℎ + 𝑐 9 ∈ ℝ P 𝑧 Output layer 𝑋 ℎ = 𝑔(𝑊𝑦 + 𝑐 # ) Hidden layer 𝑊 𝑦 = [𝑦 # , 𝑦 9 , 𝑦 : ] Example input sentence Co Conc ncatena enated ed wo word em embed eddi ding ngs went to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  7. Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word visiting F = softmax 𝑋ℎ + 𝑐 9 ∈ ℝ P 𝑧 Output layer 𝑋 ℎ = 𝑔(𝑊𝑦 + 𝑐 # ) Hidden layer 𝑊 𝑦 = [𝑦 # , 𝑦 9 , 𝑦 : ] Example input sentence Co Conc ncatena enated ed wo word em embedd beddings ngs to after class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  8. Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word her F = softmax 𝑋ℎ + 𝑐 9 ∈ ℝ P 𝑧 Output layer 𝑋 ℎ = 𝑔(𝑊𝑦 + 𝑐 # ) Hidden layer 𝑊 𝑦 = [𝑦 # , 𝑦 9 , 𝑦 : ] Example input sentence Conc Co ncatena enated ed wo word em embed eddi ding ngs after class visiting CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  9. Language Modelling: Feed-forward Neural Net Neural Approach #1: Feed-forward Neural Net General Idea: using windows of words, predict the next word grandma F = softmax 𝑋ℎ + 𝑐 9 ∈ ℝ P 𝑧 Output layer 𝑋 ℎ = 𝑔(𝑊𝑦 + 𝑐 # ) Hidden layer 𝑊 𝑦 = [𝑦 # , 𝑦 9 , 𝑦 : ] Example input sentence Co Conc ncatena enated ed wo word em embed eddi ding ngs after her visiting CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  10. Language Modelling : Feed-forward Neural Net FFN FFNN ST STRENGTHS? S? FFN FFNN ISSU SSUES? S? CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  11. Language Modelling : Feed-forward Neural Net FFN FFNN ST STRENGTHS? S? • No sparsity issues (it’s okay if we’ve never seen a segment of words) • No storage issues (we never store counts) FFN FFNN ISSU SSUES? S? • Fixed-window size can never be big enough. Need more context. • Increasing window size adds many more weights • The weights awkwardly handle word position • No concept of time • Requires inputting entire context just to predict one word CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  12. Language Modelling We especially need a system that: • Has an “ infinite ” concept of the past, not just a fixed window • For each new input, output the most likely next event (e.g., word) CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  13. Outline Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 50

  14. CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 51

  15. Language Modelling IDEA: for every individual input, output a prediction Let’s use the previous hidden state, too Le went F = softmax 𝑋ℎ + 𝑐 9 ∈ ℝ P 𝑧 Output Out ut layer er 𝑋 ℎ = 𝑔(𝑊𝑦 + 𝑐 # ) Hi Hidde dden layer 𝑊 𝑦 = 𝑦 # Example i Ex input wo word si single word embedding She CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  16. Language Modelling: RNNs Neural Approach #2: Recurrent Neural Network (RNN) 𝑧 F # 𝑧 F 9 𝑧 F : 𝑧 F ; Out Output ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hi Hidde dden layer 𝑊 V V V Input l In layer 𝑦 9 𝑦 # 𝑦 : 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  17. Language Modelling: RNNs We have seen this abstract view in Lecture 15. 𝑧 F R Output Out ut layer er 𝑋 𝑉 Th The recurrent loop loop 𝑽 co conveys that the Hi Hidde dden layer cu current hidden layer is influence ced by the hidden hi en layer er from the he prev evious us time e step ep. 𝑊 Input l In layer 𝑦 R CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  18. � RNN (review) Training Process Tr F R = − W 𝑧 X R log(𝑧 R ) 𝐷𝐹 𝑧 R , 𝑧 F X X∈\ 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Er Error 𝑧 F # 𝑧 F 9 𝑧 F : 𝑧 F ; Output Out ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hi Hidde dden layer 𝑊 𝑊 𝑊 𝑊 In Input l layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  19. � RNN (review) Tr Training Process F R = − W 𝑧 X R log(𝑧 R ) 𝐷𝐹 𝑧 R , 𝑧 F X X∈\ 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Er Error 𝑧 F # 𝑧 F 9 𝑧 F : 𝑧 F ; Out Output ut layer er During training, regardless of our output predictions, we feed in the correct inputs 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hi Hidde dden layer 𝑊 𝑊 𝑊 𝑊 In Input l layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  20. � RNN (review) Training Process Tr F R = − W 𝑧 X R log(𝑧 R ) 𝐷𝐹 𝑧 R , 𝑧 F X X∈\ 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Error Er went? after? class? over? 𝑧 F Output Out ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hidde Hi dden layer 𝑊 𝑊 𝑊 𝑊 In Input l layer Our total loss is simply the average loss across all 𝑈 time steps 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  21. � RNN (review) Training Process Tr F R = − W 𝑧 X R log(𝑧 R ) 𝐷𝐹 𝑧 R , 𝑧 F X To update our weights (e.g. To . 𝑽 ), ), we we calculate the gradi dient X∈\ ., 𝝐𝑴 of ou of our los loss w. w.r.t. t . the r repeate ted w weight m t matri trix (e (e.g .g., 𝝐𝑽 ). 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Error Er went? after? class? over? Us Using g the ch chain rule, we trace ce the de derivative all the 𝑧 F Out Output ut layer way back to the beginning, w wa er , while s summing t the r results lts. 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hi Hidde dden layer 𝑊 𝑊 𝑊 𝑊 Input l In layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  22. RNN (review) Tr Training Process 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Er Error went? after? class? over? 𝑧 F Out Output ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hi Hidde dden layer 𝑊 𝑊 𝑊 𝑊 In Input l layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  23. RNN (review) Training Process Tr 𝝐𝑴 𝝐𝑾 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Error Er went? after? class? over? 𝑧 F Output Out ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 : Hidde Hi dden layer 𝑊 𝑊 𝑊 𝑊 In Input l layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  24. RNN (review) Training Process Tr 𝝐𝑴 𝝐𝑾 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Error Er went? after? class? over? 𝑧 F Output Out ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 9 𝑉 𝑉 : Hi Hidde dden layer 𝑊 𝑊 𝑊 𝑊 In Input l layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  25. RNN (review) Training Process Tr 𝝐𝑴 𝝐𝑾 𝐷𝐹 𝑧 9 , 𝑧 F 9 𝐷𝐹 𝑧 # , 𝑧 F # 𝐷𝐹 𝑧 : , 𝑧 F : 𝐷𝐹 𝑧 ; , 𝑧 F ; Error Er went? after? class? over? 𝑧 F Output Out ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 # 𝑉 9 𝑉 : Hidde Hi dden layer 𝑊 𝑊 𝑊 𝑊 In Input l layer 𝑦 # 𝑦 9 𝑦 : 𝑦 ; went She to class CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  26. RNN (review) • This backpropagation through time (BPTT) process is expensive • Instead of updating after every timestep, we tend to do so every 𝑈 steps (e.g., every sentence or paragraph) • This isn’t equivalent to using only a window size 𝑈 (a la n-grams) because we still have ‘infinite memory’ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  27. RNN: Generation We can generate the most likely next event (e.g., word) by sampling from 𝑧 F Continue until we generate <EOS> symbol. CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  28. RNN: Generation We can generate the most likely next event (e.g., word) by sampling from 𝒛 c Continue until we generate <EOS> symbol. “Sorry” Output Out ut layer er 𝑋 Hi Hidde dden layer 𝑊 Input l In layer 𝑦 # <START> CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  29. RNN: Generation We can generate the most likely next event (e.g., word) by sampling from 𝒛 c Continue until we generate <EOS> symbol. “Sorry” Harry shouted panicking Out Output ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hidde Hi dden layer 𝑊 𝑊 𝑊 𝑊 In Input l layer 𝑦 # 𝑦 9 𝑦 ; 𝑦 : <START> “Sorry” “Harry” shouted, CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  30. RNN: Generation NOTE: the same input (e.g., “ Harry ” ) can easily yield different outputs, depending on the context (unlike FFNNs and n-grams). “Sorry” Harry shouted panicking Out Output ut layer er 𝑋 𝑋 𝑋 𝑋 𝑉 𝑉 𝑉 Hi Hidde dden layer 𝑊 𝑊 𝑊 𝑊 Input l In layer 𝑦 # 𝑦 9 𝑦 ; 𝑦 : <START> “Sorry” “Harry” “shouted,” CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  31. RNN: Generation When trained on Harry Potter text, it generates: Source: https://medium.com/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da6 CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  32. RNN: Generation When trained on recipes Source: https://gist.github.com/nylki/1efbaa36635956d35bcc CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  33. RNNs: Overview RN RNN S STREN RENGTHS? • Can handle infinite-length sequences (not just a fixed-window) • Has a “ memory ” of the context (thanks to the hidden layer’s recurrent loop) • Same weights used for all inputs, so word order isn’t wonky (like FFNN) RN RNN IS ISSUES ES? • Slow to train (BPTT) • Due to ” infinite sequence ” , gradients can easily vanish or explode • Has trouble actually making use of long-range context CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  34. RNNs: Overview RN RNN S STREN RENGTHS? • Can handle infinite-length sequences (not just a fixed-window) • Has a “ memory ” of the context (thanks to the hidden layer’s recurrent loop) • Same weights used for all inputs, so word order isn’t wonky (like FFNN) RN RNN IS ISSUES ES? • Slow to train (BPTT) • Due to ” infinite sequence ” , gradients can easily vanish or explode • Has trouble actually making use of long-range context CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  35. RNNs: Vanishing and Exploding Gradients (review) 𝝐𝑴 𝟓 𝝐𝑾 𝟐 𝝐𝑴 𝟓 𝝐𝑾 𝟐 = ? 𝐷𝐹 𝑧 ; , 𝑧 F ; 𝑧 F 𝑉 𝑉 𝑉 𝑉 𝑊 9 𝑊 : 𝑊 # Hi Hidde dden layer 𝑋 𝑋 𝑋 𝑋 In Input l layer went class to She CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  36. RNNs: Vanishing and Exploding Gradients (review) 𝝐𝑴 𝟓 𝝐𝑾 𝟐 𝝐𝑾 𝟐 = 𝝐𝑴 𝟓 𝝐𝑴 𝟓 𝐷𝐹 𝑧 ; , 𝑧 F ; 𝝐𝑾 𝟒 𝑧 F 𝑉 𝑉 𝑉 𝑉 𝑊 9 𝑊 : 𝑊 # Hi Hidde dden layer 𝑋 𝑋 𝑋 𝑋 In Input l layer went class to She CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  37. RNNs: Vanishing and Exploding Gradients (review) 𝝐𝑴 𝟓 𝝐𝑾 𝟐 𝝐𝑾 𝟐 = 𝝐𝑴 𝟓 𝝐𝑴 𝟓 𝝐𝑾 𝟒 𝐷𝐹 𝑧 ; , 𝑧 F ; 𝝐𝑾 𝟒 𝝐𝑾 𝟑 𝑧 F 𝑉 𝑉 𝑉 𝑉 𝑊 9 𝑊 : 𝑊 # Hi Hidde dden layer 𝑋 𝑋 𝑋 𝑋 Input l In layer went class to She CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  38. RNNs: Vanishing and Exploding Gradients (review) 𝝐𝑴 𝟓 𝝐𝑾 𝟐 𝝐𝑾 𝟐 = 𝝐𝑴 𝟓 𝝐𝑴 𝟓 𝝐𝑾 𝟒 𝝐𝑾 𝟑 𝐷𝐹 𝑧 ; , 𝑧 F ; 𝝐𝑾 𝟒 𝝐𝑾 𝟑 𝝐𝑾 𝟐 𝑧 F 𝑉 𝑉 𝑉 𝑉 𝑊 9 𝑊 : 𝑊 # Hidde Hi dden layer 𝑋 𝑋 𝑋 𝑋 Input l In layer went class to She CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  39. RNNs: Vanishing and Exploding Gradients (review) To address RNNs’ finnicky nature with long-range context, we turned to an RNN variant named LSTMs (long short-term memory) But first, let’s recap what we’ve learned so far CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  40. Sequential Modelling (so far) 𝑧 F # 𝑧 F 9 𝑧 F : class 𝑉 𝑉 𝑉 𝑉 𝑊 𝑊 𝑄 went 𝑇ℎ𝑓 = count(𝑇ℎ𝑓 𝑥𝑓𝑜𝑢) count(𝑇ℎ𝑓) 𝑋 𝑋 𝑋 𝑋 to to She went She went n-grams RNN FFNN • Ha Handles infinite context • Kind of robust… … almost • Ba Basic counts; fast (i (in t theory) • Fi Fixed xed wind ndow size • Fi Fixed xed wind ndow size • Ro Robust to rare words • We Weirdly handles context • Sp Sparsity & storage e issues ues • Sl Slow positions po • No Not robust • Di Difficulty w with l long c context • No No “memory” of past CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  41. Outline Language Modelling RNNs/LSTMs +ELMo Seq2Seq +Attention Transformers +BERT Conclusions CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 78

  42. CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 79

  43. Long short-term memory (LSTM) • A type of RNN that is designed to better handle long-range dependencies • In ” vanilla ” RNNs, the hidden state is perpetually being rewritten • In addition to a traditional hidden state h , let’s have a dedicated memory cell c for long-term events. More power to relay sequence info. CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  44. Inside an LSTM Hidden Layer 𝐷 * 𝐷 *=# 𝐷 *+# 𝐼 *+# 𝐼 * 𝐼 *=# Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  45. Inside an LSTM Hidden Layer some old memories are “forgotten” 𝐷 * 𝐷 *=# 𝐷 *+# 𝐼 *+# 𝐼 * 𝐼 *=# Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  46. Inside an LSTM Hidden Layer some old memories are “forgotten” some new memories are made 𝐷 * 𝐷 *=# 𝐷 *+# 𝐼 *+# 𝐼 * 𝐼 *=# Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  47. Inside an LSTM Hidden Layer some old memories are “forgotten” some new memories are made 𝐷 * 𝐷 *=# 𝐷 *+# 𝐼 *+# 𝐼 * 𝐼 *=# a nonlinear weighted version of the long-term memory becomes our short-term memory Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  48. Inside an LSTM Hidden Layer some old memories are “forgotten” some new memories are made 𝐷 * 𝐷 *=# 𝐷 *+# memory is written, erased, and read by three gates – which are influenced by 𝒚 and 𝒊 𝐼 *+# 𝐼 * 𝐼 *=# a nonlinear weighted version of the long-term memory becomes our short-term memory Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  49. Inside an LSTM Hidden Layer It’s still possible for LSTMs to suffer from vanishing/exploding gradients, but it’s way less likely than with vanilla RNNs: • If RNNs wish to preserve info over long contexts, it must delicately find a recurrent weight matrix 𝑋 ℎ that isn’t too large or small • However, LSTMs have 3 separate mechanism that adjust the flow of information (e.g., forget gate, if turned off, will preserve all info) CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  50. Long short-term memory (LSTM) LS LSTM STRENGTHS? • Almost always outperforms vanilla RNNs • Captures long-range dependencies shockingly well LS LSTM ISSUES? • Has more weights to learn than vanilla RNNs; thus, • Requires a moderate amount of training data (otherwise, vanilla RNNs are better) • Can still suffer from vanishing/exploding gradients CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  51. Sequential Modelling 𝑧 F # 𝑧 F 9 𝑧 F : clas 𝑉 𝑉 𝑉 𝑊 𝑊 s 𝑄 went 𝑇ℎ𝑓 = count(𝑇ℎ𝑓 𝑥𝑓𝑜𝑢) 𝑉 count(𝑇ℎ𝑓) 𝑋 𝑋 𝑋 𝑋 to She went to She went n-grams RNN FFNN 𝑧 F # 𝑧 F 9 𝑧 F : went to She LSTM CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  52. Sequential Modelling IM IMPORTANT If your goal isn’t to predict the next item in a sequence, and you rather do some other classification or regression task using the sequence, then you can: • Train an aforementioned model (e.g., LSTM) as a language model • Use the hidden layers that correspond to each item in your sequence CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  53. Sequential Modelling 2. Use hidden layer 1. Train LM to learn hidden layer embeddings for other tasks embeddings Sentiment score 𝑧 F # 𝑧 F 9 𝑧 F ; 𝑧 F : Out Output ut 𝑋 𝑋 𝑋 𝑋 layer la 𝑉 𝑉 𝑉 Hidde Hi dden layer la 𝑊 𝑊 𝑊 𝑊 Input In 𝑦 9 𝑦 : 𝑦 # 𝑦 ; layer la 𝑦 9 𝑦 : 𝑦 # 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  54. Sequential Modelling Or jointly learn hidden embeddings toward a particular task (end-to-end) Out Output ut Sentiment score layer la Hidde Hi dden layer 2 la 𝑋 𝑋 𝑋 𝑋 𝑉 Hi Hidde dden 𝑉 𝑉 la layer 1 𝑊 𝑊 𝑊 𝑊 Input In 𝑦 : layer la 𝑦 9 𝑦 # 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  55. You now have the foundation for modelling sequential data. Most state-of-the-art advances are based on those core RNN/LSTM ideas. But, with tens of thousands of researchers and hackers exploring deep learning, there are many tweaks that haven proven useful. (This is where things get crazy.) CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  56. Bi-directional (review) 𝑍 𝑍 𝑍 *+# *+9 * �symbol for a BRNN 𝑍 * previous state previous state 𝑌 * 𝑌 *+9 𝑌 *+# 𝑌 * CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 93

  57. Bi-directional (review) RNNs/LSTMs use the left-to-right context and sequentially process data. If you have full access to the data at testing time, why not make use of the flow of information from right-to-left, also? CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  58. RNN Extensions: Bi-directional LSTMs (review) For brevity, let’s use the follow schematic to represent an RNN v v v v ℎ # ℎ ; ℎ 9 ℎ : Hidde Hi dden layer 𝑦 # Input l In layer 𝑦 9 𝑦 : 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  59. RNN Extensions: Bi-directional LSTMs (review) For brevity, let’s use the follow schematic to represent an RNN v v v v ℎ # w ℎ ; w ℎ 9 ℎ : w w ℎ # ℎ ; ℎ 9 ℎ : Hi Hidde dden layer 𝑦 # In Input l layer 𝑦 9 𝑦 : 𝑦 ; 𝑦 # 𝑦 9 𝑦 : 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  60. RNN Extensions: Bi-directional LSTMs (review) w w w w ℎ # ℎ 9 ℎ : ℎ ; Concatenate the hidden layers Co v v v v ℎ # ℎ 9 ℎ : ℎ ; v v v v ℎ # w ℎ ; w ℎ 9 ℎ : w w ℎ # ℎ ; ℎ 9 ℎ : Hidde Hi dden layer 𝑦 # In Input l layer 𝑦 9 𝑦 : 𝑦 ; 𝑦 # 𝑦 9 𝑦 : 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  61. RNN Extensions: Bi-directional LSTMs (review) 𝑧 F # 𝑧 F : 𝑧 F 9 𝑧 F ; Out Output ut layer er w w w w ℎ # ℎ 9 ℎ : ℎ ; Concatenate the hidden layers Co v v v v ℎ # ℎ 9 ℎ : ℎ ; v v v v ℎ # w ℎ ; w ℎ 9 ℎ : w w ℎ # ℎ ; ℎ 9 ℎ : Hidde Hi dden layer 𝑦 # In Input l layer 𝑦 9 𝑦 : 𝑦 ; 𝑦 # 𝑦 9 𝑦 : 𝑦 ; CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  62. RNN Extensions: Bi-directional LSTMs (review) BI-LS BI LSTM STRENGTHS? • Usually performs at least as well as uni-directional RNNs/LSTMs BI BI-LS LSTM ISSUES? • Slower to train • Only possible if access to full data is allowed CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

  63. Deep RNN (review) LSTMs units can be arranged in layers, so that the output of each unit is the input to the other units. This is called a deep RNN , where the adjective “ deep ” refers to these multiple layers. • Each layer feeds the LSTM on the next layer • First time step of a feature is fed to the first LSTM, which processes that data and produces an output (and a new state for itself). • That output is fed to the next LSTM, which does the same thing, and the next, and so on. • Then the second time step arrives at the first LSTM, and the process repeats. CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 100

Recommend


More recommend