cs5242 neural networks and deep learning
play

CS5242 Neural Networks and Deep Learning Lecture 09: RNN - PowerPoint PPT Presentation

CS5242 Neural Networks and Deep Learning Lecture 09: RNN Applications II Wei WANG TA: Yao SHU, Juncheng LIU, Qingpeng Cai cs5242@comp.nus.edu.sg Recap Language modelling Training Model the joint probability Model the conditional


  1. CS5242 Neural Networks and Deep Learning Lecture 09: RNN Applications II Wei WANG TA: Yao SHU, Juncheng LIU, Qingpeng Cai cs5242@comp.nus.edu.sg

  2. Recap • Language modelling • Training • Model the joint probability • Model the conditional probability • Train RNN for predicting the next word • Inference • Greedy search VS beam search • Image caption generation • CNN---Image feature • RNN---word generation CS5242 2

  3. Agenda • Machine translation • Attention modelling • Transformer * • Question answering • Colab tutorial CS5242 3

  4. RNN Architectures Sentiment Language Image caption Machine translation, analysis modelling Question answering Image source: https://karpathy.github.io/2015/05/21/rnn-effectiveness/ CS5242 4

  5. Machine translation [19] • Given a sentence in one language, e.g. English • Singapore MRT is not stable • Return a sentence in another language, e.g. Chinese • 新加坡地铁不稳定 • Training • max σ <𝑦,𝑧> 𝑚𝑝𝑕𝑄 𝑧 1 , 𝑧 2 , … , 𝑧 𝑛 |𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 Θ • Inference • <𝑧 1 ,𝑧 2 ,…,𝑧 𝑛 > 𝑚𝑝𝑕𝑄 𝑧 1 , 𝑧 2 , … , 𝑧 𝑛 |𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 max CS5242 5

  6. Sequence to sequence model • Seq2Seq • 𝑄 𝑧 1 , 𝑧 2 , … , 𝑧 𝑛 |𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 = 𝑄 𝑧 1 , 𝑧 2 , … , 𝑧 𝑛 |𝑇 • S is a summary of input W X Y Z END S RNN RNN RNN RNN RNN RNN RNN RNN A B C END (START) W X Y Z Encoder Decoder CS5242 6

  7. Sequence to Sequence[19] • Seq2Seq • 𝑄 𝑧 1 , 𝑧 2 , … , 𝑧 𝑛 |𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 = 𝑄 𝑧 1 , 𝑧 2 , … , 𝑧 𝑛 |𝑇 • S is a summary of input • Encoder and Decoder are two RNN networks • They have their own parameters • End-to-end training • Reverse the input sequence • 𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 → 𝑦 𝑜 , 𝑦 𝑜−1 , … , 𝑦 1 • Overall better output sequence ? • 𝑦 1 is near 𝑧 1 , 𝑦 2 is near 𝑧 2 , … • Multiple stacks of RNN → better output sequence CS5242 8

  8. Attention modelling [20] • Problem of Seq2Seq model • Each output word depends on the output of the encoder • The contribution of some words should be larger than others • Singapore MRT is not stable --- 新加坡 • Singapore MRT is not stable --- 地铁 新加坡 地铁 不 稳定 Singapore MRT is not stable CS5242 9

  9. Attention modelling • Differentiate the words from the encoder • Let some words have more contribution 新加坡 地铁 不 Decoder RNN Encoder RNN Singapore MRT is not Image from: https://distill.pub/2016/augmented-rnns/ CS5242 10

  10. Attention modelling 新加坡 地铁 不 demo Singapore MRT is not Image from: https://distill.pub/2016/augmented-rnns/ CS5242 11

  11. ǁ Attention modelling [20] • Example implementation • Extended GRU for the decoder • Consider the related words from the encoder during decoding • Weighted combination of hidden states from encoder → c t c t • 𝑡 𝑢 = 1 − 𝑨 𝑢 ∘ 𝑡 𝑢−1 + 𝑨 𝑢 ∘ ǁ 𝑡 𝑢 • 𝑧 𝑢 𝑡 𝑢 = tanh 𝑠 𝑢 ∘ 𝑡 𝑢−1 , 𝑧 𝑢−1 𝑋 𝑓 , 𝑑 𝑢 W 𝑧 𝑢−1 Attention • 𝑠 𝑢 = 𝜏 𝑡 𝑢−1 , 𝑧 𝑢−1 𝑋 𝑓 , 𝑑 𝑢 𝑋 𝑠 • 𝑨 𝑢 = 𝜏 𝑡 𝑢−1 , 𝑧 𝑢−1 𝑋 𝑓 , 𝑑 𝑢 𝑋 ℎ 1 ℎ 2 𝑡 𝑢 𝑡 𝑢−1 𝑨 𝑜 • 𝑑 𝑢 = σ 𝑘=1 𝛽 𝑢𝑘 ℎ 𝑘 RNN RNN RNN RNN RNN • 𝛽 𝑢𝑘 attention weight 𝑋 𝑓 • 𝛽 𝑢𝑘=𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑓 𝑢𝑘 ) j=1…n 𝑈 tanh(𝑡 𝑢−1 𝑋 • 𝑓 𝑢𝑘 = 𝑤 𝑏 𝑏 + ℎ 𝑘 𝑉 𝑏 ) 𝑧 𝑢−1 • Larger weight for more related words from the encoder Encoder Decoder CS5242 12

  12. Attention modelling • Encoder RNN • Input word vectors: a=[0.1,0,1], b=[1,0.1,0], c=[0.2,0.3,1] • Hidden representation (vector) • h1, h2, h3 • h1=tanh(aU+h 0 W) • h2=tanh(bU+h 1 W) • h3=tanh(cU+h 2 W) Parameters: {U, W} CS5242 13

  13. ǁ • Decoder RNN • Given hidden state vector s 0 • To compute the weights of h1, h2, h3 for computing s1 • e 11 = a(s 0 , h 1 ), e 12 =a(s 0 , h 2 ), e 13 =a(s 0 , h 3 ) • a(s 0 , h 1 ) = v T tanh(s 0 W a + h 1 U a ) • a(s 0 , h 2 ) = v T tanh(s 0 W a + h 2 U a ) • a(s 0 , h 3 ) = v T tanh(s 0 W a + h 3 U a ) • 𝛽 11 = exp(e 11 ) / (exp(e 11 )+exp(e 12 )+exp(e 13 )) • 𝛽 12 = exp(e 12 ) / (exp(e 11 )+exp(e 12 )+exp(e 13 )) C t • 𝛽 13 = exp(e 13 ) / (exp(e 11 )+exp(e 12 )+exp(e 13 )) • c 1 = 𝛽 11 h 1 + 𝛽 12 h 2 + 𝛽 13 h 3 • 𝑡 𝑢 = 1 − 𝑨 𝑢 ∘ 𝑡 𝑢−1 + 𝑨 𝑢 ∘ ǁ 𝑡 𝑢 Attention • 𝑡 𝑢 = tanh 𝑠 𝑢 ∘ 𝑡 𝑢−1 , 𝑧 𝑢−1 𝑋 𝑓 , 𝑑 𝑢 W • 𝑠 s0 h1 h2 h3 s1 𝑢 = 𝜏 𝑡 𝑢−1 , 𝑧 𝑢−1 𝑋 𝑓 , 𝑑 𝑢 𝑋 𝑠 • 𝑨 𝑢 = 𝜏 𝑡 𝑢−1 , 𝑧 𝑢−1 𝑋 𝑓 , 𝑑 𝑢 𝑋 𝑨 • Parameters: {v, W a , U a , W, W r , W z , 𝑋 𝑓 } CS5242 14

  14. Transformer Repeat self-attention modelling to get better word embedding Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 15

  15. Encoder and Decoder Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 16

  16. Encoder Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 17

  17. Self-attention Represent a word by considering the words in the context Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 18

  18. Self-attention • Attention modelling • Find the attention weights? • query VS key • Do weighted summation • value CS5242 19

  19. Self-attention Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 20

  20. Multi-headed self-attention One row per word Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 21

  21. Encoder Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 22

  22. Transformer Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 23

  23. Transformer Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 24

  24. Question answering • Given a context, a question • Outputs the answer • A word • A substring of the context • A new sentence CS5242 25

  25. Example Source from [21] CS5242 26

  26. Example Source from [22] CS5242 27

  27. Example Source from [23] CS5242 28

  28. Solution [24] • Find the answer word from the context • Max P(a=w|c, q) • Steps Softmax 1. Extract representation of question and passage (context) 2. Combine question and context • Concatenation RNN RNN RNN RNN RNN RNN • Addition 3. Generate the prediction • Matching the combined feature with each candidate Question Passage / Context word feature • E.g. inner-product • Use the similarity as input to softmax CS5242 29

  29. Summary • Seq2seq model for machine translation • Attention modelling • Transformer model • Question answering CS5242 30

  30. Reference • [1] https://www.quora.com/What-are-differences-between-recurrent-neural-network-language-model-hidden-markov-model-and-n-gram-language-model • [2] https://code.google.com/archive/p/word2vec/ • [3] https://nlp.stanford.edu/projects/glove/ • [4] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber. LSTM: A Search Space Odyssey. https://arxiv.org/abs/1503.04069 • [5] http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture8.pdf • [6] http://www.deeplearningbook.org/contents/applications.html (12.4.3) • [7] Quoc V. Le, Navdeep Jaitly, Geoffrey E. Hinton . A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. 2015. arxiv.org/abs/1504.00941v2 • [8] “Layer Normalization" Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton. https://arxiv.org/abs/1607.06450. • [9] “Recurrent Dropout without Memory Loss" Stanislau Semeniuta, Aliaksei Severyn, Erhardt Barth. https://arxiv.org/abs/1603.05118 • [10] https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/LayerNormBasicLSTMCell • [11] https://r2rt.com/non-zero-initial-states-for-recurrent-neural-networks.html • [12] LSTM: A Search Space Odyssey. Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber. https://arxiv.org/abs/1503.04069 • [13] https://github.com/karpathy/char-rnn/issues/138#issuecomment-162763435 • https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html • https://danijar.com/tips-for-training-recurrent-neural-networks/ CS5242 31

Recommend


More recommend