CS5242 Neural Networks and Deep Learning Lecture 09: RNN - PowerPoint PPT Presentation

CS5242 Neural Networks and Deep Learning Lecture 09: RNN Applications II Wei WANG TA: Yao SHU, Juncheng LIU, Qingpeng Cai cs5242@comp.nus.edu.sg

Recap • Language modelling • Training • Model the joint probability • Model the conditional probability • Train RNN for predicting the next word • Inference • Greedy search VS beam search • Image caption generation • CNN---Image feature • RNN---word generation CS5242 2

Agenda • Machine translation • Attention modelling • Transformer * • Question answering • Colab tutorial CS5242 3

RNN Architectures Sentiment Language Image caption Machine translation, analysis modelling Question answering Image source: https://karpathy.github.io/2015/05/21/rnn-effectiveness/ CS5242 4

Machine translation [19] • Given a sentence in one language, e.g. English • Singapore MRT is not stable • Return a sentence in another language, e.g. Chinese • 新加坡地铁不稳定 • Training • max σ <𝑦,𝑧> 𝑚𝑝𝑕𝑄 𝑧 1 , 𝑧 2 , … , 𝑧 𝑛 |𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 Θ • Inference • <𝑧 1 ,𝑧 2 ,…,𝑧 𝑛 > 𝑚𝑝𝑕𝑄 𝑧 1 , 𝑧 2 , … , 𝑧 𝑛 |𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 max CS5242 5

Sequence to sequence model • Seq2Seq • 𝑄 𝑧 1 , 𝑧 2 , … , 𝑧 𝑛 |𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 = 𝑄 𝑧 1 , 𝑧 2 , … , 𝑧 𝑛 |𝑇 • S is a summary of input W X Y Z END S RNN RNN RNN RNN RNN RNN RNN RNN A B C END (START) W X Y Z Encoder Decoder CS5242 6

Sequence to Sequence[19] • Seq2Seq • 𝑄 𝑧 1 , 𝑧 2 , … , 𝑧 𝑛 |𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 = 𝑄 𝑧 1 , 𝑧 2 , … , 𝑧 𝑛 |𝑇 • S is a summary of input • Encoder and Decoder are two RNN networks • They have their own parameters • End-to-end training • Reverse the input sequence • 𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 → 𝑦 𝑜 , 𝑦 𝑜−1 , … , 𝑦 1 • Overall better output sequence ？ • 𝑦 1 is near 𝑧 1 , 𝑦 2 is near 𝑧 2 , … • Multiple stacks of RNN → better output sequence CS5242 8

Attention modelling [20] • Problem of Seq2Seq model • Each output word depends on the output of the encoder • The contribution of some words should be larger than others • Singapore MRT is not stable --- 新加坡 • Singapore MRT is not stable --- 地铁新加坡地铁不稳定 Singapore MRT is not stable CS5242 9

Attention modelling • Differentiate the words from the encoder • Let some words have more contribution 新加坡地铁不 Decoder RNN Encoder RNN Singapore MRT is not Image from: https://distill.pub/2016/augmented-rnns/ CS5242 10

Attention modelling 新加坡地铁不 demo Singapore MRT is not Image from: https://distill.pub/2016/augmented-rnns/ CS5242 11

ǁ Attention modelling [20] • Example implementation • Extended GRU for the decoder • Consider the related words from the encoder during decoding • Weighted combination of hidden states from encoder → c t c t • 𝑡 𝑢 = 1 − 𝑨 𝑢 ∘ 𝑡 𝑢−1 + 𝑨 𝑢 ∘ ǁ 𝑡 𝑢 • 𝑧 𝑢 𝑡 𝑢 = tanh 𝑠 𝑢 ∘ 𝑡 𝑢−1 , 𝑧 𝑢−1 𝑋 𝑓 , 𝑑 𝑢 W 𝑧 𝑢−1 Attention • 𝑠 𝑢 = 𝜏 𝑡 𝑢−1 , 𝑧 𝑢−1 𝑋 𝑓 , 𝑑 𝑢 𝑋 𝑠 • 𝑨 𝑢 = 𝜏 𝑡 𝑢−1 , 𝑧 𝑢−1 𝑋 𝑓 , 𝑑 𝑢 𝑋 ℎ 1 ℎ 2 𝑡 𝑢 𝑡 𝑢−1 𝑨 𝑜 • 𝑑 𝑢 = σ 𝑘=1 𝛽 𝑢𝑘 ℎ 𝑘 RNN RNN RNN RNN RNN • 𝛽 𝑢𝑘 attention weight 𝑋 𝑓 • 𝛽 𝑢𝑘=𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑓 𝑢𝑘 ) j=1…n 𝑈 tanh(𝑡 𝑢−1 𝑋 • 𝑓 𝑢𝑘 = 𝑤 𝑏 𝑏 + ℎ 𝑘 𝑉 𝑏 ) 𝑧 𝑢−1 • Larger weight for more related words from the encoder Encoder Decoder CS5242 12

Attention modelling • Encoder RNN • Input word vectors: a=[0.1,0,1], b=[1,0.1,0], c=[0.2,0.3,1] • Hidden representation (vector) • h1, h2, h3 • h1=tanh(aU+h 0 W) • h2=tanh(bU+h 1 W) • h3=tanh(cU+h 2 W) Parameters: {U, W} CS5242 13

ǁ • Decoder RNN • Given hidden state vector s 0 • To compute the weights of h1, h2, h3 for computing s1 • e 11 = a(s 0 , h 1 ), e 12 =a(s 0 , h 2 ), e 13 =a(s 0 , h 3 ) • a(s 0 , h 1 ) = v T tanh(s 0 W a + h 1 U a ) • a(s 0 , h 2 ) = v T tanh(s 0 W a + h 2 U a ) • a(s 0 , h 3 ) = v T tanh(s 0 W a + h 3 U a ) • 𝛽 11 = exp(e 11 ) / (exp(e 11 )+exp(e 12 )+exp(e 13 )) • 𝛽 12 = exp(e 12 ) / (exp(e 11 )+exp(e 12 )+exp(e 13 )) C t • 𝛽 13 = exp(e 13 ) / (exp(e 11 )+exp(e 12 )+exp(e 13 )) • c 1 = 𝛽 11 h 1 + 𝛽 12 h 2 + 𝛽 13 h 3 • 𝑡 𝑢 = 1 − 𝑨 𝑢 ∘ 𝑡 𝑢−1 + 𝑨 𝑢 ∘ ǁ 𝑡 𝑢 Attention • 𝑡 𝑢 = tanh 𝑠 𝑢 ∘ 𝑡 𝑢−1 , 𝑧 𝑢−1 𝑋 𝑓 , 𝑑 𝑢 W • 𝑠 s0 h1 h2 h3 s1 𝑢 = 𝜏 𝑡 𝑢−1 , 𝑧 𝑢−1 𝑋 𝑓 , 𝑑 𝑢 𝑋 𝑠 • 𝑨 𝑢 = 𝜏 𝑡 𝑢−1 , 𝑧 𝑢−1 𝑋 𝑓 , 𝑑 𝑢 𝑋 𝑨 • Parameters: {v, W a , U a , W, W r , W z , 𝑋 𝑓 } CS5242 14

Transformer Repeat self-attention modelling to get better word embedding Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 15

Encoder and Decoder Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 16

Encoder Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 17

Self-attention Represent a word by considering the words in the context Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 18

Self-attention • Attention modelling • Find the attention weights? • query VS key • Do weighted summation • value CS5242 19

Self-attention Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 20

Multi-headed self-attention One row per word Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 21

Encoder Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 22

Transformer Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 23

Transformer Image source: https://jalammar.github.io/illustrated-transformer/ CS5242 24

Question answering • Given a context, a question • Outputs the answer • A word • A substring of the context • A new sentence CS5242 25

Example Source from [21] CS5242 26

Solution [24] • Find the answer word from the context • Max P(a=w|c, q) • Steps Softmax 1. Extract representation of question and passage (context) 2. Combine question and context • Concatenation RNN RNN RNN RNN RNN RNN • Addition 3. Generate the prediction • Matching the combined feature with each candidate Question Passage / Context word feature • E.g. inner-product • Use the similarity as input to softmax CS5242 29

Summary • Seq2seq model for machine translation • Attention modelling • Transformer model • Question answering CS5242 30

Reference • [1] https://www.quora.com/What-are-differences-between-recurrent-neural-network-language-model-hidden-markov-model-and-n-gram-language-model • [2] https://code.google.com/archive/p/word2vec/ • [3] https://nlp.stanford.edu/projects/glove/ • [4] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber. LSTM: A Search Space Odyssey. https://arxiv.org/abs/1503.04069 • [5] http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture8.pdf • [6] http://www.deeplearningbook.org/contents/applications.html (12.4.3) • [7] Quoc V. Le, Navdeep Jaitly, Geoffrey E. Hinton . A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. 2015. arxiv.org/abs/1504.00941v2 • [8] “Layer Normalization" Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton. https://arxiv.org/abs/1607.06450. • [9] “Recurrent Dropout without Memory Loss" Stanislau Semeniuta, Aliaksei Severyn, Erhardt Barth. https://arxiv.org/abs/1603.05118 • [10] https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/LayerNormBasicLSTMCell • [11] https://r2rt.com/non-zero-initial-states-for-recurrent-neural-networks.html • [12] LSTM: A Search Space Odyssey. Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber. https://arxiv.org/abs/1503.04069 • [13] https://github.com/karpathy/char-rnn/issues/138#issuecomment-162763435 • https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html • https://danijar.com/tips-for-training-recurrent-neural-networks/ CS5242 31

CS5242 Neural Networks and Deep Learning Lecture 09: RNN - PowerPoint PPT Presentation

CS5242 Neural Networks and Deep Learning Lecture 09: RNN Applications II Wei WANG TA: Yao SHU, Juncheng LIU, Qingpeng Cai cs5242@comp.nus.edu.sg Recap Language modelling Training Model the joint probability Model the conditional

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Data Mining II Neural Networks and Deep Learning Heiko Paulheim Deep Learning A recent

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Networks - Deep Learning Artificial Intelligence @ Allegheny College Janyl Jumadinova

Neural Networks Greg Mori - CMPT 419/726 Bishop PRML Ch. 5 Feed-forward Networks Network

Optimizing Deep Neural Networks Leena Chennuru Vankadara 26-10-2015 Table of Contents Neural

Deep Neural Networks CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Deep learning slides credit:

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Overview of Morpho Challenge task at CLEF 2009 Mikko Kurimo, Sami Virpioja, Ville Turunen

Towards European Transla/on Cloud: Development of public MT services in the Bal/c countries

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Automatic Alignment and Annotation Projection for Literary Texts Uli Steinbach Ines Rehbein

iOS/macOS 0-day^w48-hours from sandbox to kernel Prsent 31/05/2018 Pour BeeRumP Par Eloi

Tracking with Timing: A System Approach Adriano Lai INFN Sezione di Cagliari Italy &

Anatomic Pathology and Quality Process Improvement 17 December 2013 Steve Halasey Chief Editor

,Y\ 00 @ hi 0 1 . 5h45 , Y } QOO y yz , *=E[ be ya9uy5 Small 0 1 ' 7=5 ,

CS5242 Neural Networks and Deep Learning Lecture 09: RNN - PowerPoint PPT Presentation

CS5242 Neural Networks and Deep Learning Lecture 09: RNN Applications II Wei WANG TA: Yao SHU, Juncheng LIU, Qingpeng Cai cs5242@comp.nus.edu.sg Recap Language modelling Training Model the joint probability Model the conditional

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Data Mining II Neural Networks and Deep Learning Heiko Paulheim Deep Learning A recent

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Networks - Deep Learning Artificial Intelligence @ Allegheny College Janyl Jumadinova

Neural Networks Greg Mori - CMPT 419/726 Bishop PRML Ch. 5 Feed-forward Networks Network

Optimizing Deep Neural Networks Leena Chennuru Vankadara 26-10-2015 Table of Contents Neural

Deep Neural Networks CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Deep learning slides credit:

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Overview of Morpho Challenge task at CLEF 2009 Mikko Kurimo, Sami Virpioja, Ville Turunen

Towards European Transla/on Cloud: Development of public MT services in the Bal/c countries

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Automatic Alignment and Annotation Projection for Literary Texts Uli Steinbach Ines Rehbein

iOS/macOS 0-day^w48-hours from sandbox to kernel Prsent 31/05/2018 Pour BeeRumP Par Eloi

Tracking with Timing: A System Approach Adriano Lai INFN Sezione di Cagliari Italy &amp;

Anatomic Pathology and Quality Process Improvement 17 December 2013 Steve Halasey Chief Editor

,Y\ 00 @ hi 0 1 . 5h45 , Y } QOO y yz , *=E[ be ya9uy5 Small 0 1 ' 7=5 ,

Tracking with Timing: A System Approach Adriano Lai INFN Sezione di Cagliari Italy &