& 2019.10.8 Seung-Hoon Na Chonbuk National University - PowerPoint PPT Presentation

RNN: Extensions (3/3) • Stack LSTM – A LSTM for representing stack structure • Extend the standard LSTM with a stack pointer • Previously, only push() operation is allowed • Now, Pop() operation is supported • Memory-augmented LSTMs – Neural Turing machine – Differentiable neural computer – C.f. ) Neural encoder-decoder, Stack LSTM: Special cases of MALSTM • RNN architecture search with reinforcement learning – Training neural architectures that maximize the expected accuracy on a specific task

The Challenge of Long-Term Dependencies • Example: a very simple recurrent network • No nonlinear activation function, no inputs – 𝒊 (𝑢) = 𝑿 𝑈 𝒊 (𝑢−1) – 𝒊 (𝑢) = (𝑿 𝑢 ) 𝑈 𝒊 (0) 𝑿 = 𝑹𝚳𝑹 𝑈 𝒊 (𝑢) = 𝑹 𝑈 𝚳 𝒖 𝑹𝒊 (0) Ill-posed form

The Challenge of Long-Term Dependencies • Gradients are vanished or explored in deep models • BPTT for recurrent neural networks is a typical example 𝒉 ′ 𝒜 1 𝒉 ′ 𝒜 2 𝒉 ′ 𝒜 𝟒 ∗ 𝑿 𝟒 𝑼 𝜺 𝑢 𝒉 ′ 𝒜 2 𝒉 ′ 𝒜 𝟒 ∗ 𝑿 𝟑 𝑼 𝜺 𝑢 Delta is obtained by the Error signal 𝒉 ′ 𝒜 𝟒 ∗ 𝑿 𝑈 𝜺 𝒖 repeated multiplication 𝜺 𝒖 of W 𝑿 𝑿 𝑿 𝑿 = 𝑾𝑒𝑗𝑏𝑕 𝝁 𝑾 −1 𝒊 𝟐 𝒊 𝟑 𝒊 𝟒 𝒊 𝒖 𝑿 𝑙 = 𝑾𝑒𝑗𝑏𝑕 𝝁 𝑙 𝑾 −1 𝒚 𝟐 𝒚 𝟑 𝒚 𝟒 𝒚 𝒖 Explode if | 𝜇 𝑗 | > 1 𝜀 𝟐 = 𝑿 𝒍 𝑼 𝜺 𝑙 ∗ 𝒈(𝒜 1 𝑙−1 ) Vanish if | 𝜇 𝑗 | < 1

Exploding and vanishing gradients [Bengio ‘94; Pascanu ‘13] 𝜺 𝑼−𝟐 = 𝒉 ′ 𝒜 𝑼−𝟐 ∗ 𝑿 𝑈 𝜺 𝑼 𝜺 𝑼 𝐸𝑗𝑏𝑕 𝒉 ′ 𝒜 𝒋−𝟐 𝑿 𝑈 𝜺 𝒍 = ෑ 𝜺 𝑼 𝑿 𝑿 … 𝒊 𝒍 𝑙<𝑗≤𝑈 𝒊 𝒖 • Let: 𝐸𝑗𝑏𝑕 𝒉 ′ 𝒜 𝒋−𝟐 – ≤ 𝛿 • for bounded nonlinear functions 𝑕 ′ 𝑦 – 𝜇 1 : the largest singular value of 𝑿 • Sufficient condition for Vanishing gradient problem – 𝜇 1 < 1/𝛿 𝐸𝑗𝑏𝑕 𝒉 ′ 𝒜 𝒋−𝟐 𝐸𝑗𝑏𝑕 𝒉 ′ 𝒜 𝒋−𝟐 1 𝛿 𝛿 =1 𝑿 𝑈 𝑿 𝑈  ≤ < • Necessary condition for Exploding gradient problem – 𝜇 1 > 1/𝛿  obtained by just inverting the condition for vanishing gradient problem

Gradient clipping [Pascanu ’ 13] • Deal with exploring gradients • Clip the norm | 𝒉 | of the gradient 𝒉 just before parameter update 𝒉𝑤 If | 𝒉 | > 𝑤 : 𝒉 ← | 𝒉 | Without gradient clipping with clipping

Long Short Term Memory (LSTM) • LSTM : makes it easier for RNNs to capture long- term dependencies  Using gated units – Basic LSTM [Hochreiter and Schmidhuer, 98] • Cell state unit 𝒅 (𝑢) : as an internal memory • Introduces input gate & output gate • Problem: The output is close to zero as long as the output gate is closed. – Modern LSTM: Uses forget gate [Gers et al ’00] – Variants of LSTM • Add peephole connections [Gers et al ’02] – Allow all gates to inspect the current cell state even when the output gate is closed.

Long Short Term Memory (LSTM) LSTM Recurrent neural networks 𝒈 𝑿 𝒊 𝒊 𝒅 𝑽 𝒚 Memory cell (cell state unit) 𝒚

Long Short Term Memory (LSTM) • Memory cell 𝒅 : gated unit – Controlled by input/output/forget gates 𝒊 Gated flow 𝒑 𝒛 𝒈 𝒅 𝒛 = 𝐯°𝒚 𝒗 𝒚 𝒋 𝒈 : forget gate 𝒋 : input gate 𝒜 𝒑 : output gate

Long Short Term Memory (LSTM) LSTM 𝒊 Computing gate values 𝒈 (𝑢) = 𝑕 𝑔 (𝒚 𝑢 , 𝒊 (𝑢−1) )( forget gate) 𝒋 (𝑢) = 𝑕 𝑗 (𝒚 𝑢 , 𝒊 (𝑢−1) ) (input gate) 𝒑 𝒑 (𝑢) = 𝑕 𝑝 (𝒚 𝑢 , 𝒊 (𝑢−1) ) (output gate) 𝒋 𝒅 ෤ 𝒅 𝒈 (new memory cell) 𝒅 𝑢 = 𝑢𝑏𝑜ℎ 𝑋 𝑑 𝑦 𝑢 + 𝑉 𝑑 ℎ 𝑢−1 Memory cell ෤ (cell state unit) 𝒅 (𝑢) = 𝒋 (𝑢) ° ෤ 𝒅 𝑢 + 𝒈 (𝑢) ° 𝒅 𝑢−1 𝒚 𝒊 (𝑢) = 𝒑 (𝑢) ° tanh(𝒅 𝑢 )

Long Short Term Memory (LSTM) LSTM Controling by gate values 𝒊 𝒑 Computing gate values 𝒋 𝒈 (𝑢) = 𝑕 𝑔 (𝒚 𝑢 , 𝒊 (𝑢−1) ) 𝒑 𝒈 𝒋 (𝑢) = 𝑕 𝑗 (𝒚 𝑢 , 𝒊 (𝑢−1) ) 𝒋 𝒑 (𝑢) = 𝑕 𝑝 (𝒚 𝑢 , 𝒊 (𝑢−1) ) 𝒅 ෤ 𝒅 𝒈 Memory cell (cell state unit) (new memory cell) 𝒅 𝑢 = 𝑢𝑏𝑜ℎ 𝑋 𝑑 𝑦 𝑢 + 𝑉 𝑑 ℎ 𝑢−1 ෤ 𝒚 𝒅 (𝑢) = 𝒋 (𝑢) ° ෤ 𝒅 𝑢 + 𝒈 (𝑢) ° 𝒅 𝑢−1 𝒊 (𝑢) = 𝒑 (𝑢) ° tanh(𝒅 𝑢 )

Long Short Term Memory (LSTM): Cell Unit Notation (Simplified) 𝒅 𝑢 = 𝑢𝑏𝑜ℎ 𝑋 𝑑 𝑦 𝑢 + 𝑉 𝑑 ℎ 𝑢−1 ෤ 𝒊 𝒅 (𝑢) = 𝒋 (𝑢) ° ෤ 𝒅 𝑢 + 𝒈 (𝑢) ° 𝒅 𝑢−1 𝒑 𝒊 (𝑢) = 𝒑 (𝑢) ° tanh(𝒅 𝑢 ) 𝒅 𝒈 𝒋 𝒚

Long Short Term Memory (LSTM): Long-term dependencies 𝒚 1  𝒊 (4) : early inputs can be preserved in the memory cell during long time steps by controlling mechanism 𝒊 (4) 𝒊 (1) 𝒊 (2) 𝒊 (3) 𝒑 𝒑 𝒑 𝒑 𝒅 (4) 𝒅 (1) 𝒅 (2) 𝒅 (3) 𝒈 𝒈 𝒈 𝒈 𝒋 𝒋 𝒋 𝒋 𝒚 (4) 𝒚 (2) 𝒚 (3) 𝒚 (1)

ǁ LSTM: Update Formula 𝒊 (𝒖) = 𝒈(𝒚 𝒖 , 𝒊 (𝒖−𝟐) ) • 𝑗 𝑢 = 𝜏 𝑋 𝑗 𝑦 𝑢 + 𝑉 𝑗 ℎ 𝑢−1 (Input gate) • 𝑔 𝑢 = 𝜏 𝑋 𝑔 𝑦 𝑢 + 𝑉 𝑔 ℎ 𝑢−1 (Forget gate) • 𝑝 𝑢 = 𝜏 𝑋 𝑝 𝑦 𝑢 + 𝑉 𝑝 ℎ 𝑢−1 (Output/Exposure gate) 𝑑 𝑢 = 𝑢𝑏𝑜ℎ 𝑋 𝑑 𝑦 𝑢 + 𝑉 𝑑 ℎ 𝑢−1 • (New memory cell) • 𝑑 (𝑢) = 𝑔 (𝑢) °𝑑 𝑢−1 + 𝑗 (𝑢) ° ǁ 𝑑 𝑢 (Final memory cell) • ℎ (𝑢) = 𝑝 (𝑢) °tanh(𝑑 𝑢 )

LSTM: Memory Cell 𝒊 (𝒖) 𝑑 (𝑢) = 𝑔 (𝑢) °𝑑 𝑢−1 + 𝑗 (𝑢) ° ǁ 𝑑 𝑢 ℎ (𝑢) = 𝑝 (𝑢) °tanh(𝑑 𝑢 ) 𝒅 (𝒖−𝟐) 𝒅 (𝒖) 𝒅 𝒖 𝒈 (𝑢) 𝒋 (𝑢) 𝒑 (𝑢) ෤ 𝒚 𝒖 𝒚 𝒖 𝒚 𝒖 𝒊 (𝑢−1) 𝒚 𝒖 𝒊 (𝑢−1) 𝒊 (𝑢−1) 𝒊 (𝑢−1)

LSTM: Memory Cell • 𝑑 (𝑢) : behaves like a memory = MEMORY – 𝑑 (𝑢) = 𝑔 (𝑢) °𝑑 𝑢−1 + 𝑗 (𝑢) ° ǁ 𝑑 𝑢 • M(t) = FORGET * M(t-1) + INPUT * NEW_INPUT • H(t) = OUTPUT * M(t) • FORGET: Erase operation (or memory reset) • INPUT: Write operation • OUTPUT: Read operation

Memory Cell - Example New input M[t] -1 0.5 1 -1 0 0.5 0 1 0 0 1 1 0 1 1 1 Forget gate Input gate 0 0.5 1 1 0 0 0 1 tahn New memory M[t+1] 0 0.5 1 2 Output gate H[t] 0 0 1 1 0 0 0.76 0.96 45

Long Short Term Memory (LSTM): Backpropagation • Error signal in gated flow 𝜺𝒛 𝒗 𝒚 𝒛 𝒛 = 𝐯°𝒚 = 𝐸𝑗𝑏𝑕 𝒗 𝒚 𝜺𝐲 = 𝐸𝑗𝑏𝑕 𝒗 𝑼 𝜺𝒛 = 𝒗 ° 𝜺𝒛

Long Short Term Memory (LSTM): Backpropagation 𝒜 𝑢 = 𝑋 𝑑 𝒚 𝑢 + 𝑉 𝑑 𝒊 𝑢−1 𝒅 𝑢 = 𝒋 𝑢 ° tanh(𝒜 𝑢 ) + 𝒈 𝑢 ° 𝒅 𝑢−1 𝒊 𝑢 = 𝒑 𝑢 ° tanh 𝒅 𝑢 𝒋 𝑢 𝒊 𝑢−1 𝜺 𝒅 𝑢 𝒑 𝑢−1 𝒈 𝑢 𝒅 𝑢 𝒅 𝑢−1

Long Short Term Memory (LSTM): Backpropagation 𝜺𝒜 𝑢 = 𝑢𝑏𝑜ℎ ′ 𝒜 𝑢 °𝒋 𝑢 °𝜺𝒅 𝑢 𝒜 𝑢 = 𝑋 𝑑 𝒚 𝑢 + 𝑉 𝑑 𝒊 𝑢−1 𝑈 𝜺𝒜 𝑢 𝜺𝒊 𝑢−1 = 𝑽 𝑑 𝒅 𝑢 = 𝒋 𝑢 ° tanh(𝒜 𝑢 ) + 𝒈 𝑢 ° 𝒅 𝑢−1 𝒊 𝑢 = 𝒑 𝑢 ° tanh 𝒅 𝑢 𝜺𝒊 𝑢−1 = 𝑢𝑏𝑜ℎ ′ 𝒜 𝑢 °𝒋 𝑢 °𝑽 𝑑 𝑈 𝜺𝒅 𝑢 𝒋 𝑢 𝒊 𝑢−1 𝜺 𝒅 𝑢 𝒑 𝑢−1 𝒈 𝑢 𝒅 𝑢 𝒅 𝑢−1

Long Short Term Memory (LSTM): Backpropagation 𝜺𝒊 𝑢−1 = 𝑢𝑏𝑜ℎ ′ 𝒜 𝑢 °𝒋 𝑢 °𝑽 𝑑 𝑈 𝜺𝒅 𝑢 𝒜 𝑢 = 𝑋 𝑑 𝒚 𝑢 + 𝑉 𝑑 𝒊 𝑢−1 𝒅 𝑢 = 𝒋 𝑢 ° tanh(𝒜 𝑢 ) + 𝒈 𝑢 ° 𝒅 𝑢−1 𝒊 𝑢 = 𝒑 𝑢 ° tanh 𝒅 𝑢 𝒋 𝑢 𝒊 𝑢−1 𝜺 𝒅 𝑢 𝒑 𝑢−1 𝒈 𝑢 𝒅 𝑢 𝒅 𝑢−1 𝜺𝒅 𝑢−1 = 𝑢𝑏𝑜ℎ ′ 𝒅 𝑢−1 °𝒑 𝑢−1 °𝜺𝒊 𝑢−1 + 𝒈 𝑢 °𝜺𝒅 𝒖 𝜺𝒅 𝑢−1 = 𝑢𝑏𝑜ℎ ′ 𝒅 𝑢−1 °𝒑 𝑢 °𝑢𝑏𝑜ℎ ′ 𝒜 𝑢 °𝒋 𝑢 °𝑽 𝑑 𝑈 𝜺𝒅 𝑢 + 𝒈 𝑢 °𝜺𝒅 𝒖

Long Short Term Memory (LSTM): Backpropagation 𝒜 𝑢 = 𝑿 𝑑 𝒚 𝑢 + 𝑽 𝑑 𝒊 𝑢−1 𝑈 𝜺𝒅 𝑢 𝜺𝒊 𝑢−1 = 𝑢𝑏𝑜ℎ′(𝒜 𝑢 )°𝒋 𝑢 °𝑽 𝑑 𝒅 𝑢 = 𝒋 𝑢 ° tanh(𝒜 𝑢 ) + 𝒈 𝑢 ° 𝒅 𝑢−1 𝒊 𝑢 = 𝒑 𝑢 ° tanh 𝒅 𝑢 𝒋 𝑢 𝒊 𝑢−1 tanh ′ 𝑦 = 1 − 𝑢𝑏𝑜ℎ 2 𝑦 𝜺 𝒅 𝑢 𝒑 𝑢−1 𝒈 𝑢 𝜺𝒅 𝑢−1 = (𝑢𝑏𝑜ℎ ′ 𝒅 𝑢−1 °𝒑 𝑢−1 °𝑢𝑏𝑜ℎ ′ 𝒜 𝑢 °𝒋 𝑢 °𝑽 𝑑 𝒅 𝑢 𝑈 𝒅 𝑢−1 𝜺𝒅 𝑢−1 = 𝑢𝑏𝑜ℎ ′ 𝒅 𝑢−1 °𝒑 𝑢−1 °𝜺𝒊 𝑢−1 + 𝒈 𝑢 °𝜺𝒅 𝒖 𝜺𝒅 𝑢−1 = 𝑢𝑏𝑜ℎ ′ 𝒅 𝑢−1 °𝒑 𝑢−1 °𝑢𝑏𝑜ℎ′(𝒜 𝑢 )°𝒋 𝑢 °𝑽 𝑑 𝑈 𝜺𝒅 𝑢 + 𝒈 𝑢 °𝜺𝒅 𝒖

LSTM vs. Vanilla RNN: Backpropagation Vanilla RNN LSTM tanh 𝑦 = 𝑕(𝑦) 𝑿 𝒋 𝑢 𝒊 𝑢 𝒊 𝑢−1 𝒊 𝑢−1 𝜺 𝒅 𝑢 𝒑 𝑢 𝒜 𝑢 = 𝑿𝒊 𝑢−1 + 𝑽𝒚 𝑢 𝒈 𝑢 𝒊 𝑢 = 𝑢𝑏𝑜ℎ(𝒜 𝒖 ) 𝒅 𝑢 𝒅 𝑢−1 𝜺𝒊 𝒖−𝟐 = 𝑕 ′ 𝒜 𝒖 ∗ 𝑿 𝑈 𝜺𝒊 𝒖 𝜺𝒅 𝑢−1 = 𝑕′ 𝒅 𝑢−1 °𝑕′ 𝒜 𝑢 °𝒑 𝑢−1 °𝒋 𝑢 °𝑽 𝑈 + 𝒈 𝑢 °𝜺𝒅 𝒖 This additive term is the key for dealing with vanishing gradient problems

Exercise: Backpropagation for LSTM 𝒛 𝑢 𝒑 𝑢 Complete flow graph & derive weight update formula 𝒋 𝑢 𝒊 𝑢−1 𝒊 𝑢 ෤ 𝒅 𝑢 𝒅 𝑢−1 𝒅 𝑢 𝒈 𝑢 memory cell 𝒚 𝑢 new input

Gated Recurrent Units [Cho et al ’14] • Alternative architecture to handle long-term dependencies 𝒊 (𝒖) = 𝒈(𝒚 𝒖 , 𝒊 (𝒖−𝟐) ) • 𝑨 𝑢 = 𝜏 𝑋 𝑨 𝑦 𝑢 + 𝑉 𝑨 ℎ 𝑢−1 (Update gate) • 𝑠 𝑢 = 𝜏 𝑋 𝑠 𝑦 𝑢 + 𝑉 𝑠 ℎ 𝑢−1 (Reset gate) ℎ 𝑢 = 𝑢𝑏𝑜ℎ 𝑠 (𝑢) °𝑉ℎ 𝑢−1 + 𝑋𝑦 (𝑢) • ෨ (New memory) • ℎ (𝑢) = (1 − 𝑨 (𝑢) ) °෨ ℎ 𝑢 + 𝑨 (𝑢) °ℎ 𝑢−1 (Hidden state)

LSTM CRF: RNN with Output Dependency • The output layer of RNN takes a directed graphical model that contains edges from some 𝒛 (𝑗) in the past to the current output – This model is able to perform a CRF-style of tagging 𝒛 (1) 𝒛 (2) 𝒛 (3) 𝒛 (𝑢) 𝒊 (1) 𝒊 (2) 𝒊 (3) 𝒊 (𝑢) 𝑿 𝑿 𝑿 𝒚 (1) 𝒚 (2) 𝒚 (3) 𝒚 (𝑢)

Recurrent Language Model 𝒊 (𝒖−𝟐) • Introducing the state variable in the graphical model of the RNN

Bidirectional RNN • Combine two RNNs – Forward RNN: an RNN that moves forward beginning from the start of the sequence – Backward RNN: an RNN that moves backward beginning from the end Backward RNN of the sequence – It can make a prediction of y ( t ) depend on the whole input sequence. Forward RNN

Bidirectional LSTM CRF [Huang ‘15] • One of the state-of-the art models for sequence labelling tasks BI-LSTM-CRF model applied to named entity tasks

Bidirectional LSTM CRF [Huang ‘15] Comparison of tagging performance on POS, chunking and NER tasks for various models [Huang et al. 15]

Neural Machine Translation • RNN encoder-decoder – Neural encoder-decoder : Conditional recurrent language model • Neural machine translation with attention mechanism – Encoder: Bidirectional LSTM – Decoder: Attention Mechanism [Bahdanau et al ’15] • Character based NMT – Hierarchical RNN Encoder- Decoder [Ling ‘16] – Subword-level Neural MT [Sennrich ’15] – Hybrid NMT [Luong & Manning ‘16] – Google’s NMT [Wu et al ‘16]

Neural Encoder-Decoder Input Translated text text Credit: http://nlp.stanford.edu/projects/nmt/Luong-Cho-Manning-NMT-ACL2016-v4.pdf

Neural Encoder-Decoder: Conditional Recurrent Language Model Credit: http://nlp.stanford.edu/projects/nmt/Luong-Cho-Manning-NMT-ACL2016-v4.pdf

Neural Encoder- Decoder [Cho et al ’14] • Computing the log of translation probability 𝑚𝑝𝑕 𝑄(𝑧|𝑦) by two RNNs Encoder: RNN Decoder: Recurrent language model

Decoder: Recurrent Language Model Credit: http://nlp.stanford.edu/projects/nmt/Luong-Cho-Manning-NMT-ACL2016-v4.pdf

Neural Encoder-Decoder with Attention Mechanism [Bahdanau et al ’15] Attention Attention condition Sampling a word Sampling a word • Decoder with attention mechanism – Apply attention first to the encoded representations before generating a next target word – Attention: find aligned source words for a target word • Considered as implicit alignment process – Context vector c: • Previously, the last hidden state from RNN encoder[Cho et al ’14] • Now, content-sensitively chosen with a mixture of hidden states of input sentence at generating each target word

Decoder with Attention Mechanism Encoded representations 𝑏 (𝒊 𝑢−1 , ഥ • Attention: 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑔 𝑰 𝑡 ) ) Attention scoring function 𝑡𝑑𝑝𝑠𝑓 𝒊 𝑢−1 , ഥ 𝒊 𝑡 = 𝒘 𝑈 tanh 𝑿𝒊 𝑢−1 + 𝑾ത 𝒊 𝑡 Directly computes a soft alignment 𝒊 𝑢−1 𝑡𝑝𝑔𝑢𝑛𝑏𝑦 exp(𝑡𝑑𝑝𝑠𝑓(𝒊 𝑢−1 , ഥ 𝒊 𝑡 )) 𝒃 𝒖 (𝑡) = σ 𝑡′ exp(𝑡𝑑𝑝𝑠𝑓(𝒊 𝑢−1 , ഥ 𝒊 𝑡′ )) Expected annotation 𝒊 𝑡 : a source hidden state ഥ 𝐼 𝑇 = [ത ℎ 1 , ⋯ , ത ഥ ℎ 𝑜 ]

Decoder with Attention Mechanism • Original scoring function [Bahdanau et al ’15] = 𝒘 𝑈 tanh 𝑿𝒊 𝑢−1 + 𝑾ഥ 𝑡𝑑𝑝𝑠𝑓 𝒊 𝑢−1 , ഥ 𝒊 𝑡 𝒊 𝑡 • Extension of scoring functions [Luong et al ‘15] Bilinear function

Neural Encoder-Decoder with Attention Mechanism [Luong et al ‘15] • Computation path: 𝒊 𝑢 → 𝒃 𝑢 → 𝒅 𝑢 → ෩ 𝒊 𝑢 - Previously, 𝒊 𝑢−1 → 𝒃 𝑢 → 𝒅 𝑢 → 𝒊 𝑢 • Attention scoring function http://aclweb.org/anthology/D15-1166

Neural Encoder-Decoder with Attention Mechanism [Luong et al ‘15] • Input-feeding approach t 에서 attentional 벡터가 다음 입력벡터와 concat 되 어 t+1 의 입력을 구성 Attentional vectors ሚ 𝐢 t are fed as inputs to the next time steps to inform the model about past alignment decisions

GNMT: Google’s Neural Machine Translation [Wu et al ‘16] Deep LSTM network with 8 encoder and 8 decoder layers using residual connections as well as attention connections from the decoder network to the encoder. Trained by Google’s Tensor Processing Unit (TPU)

GNMT: Google’s Neural Machine Translation [Wu et al ‘16] Mean of side-by-side scores on production data Reduces translation errors by an average of 60% compared to Google’s phrase-based production system.

Pointer Network • Attention as a pointer to select a member of the input sequence as the output. Attention as output Neural encoder-decoder Pointer network

Neural Conversational Model [Vinyals and Le ’ 15] • Using neural encoder-decoder for conversations – Response generation http://arxiv.org/pdf/1506.05869.pdf

BIDAF for Machine Reading Comprehension [Seo ‘17] Bidirectional attention flow

Memory Augmented Neural Networks – Extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes • Writing & Reading mechanisms are added • Examples  Neural Turing Machine  Differentiable Neural Computer  Memory networks

Neural Turing Machine [Graves ‘14] • Two basic components: A neural network controller and a memory bank . • The controller network receives inputs from an external environment and emits outputs in response. – It also reads to and writes from a memory matrix via a set of parallel read and write heads .

Memory • Memory 𝑁 𝑢 – The contents of the 𝑂 x 𝑁 memory matrix at time 𝑢

Read/Write Operations for Memory • Read from memory (“blurry”) – 𝒙 𝑢 : a vector of weightings over the N locations emitted by a read head at time 𝑢 ( σ 𝑗 𝑥 𝑢 (𝑗) = 1 ) – 𝒔 𝑢 : The length M read vector • Write to memory (“blurry”) – Each write: an erase followed by an add – 𝒇 𝑢 : Erase vector, 𝒃 𝑢 : Add vector

Addressing by Content • Based on Attention mechanism – Focuses attention on locations based on the similarity b/w the current values and values emitted by the controller – 𝒍 𝑢 : The length M key vector – 𝛾 𝑢 : a key strength, which can amplify or attenuate the precision of the focus – K[ u , v ]: similarity measure  cosine similarity

Addressing • Interpolating content-based weights with previous weights – which results in the gated weighting • A scalar interpolation gate 𝑕 𝑢 – Blend between the weighing 𝒙 𝑢−1 produced by the head at the previous time and the weighting 𝒙 𝑑 produced by the content system at the current time- step

Addressing by Location • Based on Shifting – 𝒕 𝑢 : shift weighting that defines a normalized distribution over the allowed integer shifts • E.g.) The simplest way: to use a softmax layer • Scalar-based: if the shift scholar is 6.7, then 𝑡 𝑢 (6)=0.3 , 𝑡 𝑢 (7)=0.7, and the rest of 𝒕 𝑢 is zero – 𝛿 𝑢 : an additional scalar which sharpen the final weighting

Addressing: Architecture

Controller Output for read head Output for write head 𝑆 ∈ 𝑆 𝑁 𝑋 ∈ 𝑆 𝑁 • 𝒍 𝑢 • 𝒇 𝑢 , 𝒃 𝑢 , 𝒍 𝑢 𝑆 ∈ 0,1 𝑂 𝑋 ∈ 0,1 𝑂 • 𝒕 𝑢 • 𝒕 𝑢 𝑆 ∈ 𝑆 + 𝑋 ∈ 𝑆 + • 𝛾 𝑢 • 𝛾 𝑢 𝑆 ∈ 𝑆 ≥1 𝑋 ∈ 𝑆 ≥1 • 𝛿 𝑢 • 𝛿 𝑢 𝑆 ∈ (0,1) 𝑋 ∈ (0,1) • 𝑕 𝑢 • 𝑕 𝑢 Controller The network for controller: FNN or RNN Input External output 𝒔 𝑢 ∈ 𝑆 𝑁

NTM vs. LSTM: Copy task • Task: Copy sequences of eight bit random vectors, where sequence lengths were randomised b/w 1 and 20

NTM vs. LSTM: Mult copy

Differentiable Neural Computers • Extension of NTM by advancing Memory addressing • Memory addressing are defined by three main attention mechanisms – Content (also used in NTM) – memory allocation – Temporal order • The controller interpolates among these mechanisms using scalar gates Credit: http://people.idsia.ch/~rupesh/rnnsymposium2016/slides/graves.pdf

DNC: Overall architecture

DNC: bAbI Results • Each story is treated as a separate sequence and presented it to the network in the form of word vectors, one word at a time. mary journeyed to the kitchen. mary moved to the bedroom. john went back to the hallway. john picked up the milk there. what is john carrying ? - john travelled to the garden. john journeyed to the bedroom. what is john carrying ? - mary travelled to the bathroom. john took the apple there. what is john carrying ? - - The answers required at the ‘−’ symbols, grouped by question into braces, are {milk}, {milk}, {milk apple} The network was trained to minimize the cross-entropy of the softmax outputs with respect to the target words

DNC: bAbI Results http://www.nature.com/nature/journal/v538/n7626/full/nature20101.html

Deep learning for Natural language processing • Short intro to NLP • Word embedding • Deep learning for NLP

Natural Language Processing • What is NLP? – The automatic processing of human language • Give computers the ability to process human language – Its goal enables computers to achieve human-like comprehension of texts/languages • Tasks – Text processing • POS Tagging / Parsing / Discourse analysis – Information extraction – Question answering – Dialog system / Chatbot – Machine translation

Linguistics and NLP • Many NLP tasks correspond to structural subfields of linguistics Subfields of linguistics NLP Tasks Phonetics Speech recognition Phonology Morphology Word segmentation POS tagging Parsing Syntax Semantics Word sense disambiguation Semantic role labeling Semantic parsing Pragmatics Named entity recognition/disambiguation Reading comprehension

Information Extraction According to Robert Callahan, president of Eastern’s flight attendants union, the past practice of Eastern’s parent, Houston-based Texas Air Corp., has involved ultimatums to unions to accept the carrier’s terms <Empolyee_Of> Entity extraction Robert Callahan Eastern’s <Located_In> 92 Texas Air Corp Huston According to <Per> Robert Callahan </Per>, president of <Org> Eastern’s </Org> flight attendants union, the Relation extraction past practice of <Org> Eastern’s </Org> parent, <Loc> Houston </Loc> -based <Org> Texas Air Corp. </Org>, has involved ultimatums to unions to accept the carrier’s terms

POS Tagging • Input: Plays well with others • Ambiguity: NNS/VBZ UH/JJ/NN/RB IN NNS • Output: Plays/VBZ well/RB with/IN others/NNS

Parsing • Sentence: “John ate the apple” • Parse tree (PSG tree) S S → NP VP NP → N NP VP NP → DET N VP → V NP NP N V N → John DET N V → ate DET → the John ate apple the N → apple

Dependency Parsing John ate the apple S NP VP ate OBJ SUBJ NP N V apple John DET N MOD the apple John ate the Dependency tree PSG tree

Semantic Role Labeling Semantic roles Description Agent Initiator of action, Jim gave the book to the professor capable of volition Patient Affected by action, undergoes change of state [Agent Jim] gave [Patient the book] Theme Entity moving, or being “located” [Goal to the professor.] Experiencer Perceives action but not in control Location Beneficiary Instrument Source Goal

Sentiment analysis Posted by: big John (1) I bought a Samsung camera and my friends brought a Canon camera yesterday . (2) In the past week, we both used the cameras a lot . (3) The photos from my Samy are not that great, and the battery life is short too . (4) My friend was very happy with his camera and loves its picture quality . (5) I want a camera that can take good photos . (6) I am going to return it tomorrow . (Samsung, picture_quality, negative, big John) (Samsung, battery_life, negative, big John) (Canon, GENERAL, positive, big John’s_friend ) (Canon, picture_quality, positive, big John’s_friend )

Coreference Resolution [A man named Lionel Gaedi] went to [the Port-au-Prince morgue]2 in search of [[his] brother], [Josef], but was unable to find [[his] body] among [the piles of corpses that had been left [there] ]. [A man named Lionel Gaedi]1 went to [the Port-au-Prince morgue]2 in search of [[his]1 brother]3, [ Josef ]3, but was unable to find [[his]3 body]4 among [the piles of corpses that had been left [there]2 ]5.

Question Answering • One of the oldest NLP tasks • Modern QA systems – IBM’s Watson, Apple’s Siri, etc. • Examples of Factoid questions Questions Answers In Paris, France Where is the Louvre Museum located? What’s the abbreviation for limited partnership? L.P . What are the names of Odin’s ravens? Huginn and Muninn What currency is used in China? The yuan

Example: IBM Watson System • Open-domain question answering system (DeepQA) – In 2011, Watson defeated Brad Rutter and Ken Jennings in the Jeopardy! Challenge

& 2019.10.8 Seung-Hoon Na Chonbuk National University - PowerPoint PPT Presentation

Recurrent Neural Networks and Natural Language Processing: & 2019.10.8 Seung-Hoon Na Chonbuk National University Contents Recurrent neural networks Classical RNN LSTM Recurrent language

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University Neural networks

Markov Chain Monte Carlo (MCMC) Inference Seung-Hoon Na Chonbuk National University Monte Carlo

Bayesian Linear Regression Seung-Hoon Na Chonbuk National University Bayesian Linear Regression

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

Disentangled Representation Learning 2020.5.21 Seung-Hoon Na Jeonbuk National University

Automated Machine Learning 2020.4.16 Seung-Hoon Na Jeonbuk National University Contents

Precision GEM Production in Korea Chonbuk National Univ. Hyunsoo Kim, Min Sang Ryu University

FLSCHED: A Lockless and Lightweight Approach to OS Scheduler for Xeon Phi Heeseung Jo Chonbuk

Characterization of Nano-sized (Ti, Mo)C Forming FeCrAl Alloy Sungyu Kim a , Chang-Hoon Lee b , Jae

KIM SANG HOON Kim concentrates on creating difgerent perspec - As a furniture designer, KIM SANG

Obfuscated Financial Fraud Android Malware : Detection and Behavior Tracking In Seung, Yang

Gravity with higher curvature terms BUM-HOON LEE SOGANG UNIVERSITY Sogang University APRIL 12,

Integrating Filtration Mechanism with a 3D Diesel Particulate Filter (DPF) Model using Hoon Lee

Low-Drift, Efficient Visual Odometry and SLAM Utilizing Environmental Structures Seung Jae Lee 1

T echnical and Legal Challenges for Urban Autonomous Driving Seung-Woo Seo, Prof. Vehicle

Migration, Assignment, and Scheduling of Jobs in Virtualized Environment Seung-Hwan Lim Jae-Seok

Recurrent Networks: Stability analysis and LSTMs 1 Which open source project? 2 Related math.

Slides for Lecture 12 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve

Evaluation of an LSTM-RNN System in Different NIST Language Recognition Frameworks Ruben Zazo,

CoSMIX: A Compiler-based System for Secure Memory Instrumentation and Execution in Enclaves Meni

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Introduction to Computer Science CSCI 109 China Tianhe-2 Andrew Goodney Fall 2017 Lecture

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Virtual Memory - Multiprogramming -

rt sts r t