& 2019.10.8 Seung-Hoon Na Chonbuk National University - - PowerPoint PPT Presentation

2019 10 8 seung hoon na chonbuk national university
SMART_READER_LITE
LIVE PREVIEW

& 2019.10.8 Seung-Hoon Na Chonbuk National University - - PowerPoint PPT Presentation

Recurrent Neural Networks and Natural Language Processing: & 2019.10.8 Seung-Hoon Na Chonbuk National University Contents Recurrent neural networks Classical RNN LSTM Recurrent language


slide-1
SLIDE 1

Recurrent Neural Networks and Natural Language Processing: 순환 신경망 & 자연언어처리

2019.10.8 Seung-Hoon Na Chonbuk National University

slide-2
SLIDE 2

Contents

  • Recurrent neural networks

– Classical RNN – LSTM – Recurrent language model – Sequence labelling – Neural encoder-decoder – Memory augmented neural networks

  • Deep learning for NLP

– Introduction to NLP – Word embedding – ELMo, BERT: Contextualized word embedding

  • Summary
slide-3
SLIDE 3

Neural Network: Two types

  • Feedforward neural networks (FNN)

– = Deep feedforward networks = multilayer perceptrons (MLP) – No feedback connections

  • information flows: x → f(x) → y

– Represented by a directed acyclic graph

  • Recurrent neural networks (RNN)

– Feedback connections are included – Long short term memory (LSTM) – Recently, RNNs using explicit memories like Neural Turing machine (NTM) are extensively studied – Represented by a cyclic graph

𝒚 𝒊

slide-4
SLIDE 4

FNN: Notation

  • For simplicity, a network has single hidden layer only

– 𝑧𝑙: k-th output unit, ℎ𝑘: j-th hidden unit, 𝑦𝑗: i-th input – 𝑣𝑘𝑙: weight b/w j-th hidden and k-th output – 𝑥𝑗𝑘: weight b/w i-th input and j-th hidden

  • Bias terms are also contained in weights

Output layer Hidden layer Input layer 𝑧1 𝑧2 𝑧𝐿−1 𝑧𝐿 𝑧𝑙 ℎ1 ℎ2 ℎ𝑛−1ℎ𝑛 𝑦1 𝑦2 𝑦3 𝑦𝑜−1 𝑦𝑜 ℎ𝑘 𝑦𝑗 𝑣𝑘𝑙 𝑥𝑗𝑘

slide-5
SLIDE 5

FNN: Matrix Notation

5

Output layer Hidden layer Input layer 𝑧1 𝑧2 𝑧𝐿−1 𝑧𝐿 𝑧𝑙 ℎ1 ℎ2 ℎ𝑛−1ℎ𝑛 𝑦1 𝑦2 𝑦3 𝑦𝑜−1 𝑦𝑜 ℎ𝑘 𝑦𝑗 𝑣𝑘𝑙 𝑥

𝑘𝑙

𝒊 𝒛 𝒚 𝑿 𝑽 𝒛 = 𝑔(𝑽𝑕(𝑿𝒚)) 𝒛 = 𝑔(𝑽𝑕 𝑿𝒚 + 𝒄 + 𝒆)

for explicit bias terms

slide-6
SLIDE 6

Typical Setting for Classification

෥ 𝑧𝑗 = 𝑓𝑦𝑞(𝑧𝑗) σ 𝑓𝑦𝑞(𝑧𝑢)

– K: the number of labels – Input layer: Input values (raw features) – Output layer: Scores of labels – Softmax layer: Normalization of output values

  • Scores are transformed to probabilities of

Output layer Hidden layer Input layer 𝑧1 𝑧2 𝑧𝐿−1 𝑧𝐿 Softmax layer 𝑧3 ෤ 𝑧1 ෤ 𝑧2 ෤ 𝑧𝐿

slide-7
SLIDE 7

Recurrent neural networks

  • A family of neural networks for processing

sequential data

  • Specialized for processing a sequence of values

– 𝒚 1 , 𝒚(2), ⋯ , 𝒚(𝜐)

  • Use parameter sharing across time steps

– “I went to Nepal in 2009” – “In 2009, I went to Nepal”

Traditional nets need to learn all of the rules of the language separately at each position in the sentence

slide-8
SLIDE 8

RNN as a Dynamical System

  • The classical form of a dynamical system takes:

– 𝒕(𝑢): the state of the system

  • Unfolding the equation  Directed acyclic

computational graph

– 𝒕(2) = 𝑔(𝒕 2 ; 𝜾)=𝑔(𝑔(𝒕 1 ; 𝜾); 𝜾)

𝒕(𝑢) = 𝑔(𝒕 𝑢−1 ; 𝜾)

slide-9
SLIDE 9

RNN as a Dynamical System

  • RNN can be considered as a dynamic system to

take an external signal 𝒚(𝑢) at time 𝑢

  • Using the recurrence, RNNs maps an arbitrary

length sequence (𝒚 𝑢 , 𝒚 𝑢−1 , 𝒚 𝑢−2 , ⋯ , 𝒚 2 , 𝒚 1 ) to a fixed length vector 𝒊

𝒊(𝑢) = 𝑔(𝒊 𝑢−1 , 𝒚 𝑢 , 𝜄)

slide-10
SLIDE 10

Recurrent Neural Networks

10

Output layer Hidden layer Input layer

𝒊

𝒑

𝒚 𝑽 𝑾

𝒚 𝒊 𝒑

𝑾 𝑽 𝑿 𝒊(𝑢) = 𝑕 𝑿𝒊 𝑢−1 + 𝑽𝒚 𝑢

Feedforward NN Recurrent neural networks

𝒊 = 𝑕 𝑽𝒚

Parameter sharing: The same weights across several time steps

slide-11
SLIDE 11

Classical RNN: Update Formula

𝒊(𝒖) = 𝒈(𝒚 𝒖 , 𝒊(𝒖−𝟐))

  • 𝜏𝒊 𝑢 = 𝑢𝑏𝑜ℎ 𝑿𝒚 𝑢 + 𝑽𝒊 𝑢−1

𝑑(𝑢) = 𝑔(𝑢) °𝑑 𝑢−1 + 𝑗(𝑢) ° ǁ 𝑑 𝑢 (Final memory cell)

  • ℎ(𝑢) = 𝑝(𝑢)°tanh(𝑑 𝑢 )

𝒚 𝒊

𝑽 𝑿 𝑾

𝒑 𝑢 = 𝑾𝒊 𝑢 𝒑

𝒊 𝑢 = 𝑢𝑏𝑜ℎ 𝑿𝒚 𝑢 + 𝑽𝒊 𝑢−1 + 𝒄 𝒑 𝑢 = 𝑾𝒊 𝑢 (Final

Using explicit bias terms

slide-12
SLIDE 12

Computational Graph of RNN

  • Unfolding: The process that maps a circuit-

style graph to a computational graph with repeated units

  • Unfolded graph has a size that depends on the

sequence length

RNN with no outputs Indicates a delay

  • f 1 time step
slide-13
SLIDE 13

RNNs with Classical Setting

  • RNNs that produce an output at each time step and

have recurrent connections between hidden units

Loss 𝑀: measures how far each 𝒑 is from the corresponding training target 𝒛

slide-14
SLIDE 14

Classical RNNs: Computational Power

  • Classical RNNs are universal in the sense that any

function computable by a Turing machine can be computed by RNN [Siegelmann ’91,’95], where the

update formula is given as

– 𝒃 𝑢 = 𝒄 + 𝑿𝒊 𝑢−1 + 𝑽𝒚 𝑢 , – 𝒊 𝑢 = tanh 𝒃 𝑢 – 𝒑 𝑢 = 𝒅 + 𝑾𝒊 𝑢 – ෝ 𝒛 𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒑 𝑢 )

slide-15
SLIDE 15

Classical RNNs: Computational Power

  • Theorems:
  • Classical rational-weighted RNNs are

computationally equivalent to Turing machines

  • Classical real-weighted RNNs are strictly power

powerful than RNNs and Turing machines  Super-Turing Machine

slide-16
SLIDE 16

Classical RNNs: Loss function

  • The total loss for a given sequence of 𝒚 values

paired with a sequence of 𝒛:

– the sum of the losses over all the time steps.

  • 𝑀(𝑢): the negative log-likelihood of 𝑧(𝑢) given

𝒚(1), ⋯ , 𝒚 𝑢 𝑀 𝒚(1), ⋯ , 𝒚 𝜐 , 𝒛(1), ⋯ , 𝒛 𝜐 = ෍

𝑢

𝑀(𝑢) = ෍

𝑢

log 𝑞𝑛𝑝𝑒𝑓𝑚 𝑧(𝑢)| 𝒚(1), ⋯ , 𝒚𝑢

slide-17
SLIDE 17

Backpropagation through Time (BPTT)

forward propagation backward propagation through time (BPTT)

slide-18
SLIDE 18

RNN with Output Recurrence

  • Lack hidden-to-hidden connections

– Less powerful than classical RNNs – This type of RNN cannot simulate a universal TM

slide-19
SLIDE 19

RNN with Single Output

  • At the end of the sequence, network obtains a

representation for entire input sequence and produces a single output

slide-20
SLIDE 20

RNN with Output Dependency

  • The output layer of RNN takes a directed

graphical model that contains edges from some 𝒛(𝑗) in the past to the current output

– This model is able to perform a CRF-style of tagging

𝒚(1) 𝒊(1) 𝒊(2) 𝒚(2) 𝒊(3) 𝒚(3) 𝒊(𝑢) 𝒚(𝑢) 𝒛(1) 𝒛(2) 𝒛(3) 𝒛(𝑢) 𝑿 𝑿 𝑿

slide-21
SLIDE 21

Recurrent Language Model: RNN as Directed Graphical Models

  • Introducing the state variable in the graphical model of

the RNN

slide-22
SLIDE 22

Recurrent Language Model:

Teacher Forcing

  • At training time, the teacher forcing feeds

the correct output 𝒛(𝑢) from the training set.

  • At test time, because the true output is not

available, the correct output is approximated by the model’s output

학습단계에서는 t시점의 hidden표상에 t-1 시점의 gold 정답을 입력으로 가정 테스트단계에서는 t-1시점의 predicted output을 입력

slide-23
SLIDE 23

Modeling Sequences Conditioned on Context with RNNs

  • Generating sequences given a fixed vector 𝒚

– Context: a fixed vector 𝒚 – Take only a single vector 𝒚 as input and generates the 𝒛 sequence

  • Some common ways

– 1. as an extra input at each time step, or – 2. as the initial state 𝒊(0), or – 3. both.

slide-24
SLIDE 24

Modeling Sequences Conditioned on Context with RNNs

  • maps a fixed-length vector 𝒚 into a

distribution over sequences 𝒁

  • E.g.) image labelling
slide-25
SLIDE 25

Modeling Sequences Conditioned on Context with RNNs

  • Input: sequence of vectors 𝒚(𝑢)
  • Output: sequence with the same length as input

𝑄 𝒛(1), ⋯ , 𝒛 𝜐 |𝒚 1 ⋯ , 𝒚(𝜐) ≈ ෑ

𝑢

𝑄 𝒛(𝑢)|𝒚 1 ⋯ , 𝒚 𝑢 , 𝒛(1), ⋯ , 𝒛 𝑢−1

slide-26
SLIDE 26

Bidirectional RNN

  • Combine two RNNs

– Forward RNN: an RNN that moves forward beginning from the start

  • f the sequence

– Backward RNN: an RNN that moves backward beginning from the end

  • f the sequence

– It can make a prediction

  • f y(t) depend on the

whole input sequence.

Forward RNN Backward RNN

slide-27
SLIDE 27

Encoder-Decoder Sequence-to-Sequence

  • Input: sequence
  • Output: sequence (but with

a different length)  Machine translation

generate an output sequence (𝒛(1), ⋯ , 𝒛 𝑜𝑧 ) given an input sequence (𝒚(1), ⋯ , 𝒚 𝑜𝑦 )

Encoder로 RNN을 Decoder로 Recurrent language model을 사용

slide-28
SLIDE 28

RNN: Extensions (1/3)

  • Classical RNN

– Suffers from the challenge of long-term dependencies

  • LSTM (Long short term memory)

– Gated units, dealing with vanishing gradients – Dealing with the challenge of long-term dependencies

  • Bidirectional LSTM

– forward & backward RNNs

  • Bidirectional LSTM CRF

– Output dependency with linear-chain CRF

  • Recurrent language model

– RNN for sequence generation – Predicting a next word conditioning all the previous words

  • Recursive neural network & Tree LSTM

– Generalized RNN for representation of tree structure

slide-29
SLIDE 29

RNN: Extensions (2/3)

  • Neural encoder-decoder

– Conditional recurrent language model – Encoder: RNN for encoding a source sentence – Decoder: RNN for generating a target sentence

  • Neural machine translation

– Neural encoder-decoder with attention mechanism – Attention-based decoder: Selectively conditioning source words, when generating a target word

  • Pointer network

– Attention as generation: Output vocabulary is the set of given source words

slide-30
SLIDE 30

RNN: Extensions (3/3)

  • Stack LSTM

– A LSTM for representing stack structure

  • Extend the standard LSTM with a stack pointer
  • Previously, only push() operation is allowed
  • Now, Pop() operation is supported
  • Memory-augmented LSTMs

– Neural Turing machine – Differentiable neural computer – C.f. ) Neural encoder-decoder, Stack LSTM: Special cases of MALSTM

  • RNN architecture search with reinforcement learning

– Training neural architectures that maximize the expected accuracy on a specific task

slide-31
SLIDE 31

The Challenge of Long-Term Dependencies

  • Example: a very simple recurrent network
  • No nonlinear activation function, no inputs

– 𝒊(𝑢) = 𝑿𝑈𝒊(𝑢−1) – 𝒊(𝑢) = (𝑿𝑢)𝑈𝒊(0)

𝑿 = 𝑹𝚳𝑹𝑈 𝒊(𝑢) = 𝑹𝑈𝚳𝒖𝑹𝒊(0)

Ill-posed form

slide-32
SLIDE 32

The Challenge of Long-Term Dependencies

  • Gradients are vanished or explored in deep models
  • BPTT for recurrent neural networks is a typical example

𝒚𝟐 𝒊𝟐 𝒊𝟑 𝒚𝟑 𝒊𝟒 𝒚𝟒 𝒊𝒖 𝒚𝒖 𝑿 𝑿 𝑿 𝜺𝒖

𝒉′ 𝒜𝟒 ∗ 𝑿𝑈𝜺𝒖 𝒉′ 𝒜2 𝒉′ 𝒜𝟒 ∗ 𝑿𝟑 𝑼𝜺𝑢 𝒉′ 𝒜1 𝒉′ 𝒜2 𝒉′ 𝒜𝟒 ∗ 𝑿𝟒 𝑼𝜺𝑢

Error signal

Delta is obtained by the repeated multiplication

  • f W

Explode if | 𝜇𝑗 | > 1 Vanish if | 𝜇𝑗 | < 1

𝑿 = 𝑾𝑒𝑗𝑏𝑕 𝝁 𝑾−1 𝑿𝑙 = 𝑾𝑒𝑗𝑏𝑕 𝝁 𝑙𝑾−1

𝜀𝟐 = 𝑿𝒍 𝑼𝜺𝑙 ∗ 𝒈(𝒜1

𝑙−1)

slide-33
SLIDE 33

Exploding and vanishing gradients [Bengio ‘94; Pascanu ‘13]

  • Let:

– 𝐸𝑗𝑏𝑕 𝒉′ 𝒜𝒋−𝟐 ≤ 𝛿

  • for bounded nonlinear functions 𝑕′ 𝑦

– 𝜇1: the largest singular value of 𝑿

  • Sufficient condition for Vanishing gradient problem

– 𝜇1 < 1/𝛿

  • Necessary condition for Exploding gradient problem

– 𝜇1 > 1/𝛿

𝒊𝒍 … 𝒊𝒖 𝑿 𝑿 𝜺𝑼 𝜺𝑼−𝟐 = 𝒉′ 𝒜𝑼−𝟐 ∗ 𝑿𝑈 𝜺𝑼 𝜺𝒍 = ෑ

𝑙<𝑗≤𝑈

𝐸𝑗𝑏𝑕 𝒉′ 𝒜𝒋−𝟐 𝑿𝑈 𝜺𝑼

 𝐸𝑗𝑏𝑕 𝒉′ 𝒜𝒋−𝟐 𝑿𝑈 ≤ 𝐸𝑗𝑏𝑕 𝒉′ 𝒜𝒋−𝟐 𝑿𝑈 <

1 𝛿 𝛿=1

 obtained by just inverting the condition for vanishing gradient problem

slide-34
SLIDE 34

Gradient clipping [Pascanu’ 13]

  • Deal with exploring gradients
  • Clip the norm | 𝒉 | of the gradient 𝒉 just before

parameter update If | 𝒉 |>𝑤: 𝒉 ←

𝒉𝑤 | 𝒉 |

Without gradient clipping with clipping

slide-35
SLIDE 35

Long Short Term Memory (LSTM)

  • LSTM: makes it easier for RNNs to capture long-

term dependencies  Using gated units

– Basic LSTM [Hochreiter and Schmidhuer, 98]

  • Cell state unit 𝒅(𝑢): as an internal memory
  • Introduces input gate & output gate
  • Problem: The output is close to zero as long as the output

gate is closed.

– Modern LSTM: Uses forget gate [Gers et al ’00] – Variants of LSTM

  • Add peephole connections [Gers et al ’02]

– Allow all gates to inspect the current cell state even when the

  • utput gate is closed.
slide-36
SLIDE 36

Long Short Term Memory (LSTM)

𝒚 𝒊

𝑽 𝑿

Recurrent neural networks LSTM

𝒚 𝒅 𝒊

𝒈

Memory cell (cell state unit)

slide-37
SLIDE 37

Long Short Term Memory (LSTM)

  • Memory cell 𝒅: gated unit

– Controlled by input/output/forget gates 𝒜 𝒅 𝒊 𝒑 𝒋 𝒈 𝒚 𝒗 𝒛

Gated flow

𝒛 = 𝐯°𝒚

𝒈: forget gate 𝒋: input gate 𝒑: output gate

slide-38
SLIDE 38

Long Short Term Memory (LSTM)

LSTM

𝒚 𝒅 𝒊

Memory cell (cell state unit)

෤ 𝒅 𝒋 𝒑 𝒈

Computing gate values 𝒈(𝑢) = 𝑕𝑔(𝒚 𝑢 , 𝒊(𝑢−1))(forget gate) 𝒋(𝑢) = 𝑕𝑗(𝒚 𝑢 , 𝒊(𝑢−1)) (input gate) 𝒑(𝑢) = 𝑕𝑝(𝒚 𝑢 , 𝒊(𝑢−1))(output gate) ෤ 𝒅 𝑢 = 𝑢𝑏𝑜ℎ 𝑋 𝑑 𝑦 𝑢 + 𝑉 𝑑 ℎ 𝑢−1

(new memory cell)

𝒅(𝑢) = 𝒋(𝑢)° ෤ 𝒅 𝑢 + 𝒈(𝑢)° 𝒅 𝑢−1 𝒊(𝑢) = 𝒑(𝑢)° tanh(𝒅 𝑢 )

slide-39
SLIDE 39

Long Short Term Memory (LSTM)

LSTM

𝒚 𝒅 𝒊

Memory cell (cell state unit)

෤ 𝒅 𝒋 𝒑 𝒈

Computing gate values 𝒈(𝑢) = 𝑕𝑔(𝒚 𝑢 , 𝒊(𝑢−1)) 𝒋(𝑢) = 𝑕𝑗(𝒚 𝑢 , 𝒊(𝑢−1)) 𝒑(𝑢) = 𝑕𝑝(𝒚 𝑢 , 𝒊(𝑢−1)) ෤ 𝒅 𝑢 = 𝑢𝑏𝑜ℎ 𝑋 𝑑 𝑦 𝑢 + 𝑉 𝑑 ℎ 𝑢−1

(new memory cell)

𝒋 𝒑 𝒈

Controling by gate values

𝒅(𝑢) = 𝒋(𝑢)° ෤ 𝒅 𝑢 + 𝒈(𝑢)° 𝒅 𝑢−1 𝒊(𝑢) = 𝒑(𝑢)° tanh(𝒅 𝑢 )

slide-40
SLIDE 40

Long Short Term Memory (LSTM): Cell Unit Notation (Simplified)

𝒚 𝒅 𝒊 𝒋 𝒑 𝒈

𝒅(𝑢) = 𝒋(𝑢)° ෤ 𝒅 𝑢 + 𝒈(𝑢)° 𝒅 𝑢−1 𝒊(𝑢) = 𝒑(𝑢)° tanh(𝒅 𝑢 ) ෤ 𝒅 𝑢 = 𝑢𝑏𝑜ℎ 𝑋 𝑑 𝑦 𝑢 + 𝑉 𝑑 ℎ 𝑢−1

slide-41
SLIDE 41

Long Short Term Memory (LSTM): Long-term dependencies

𝒚(1) 𝒅(1) 𝒊(1) 𝒋 𝒑 𝒈 𝒚(2) 𝒅(2) 𝒋 𝒈 𝒑 𝒊(2) 𝒚(3) 𝒅(3) 𝒋 𝒑 𝒊(3) 𝒚(4) 𝒅(4) 𝒋 𝒑 𝒊(4) 𝒈 𝒈

𝒚 1  𝒊(4): early inputs can be preserved in the memory cell

during long time steps by controlling mechanism

slide-42
SLIDE 42

LSTM: Update Formula

  • 𝑗 𝑢 = 𝜏 𝑋 𝑗 𝑦 𝑢 + 𝑉 𝑗 ℎ 𝑢−1

(Input gate)

  • 𝑔 𝑢 = 𝜏 𝑋 𝑔 𝑦 𝑢 + 𝑉 𝑔 ℎ 𝑢−1

(Forget gate)

  • 𝑝 𝑢 = 𝜏 𝑋 𝑝 𝑦 𝑢 + 𝑉 𝑝 ℎ 𝑢−1

(Output/Exposure gate)

  • ǁ

𝑑 𝑢 = 𝑢𝑏𝑜ℎ 𝑋 𝑑 𝑦 𝑢 + 𝑉 𝑑 ℎ 𝑢−1 (New memory cell)

  • 𝑑(𝑢) = 𝑔(𝑢) °𝑑 𝑢−1 + 𝑗(𝑢) ° ǁ

𝑑 𝑢 (Final memory cell)

  • ℎ(𝑢) = 𝑝(𝑢)°tanh(𝑑 𝑢 )

𝒊(𝒖) = 𝒈(𝒚 𝒖 , 𝒊(𝒖−𝟐))

slide-43
SLIDE 43

LSTM: Memory Cell

𝑑(𝑢) = 𝑔(𝑢) °𝑑 𝑢−1 + 𝑗(𝑢) ° ǁ 𝑑 𝑢 ℎ(𝑢) = 𝑝(𝑢)°tanh(𝑑 𝑢 ) 𝒊(𝒖) ෤ 𝒅 𝒖 𝒅(𝒖−𝟐) 𝒅(𝒖) 𝒋(𝑢) 𝒈(𝑢) 𝒑(𝑢)

𝒚 𝒖 𝒊(𝑢−1) 𝒚 𝒖 𝒊(𝑢−1) 𝒚 𝒖 𝒊(𝑢−1)

𝒚 𝒖

𝒊(𝑢−1)

slide-44
SLIDE 44

LSTM: Memory Cell

  • 𝑑(𝑢): behaves like a memory = MEMORY

– 𝑑(𝑢) = 𝑔(𝑢) °𝑑 𝑢−1 + 𝑗(𝑢) ° ǁ 𝑑 𝑢

  • M(t) = FORGET * M(t-1) + INPUT * NEW_INPUT
  • H(t) = OUTPUT * M(t)
  • FORGET: Erase operation (or memory reset)
  • INPUT: Write operation
  • OUTPUT: Read operation
slide-45
SLIDE 45

Memory Cell - Example

45

0.5 1

M[t] New input

Forget gate

  • 1

0.5 1

  • 1

1 1

Input gate

1 1 1 0.5 1 1 1

Output gate

0.5 1 2

M[t+1]

1 1 0.76 0.96

H[t] tahn New memory

slide-46
SLIDE 46

Long Short Term Memory (LSTM): Backpropagation

  • Error signal in gated flow

𝒚 𝒗 𝒛 𝒛 = 𝐯°𝒚 = 𝐸𝑗𝑏𝑕 𝒗 𝒚 𝜺𝐲 = 𝐸𝑗𝑏𝑕 𝒗 𝑼𝜺𝒛 = 𝒗 ° 𝜺𝒛

𝜺𝒛

slide-47
SLIDE 47

Long Short Term Memory (LSTM): Backpropagation

𝜺𝒅𝑢

𝒜𝑢 = 𝑋

𝑑𝒚𝑢 + 𝑉𝑑𝒊𝑢−1

𝒅𝑢 = 𝒋𝑢° tanh(𝒜𝑢) + 𝒈𝑢° 𝒅𝑢−1 𝒊𝑢 = 𝒑𝑢° tanh 𝒅𝑢 𝒋𝑢 𝒈𝑢 𝒑𝑢−1

𝒊𝑢−1 𝒅𝑢−1 𝒅𝑢

slide-48
SLIDE 48

Long Short Term Memory (LSTM): Backpropagation

𝜺𝒅𝑢

𝒜𝑢 = 𝑋

𝑑𝒚𝑢 + 𝑉𝑑𝒊𝑢−1

𝒅𝑢 = 𝒋𝑢° tanh(𝒜𝑢) + 𝒈𝑢° 𝒅𝑢−1 𝒊𝑢 = 𝒑𝑢° tanh 𝒅𝑢 𝒋𝑢 𝒈𝑢 𝒑𝑢−1

𝒊𝑢−1 𝒅𝑢−1 𝒅𝑢

𝜺𝒊𝑢−1 = 𝑽𝑑

𝑈𝜺𝒜𝑢

𝜺𝒜𝑢 = 𝑢𝑏𝑜ℎ′ 𝒜𝑢 °𝒋𝑢°𝜺𝒅𝑢 𝜺𝒊𝑢−1 = 𝑢𝑏𝑜ℎ′ 𝒜𝑢 °𝒋𝑢°𝑽𝑑

𝑈𝜺𝒅𝑢

slide-49
SLIDE 49

Long Short Term Memory (LSTM): Backpropagation

𝜺𝒅𝑢

𝒜𝑢 = 𝑋

𝑑𝒚𝑢 + 𝑉𝑑𝒊𝑢−1

𝒅𝑢 = 𝒋𝑢° tanh(𝒜𝑢) + 𝒈𝑢° 𝒅𝑢−1 𝒊𝑢 = 𝒑𝑢° tanh 𝒅𝑢 𝒋𝑢 𝒈𝑢 𝒑𝑢−1

𝒊𝑢−1 𝒅𝑢−1 𝒅𝑢

𝜺𝒅𝑢−1 = 𝑢𝑏𝑜ℎ′ 𝒅𝑢−1 °𝒑𝑢−1°𝜺𝒊𝑢−1+ 𝒈𝑢°𝜺𝒅𝒖 𝜺𝒅𝑢−1 = 𝑢𝑏𝑜ℎ′ 𝒅𝑢−1 °𝒑𝑢°𝑢𝑏𝑜ℎ′ 𝒜𝑢 °𝒋𝑢°𝑽𝑑

𝑈𝜺𝒅𝑢+ 𝒈𝑢°𝜺𝒅𝒖

𝜺𝒊𝑢−1 = 𝑢𝑏𝑜ℎ′ 𝒜𝑢 °𝒋𝑢°𝑽𝑑

𝑈𝜺𝒅𝑢

slide-50
SLIDE 50

Long Short Term Memory (LSTM): Backpropagation

𝜺𝒅𝑢

𝒜𝑢 = 𝑿𝑑𝒚𝑢 + 𝑽𝑑𝒊𝑢−1 𝒅𝑢 = 𝒋𝑢° tanh(𝒜𝑢) + 𝒈𝑢° 𝒅𝑢−1 𝒊𝑢 = 𝒑𝑢° tanh 𝒅𝑢 𝒋𝑢 𝒈𝑢 𝒑𝑢−1

𝒊𝑢−1 𝒅𝑢−1 𝒅𝑢

𝜺𝒊𝑢−1 = 𝑢𝑏𝑜ℎ′(𝒜𝑢)°𝒋𝑢°𝑽𝑑

𝑈𝜺𝒅𝑢

𝜺𝒅𝑢−1 = 𝑢𝑏𝑜ℎ′ 𝒅𝑢−1 °𝒑𝑢−1°𝜺𝒊𝑢−1+ 𝒈𝑢°𝜺𝒅𝒖 𝜺𝒅𝑢−1 = 𝑢𝑏𝑜ℎ′ 𝒅𝑢−1 °𝒑𝑢−1°𝑢𝑏𝑜ℎ′(𝒜𝑢)°𝒋𝑢°𝑽𝑑

𝑈𝜺𝒅𝑢+ 𝒈𝑢°𝜺𝒅𝒖

𝜺𝒅𝑢−1 = (𝑢𝑏𝑜ℎ′ 𝒅𝑢−1 °𝒑𝑢−1°𝑢𝑏𝑜ℎ′ 𝒜𝑢 °𝒋𝑢°𝑽𝑑

𝑈 tanh ′ 𝑦 = 1 − 𝑢𝑏𝑜ℎ2 𝑦

slide-51
SLIDE 51

LSTM vs. Vanilla RNN: Backpropagation

𝜺𝒅𝑢

𝒋𝑢 𝒈𝑢 𝒑𝑢

𝒊𝑢−1 𝒅𝑢−1 𝒅𝑢 𝒊𝑢−1 𝒊𝑢

𝑿

𝜺𝒊𝒖−𝟐 = 𝑕′ 𝒜𝒖 ∗ 𝑿𝑈 𝜺𝒊𝒖

𝒜𝑢 = 𝑿𝒊𝑢−1 + 𝑽𝒚𝑢 𝒊𝑢 = 𝑢𝑏𝑜ℎ(𝒜𝒖)

𝜺𝒅𝑢−1 = 𝑕′ 𝒅𝑢−1 °𝑕′ 𝒜𝑢 °𝒑𝑢−1°𝒋𝑢 °𝑽𝑈 + 𝒈𝑢 °𝜺𝒅𝒖

tanh 𝑦 = 𝑕(𝑦) This additive term is the key for dealing with vanishing gradient problems

Vanilla RNN LSTM

slide-52
SLIDE 52

Exercise: Backpropagation for LSTM

𝒊𝑢−1 𝒚𝑢 𝒋𝑢 𝒈𝑢 ෤ 𝒅𝑢 𝒅𝑢−1 𝒊𝑢 𝒅𝑢 𝒑𝑢

Complete flow graph & derive weight update formula

memory cell 𝒛𝑢 new input

slide-53
SLIDE 53

Gated Recurrent Units [Cho et al ’14]

  • Alternative architecture to handle long-term

dependencies

𝒊(𝒖) = 𝒈(𝒚 𝒖 , 𝒊(𝒖−𝟐))

  • 𝑨 𝑢 = 𝜏 𝑋 𝑨 𝑦 𝑢 + 𝑉 𝑨 ℎ 𝑢−1

(Update gate)

  • 𝑠 𝑢 = 𝜏 𝑋 𝑠 𝑦 𝑢 + 𝑉 𝑠 ℎ 𝑢−1

(Reset gate)

ℎ 𝑢 = 𝑢𝑏𝑜ℎ 𝑠(𝑢)°𝑉ℎ 𝑢−1 + 𝑋𝑦(𝑢) (New memory)

  • ℎ(𝑢) = (1 − 𝑨(𝑢)) °෨

ℎ 𝑢 + 𝑨(𝑢) °ℎ 𝑢−1 (Hidden state)

slide-54
SLIDE 54
  • The output layer of RNN takes a directed

graphical model that contains edges from some 𝒛(𝑗) in the past to the current output

– This model is able to perform a CRF-style of tagging

LSTM CRF: RNN with Output Dependency

𝒚(1) 𝒊(1) 𝒊(2) 𝒚(2) 𝒊(3) 𝒚(3) 𝒊(𝑢) 𝒚(𝑢) 𝒛(1) 𝒛(2) 𝒛(3) 𝒛(𝑢) 𝑿 𝑿 𝑿

slide-55
SLIDE 55

Recurrent Language Model

  • Introducing the state variable in the graphical model of

the RNN

𝒊(𝒖−𝟐)

slide-56
SLIDE 56

Bidirectional RNN

  • Combine two RNNs

– Forward RNN: an RNN that moves forward beginning from the start

  • f the sequence

– Backward RNN: an RNN that moves backward beginning from the end

  • f the sequence

– It can make a prediction

  • f y(t) depend on the

whole input sequence.

Forward RNN Backward RNN

slide-57
SLIDE 57

Bidirectional LSTM CRF [Huang ‘15]

  • One of the state-of-the art models for

sequence labelling tasks

BI-LSTM-CRF model applied to named entity tasks

slide-58
SLIDE 58

Bidirectional LSTM CRF [Huang ‘15]

Comparison of tagging performance on POS, chunking and NER tasks for various models [Huang et al. 15]

slide-59
SLIDE 59

Neural Machine Translation

  • RNN encoder-decoder

– Neural encoder-decoder: Conditional recurrent language model

  • Neural machine translation with attention

mechanism

– Encoder: Bidirectional LSTM – Decoder: Attention Mechanism [Bahdanau et al ’15]

  • Character based NMT

– Hierarchical RNN Encoder-Decoder [Ling ‘16] – Subword-level Neural MT [Sennrich ’15] – Hybrid NMT [Luong & Manning ‘16] – Google’s NMT [Wu et al ‘16]

slide-60
SLIDE 60

Neural Encoder-Decoder

Input text Translated text

Credit: http://nlp.stanford.edu/projects/nmt/Luong-Cho-Manning-NMT-ACL2016-v4.pdf

slide-61
SLIDE 61

Neural Encoder-Decoder: Conditional Recurrent Language Model

Credit: http://nlp.stanford.edu/projects/nmt/Luong-Cho-Manning-NMT-ACL2016-v4.pdf

slide-62
SLIDE 62

Neural Encoder-Decoder [Cho et al ’14]

Encoder: RNN

Decoder: Recurrent language model

  • Computing the log of translation probability 𝑚𝑝𝑕 𝑄(𝑧|𝑦) by two RNNs
slide-63
SLIDE 63

Decoder: Recurrent Language Model

Credit: http://nlp.stanford.edu/projects/nmt/Luong-Cho-Manning-NMT-ACL2016-v4.pdf

slide-64
SLIDE 64

Neural Encoder-Decoder with Attention Mechanism [Bahdanau et al ’15]

  • Decoder with attention mechanism

– Apply attention first to the encoded representations before generating a next target word – Attention: find aligned source words for a target word

  • Considered as implicit alignment process

– Context vector c:

  • Previously, the last hidden state from RNN encoder[Cho et al ’14]
  • Now, content-sensitively chosen with a mixture of hidden states of

input sentence at generating each target word

Sampling a word Sampling a word condition

Attention Attention

slide-65
SLIDE 65

Decoder with Attention Mechanism

  • Attention: 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑔

𝑏(𝒊𝑢−1, ഥ

𝑰𝑡))

𝑡𝑑𝑝𝑠𝑓 𝒊𝑢−1, ഥ 𝒊𝑡 = 𝒘𝑈 tanh 𝑿𝒊𝑢−1 + 𝑾ത

𝒊𝑡

Directly computes a soft alignment 𝒃𝒖(𝑡) = exp(𝑡𝑑𝑝𝑠𝑓(𝒊𝑢−1, ഥ 𝒊𝑡)) σ𝑡′ exp(𝑡𝑑𝑝𝑠𝑓(𝒊𝑢−1, ഥ 𝒊𝑡′)) Expected annotation

Attention scoring function

ഥ 𝒊𝑡: a source hidden state

𝒊𝑢−1

ഥ 𝐼𝑇 = [ത ℎ1, ⋯ , ത ℎ𝑜]

Encoded representations

𝑡𝑝𝑔𝑢𝑛𝑏𝑦

slide-66
SLIDE 66
  • Original scoring function [Bahdanau et al ’15]
  • Extension of scoring functions [Luong et al ‘15]

Decoder with Attention Mechanism

Bilinear function

𝑡𝑑𝑝𝑠𝑓 𝒊𝑢−1, ഥ 𝒊𝑡 = 𝒘𝑈 tanh 𝑿𝒊𝑢−1 + 𝑾ഥ 𝒊𝑡

slide-67
SLIDE 67
  • Attention scoring function

http://aclweb.org/anthology/D15-1166

  • Computation path: 𝒊𝑢 → 𝒃𝑢 → 𝒅𝑢 → ෩

𝒊𝑢

  • Previously, 𝒊𝑢−1 → 𝒃𝑢 → 𝒅𝑢 → 𝒊𝑢

Neural Encoder-Decoder with Attention Mechanism [Luong et al ‘15]

slide-68
SLIDE 68

Neural Encoder-Decoder with Attention Mechanism [Luong et al ‘15]

  • Input-feeding approach

Attentional vectors ሚ 𝐢t are fed as inputs to the next time steps to inform the model about past alignment decisions

t에서 attentional 벡터가 다음 입력벡터와 concat되 어 t+1의 입력을 구성

slide-69
SLIDE 69

GNMT: Google’s Neural Machine Translation [Wu et al ‘16]

Deep LSTM network with 8 encoder and 8 decoder layers using residual connections as well as attention connections from the decoder network to the encoder.

Trained by Google’s Tensor Processing Unit (TPU)

slide-70
SLIDE 70

GNMT: Google’s Neural Machine Translation [Wu et al ‘16]

Mean of side-by-side scores on production data

Reduces translation errors by an average of 60% compared to Google’s phrase-based production system.

slide-71
SLIDE 71

Pointer Network

Neural encoder-decoder Pointer network

  • Attention as a pointer to select a member of the

input sequence as the output.

Attention as output

slide-72
SLIDE 72

Neural Conversational Model

[Vinyals and Le ’ 15]

  • Using neural encoder-decoder for conversations

– Response generation

http://arxiv.org/pdf/1506.05869.pdf

slide-73
SLIDE 73

BIDAF for Machine Reading Comprehension [Seo ‘17]

Bidirectional attention flow

slide-74
SLIDE 74

Memory Augmented Neural Networks

– Extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes

  • Writing & Reading mechanisms are added
  • Examples
  • Neural Turing Machine
  • Differentiable Neural

Computer

  • Memory networks
slide-75
SLIDE 75

Neural Turing Machine [Graves ‘14]

  • Two basic components: A

neural network controller and a memory bank.

  • The controller network

receives inputs from an external environment and emits outputs in response.

– It also reads to and writes from a memory matrix via a set of parallel read and write heads.

slide-76
SLIDE 76

Memory

  • Memory 𝑁𝑢

– The contents of the 𝑂 x 𝑁 memory matrix at time 𝑢

slide-77
SLIDE 77

Read/Write Operations for Memory

  • Read from memory (“blurry”)

– 𝒙𝑢: a vector of weightings over the N locations emitted by a read head at time 𝑢 (σ𝑗 𝑥𝑢(𝑗) = 1) – 𝒔𝑢: The length M read vector

  • Write to memory (“blurry”)

– Each write: an erase followed by an add – 𝒇𝑢: Erase vector, 𝒃𝑢: Add vector

slide-78
SLIDE 78

Addressing by Content

  • Based on Attention mechanism

– Focuses attention on locations based on the similarity b/w the current values and values emitted by the controller – 𝒍𝑢: The length M key vector – 𝛾𝑢: a key strength, which can amplify or attenuate the precision of the focus – K[u,v]: similarity measure  cosine similarity

slide-79
SLIDE 79

Addressing

  • Interpolating content-based weights with

previous weights

– which results in the gated weighting

  • A scalar interpolation gate 𝑕𝑢

– Blend between the weighing 𝒙𝑢−1 produced by the head at the previous time and the weighting 𝒙𝑑 produced by the content system at the current time- step

slide-80
SLIDE 80

Addressing by Location

  • Based on Shifting

– 𝒕𝑢: shift weighting that defines a normalized distribution over the allowed integer shifts

  • E.g.) The simplest way: to use a softmax layer
  • Scalar-based: if the shift scholar is 6.7, then 𝑡𝑢(6)=0.3 ,

𝑡𝑢(7)=0.7, and the rest of 𝒕𝑢 is zero

– 𝛿𝑢: an additional scalar which sharpen the final weighting

slide-81
SLIDE 81

Addressing: Architecture

slide-82
SLIDE 82

Controller

Controller

𝒔𝑢 ∈ 𝑆𝑁

Input

The network for controller: FNN or RNN

  • 𝒍𝑢

𝑆 ∈ 𝑆𝑁

  • 𝒕𝑢

𝑆 ∈ 0,1 𝑂

  • 𝛾𝑢

𝑆 ∈ 𝑆+

  • 𝛿𝑢

𝑆 ∈ 𝑆≥1

  • 𝑕𝑢

𝑆 ∈ (0,1)

Output for write head

  • 𝒇𝑢, 𝒃𝑢, 𝒍𝑢

𝑋 ∈ 𝑆𝑁

  • 𝒕𝑢

𝑋 ∈ 0,1 𝑂

  • 𝛾𝑢

𝑋 ∈ 𝑆+

  • 𝛿𝑢

𝑋 ∈ 𝑆≥1

  • 𝑕𝑢

𝑋 ∈ (0,1)

Output for read head

External output

slide-83
SLIDE 83

NTM vs. LSTM: Copy task

  • Task: Copy sequences of eight bit random vectors, where

sequence lengths were randomised b/w 1 and 20

slide-84
SLIDE 84

NTM vs. LSTM: Mult copy

slide-85
SLIDE 85

Differentiable Neural Computers

  • Extension of NTM by advancing Memory addressing
  • Memory addressing are defined by three main

attention mechanisms

– Content (also used in NTM) – memory allocation – Temporal order

  • The controller interpolates among these

mechanisms using scalar gates

Credit: http://people.idsia.ch/~rupesh/rnnsymposium2016/slides/graves.pdf

slide-86
SLIDE 86

DNC: Overall architecture

slide-87
SLIDE 87

DNC: bAbI Results

  • Each story is treated as a separate sequence and

presented it to the network in the form of word vectors,

  • ne word at a time.

mary journeyed to the kitchen. mary moved to the bedroom. john went back to the hallway. john picked up the milk there. what is john carrying ? - john travelled to the garden. john journeyed to the bedroom. what is john carrying ? - mary travelled to the bathroom. john took the apple there. what is john carrying ? - -

The answers required at the ‘−’ symbols, grouped by question into braces, are {milk}, {milk}, {milk apple} The network was trained to minimize the cross-entropy of the softmax outputs with respect to the target words

slide-88
SLIDE 88

DNC: bAbI Results

http://www.nature.com/nature/journal/v538/n7626/full/nature20101.html

slide-89
SLIDE 89

Deep learning for Natural language processing

  • Short intro to NLP
  • Word embedding
  • Deep learning for NLP
slide-90
SLIDE 90

Natural Language Processing

  • What is NLP?

– The automatic processing of human language

  • Give computers the ability to process human language

– Its goal enables computers to achieve human-like comprehension

  • f texts/languages
  • Tasks

– Text processing

  • POS Tagging / Parsing / Discourse analysis

– Information extraction – Question answering – Dialog system / Chatbot – Machine translation

slide-91
SLIDE 91

Linguistics and NLP

  • Many NLP tasks correspond to structural

subfields of linguistics

Phonetics Phonology Morphology Syntax Semantics Pragmatics Speech recognition POS tagging Parsing Word sense disambiguation Semantic role labeling Word segmentation Semantic parsing

Subfields of linguistics NLP Tasks

Named entity recognition/disambiguation Reading comprehension

slide-92
SLIDE 92

Information Extraction

According to Robert Callahan, president of Eastern’s flight attendants union, the past practice

  • f

Eastern’s parent, Houston-based Texas Air Corp., has involved ultimatums to unions to accept the carrier’s terms According to <Per> Robert Callahan </Per>, president

  • f <Org> Eastern’s </Org> flight attendants union, the

past practice of <Org> Eastern’s </Org> parent, <Loc> Houston </Loc> -based <Org> Texas Air Corp. </Org>, has involved ultimatums to unions to accept the carrier’s terms

Entity extraction Relation extraction

92

Robert Callahan Eastern’s <Empolyee_Of> Texas Air Corp Huston <Located_In>

slide-93
SLIDE 93

POS Tagging

  • Input:

Plays well with others

  • Ambiguity: NNS/VBZ UH/JJ/NN/RB IN NNS
  • Output:

Plays/VBZ well/RB with/IN others/NNS

slide-94
SLIDE 94

Parsing

  • Sentence: “John ate the apple”
  • Parse tree (PSG tree)

S NP VP N John V NP ate DET N the apple

S → NP VP NP → N NP → DET N VP → V NP N → John V → ate DET → the N → apple

slide-95
SLIDE 95

Dependency Parsing

S NP VP N John V NP ate DET N the apple John ate the apple PSG tree John ate the apple SUBJ MOD OBJ Dependency tree

slide-96
SLIDE 96

Semantic Role Labeling

[Agent Jim] gave [Patient the book] [Goal to the professor.] Jim gave the book to the professor

Semantic roles Description Agent Initiator of action, capable

  • f volition

Patient Affected by action, undergoes change of state Theme Entity moving, or being “located” Experiencer Perceives action but not in control Beneficiary Instrument Location Source Goal

slide-97
SLIDE 97

Sentiment analysis

(1) I bought a Samsung camera and my friends brought a Canon camera

  • yesterday. (2) In the past week, we both used the cameras a lot. (3) The

photos from my Samy are not that great, and the battery life is short too. (4) My friend was very happy with his camera and loves its picture quality. (5) I want a camera that can take good photos. (6) I am going to return it tomorrow.

Posted by: big John (Samsung, picture_quality, negative, big John) (Samsung, battery_life, negative, big John) (Canon, GENERAL, positive, big John’s_friend) (Canon, picture_quality, positive, big John’s_friend)

slide-98
SLIDE 98

Coreference Resolution

[A man named Lionel Gaedi]1 went to [the Port-au-Prince morgue]2 in search of [[his]1 brother]3, [ Josef ]3, but was unable to find [[his]3 body]4 among [the piles of corpses that had been left [there]2 ]5. [A man named Lionel Gaedi] went to [the Port-au-Prince morgue]2 in search of [[his] brother], [Josef], but was unable to find [[his] body] among [the piles of corpses that had been left [there] ].

slide-99
SLIDE 99

Question Answering

  • One of the oldest NLP tasks
  • Modern QA systems

– IBM’s Watson, Apple’s Siri, etc.

  • Examples of Factoid questions

Questions Answers Where is the Louvre Museum located? In Paris, France What’s the abbreviation for limited partnership? L.P . What are the names of Odin’s ravens? Huginn and Muninn What currency is used in China? The yuan

slide-100
SLIDE 100

Example: IBM Watson System

  • Open-domain question answering system (DeepQA)

– In 2011, Watson defeated Brad Rutter and Ken Jennings in the Jeopardy! Challenge

slide-101
SLIDE 101

Machine Reading Comprehension

  • SQuAD / KorQuAD
slide-102
SLIDE 102

Chatbot

slide-103
SLIDE 103

Conversational Question Answering

slide-104
SLIDE 104

Word embedding: Distributed representation

– Distributed representation

  • n-dimensional latent vector for a word
  • Semantically similar words are closely located in vector space
slide-105
SLIDE 105

Word embedding matrix: Lookup Table

⋮ 1 ⋮ Word One-hot vector e (|V|-dimensional vector)

L=

|V|

the cat mat

d

𝑀 = 𝑆𝑒×|𝑊|

Word vector x is obtained from one-hot vector e by referring to lookup table

x =L e

slide-106
SLIDE 106

Word embedding matrix: context  input layer

  • word seq 𝑥1 ⋯ 𝑥𝑜  input layer

𝑥𝑢−𝑜+1 𝑥𝑢−2 𝑥𝑢−1

… … …

n context words Input layer 𝑀 𝑓𝑥𝑢−𝑜+1 𝑜 𝑒 dim input vector

concatenation

d-diml vec

𝑀 𝑓𝑥𝑢−2 𝑀 𝑓𝑥𝑢−1 𝑓𝑥 : one hot vector 𝑥

Lookup table

d-dim vec d-dim vec

slide-107
SLIDE 107

Natural Language Processing using Word Embedding

Word

Learning word embedding matrix

Raw corpus

Initialize lookup table

Application-specific neural network

Application- specific NN

Annotated corpus  Lookup table is further fine-tuned

Unsupervised Supervised

slide-108
SLIDE 108

Language Model

  • Defines a probability distribution over sequences of

tokens in a natural language

  • N-gram model

– An n-gram is a sequence of n tokens – Defines the conditional probability of the n-th token given the preceding n-1 tokens – To avoid it from getting zero probabilities, smoothing needs to be employed

  • Back-off methods
  • Interpolation method
  • Class-based language models

𝑄 𝑥1, ⋯ , 𝑥𝑈 = 𝑄(𝑥1, ⋯ , 𝑥𝑜−1) ෑ

𝑢=𝑜 𝑈

𝑄(𝑥𝑢|𝑥𝑢−𝑜+1, ⋯ , 𝑥𝑢−1)

slide-109
SLIDE 109

Neural Language Model [Bengio ’03]

– Instead of an original raw symbol, a distributed representation of words is used – Word embedding: Raw symbols are projected to a low- dimensional vector space – Unlike class-based n-gram models, it can recognize that two words are semantically similar and also encode each word as distinct from each other

  • Method

– Estimate 𝑄(𝑥𝑗|𝑥𝑗− 𝑜−1 , ⋯ , 𝑥𝑗−1) by FNN for classification – 𝐲: Concatenated input features (input layer) – 𝒛 = 𝐕 tanh(𝐞 + 𝐈𝐲) – 𝒛 = 𝐗𝐲 + 𝐜 + 𝐕 tanh(𝐞 + 𝐈𝐲)

slide-110
SLIDE 110

Neural Language Model [Bengio ‘03]

  • Words are projected by a linear operation on the projection layer
  • Softmax function is used at the output layer to ensure that 0 <= p<= 1
slide-111
SLIDE 111

Neural Language Model

Image Credit: Tomas Mikolov

Output layer

𝑥𝑢−𝑜+1 𝑥𝑢−2 𝑥𝑢−1 … …

Lookup table

𝑥𝑢 … …

𝑄 𝑥𝑢 𝑥𝑢−𝑜+1, ⋯ , 𝑥𝑢−1 = 𝑓𝑦𝑞(𝑧𝑥𝑢) σ 𝑓𝑦𝑞(𝑧𝑢)

Softmax layer 𝑧1 𝑧2 𝑧|𝑊| Hidden layer

𝑽 𝑰 𝑿

𝒚 𝒛 𝒊 𝑰 𝑽 𝑿

softmax

𝒛 = 𝐗𝐲 + 𝐜 + 𝐕 tanh(𝐞 + 𝐈𝐲) Input layer

slide-112
SLIDE 112

Experiments: Neural Language Model [Bengio ’03]

NLM

slide-113
SLIDE 113

Neural Language Model: Discussion

  • Limitation: Computational complexity

– Softmax layer requires computing scores over all vocabulary words

  • Vocabulary size is very large
  • Using short list [Schwent ‘02]

– Vocabulary 𝑊 is split into a shortlist 𝑀 of most frequent words and a tail 𝑈 of more rare words

𝑄 𝑧 = 𝑗 𝐷 = 𝜀 𝑗 ∈ 𝑀 𝑄 𝑧 = 1 𝐷, 𝑗 ∈ 𝑀 1 − 𝑄 𝑗 ∈ 𝑈 𝐷 +𝜀 𝑗 ∈ 𝑈 𝑄 𝑧 = 𝑗 𝐷, 𝑗 ∈ 𝑈 𝑄(𝑗 ∈ 𝑈|𝐷) Use n-gram model Use neural language model

slide-114
SLIDE 114

Hierarchical Softmax [Morin and Bengio ‘05]

  • Requires O(log|V|) computation, instead of O(|V|)
  • The next-word conditional probability is computed by

𝑄 𝑤 𝑥𝑢−1, ⋯ , 𝑥𝑢−𝑜+1 = ෑ

𝑘=1 𝑛

𝑄(𝑐

𝑘(𝑤)|𝑐1 𝑤 , ⋯ , 𝑐 𝑘−1 𝑤 , 𝑥𝑢−1, ⋯ , 𝑥𝑢−𝑜+1)

slide-115
SLIDE 115

Important Sampling [Bengio ‘03; Jean ’15]

𝜖𝑚𝑝𝑕𝑞 𝑥 𝐷 𝜖𝜄 = 𝜖𝑧𝑥 𝜖𝜄 − ෍

𝑙:𝑥′∈𝑊

𝑄(𝑥′|𝐷) 𝜖𝑧𝑥′ 𝜖𝜄

𝑧𝑥: score given before applying softmax

𝐹𝑄(𝑥|𝐷) 𝜖𝑧𝑥′ 𝜖𝜄

Expected gradient Approximation by importance sampling

𝑙:𝑥∈𝑊′

𝜕𝑥 σ𝑥′:𝑧𝑥′∈𝑊′ 𝜕𝑥′ 𝜖𝑧𝑥 𝜖𝜄

𝜕𝑥 = exp{𝑧𝑥 − 𝑚𝑝𝑕𝑅(𝑥)}

Proposed distribution

slide-116
SLIDE 116

Ranking Loss [Collobert & Weston ’08]

  • Sampling a negative example

– 𝒕: a given sequence of words (in training data) – 𝒕′: a negative example the last word is replaced with another word – 𝑔(𝒕): score of the sequence 𝒕 – Goal: makes the score difference (𝑔(𝒕) – 𝑔(𝒕’)) large – Various loss functions are possible

  • Hinge loss: max(0, 1 − 𝑔 𝒕 + 𝑔 𝒕′ )
slide-117
SLIDE 117

Ranking Loss [Collobert & Weston ’08]

𝒚 𝒊

Perturbed

𝑰 𝒚′ 𝑰 𝒊′ 𝑧 𝑧′ 𝒙 𝒙

max(0, (𝑧′ + 1) − 𝑧)

slide-118
SLIDE 118

Recurrent Neural Language Model [Mikolov ‘00]

  • Hidden layer of the

previous layer connects to the hidden layer of the next word

  • No need to specify the

context length

– NLM: only (n-1) previous words are conditioned – RLM: all previous words are conditioned

slide-119
SLIDE 119

Recurrent Neural Language Model

  • Unfolded flow graph

http://arxiv.org/abs/1511.07916

slide-120
SLIDE 120

Word2Vec [Mikolov ‘13]

𝑞 𝑥𝑃 𝑥𝐽 = exp(𝑤𝑥

′ 𝑃 𝑈𝑤𝑥𝐽)

σ𝑥 exp(𝑤𝑥

′ 𝑈 𝑤𝑥𝐽) http://arxiv.org/pdf/1301.3781.pdf

slide-121
SLIDE 121

Word Vectors have linear relationships

Mikolov ‘13

slide-122
SLIDE 122

ELMo: Contextualized Word Embedding

  • 문맥에 따라 단어의 임베딩을 차별화
slide-123
SLIDE 123

ELMo: Contextualized Word Embedding

  • ELMo의 구조는 LSTM의 Language model에 기반

– 이전단어로부터 다음 단어 예측하는 tasks를 forward방 향과 backward방향 두 가지를 구성  BiLM

slide-124
SLIDE 124

ELMo: Contextualized Word Embedding

  • K번째 단어에 대한 문맥화 단어임베딩:

– 문장에 대한 ELMo인코딩 후 해당 단어에 대한 모든 layer의 표상의 선형 결합

slide-125
SLIDE 125

ELMo: Contextualized Word Embedding

  • Raw corpus로부터 BiLM ELMo 사전학습

– 사전학습된 모델은 다른 NLP tasks에 활용

slide-126
SLIDE 126

ELMo: Contextualized Word Embedding

  • 실험결과

– 각 NLP tasks 영역에서 당시 최고 성능 개선

slide-127
SLIDE 127

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  • Transformer encoder [Vaswani et al ‘17]
  • Multi-headed self attention
  • Layer norm and residuals
  • Positional embeddings
slide-128
SLIDE 128

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

slide-129
SLIDE 129

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  • 실험결과

– 대용량 코퍼스 사전학습  Fine tuning

  • 각 NLP tasks 영역에서 최고 성능 달성
slide-130
SLIDE 130

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

slide-131
SLIDE 131

Summary

  • Recurrent neural networks

– Classical RNN, LSTM, RNN encoder-decoder, attention mechanism, Neural turning machine

  • Deep learning for NLP

– Word embedding – ELMo, BERT: Contextualized word embedding – Applications

  • POS tagging, Named entity recognition, semantic role labelling
  • Information extraction, parsing, Sentiment analysis, Neural

machine translation, Question answering, Response generation, Sentence completing, Reading comprehension, Information retrieval, sentence retrieval, knowledge completing