Deep Learning for Natural Language processing Jindich Libovick - - PowerPoint PPT Presentation

deep learning for natural language processing
SMART_READER_LITE
LIVE PREVIEW

Deep Learning for Natural Language processing Jindich Libovick - - PowerPoint PPT Presentation

Deep Learning for Natural Language processing Jindich Libovick March 1, 2017 Introduction to Natural Language Processing Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise


slide-1
SLIDE 1

Deep Learning for Natural Language processing

Jindřich Libovický

March 1, 2017

Introduction to Natural Language Processing

Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

slide-2
SLIDE 2

Outline

Neural Networks Basics Representing Words Representing Sequences Recurrent Networks Convolutional Networks Self-attentive Networks Classifjcation and Labeling Generating Sequences Pre-training Representations Word2Vec ELMo BERT

Deep Learning for Natural Language processing

1/90

slide-3
SLIDE 3

Deep Learning in NLP

  • NLP tasks learn end-to-end using deep learning — the number-one approach in current

research

  • State of the art in POS tagging, parsing, named-entity recognition, machine translation,

  • Good news: training without almost any linguistic insight
  • Bad news: requires enormous amount of training data and really big computational

power

Deep Learning for Natural Language processing

2/90

slide-4
SLIDE 4

What is deep learning?

  • Buzzword for machine learning using neural networks with many layers using

back-propagation

  • Learning of a real-valued function with millions of parameters that solves a particular

problem

  • Learning more and more abstract representation of the input data until we reach such a

suitable representation for our problem

Deep Learning for Natural Language processing

3/90

slide-5
SLIDE 5

Neural Networks Basics

slide-6
SLIDE 6

Neural Networks Basics

Neural Networks Basics Representing Words Representing Sequences Recurrent Networks Convolutional Networks Self-attentive Networks Classifjcation and Labeling Generating Sequences Pre-training Representations Word2Vec ELMo BERT

Deep Learning for Natural Language processing

4/90

slide-7
SLIDE 7

Single Neuron

activation function input x weights w

  • utput

∑ ⋅ is > 0? 𝑦𝑗 ⋅𝑥𝑗 𝑦1 ⋅𝑥1 𝑦2 ⋅𝑥2 ฀ ฀ 𝑦𝑜 ⋅𝑥𝑜

Deep Learning for Natural Language processing

5/90

slide-8
SLIDE 8

Neural Network

𝑦 ↓ ↑ ↓ ↑ ℎ1 = 𝑔(𝑋1𝑦 + 𝑐1) ↓ ↑ ↓ ↑ ℎ2 = 𝑔(𝑋2ℎ1 + 𝑐2) ↓ ↑ ↓ ↑ ⋮ ⋮ ↓ ↑ ↓ ↑ ℎ𝑜 = 𝑔(𝑋𝑜ℎ𝑜−1 + 𝑐𝑜) ↓ ↑ ↓ ↑ 𝑝 = 𝑕(𝑋𝑝ℎ𝑜 + 𝑐𝑝)

∂𝐹 ∂𝑋𝑝 = ∂𝐹 ∂𝑝 ⋅ ∂𝑝 ∂𝑋𝑝

↓ ↓ ↑ 𝐹 = 𝑓(𝑝, 𝑢) →

∂𝐹 ∂𝑝

Deep Learning for Natural Language processing

6/90

slide-9
SLIDE 9

Implementation

Logistic regression: 𝑧 = 𝜏 (𝑋𝑦 + 𝑐) (1) Computation graph:

𝑦 𝑋 × 𝑐 + 𝜏 ℎ forward graph loss 𝑧∗ 𝑝 𝜏′ 𝑝′ + 𝑐′ ℎ′ × 𝑋 ′ backward graph

Deep Learning for Natural Language processing

7/90

slide-10
SLIDE 10

Frameworks for Deep Learning

research and prototyping in Python

  • graph statically constructed,

symbolic computation

  • computation happens in a session
  • allows graph export and running as a

binary

  • computations written dynamically as

normal procedural code

  • easy debugging: inspecting variables

at any time of the computation

Deep Learning for Natural Language processing

8/90

slide-11
SLIDE 11

Representing Words

slide-12
SLIDE 12

Representing Words

Neural Networks Basics Representing Words Representing Sequences Recurrent Networks Convolutional Networks Self-attentive Networks Classifjcation and Labeling Generating Sequences Pre-training Representations Word2Vec ELMo BERT

Deep Learning for Natural Language processing

9/90

slide-13
SLIDE 13

Language Modeling

  • estimate probability of a next word in a text

P(𝑥𝑗|𝑥𝑗−1, 𝑥𝑗−2, … , 𝑥1)

  • standard approach: 𝑜-gram models with Markov assumption

≈ P(𝑥𝑗|𝑥𝑗−1, 𝑥𝑗−2, … , 𝑥𝑗−𝑜) ≈

𝑜

𝑘=0

𝜇𝑘 𝑑(𝑥𝑗|𝑥𝑗−1, … , 𝑥𝑗−𝑘) 𝑑(𝑥𝑗|𝑥𝑗−1, … , 𝑥𝑗−𝑘+1)

  • Let’s simulate it with a neural network:

… ≈ 𝐺(𝑥𝑗−1, … , 𝑥𝑗−𝑜|𝜄) 𝜄 is a set of trainable parameters.

Deep Learning for Natural Language processing

10/90

slide-14
SLIDE 14

Simple Neural Language Model

1𝑥𝑜−3 ⋅𝑋𝑓 1𝑥𝑜−2 ⋅𝑋𝑓 1𝑥𝑜−1 ⋅𝑋𝑓 tanh ⋅𝑊3 ⋅𝑊2 ⋅𝑊1 + 𝑐ℎ softmax ⋅𝑋 + 𝑐

P(𝑥𝑜|𝑥𝑜−1, 𝑥𝑜−2, 𝑥𝑜−3)

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. The Journal of Machine Learning Research, 3 (Feb):1137–1155, 2003. ISSN 1532-4435

Deep Learning for Natural Language processing

11/90

slide-15
SLIDE 15

Neural LM: Word Representation

  • limited vocabulary (hundred thousands words): indexed set of words
  • words are initially represented as one-hot-vectors 1𝑥 = (0, … , 0, 1, 0, … 0)
  • projection 1𝑥 ⋅ 𝑊 corresponds to selecting one row from matrix 𝑊
  • 𝑊 : is a table of learned word vector representations

so-called word embeddings

  • dimension typically 100 — 300

The fjrst hidden layer is then: ℎ1 = 𝑊𝑥𝑗−𝑜 ⊕ 𝑊𝑥𝑗−𝑜+1 ⊕ … ⊕ 𝑊𝑥𝑗−1 Matrix 𝑊 is shared for all words.

Deep Learning for Natural Language processing

12/90

slide-16
SLIDE 16

Neural LM: Next Word Estimation

  • optionally add extra hidden layer:

ℎ2 = 𝑔(ℎ1𝑋1 + 𝑐1)

  • last layer: probability distribution over vocabulary

𝑧 = softmax(ℎ2𝑋2 + 𝑐2) = exp(ℎ2𝑋2 + 𝑐2) ∑ exp(ℎ2𝑋2 + 𝑐2)

  • training objective: cross-entropy between the true (i.e., one-hot) distribution and

estimated distribution 𝐹 = − ∑

𝑗

𝑞true(𝑥𝑗) log 𝑧(𝑥𝑗) = ∑

𝑗

− log 𝑧(𝑥𝑗)

  • learned by error back-propagation

Deep Learning for Natural Language processing

13/90

slide-17
SLIDE 17

Learned Representations

  • word embeddings from LMs have interesting properties
  • cluster according to POS & meaning similarity

Table taken from Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011. ISSN 1533-7928

  • in IR: query expansion by nearest neighbors
  • in deep learning models: embeddings initialization speeds up training / allows complex

model with less data

Deep Learning for Natural Language processing

14/90

slide-18
SLIDE 18

Implementation in PyTorch I

import torch import torch.nn as nn class LanguageModel(nn.Module): def __init__(self, vocab_size, embedding_dim, hidden_dim): super().__init__() self.embedding = nn.Embedding(vocab_size, embedding_dim) self.hidden_layer = nn.Linear(3 * embedding_dim, hidden_dim) self.output_layer = nn.Linear(hidden_dim, vocab_size) self.loss_function = nn.CrossEntropyLoss() def forward(self, word_1, word_2, word_3, target=None): embedded_1 = self.embedding(word_1) embedded_2 = self.embedding(word_2) embedded_3 = self.embedding(word_3)

Deep Learning for Natural Language processing

15/90

slide-19
SLIDE 19

Implementation in PyTorch II

hidden = torch.tanh(self.hidden_layer( torch.cat(embedded_1, embedded_2, embedded_3))) logits = self.output_layer(hidden) loss = None if target is not None: loss = self.loss_function(logits, targets) return logits, loss

Deep Learning for Natural Language processing

16/90

slide-20
SLIDE 20

Implementation in TensorFlow I

import tensorfow as tf input_words = [tf.placeholder(tf.int32, shape=[None]) for _ in range(3)] target_word = tf.placeholder(tf.int32, shape[None]) embeddings = tf.get_variable(tf.float32, shape=[vocab_size, emb_dim]) embedded_words = tf.concat([tf.nn.embedding_lookup(w) for w in input_words]) hidden_layer = tf.layers.dense(embedded_words, hidden_size, activation=tf.tanh)

  • utput_layer = tf.layers.dense(hidden_layer, vocab_size, activation=None)
  • utput_probabilities = tf.nn.softmax(output_layer)

loss = tf.nn.cross_entropy_with_logits(output_layer, target_words)

  • ptimizer = tf.optimizers.AdamOptimizers()

train_op = optimizer.minimize(loss)

Deep Learning for Natural Language processing

17/90

slide-21
SLIDE 21

Implementation in TensorFlow II

session = tf.Session() # initialize variables

Training given batch

_, loss_value = session.run([train_op, loss], feed_dict={ input_words[0]: ..., input_words[1]: ..., input_words[2]: ..., target_word: ... })

Inference given batch

probs = session.run(output_probabilities, feed_dict={ input_words[0]: ..., input_words[1]: ..., input_words[2]: ..., })

Deep Learning for Natural Language processing

18/90

slide-22
SLIDE 22

Representing Sequences

slide-23
SLIDE 23

Representing Sequences

Neural Networks Basics Representing Words Representing Sequences Recurrent Networks Convolutional Networks Self-attentive Networks Classifjcation and Labeling Generating Sequences Pre-training Representations Word2Vec ELMo BERT

Deep Learning for Natural Language processing

19/90

slide-24
SLIDE 24

Representing Sequences

Recurrent Networks

slide-25
SLIDE 25

Recurrent Networks (RNNs)

…the default choice for sequence labeling

  • inputs: 𝑦, … , 𝑦𝑈
  • initial state ℎ0 = 0, a result of previous

computation, trainable parameter

  • recurrent computation: ℎ𝑢 = 𝐵(ℎ𝑢−1, 𝑦𝑢)

Deep Learning for Natural Language processing

20/90

slide-26
SLIDE 26

RNN as Imperative Code

def rnn(initial_state, inputs): prev_state = initial_state for x in inputs: new_state, output = rnn_cell(x, prev_state) prev_state = new_state yield output

Deep Learning for Natural Language processing

21/90

slide-27
SLIDE 27

RNN as a Fancy Image

Deep Learning for Natural Language processing

22/90

slide-28
SLIDE 28

Vanilla RNN

ℎ𝑢 = tanh (𝑋[ℎ𝑢−1; 𝑦𝑢] + 𝑐)

  • cannot propagate long-distance relations
  • vanishing gradient problem

Deep Learning for Natural Language processing

23/90

slide-29
SLIDE 29

Vanishing Gradient Problem (1)

tanh 𝑦 = 1 − 𝑓−2𝑦 1 + 𝑓−2𝑦

  • 1.0
  • 0.5

0.0 0.5 1.0 −6 −4 −2 2 4 6 𝑧 𝑦

dtanh 𝑦 d𝑦 = 1 − tanh2 𝑦 ∈ (0, 1]

0.0 0.2 0.4 0.6 0.8 1.0 −6 −4 −2 2 4 6 𝑧 𝑦

Weight initialized ∼ 𝒪(0, 1) to have gradients further from zero.

Deep Learning for Natural Language processing

24/90

slide-30
SLIDE 30

Vanishing Gradient Problem (2)

∂𝐹𝑢+1 ∂𝑐 = ∂𝐹𝑢+1 ∂ℎ𝑢+1 ⋅ ∂ℎ𝑢+1 ∂𝑐

(chain rule)

Deep Learning for Natural Language processing

25/90

slide-31
SLIDE 31

Vanishing Gradient Problem (3)

∂ℎ𝑢 ∂𝑐 = ∂ tanh

=𝑨𝑢 (activation)

⏞⏞⏞⏞⏞⏞⏞⏞⏞ (𝑋ℎℎ𝑢−1 + 𝑋𝑦𝑦𝑢 + 𝑐) ∂𝑐

(tanh′ is derivative of tanh)

= tanh′(𝑨𝑢) ⋅ ⎛ ⎜ ⎜ ⎝ ∂𝑋ℎℎ𝑢−1 ∂𝑐 + ∂𝑋𝑦𝑦𝑢 ∂𝑐 ⏟

=0

+ ∂𝑐 ∂𝑐 ⏟

=1

⎞ ⎟ ⎟ ⎠ = 𝑋ℎ ⏟

∼𝒪(0,1)

tanh′(𝑨𝑢) ⏟

∈(0;1]

∂ℎ𝑢−1 ∂𝑐 + tanh′(𝑨𝑢)

Deep Learning for Natural Language processing

26/90

slide-32
SLIDE 32

Long Short-Term Memory Networks

LSTM = Long short-term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. ISSN 0899-7667

Control the gradient fmow by explicitly gating:

  • what to use from input,
  • what to use from hidden state,
  • what to put on output

Deep Learning for Natural Language processing

27/90

slide-33
SLIDE 33

LMST: Hidden State

  • two types of hidden states
  • ℎ𝑢 — “public” hidden state, used an output
  • 𝑑𝑢 — “private” memory, no non-linearities on the way
  • direct fmow of gradients (without multiplying by ≤ 1 derivatives)

Deep Learning for Natural Language processing

28/90

slide-34
SLIDE 34

LSTM: Forget Gate

𝑔𝑢 = 𝜏 (𝑋𝑔[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑔)

  • based on input and previous state, decide what to forget from the memory

Deep Learning for Natural Language processing

29/90

slide-35
SLIDE 35

LSTM: Input Gate

𝑗𝑢 = 𝜏 (𝑋𝑗 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑗) ̃ 𝐷𝑢 = tanh (𝑋𝑑 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝐷)

  • ̃

𝐷 — candidate what may want to add to the memory

  • 𝑗𝑢 — decide how much of the information we want to store

Deep Learning for Natural Language processing

30/90

slide-36
SLIDE 36

LMST: Cell State Update

𝐷𝑢 = 𝑔𝑢 ⊙ 𝐷𝑢−1 + 𝑗𝑢 ⊙ ̃ 𝐷𝑢

Deep Learning for Natural Language processing

31/90

slide-37
SLIDE 37

LSTM: Output Gate

𝑝𝑢 = 𝜏 (𝑋𝑝 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑝) ℎ𝑢 = 𝑝𝑢 ⊙ tanh 𝐷𝑢

Deep Learning for Natural Language processing

32/90

slide-38
SLIDE 38

Here we are, LSTM!

𝑔𝑢 = 𝜏 (𝑋𝑔[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑔) 𝑗𝑢 = 𝜏 (𝑋𝑗 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑗) 𝑝𝑢 = 𝜏 (𝑋𝑝 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑝) ̃ 𝐷𝑢 = tanh (𝑋𝑑 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝐷) 𝐷𝑢 = 𝑔𝑢 ⊙ 𝐷𝑢−1 + 𝑗𝑢 ⊙ ̃ 𝐷𝑢 ℎ𝑢 = 𝑝𝑢 ⊙ tanh 𝐷𝑢 Question How would you implement it effjciently? Compute all gates in a single matrix multiplication.

Deep Learning for Natural Language processing

33/90

slide-39
SLIDE 39

Gated Recurrent Units

update gate 𝑨𝑢 = 𝜏(𝑦𝑢𝑋𝑨 + ℎ𝑢−1𝑉𝑨 + 𝑐𝑨) ∈ (0, 1) remember gate 𝑠𝑢 = 𝜏(𝑦𝑢𝑋𝑠 + ℎ𝑢−1𝑉𝑠 + 𝑐𝑠) ∈ (0, 1) candidate hidden state ̃ ℎ𝑢 = tanh (𝑦𝑢𝑋ℎ + (𝑠𝑢 ⊙ ℎ𝑢−1)𝑉ℎ) ∈ (−1, 1) hidden state ℎ𝑢 = (1 − 𝑨𝑢) ⊙ ℎ𝑢−1 + 𝑨𝑢 ⋅ ̃ ℎ𝑢

Deep Learning for Natural Language processing

34/90

slide-40
SLIDE 40

LSTM vs. GRU

  • GRU is smaller and therefore faster
  • performance similar, task dependent
  • theoretical limitation: GRU accepts regular languages, LSTM can simulate counter

machine

Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014. ISSN 2331-8422; Gail Weiss, Yoav Goldberg, and Eran Yahav. On the practical computational power of fjnite precision rnns for language

  • recognition. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 740–745, Melbourne,

Australia, July 2018. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P18-2117

Deep Learning for Natural Language processing

35/90

slide-41
SLIDE 41

RNN in PyTorch

rnn = nn.LSTM(input_dim, hidden_dim=512, num_layers=1, bidirectional=True, dropout=0.8)

  • utput, (hidden, cell) = self.rnn(x)

https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTM

Deep Learning for Natural Language processing

36/90

slide-42
SLIDE 42

RNN in TensorFlow

inputs = ... # float tf.Tensor of shape [batch, length, dim] lengths = ... # int tf.Tensor of shape [batch] # Cell objects are templates fw_cell = tf.nn.rnn_cell.LSTMCell(512, name="fw_cell") bw_cell = tf.nn.rnn_cell.LSTMCell(512, name="bw_cell")

  • utputs, states = tf.nn.bidirectional_dynamic_rnn(

cell_fw, cell_bw, inputs, sequence_length=lengths)

https://www.tensorflow.org/api_docs/python/tf/nn/bidirectional_dynamic_rnn

Deep Learning for Natural Language processing

37/90

slide-43
SLIDE 43

Bidirectional Networks

  • simple trick to improve performance
  • run one RNN forward, second one backward and concatenate outputs

Image from: http://colah.github.io/posts/2015-09-NN-Types-FP/

  • state of the art in tagging, crucial for neural machine translation

Deep Learning for Natural Language processing

38/90

slide-44
SLIDE 44

Representing Sequences

Convolutional Networks

slide-45
SLIDE 45

1-D Convolution

≈ sliding window over the sequence embeddings x = (𝑦1, … , 𝑦𝑂) 𝑦0 = ⃗ 𝑦𝑂 = ⃗ ℎ1 = 𝑔 (𝑋[𝑦0; 𝑦1.𝑦2] + 𝑐) ℎ𝑗 = 𝑔 (𝑋 [𝑦𝑗−1; 𝑦𝑗; 𝑦𝑗+1] + 𝑐) pad with 0s if we want to keep sequence length

Deep Learning for Natural Language processing

39/90

slide-46
SLIDE 46

1-D Convolution: Pseudocode

xs = ... # input sequnce kernel_size = 3 # window size filters = 300 # output dimensions strides=1 # step size W = trained_parameter(xs.shape[2] * kernel_size, filters) b = trained_parameter(filters) window = kernel_size // 2

  • utputs = []

for i in range(window, xs.shape[1] - window): h = np.mul(W, xs[i - window:i + window]) + b

  • utputs.append(h)

return np.array(h)

Deep Learning for Natural Language processing

40/90

slide-47
SLIDE 47

1-D Convolution: Frameworks

TensorFlow

h = tf.layers.conv1d(x, filters=300 kernel_size=3, strides=1, padding='same')

https://www.tensorflow.org/api_docs/python/tf/layers/conv1d

PyTorch

conv = nn.Conv1d(in_channels, out_channels=300, kernel_size=3, stride=1, padding=0, dilation=1, groups=1, bias=True) h = conv(x)

https://pytorch.org/docs/stable/nn.html#torch.nn.Conv1d

Deep Learning for Natural Language processing

41/90

slide-48
SLIDE 48

Rectifjed Linear Units

ReLU:

0.0 1.0 2.0 3.0 4.0 5.0 6.0 −6 −4 −2 2 4 6 𝑧 𝑦

Derivative of ReLU:

0.0 0.2 0.4 0.6 0.8 1.0 −6 −4 −2 2 4 6 𝑧 𝑦

faster, sufger less with vanishing gradient

Vinod Nair and Geofgrey E Hinton. Rectifjed linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, TODO, TODO 2010. TODO

Deep Learning for Natural Language processing

42/90

slide-49
SLIDE 49

Residual Connections

embeddings x = (𝑦1, … , 𝑦𝑂) 𝑦0 = ⃗ 𝑦𝑂 = ⃗

⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕

ℎ𝑗 = 𝑔 (𝑋 [𝑦𝑗−1; 𝑦𝑗; 𝑦𝑗+1] + 𝑐) + 𝑦𝑗 Allows training deeper networks. Why do you it helps? Better gradient fmow – the same as in RNNs.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, TODO, TODO 2016. IEEE Computer Society

Deep Learning for Natural Language processing

43/90

slide-50
SLIDE 50

Residual Connections: Numerical Stability

Numerically unstable, we need activation to be in similar scale ⇒ layer normalization. Activation before non-linearity is normalized: 𝑏𝑗 = 𝑕𝑗 𝜏𝑗 (𝑏𝑗 − 𝜈𝑗) …𝑕 is a trainable parameter, 𝜈, 𝜏 estimated from data. 𝜈 = 1 𝐼

𝐼

𝑗=1

𝑏𝑗 𝜏 = √ √ √ ⎷ 1 𝐼

𝐼

𝑗=1

(𝑏𝑗 − 𝜈)2

Lei Jimmy Ba, Ryan Kiros, and Geofgrey E. Hinton. Layer normalization. CoRR, abs/1607.06450, 2016. ISSN 2331-8422

Deep Learning for Natural Language processing

44/90

slide-51
SLIDE 51

Receptive Field

embeddings x = (𝑦1, … , 𝑦𝑂) 𝑦0 = ⃗ 𝑦𝑂 = ⃗ Can be enlarged by dilated convolutions.

Deep Learning for Natural Language processing

45/90

slide-52
SLIDE 52

Convolutional architectures

+

  • extremely computationally effjcient

  • limited context
  • by default no aware of 𝑜-gram order
  • max-pooling over the hidden states = element-wise maximum over sequence
  • can be understood as an ∃ operator over the feature extractors

Deep Learning for Natural Language processing

46/90

slide-53
SLIDE 53

Representing Sequences

Self-attentive Networks

slide-54
SLIDE 54

Self-attentive Networks

  • In some layers: states are linear combination of previous layer states
  • Originally for the Transformer model for machine translation
  • similarity matrix between all pairs of states
  • 𝑃(𝑜2) memory, 𝑃(1) time (when paralelized)
  • next layer: sum by rows

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 6000–6010, Long Beach, CA, USA, December 2017. Curran Associates, Inc

Deep Learning for Natural Language processing

47/90

slide-55
SLIDE 55

Multi-headed scaled dot-product attention

Single-head setup Attn(𝑅, 𝐿, 𝑊 ) = softmax (𝑅𝐿⊤ √ 𝑒 ) 𝑊 ℎ𝑗+1 = ∑ softmax (ℎ𝑗ℎ⊤

𝑗

√ 𝑒 ) Multihead-head setup Multihead(𝑅, 𝑊 ) = (𝐼1 ⊕ ⋯ ⊕ 𝐼ℎ)𝑋 𝑃 𝐼𝑗 = Attn(𝑅𝑋 𝑅

𝑗 , 𝑊 𝑋 𝐿 𝑗 , 𝑊 𝑋 𝑊 𝑗 ) keys & values queries linear linear linear split split split concat scaled dot-product attention scaled dot-product attention scaled dot-product attention scaled dot-product attention scaled dot-product attention

Deep Learning for Natural Language processing

48/90

slide-56
SLIDE 56

Dot-Product Attention in PyTorch

def attention(query, key, value, mask=None): d_k = query.size(-1) scores = torch.matmul(query, key.transpose(-2, -1)) \ / math.sqrt(d_k) p_attn = F.softmax(scores, dim = -1) return torch.matmul(p_attn, value), p_attn

Deep Learning for Natural Language processing

49/90

slide-57
SLIDE 57

Dot-Product Attention in TensorFlow

def scaled_dot_product(self, queries, keys, values):

  • 1 = tf.matmul(queries, keys, transpose_b=True)
  • 2 = o1 / (dim**0.5)
  • 3 = tf.nn.softmax(o2)

return tf.matmul(o3, values)

Deep Learning for Natural Language processing

50/90

slide-58
SLIDE 58

Position Encoding

Model cannot be aware of the position in the sequence. pos(𝑗) = ⎧ { ⎨ { ⎩ sin ( 𝑢

104

𝑗 𝑒 ) ,

if 𝑗 mod 2 = 0 cos ( 𝑢

104

𝑗−1 𝑒 ) ,

  • therwise

20 40 60 80 Text length 100 200 300 Dimension −0.5 0.0 0.5 1.0

Deep Learning for Natural Language processing

51/90

slide-59
SLIDE 59

Stacking self-attentive Layers

input embeddings ⊕ position encoding

self-attentive sublayer

multihead attention

keys & values queries

⊕ layer normalization

feed-forward sublayer

non-linear layer linear layer ⊕ layer normalization 𝑂×

  • several layers (original paper 6)
  • each layer: 2 sub-layers: self-attention and

feed-forward layer

  • everything inter-connected with residual

connections

Deep Learning for Natural Language processing

52/90

slide-60
SLIDE 60

Architectures Comparison

computation sequential operations memory Recurrent 𝑃(𝑜 ⋅ 𝑒2) 𝑃(𝑜) 𝑃(𝑜 ⋅ 𝑒) Convolutional 𝑃(𝑙 ⋅ 𝑜 ⋅ 𝑒2) 𝑃(1) 𝑃(𝑜 ⋅ 𝑒) Self-attentive 𝑃(𝑜2 ⋅ 𝑒) 𝑃(1) 𝑃(𝑜2 ⋅ 𝑒) 𝑒 model dimension, 𝑜 sequence length, 𝑙 convolutional kernel

Deep Learning for Natural Language processing

53/90

slide-61
SLIDE 61

Classifjcation and Labeling

slide-62
SLIDE 62

Classifjcation and Labeling

Neural Networks Basics Representing Words Representing Sequences Recurrent Networks Convolutional Networks Self-attentive Networks Classifjcation and Labeling Generating Sequences Pre-training Representations Word2Vec ELMo BERT

Deep Learning for Natural Language processing

54/90

slide-63
SLIDE 63

Sequence Clasifjcation

  • tasks like sentiment analysis, genre classifjcation
  • need to get one vector from sequence → average or max pooling
  • optionally hidden layers, at the and softmax for probability distribution over classes

Deep Learning for Natural Language processing

55/90

slide-64
SLIDE 64

Softmax & Cross-Entropy

Output layer with softmax (with parameters 𝑋, 𝑐): 𝑄𝑧 = softmax(x) = P(𝑧 = 𝑘 ∣ x) = exp x⊤𝑋 + 𝑐 ∑ exp x⊤𝑋 + 𝑐 Network error = cross-entropy between estimated distribution and one-hot ground-truth distribution 𝑈 = 1(𝑧∗): 𝑀(𝑄𝑧, 𝑧∗) = 𝐼(𝑄, 𝑈) = −𝔽𝑗∼𝑈 log 𝑄(𝑗) = − ∑

𝑗

𝑈(𝑗) log 𝑄(𝑗) = − log 𝑄(𝑧∗)

Deep Learning for Natural Language processing

56/90

slide-65
SLIDE 65

Derivative of Cross-Entropy

Let 𝑚 = x⊤𝑋 + 𝑐, 𝑚𝑧∗ corresponds to the correct one. ∂𝑀(𝑄𝑧, 𝑧∗) ∂𝑚 = − ∂ ∂𝑚 log exp 𝑚𝑧∗ ∑𝑘 exp 𝑚𝑘 = − ∂ ∂𝑚𝑚𝑧∗ − log ∑ exp 𝑚 = 1𝑧∗ + ∂ ∂𝑚 − log ∑ exp 𝑚 = 1𝑧∗ − ∑ 1𝑧∗ exp 𝑚 ∑ exp 𝑚 = = 1𝑧∗ − 𝑄𝑧(𝑧∗) Interpretation: Reinforce the correct logit, supress the rest.

Deep Learning for Natural Language processing

57/90

slide-66
SLIDE 66

Sequence Labeling

  • assign value / probability distribution to every token in a sequence
  • morphological tagging, named-entity recognition, LM with unlimited history, answer

span selection

  • every state is classifjed independently with a classifjer
  • during training, error babckpropagate form all classifjers

Lab next time: i/y spelling as sequence labeling

Deep Learning for Natural Language processing

58/90

slide-67
SLIDE 67

Generating Sequences

slide-68
SLIDE 68

Sequence-to-sequence Learning

  • target sequence is of difgerent lenght tahn source
  • no-trivial (= not monotonic) correspondence of source and target
  • taks like: machine translation, text summarization, image captioning

Deep Learning for Natural Language processing

59/90

slide-69
SLIDE 69

Neural Language Model

input symbol

  • ne-hot vectors

embedding lookup RNN cell (more layers) RNN state normalization distribution for the next symbol <s> embed RNN ℎ0 softmax 𝑄(𝑥1|<s>) 𝑥1 embed RNN ℎ1 softmax 𝑄(𝑥1| …) 𝑥2 embed RNN ℎ2 softmax 𝑄(𝑥2| …)

  • estimate probability of a sentence using the chain rule
  • output distributions can be used for sampling

Deep Learning for Natural Language processing

60/90

slide-70
SLIDE 70

Sampling from a LM

embed RNN ℎ0 softmax P(𝑥1|<s>) argmax embed RNN ℎ1 softmax P(𝑥1| …) argmax embed RNN ℎ2 softmax P(𝑥2| …) argmax embed RNN ℎ3 softmax P(𝑥3| …) argmax <s>

when conditioned on input → autoregressive decoder

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3104–3112, Montreal, Canada, December 2014. Curran Associates, Inc

Deep Learning for Natural Language processing

61/90

slide-71
SLIDE 71

Autoregressive Decoding: Pseudocode

last_w = "<s>" while last_w != "</s>": last_w_embeding = target_embeddings[last_w] state, dec_output = dec_cell(state, last_w_embeding) logits = output_projection(dec_output) last_w = np.argmax(logits) yield last_w

Deep Learning for Natural Language processing

62/90

slide-72
SLIDE 72

Architectures in the Decoder

  • RNN – original sequence-to-sequence learning (2015)
  • principle known since 2014 (University of Montreal)
  • made usable in 2016 (University of Edinburgh)
  • CNN – convolution sequence-to-sequence by Facebook (2017)
  • Self-attention (so called Transformer) by Google (2017)

More on the topic in the MT class.

Deep Learning for Natural Language processing

63/90

slide-73
SLIDE 73

Implementation: Runtime vs. training

runtime: ̂

𝑧𝑘

(decoded) ×

training: 𝑧𝑘

(ground truth) <s> ~y1 ~y2 ~y3 ~y4 ~y5 <s> x1 x2 x3 x4 <s> y1 y2 y3 y4 loss

Deep Learning for Natural Language processing

64/90

slide-74
SLIDE 74

Attention Model

<s> x1 x2 x3 x4 ~yi ~yi+1 h1 h0 h2 h3 h4

...

+

× α0 × α1 × α2 × α3 × α4

si si-1 si+1

+

Deep Learning for Natural Language processing

65/90

slide-75
SLIDE 75

Attention Model in Equations (1)

Inputs: decoder state 𝑡𝑗 encoder states ℎ𝑘 = [⃗⃗⃗⃗⃗⃗⃗ ℎ𝑘; ⃖⃖⃖⃖⃖⃖⃖ ℎ𝑘] ∀𝑗 = 1 … 𝑈𝑦 Attention energies: 𝑓𝑗𝑘 = 𝑤⊤

𝑏 tanh (𝑋𝑏𝑡𝑗−1 + 𝑉𝑏ℎ𝑘 + 𝑐𝑏)

Attention distribution: 𝛽𝑗𝑘 = exp (𝑓𝑗𝑘) ∑

𝑈𝑦 𝑙=1 exp (𝑓𝑗𝑙)

Context vector: 𝑑𝑗 =

𝑈𝑦

𝑘=1

𝛽𝑗𝑘ℎ𝑘

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014. ISSN 2331-8422

Deep Learning for Natural Language processing

66/90

slide-76
SLIDE 76

Attention Model in Equations (2)

Output projection: 𝑢𝑗 = MLP (𝑉𝑝𝑡𝑗−1 + 𝑊𝑝𝐹𝑧𝑗−1 + 𝐷𝑝𝑑𝑗 + 𝑐𝑝) …attention is mixed with the hidden state Output distribution: 𝑞 (𝑧𝑗 = 𝑙|𝑡𝑗, 𝑧𝑗−1, 𝑑𝑗) ∝ exp (𝑋𝑝𝑢𝑗)𝑙 + 𝑐𝑙

Deep Learning for Natural Language processing

67/90

slide-77
SLIDE 77

Transformer Decoder

input embeddings ⊕ position encoding self-attentive sublayer multihead attention

keys & values queries

⊕ layer normalization cross-attention sublayer multihead attention

keys & values queries

⊕ layer normalization encoder feed-forward sublayer non-linear layer linear layer ⊕ layer normalization 𝑂× linear softmax

  • utput symbol probabilities
  • similar to encoder, additional layer with

attention to the encoder

  • in every steps self-attention over

complete history ⇒ 𝑃(𝑜2) complexity

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 6000–6010, Long Beach, CA, USA, December 2017. Curran Associates, Inc

Deep Learning for Natural Language processing

68/90

slide-78
SLIDE 78

Transfomer Decoder: Non-autoregressive training

𝑤1 𝑤2 𝑤3 … 𝑤𝑁 𝑟1 𝑟2 𝑟3 … 𝑟𝑂 … … … … … Queries 𝑅 Values 𝑊

−∞

  • analogical to encoder
  • target is known at training: don’t need to

wait until it’s generated

  • self attention can be parallelized via

matrix multiplication

  • prevent attentding the future using a

mask Question 1: What if the matrix was diagonal? Question 2: How such a matrix look like for convolutional architecture?

Deep Learning for Natural Language processing

69/90

slide-79
SLIDE 79

Pre-training Representations

slide-80
SLIDE 80

Pre-training Representations

Neural Networks Basics Representing Words Representing Sequences Recurrent Networks Convolutional Networks Self-attentive Networks Classifjcation and Labeling Generating Sequences Pre-training Representations Word2Vec ELMo BERT

Deep Learning for Natural Language processing

70/90

slide-81
SLIDE 81

Pre-trained Representations

  • representations that emerge in models seem to be carry a lot of information about the

language

  • representations pre-trained on large data can be re-used on tasks with smaller training

data

Deep Learning for Natural Language processing

71/90

slide-82
SLIDE 82

Pre-training Representations

Word2Vec

slide-83
SLIDE 83

Word2Vec

  • way to learn word embeddings without training the complete LM

CBOW Skip-gram ∑ 𝑥3 𝑥1 𝑥2 𝑥4 𝑥5 ⋮ 𝑥3 𝑥1 𝑥2 𝑥4 𝑥5 ⋮

  • CBOW: minimize cross-entropy of the middle word of a sliding windows
  • skip-gram: minimize cross-entropy of a bag of words around a word (LM other way

round)

Tomáš Mikolov, Wen-tau Yih, and Geofgrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia, jun 2013. Association for Computational Linguistics

Deep Learning for Natural Language processing

72/90

slide-84
SLIDE 84

Word2Vec: sampling

1. All human beings are born free and equal in dignity … → (All, humans) (All, beings) 2. All human beings are born free and equal in dignity … → (human, All) (human, beings) (human, are) 3. All human beings are born free and equal in dignity … → (beings, All) (beings, human) (beings, are) (beings, born) 4. All human beings are born free and equal in dignity … → (are, human) (are, beings) (are, born) (are, free)

Deep Learning for Natural Language processing

73/90

slide-85
SLIDE 85

Word2Vec: Formulas

  • Training objective:

1 𝑈

𝑈

𝑢=1

𝑘∼(−𝑑,𝑑)

log 𝑞(𝑥𝑢+𝑑|𝑥𝑢)

  • Probability estimation:

𝑞(𝑥𝑃|𝑥𝐽) = exp (𝑊 ′⊤

𝑥𝑃𝑊𝑥𝐽)

∑𝑥 exp (𝑊 ′⊤

𝑥𝑊𝑥𝑗)

where 𝑊 is input (embedding) matrix, 𝑊 ′ output matrix

Equations 1, 2. Tomáš Mikolov, Wen-tau Yih, and Geofgrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia, jun 2013. Association for Computational Linguistics

Deep Learning for Natural Language processing

74/90

slide-86
SLIDE 86

Word2Vec: Training using Negative Sampling

The summation in denominator is slow, use noise contrastive estimation: log 𝜏 (𝑊 ′⊤

𝑥𝑃𝑊𝑥𝐽) + 𝑙

𝑗=1

𝐹𝑥𝑗∼𝑄𝑜(𝑥) [log 𝜏 (−𝑊 ′⊤

𝑥𝑗𝑊𝑥𝐽)]

Main idea: classify independently by logistic regression the positive and few sampled negative examples.

Equations 1, 3. Tomáš Mikolov, Wen-tau Yih, and Geofgrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia, jun 2013. Association for Computational Linguistics

Deep Learning for Natural Language processing

75/90

slide-87
SLIDE 87

Word2Vec: Vector Arithmetics

man woman uncle aunt king queen kings queens king queen

Image originally from Tomáš Mikolov, Wen-tau Yih, and Geofgrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia, jun 2013. Association for Computational Linguistics

Deep Learning for Natural Language processing

76/90

slide-88
SLIDE 88

Few More Notes on Embeddings

  • many method for pre-trained words embeddings (most popluar GloVe)
  • embeddings capturing character-level properties
  • multilingual embeddings

Deep Learning for Natural Language processing

77/90

slide-89
SLIDE 89

Training models

FastText – Word2Vec model implementation by Facebook https://github.com/facebookresearch/fastText

./fasttext skipgram -input data.txt -output model

Deep Learning for Natural Language processing

78/90

slide-90
SLIDE 90

Pre-training Representations

ELMo

slide-91
SLIDE 91

What is ELMo?

  • pre-trained large language model
  • “nothing special” – combines all

known tricks, trained on extremely large data

  • improves almost all NLP tasks
  • published in June 2018

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/N18-1202

Deep Learning for Natural Language processing

79/90

slide-92
SLIDE 92

ELMo Architecture: Input

character embeddings of size 16 1D-convolution to 2,048 dimensions + max-pool

window fjlters 1 32 2 32 3 64 4 128 5 256 6 512 7 1024

2× highway layer (2,048 dimensions) linear projection to 512 dimensions

  • input tokenized, treated on character

level

  • 2,048 𝑜-gram fjlters + max-pooling

(∼ soft search for learned 𝑜-grams)

  • 2 highway layers:

𝑕𝑚+1 = 𝜏 (𝑋𝑕ℎ𝑚 + 𝑐𝑕) ℎ𝑚+1 = (1 − 𝑕𝑚+1) ⊙ ℎ𝑚+ 𝑕𝑚+1 ⊙ ReLu (𝑋ℎ𝑚 + 𝑐) contain gates that contol if projection is needed

Deep Learning for Natural Language processing

80/90

slide-93
SLIDE 93

ELMo Architecture: Language Models

  • token representations input for 2 language models: forward and backward
  • both LMs 2 layers with 4,096 dimensions wiht layer normalization and residual

connections

  • output classifjer shared (only used in training, does hot have to be good)

Learned layer combination for downstream tasks: ELMotask

𝑙

= 𝛿task ∑

layer𝑀

𝑡task

𝑀 ℎ(𝑀) 𝑙

𝛿task, 𝑡task

𝑀

trainable parameters.

Deep Learning for Natural Language processing

81/90

slide-94
SLIDE 94

Task where ELMo helps

Answer Span Selection Find an answer to a question in a unstructured text. Semantic Role Labeling Detect who did what to whom in sentences. Natural Language Inference Decide whether two sentences are in agreement, contradict each other, or have nothing to do with each other. Named Entity Recognition Detect and classify names people, locations, organization, numbers with units, email addresses, URLs, phone numbers … Coreference Resolution Detect what entities pronouns refer to. I Semantic Similarity Measure how similar meaning two sentences are. (Think of clustering similar question on StackOverfmow or detecting plagiarism.)

Deep Learning for Natural Language processing

82/90

slide-95
SLIDE 95

Improvements by Elmo

Deep Learning for Natural Language processing

83/90

slide-96
SLIDE 96

How to use it

  • implemetned in AllenNLP

framework (uses PyTorch)

  • pre-trained English models

available

from allennlp.modules.elmo import Elmo, batch_to_ids

  • ptions_file = ...

weight_file = ... elmo = Elmo(options_file, weight_file, 2, dropout=0) sentences = [['First', 'sentence', '.'], ['Another', '.']] character_ids = batch_to_ids(sentences) embeddings = elmo(character_ids)

https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md

Deep Learning for Natural Language processing

84/90

slide-97
SLIDE 97

Pre-training Representations

BERT

slide-98
SLIDE 98

What is BERT

  • another way of pretraining sentence

representations

  • uses Transformer architecture and

slightly difgerent training objective

  • even beeter than ELMo
  • done by Google, published in

November 2018

  • J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT:

Pre-training of deep bidirectional transformers for language

  • understanding. ArXiv e-prints, October 2018

Deep Learning for Natural Language processing

85/90

slide-99
SLIDE 99

Achitecture Comparison

Deep Learning for Natural Language processing

86/90

slide-100
SLIDE 100

Masked Language Model

All human being are born free free MASK hairy free and equal in dignity and rights

  • 1. Randomly sample a word → free
  • 2. With 80% change replace with special MASK token.
  • 3. With 10% change replace with random token → hairy
  • 4. With 10% change keep asi is → free

Then a classifjer should predict the missing/replaced word free

Deep Learning for Natural Language processing

87/90

slide-101
SLIDE 101

Additional Objective: Next Sentence Prediction

  • trained in the multi-task learning setup
  • secondary objective: next sentences prediction
  • decide for a pair of consecuitve sentences whether they follow each other

Deep Learning for Natural Language processing

88/90

slide-102
SLIDE 102

Performance of BERT

Tables 1 and 2. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. ArXiv e-prints, October 2018

Deep Learning for Natural Language processing

89/90

slide-103
SLIDE 103

Deep Learning for Natural Language processing

Summary

  • 1. Discrete symbols → continuous representation with trained

embeddings

  • 2. Architectures to get suitable representation: recurrent,

convolutional, self-attentive

  • 3. Output: classifjcation, sequence labeling, autoregressive

decoding

  • 4. Representations pretrained on large data helps on downstream

tasks

http://ufal.mff.cuni.cz/~zabokrtsky/fel