Deep Learning for Natural Language processing
Jindřich Libovický
March 1, 2017
Introduction to Natural Language Processing
Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
Deep Learning for Natural Language processing Jindich Libovick - - PowerPoint PPT Presentation
Deep Learning for Natural Language processing Jindich Libovick March 1, 2017 Introduction to Natural Language Processing Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise
Jindřich Libovický
March 1, 2017
Introduction to Natural Language Processing
Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
Neural Networks Basics Representing Words Representing Sequences Recurrent Networks Convolutional Networks Self-attentive Networks Classifjcation and Labeling Generating Sequences Pre-training Representations Word2Vec ELMo BERT
Deep Learning for Natural Language processing
1/90
research
…
power
Deep Learning for Natural Language processing
2/90
back-propagation
problem
suitable representation for our problem
Deep Learning for Natural Language processing
3/90
Neural Networks Basics Representing Words Representing Sequences Recurrent Networks Convolutional Networks Self-attentive Networks Classifjcation and Labeling Generating Sequences Pre-training Representations Word2Vec ELMo BERT
Deep Learning for Natural Language processing
4/90
activation function input x weights w
∑ ⋅ is > 0? 𝑦𝑗 ⋅𝑥𝑗 𝑦1 ⋅𝑥1 𝑦2 ⋅𝑥2 𝑦𝑜 ⋅𝑥𝑜
Deep Learning for Natural Language processing
5/90
𝑦 ↓ ↑ ↓ ↑ ℎ1 = 𝑔(𝑋1𝑦 + 𝑐1) ↓ ↑ ↓ ↑ ℎ2 = 𝑔(𝑋2ℎ1 + 𝑐2) ↓ ↑ ↓ ↑ ⋮ ⋮ ↓ ↑ ↓ ↑ ℎ𝑜 = 𝑔(𝑋𝑜ℎ𝑜−1 + 𝑐𝑜) ↓ ↑ ↓ ↑ 𝑝 = (𝑋𝑝ℎ𝑜 + 𝑐𝑝)
∂𝐹 ∂𝑋𝑝 = ∂𝐹 ∂𝑝 ⋅ ∂𝑝 ∂𝑋𝑝
↓ ↓ ↑ 𝐹 = 𝑓(𝑝, 𝑢) →
∂𝐹 ∂𝑝
Deep Learning for Natural Language processing
6/90
Logistic regression: 𝑧 = 𝜏 (𝑋𝑦 + 𝑐) (1) Computation graph:
𝑦 𝑋 × 𝑐 + 𝜏 ℎ forward graph loss 𝑧∗ 𝑝 𝜏′ 𝑝′ + 𝑐′ ℎ′ × 𝑋 ′ backward graph
Deep Learning for Natural Language processing
7/90
research and prototyping in Python
symbolic computation
binary
normal procedural code
at any time of the computation
Deep Learning for Natural Language processing
8/90
Neural Networks Basics Representing Words Representing Sequences Recurrent Networks Convolutional Networks Self-attentive Networks Classifjcation and Labeling Generating Sequences Pre-training Representations Word2Vec ELMo BERT
Deep Learning for Natural Language processing
9/90
P(𝑥𝑗|𝑥𝑗−1, 𝑥𝑗−2, … , 𝑥1)
≈ P(𝑥𝑗|𝑥𝑗−1, 𝑥𝑗−2, … , 𝑥𝑗−𝑜) ≈
𝑜
∑
𝑘=0
𝜇𝑘 𝑑(𝑥𝑗|𝑥𝑗−1, … , 𝑥𝑗−𝑘) 𝑑(𝑥𝑗|𝑥𝑗−1, … , 𝑥𝑗−𝑘+1)
… ≈ 𝐺(𝑥𝑗−1, … , 𝑥𝑗−𝑜|𝜄) 𝜄 is a set of trainable parameters.
Deep Learning for Natural Language processing
10/90
1𝑥𝑜−3 ⋅𝑋𝑓 1𝑥𝑜−2 ⋅𝑋𝑓 1𝑥𝑜−1 ⋅𝑋𝑓 tanh ⋅𝑊3 ⋅𝑊2 ⋅𝑊1 + 𝑐ℎ softmax ⋅𝑋 + 𝑐
P(𝑥𝑜|𝑥𝑜−1, 𝑥𝑜−2, 𝑥𝑜−3)
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. The Journal of Machine Learning Research, 3 (Feb):1137–1155, 2003. ISSN 1532-4435
Deep Learning for Natural Language processing
11/90
so-called word embeddings
The fjrst hidden layer is then: ℎ1 = 𝑊𝑥𝑗−𝑜 ⊕ 𝑊𝑥𝑗−𝑜+1 ⊕ … ⊕ 𝑊𝑥𝑗−1 Matrix 𝑊 is shared for all words.
Deep Learning for Natural Language processing
12/90
ℎ2 = 𝑔(ℎ1𝑋1 + 𝑐1)
𝑧 = softmax(ℎ2𝑋2 + 𝑐2) = exp(ℎ2𝑋2 + 𝑐2) ∑ exp(ℎ2𝑋2 + 𝑐2)
estimated distribution 𝐹 = − ∑
𝑗
𝑞true(𝑥𝑗) log 𝑧(𝑥𝑗) = ∑
𝑗
− log 𝑧(𝑥𝑗)
Deep Learning for Natural Language processing
13/90
Table taken from Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011. ISSN 1533-7928
model with less data
Deep Learning for Natural Language processing
14/90
import torch import torch.nn as nn class LanguageModel(nn.Module): def __init__(self, vocab_size, embedding_dim, hidden_dim): super().__init__() self.embedding = nn.Embedding(vocab_size, embedding_dim) self.hidden_layer = nn.Linear(3 * embedding_dim, hidden_dim) self.output_layer = nn.Linear(hidden_dim, vocab_size) self.loss_function = nn.CrossEntropyLoss() def forward(self, word_1, word_2, word_3, target=None): embedded_1 = self.embedding(word_1) embedded_2 = self.embedding(word_2) embedded_3 = self.embedding(word_3)
Deep Learning for Natural Language processing
15/90
hidden = torch.tanh(self.hidden_layer( torch.cat(embedded_1, embedded_2, embedded_3))) logits = self.output_layer(hidden) loss = None if target is not None: loss = self.loss_function(logits, targets) return logits, loss
Deep Learning for Natural Language processing
16/90
import tensorfow as tf input_words = [tf.placeholder(tf.int32, shape=[None]) for _ in range(3)] target_word = tf.placeholder(tf.int32, shape[None]) embeddings = tf.get_variable(tf.float32, shape=[vocab_size, emb_dim]) embedded_words = tf.concat([tf.nn.embedding_lookup(w) for w in input_words]) hidden_layer = tf.layers.dense(embedded_words, hidden_size, activation=tf.tanh)
loss = tf.nn.cross_entropy_with_logits(output_layer, target_words)
train_op = optimizer.minimize(loss)
Deep Learning for Natural Language processing
17/90
session = tf.Session() # initialize variables
Training given batch
_, loss_value = session.run([train_op, loss], feed_dict={ input_words[0]: ..., input_words[1]: ..., input_words[2]: ..., target_word: ... })
Inference given batch
probs = session.run(output_probabilities, feed_dict={ input_words[0]: ..., input_words[1]: ..., input_words[2]: ..., })
Deep Learning for Natural Language processing
18/90
Neural Networks Basics Representing Words Representing Sequences Recurrent Networks Convolutional Networks Self-attentive Networks Classifjcation and Labeling Generating Sequences Pre-training Representations Word2Vec ELMo BERT
Deep Learning for Natural Language processing
19/90
Representing Sequences
…the default choice for sequence labeling
computation, trainable parameter
Deep Learning for Natural Language processing
20/90
def rnn(initial_state, inputs): prev_state = initial_state for x in inputs: new_state, output = rnn_cell(x, prev_state) prev_state = new_state yield output
Deep Learning for Natural Language processing
21/90
Deep Learning for Natural Language processing
22/90
ℎ𝑢 = tanh (𝑋[ℎ𝑢−1; 𝑦𝑢] + 𝑐)
Deep Learning for Natural Language processing
23/90
tanh 𝑦 = 1 − 𝑓−2𝑦 1 + 𝑓−2𝑦
0.0 0.5 1.0 −6 −4 −2 2 4 6 𝑧 𝑦
dtanh 𝑦 d𝑦 = 1 − tanh2 𝑦 ∈ (0, 1]
0.0 0.2 0.4 0.6 0.8 1.0 −6 −4 −2 2 4 6 𝑧 𝑦
Weight initialized ∼ 𝒪(0, 1) to have gradients further from zero.
Deep Learning for Natural Language processing
24/90
∂𝐹𝑢+1 ∂𝑐 = ∂𝐹𝑢+1 ∂ℎ𝑢+1 ⋅ ∂ℎ𝑢+1 ∂𝑐
(chain rule)
Deep Learning for Natural Language processing
25/90
∂ℎ𝑢 ∂𝑐 = ∂ tanh
=𝑨𝑢 (activation)
⏞⏞⏞⏞⏞⏞⏞⏞⏞ (𝑋ℎℎ𝑢−1 + 𝑋𝑦𝑦𝑢 + 𝑐) ∂𝑐
(tanh′ is derivative of tanh)
= tanh′(𝑨𝑢) ⋅ ⎛ ⎜ ⎜ ⎝ ∂𝑋ℎℎ𝑢−1 ∂𝑐 + ∂𝑋𝑦𝑦𝑢 ∂𝑐 ⏟
=0
+ ∂𝑐 ∂𝑐 ⏟
=1
⎞ ⎟ ⎟ ⎠ = 𝑋ℎ ⏟
∼𝒪(0,1)
tanh′(𝑨𝑢) ⏟
∈(0;1]
∂ℎ𝑢−1 ∂𝑐 + tanh′(𝑨𝑢)
Deep Learning for Natural Language processing
26/90
LSTM = Long short-term memory
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. ISSN 0899-7667
Control the gradient fmow by explicitly gating:
Deep Learning for Natural Language processing
27/90
Deep Learning for Natural Language processing
28/90
𝑔𝑢 = 𝜏 (𝑋𝑔[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑔)
Deep Learning for Natural Language processing
29/90
𝑗𝑢 = 𝜏 (𝑋𝑗 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑗) ̃ 𝐷𝑢 = tanh (𝑋𝑑 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝐷)
𝐷 — candidate what may want to add to the memory
Deep Learning for Natural Language processing
30/90
𝐷𝑢 = 𝑔𝑢 ⊙ 𝐷𝑢−1 + 𝑗𝑢 ⊙ ̃ 𝐷𝑢
Deep Learning for Natural Language processing
31/90
𝑝𝑢 = 𝜏 (𝑋𝑝 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑝) ℎ𝑢 = 𝑝𝑢 ⊙ tanh 𝐷𝑢
Deep Learning for Natural Language processing
32/90
𝑔𝑢 = 𝜏 (𝑋𝑔[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑔) 𝑗𝑢 = 𝜏 (𝑋𝑗 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑗) 𝑝𝑢 = 𝜏 (𝑋𝑝 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑝) ̃ 𝐷𝑢 = tanh (𝑋𝑑 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝐷) 𝐷𝑢 = 𝑔𝑢 ⊙ 𝐷𝑢−1 + 𝑗𝑢 ⊙ ̃ 𝐷𝑢 ℎ𝑢 = 𝑝𝑢 ⊙ tanh 𝐷𝑢 Question How would you implement it effjciently? Compute all gates in a single matrix multiplication.
Deep Learning for Natural Language processing
33/90
update gate 𝑨𝑢 = 𝜏(𝑦𝑢𝑋𝑨 + ℎ𝑢−1𝑉𝑨 + 𝑐𝑨) ∈ (0, 1) remember gate 𝑠𝑢 = 𝜏(𝑦𝑢𝑋𝑠 + ℎ𝑢−1𝑉𝑠 + 𝑐𝑠) ∈ (0, 1) candidate hidden state ̃ ℎ𝑢 = tanh (𝑦𝑢𝑋ℎ + (𝑠𝑢 ⊙ ℎ𝑢−1)𝑉ℎ) ∈ (−1, 1) hidden state ℎ𝑢 = (1 − 𝑨𝑢) ⊙ ℎ𝑢−1 + 𝑨𝑢 ⋅ ̃ ℎ𝑢
Deep Learning for Natural Language processing
34/90
machine
Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014. ISSN 2331-8422; Gail Weiss, Yoav Goldberg, and Eran Yahav. On the practical computational power of fjnite precision rnns for language
Australia, July 2018. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P18-2117
Deep Learning for Natural Language processing
35/90
rnn = nn.LSTM(input_dim, hidden_dim=512, num_layers=1, bidirectional=True, dropout=0.8)
https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTM
Deep Learning for Natural Language processing
36/90
inputs = ... # float tf.Tensor of shape [batch, length, dim] lengths = ... # int tf.Tensor of shape [batch] # Cell objects are templates fw_cell = tf.nn.rnn_cell.LSTMCell(512, name="fw_cell") bw_cell = tf.nn.rnn_cell.LSTMCell(512, name="bw_cell")
cell_fw, cell_bw, inputs, sequence_length=lengths)
https://www.tensorflow.org/api_docs/python/tf/nn/bidirectional_dynamic_rnn
Deep Learning for Natural Language processing
37/90
Image from: http://colah.github.io/posts/2015-09-NN-Types-FP/
Deep Learning for Natural Language processing
38/90
Representing Sequences
≈ sliding window over the sequence embeddings x = (𝑦1, … , 𝑦𝑂) 𝑦0 = ⃗ 𝑦𝑂 = ⃗ ℎ1 = 𝑔 (𝑋[𝑦0; 𝑦1.𝑦2] + 𝑐) ℎ𝑗 = 𝑔 (𝑋 [𝑦𝑗−1; 𝑦𝑗; 𝑦𝑗+1] + 𝑐) pad with 0s if we want to keep sequence length
Deep Learning for Natural Language processing
39/90
xs = ... # input sequnce kernel_size = 3 # window size filters = 300 # output dimensions strides=1 # step size W = trained_parameter(xs.shape[2] * kernel_size, filters) b = trained_parameter(filters) window = kernel_size // 2
for i in range(window, xs.shape[1] - window): h = np.mul(W, xs[i - window:i + window]) + b
return np.array(h)
Deep Learning for Natural Language processing
40/90
TensorFlow
h = tf.layers.conv1d(x, filters=300 kernel_size=3, strides=1, padding='same')
https://www.tensorflow.org/api_docs/python/tf/layers/conv1d
PyTorch
conv = nn.Conv1d(in_channels, out_channels=300, kernel_size=3, stride=1, padding=0, dilation=1, groups=1, bias=True) h = conv(x)
https://pytorch.org/docs/stable/nn.html#torch.nn.Conv1d
Deep Learning for Natural Language processing
41/90
ReLU:
0.0 1.0 2.0 3.0 4.0 5.0 6.0 −6 −4 −2 2 4 6 𝑧 𝑦
Derivative of ReLU:
0.0 0.2 0.4 0.6 0.8 1.0 −6 −4 −2 2 4 6 𝑧 𝑦
faster, sufger less with vanishing gradient
Vinod Nair and Geofgrey E Hinton. Rectifjed linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, TODO, TODO 2010. TODO
Deep Learning for Natural Language processing
42/90
embeddings x = (𝑦1, … , 𝑦𝑂) 𝑦0 = ⃗ 𝑦𝑂 = ⃗
⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
ℎ𝑗 = 𝑔 (𝑋 [𝑦𝑗−1; 𝑦𝑗; 𝑦𝑗+1] + 𝑐) + 𝑦𝑗 Allows training deeper networks. Why do you it helps? Better gradient fmow – the same as in RNNs.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, TODO, TODO 2016. IEEE Computer Society
Deep Learning for Natural Language processing
43/90
Numerically unstable, we need activation to be in similar scale ⇒ layer normalization. Activation before non-linearity is normalized: 𝑏𝑗 = 𝑗 𝜏𝑗 (𝑏𝑗 − 𝜈𝑗) … is a trainable parameter, 𝜈, 𝜏 estimated from data. 𝜈 = 1 𝐼
𝐼
∑
𝑗=1
𝑏𝑗 𝜏 = √ √ √ ⎷ 1 𝐼
𝐼
∑
𝑗=1
(𝑏𝑗 − 𝜈)2
Lei Jimmy Ba, Ryan Kiros, and Geofgrey E. Hinton. Layer normalization. CoRR, abs/1607.06450, 2016. ISSN 2331-8422
Deep Learning for Natural Language processing
44/90
embeddings x = (𝑦1, … , 𝑦𝑂) 𝑦0 = ⃗ 𝑦𝑂 = ⃗ Can be enlarged by dilated convolutions.
Deep Learning for Natural Language processing
45/90
Deep Learning for Natural Language processing
46/90
Representing Sequences
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 6000–6010, Long Beach, CA, USA, December 2017. Curran Associates, Inc
Deep Learning for Natural Language processing
47/90
Single-head setup Attn(𝑅, 𝐿, 𝑊 ) = softmax (𝑅𝐿⊤ √ 𝑒 ) 𝑊 ℎ𝑗+1 = ∑ softmax (ℎ𝑗ℎ⊤
𝑗
√ 𝑒 ) Multihead-head setup Multihead(𝑅, 𝑊 ) = (𝐼1 ⊕ ⋯ ⊕ 𝐼ℎ)𝑋 𝑃 𝐼𝑗 = Attn(𝑅𝑋 𝑅
𝑗 , 𝑊 𝑋 𝐿 𝑗 , 𝑊 𝑋 𝑊 𝑗 ) keys & values queries linear linear linear split split split concat scaled dot-product attention scaled dot-product attention scaled dot-product attention scaled dot-product attention scaled dot-product attention
Deep Learning for Natural Language processing
48/90
def attention(query, key, value, mask=None): d_k = query.size(-1) scores = torch.matmul(query, key.transpose(-2, -1)) \ / math.sqrt(d_k) p_attn = F.softmax(scores, dim = -1) return torch.matmul(p_attn, value), p_attn
Deep Learning for Natural Language processing
49/90
def scaled_dot_product(self, queries, keys, values):
return tf.matmul(o3, values)
Deep Learning for Natural Language processing
50/90
Model cannot be aware of the position in the sequence. pos(𝑗) = ⎧ { ⎨ { ⎩ sin ( 𝑢
104
𝑗 𝑒 ) ,
if 𝑗 mod 2 = 0 cos ( 𝑢
104
𝑗−1 𝑒 ) ,
20 40 60 80 Text length 100 200 300 Dimension −0.5 0.0 0.5 1.0
Deep Learning for Natural Language processing
51/90
input embeddings ⊕ position encoding
self-attentive sublayer
multihead attention
keys & values queries
⊕ layer normalization
feed-forward sublayer
non-linear layer linear layer ⊕ layer normalization 𝑂×
feed-forward layer
connections
Deep Learning for Natural Language processing
52/90
computation sequential operations memory Recurrent 𝑃(𝑜 ⋅ 𝑒2) 𝑃(𝑜) 𝑃(𝑜 ⋅ 𝑒) Convolutional 𝑃(𝑙 ⋅ 𝑜 ⋅ 𝑒2) 𝑃(1) 𝑃(𝑜 ⋅ 𝑒) Self-attentive 𝑃(𝑜2 ⋅ 𝑒) 𝑃(1) 𝑃(𝑜2 ⋅ 𝑒) 𝑒 model dimension, 𝑜 sequence length, 𝑙 convolutional kernel
Deep Learning for Natural Language processing
53/90
Neural Networks Basics Representing Words Representing Sequences Recurrent Networks Convolutional Networks Self-attentive Networks Classifjcation and Labeling Generating Sequences Pre-training Representations Word2Vec ELMo BERT
Deep Learning for Natural Language processing
54/90
Deep Learning for Natural Language processing
55/90
Output layer with softmax (with parameters 𝑋, 𝑐): 𝑄𝑧 = softmax(x) = P(𝑧 = 𝑘 ∣ x) = exp x⊤𝑋 + 𝑐 ∑ exp x⊤𝑋 + 𝑐 Network error = cross-entropy between estimated distribution and one-hot ground-truth distribution 𝑈 = 1(𝑧∗): 𝑀(𝑄𝑧, 𝑧∗) = 𝐼(𝑄, 𝑈) = −𝔽𝑗∼𝑈 log 𝑄(𝑗) = − ∑
𝑗
𝑈(𝑗) log 𝑄(𝑗) = − log 𝑄(𝑧∗)
Deep Learning for Natural Language processing
56/90
Let 𝑚 = x⊤𝑋 + 𝑐, 𝑚𝑧∗ corresponds to the correct one. ∂𝑀(𝑄𝑧, 𝑧∗) ∂𝑚 = − ∂ ∂𝑚 log exp 𝑚𝑧∗ ∑𝑘 exp 𝑚𝑘 = − ∂ ∂𝑚𝑚𝑧∗ − log ∑ exp 𝑚 = 1𝑧∗ + ∂ ∂𝑚 − log ∑ exp 𝑚 = 1𝑧∗ − ∑ 1𝑧∗ exp 𝑚 ∑ exp 𝑚 = = 1𝑧∗ − 𝑄𝑧(𝑧∗) Interpretation: Reinforce the correct logit, supress the rest.
Deep Learning for Natural Language processing
57/90
span selection
Lab next time: i/y spelling as sequence labeling
Deep Learning for Natural Language processing
58/90
Deep Learning for Natural Language processing
59/90
input symbol
embedding lookup RNN cell (more layers) RNN state normalization distribution for the next symbol <s> embed RNN ℎ0 softmax 𝑄(𝑥1|<s>) 𝑥1 embed RNN ℎ1 softmax 𝑄(𝑥1| …) 𝑥2 embed RNN ℎ2 softmax 𝑄(𝑥2| …)
Deep Learning for Natural Language processing
60/90
embed RNN ℎ0 softmax P(𝑥1|<s>) argmax embed RNN ℎ1 softmax P(𝑥1| …) argmax embed RNN ℎ2 softmax P(𝑥2| …) argmax embed RNN ℎ3 softmax P(𝑥3| …) argmax <s>
when conditioned on input → autoregressive decoder
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3104–3112, Montreal, Canada, December 2014. Curran Associates, Inc
Deep Learning for Natural Language processing
61/90
last_w = "<s>" while last_w != "</s>": last_w_embeding = target_embeddings[last_w] state, dec_output = dec_cell(state, last_w_embeding) logits = output_projection(dec_output) last_w = np.argmax(logits) yield last_w
Deep Learning for Natural Language processing
62/90
More on the topic in the MT class.
Deep Learning for Natural Language processing
63/90
runtime: ̂
(decoded) ×
training: 𝑧𝑘
(ground truth) <s> ~y1 ~y2 ~y3 ~y4 ~y5 <s> x1 x2 x3 x4 <s> y1 y2 y3 y4 loss
Deep Learning for Natural Language processing
64/90
<s> x1 x2 x3 x4 ~yi ~yi+1 h1 h0 h2 h3 h4
+
× α0 × α1 × α2 × α3 × α4
si si-1 si+1
+
Deep Learning for Natural Language processing
65/90
Inputs: decoder state 𝑡𝑗 encoder states ℎ𝑘 = [⃗⃗⃗⃗⃗⃗⃗ ℎ𝑘; ⃖⃖⃖⃖⃖⃖⃖ ℎ𝑘] ∀𝑗 = 1 … 𝑈𝑦 Attention energies: 𝑓𝑗𝑘 = 𝑤⊤
𝑏 tanh (𝑋𝑏𝑡𝑗−1 + 𝑉𝑏ℎ𝑘 + 𝑐𝑏)
Attention distribution: 𝛽𝑗𝑘 = exp (𝑓𝑗𝑘) ∑
𝑈𝑦 𝑙=1 exp (𝑓𝑗𝑙)
Context vector: 𝑑𝑗 =
𝑈𝑦
∑
𝑘=1
𝛽𝑗𝑘ℎ𝑘
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014. ISSN 2331-8422
Deep Learning for Natural Language processing
66/90
Output projection: 𝑢𝑗 = MLP (𝑉𝑝𝑡𝑗−1 + 𝑊𝑝𝐹𝑧𝑗−1 + 𝐷𝑝𝑑𝑗 + 𝑐𝑝) …attention is mixed with the hidden state Output distribution: 𝑞 (𝑧𝑗 = 𝑙|𝑡𝑗, 𝑧𝑗−1, 𝑑𝑗) ∝ exp (𝑋𝑝𝑢𝑗)𝑙 + 𝑐𝑙
Deep Learning for Natural Language processing
67/90
input embeddings ⊕ position encoding self-attentive sublayer multihead attention
keys & values queries⊕ layer normalization cross-attention sublayer multihead attention
keys & values queries⊕ layer normalization encoder feed-forward sublayer non-linear layer linear layer ⊕ layer normalization 𝑂× linear softmax
attention to the encoder
complete history ⇒ 𝑃(𝑜2) complexity
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 6000–6010, Long Beach, CA, USA, December 2017. Curran Associates, Inc
Deep Learning for Natural Language processing
68/90
𝑤1 𝑤2 𝑤3 … 𝑤𝑁 𝑟1 𝑟2 𝑟3 … 𝑟𝑂 … … … … … Queries 𝑅 Values 𝑊
−∞
wait until it’s generated
matrix multiplication
mask Question 1: What if the matrix was diagonal? Question 2: How such a matrix look like for convolutional architecture?
Deep Learning for Natural Language processing
69/90
Neural Networks Basics Representing Words Representing Sequences Recurrent Networks Convolutional Networks Self-attentive Networks Classifjcation and Labeling Generating Sequences Pre-training Representations Word2Vec ELMo BERT
Deep Learning for Natural Language processing
70/90
language
data
Deep Learning for Natural Language processing
71/90
Pre-training Representations
CBOW Skip-gram ∑ 𝑥3 𝑥1 𝑥2 𝑥4 𝑥5 ⋮ 𝑥3 𝑥1 𝑥2 𝑥4 𝑥5 ⋮
round)
Tomáš Mikolov, Wen-tau Yih, and Geofgrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia, jun 2013. Association for Computational Linguistics
Deep Learning for Natural Language processing
72/90
1. All human beings are born free and equal in dignity … → (All, humans) (All, beings) 2. All human beings are born free and equal in dignity … → (human, All) (human, beings) (human, are) 3. All human beings are born free and equal in dignity … → (beings, All) (beings, human) (beings, are) (beings, born) 4. All human beings are born free and equal in dignity … → (are, human) (are, beings) (are, born) (are, free)
Deep Learning for Natural Language processing
73/90
1 𝑈
𝑈
∑
𝑢=1
∑
𝑘∼(−𝑑,𝑑)
log 𝑞(𝑥𝑢+𝑑|𝑥𝑢)
𝑞(𝑥𝑃|𝑥𝐽) = exp (𝑊 ′⊤
𝑥𝑃𝑊𝑥𝐽)
∑𝑥 exp (𝑊 ′⊤
𝑥𝑊𝑥𝑗)
where 𝑊 is input (embedding) matrix, 𝑊 ′ output matrix
Equations 1, 2. Tomáš Mikolov, Wen-tau Yih, and Geofgrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia, jun 2013. Association for Computational Linguistics
Deep Learning for Natural Language processing
74/90
The summation in denominator is slow, use noise contrastive estimation: log 𝜏 (𝑊 ′⊤
𝑥𝑃𝑊𝑥𝐽) + 𝑙
∑
𝑗=1
𝐹𝑥𝑗∼𝑄𝑜(𝑥) [log 𝜏 (−𝑊 ′⊤
𝑥𝑗𝑊𝑥𝐽)]
Main idea: classify independently by logistic regression the positive and few sampled negative examples.
Equations 1, 3. Tomáš Mikolov, Wen-tau Yih, and Geofgrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia, jun 2013. Association for Computational Linguistics
Deep Learning for Natural Language processing
75/90
man woman uncle aunt king queen kings queens king queen
Image originally from Tomáš Mikolov, Wen-tau Yih, and Geofgrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia, jun 2013. Association for Computational Linguistics
Deep Learning for Natural Language processing
76/90
Deep Learning for Natural Language processing
77/90
FastText – Word2Vec model implementation by Facebook https://github.com/facebookresearch/fastText
./fasttext skipgram -input data.txt -output model
Deep Learning for Natural Language processing
78/90
Pre-training Representations
known tricks, trained on extremely large data
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/N18-1202
Deep Learning for Natural Language processing
79/90
character embeddings of size 16 1D-convolution to 2,048 dimensions + max-pool
window fjlters 1 32 2 32 3 64 4 128 5 256 6 512 7 1024
2× highway layer (2,048 dimensions) linear projection to 512 dimensions
level
(∼ soft search for learned 𝑜-grams)
𝑚+1 = 𝜏 (𝑋ℎ𝑚 + 𝑐) ℎ𝑚+1 = (1 − 𝑚+1) ⊙ ℎ𝑚+ 𝑚+1 ⊙ ReLu (𝑋ℎ𝑚 + 𝑐) contain gates that contol if projection is needed
Deep Learning for Natural Language processing
80/90
connections
Learned layer combination for downstream tasks: ELMotask
𝑙
= 𝛿task ∑
layer𝑀
𝑡task
𝑀 ℎ(𝑀) 𝑙
𝛿task, 𝑡task
𝑀
trainable parameters.
Deep Learning for Natural Language processing
81/90
Answer Span Selection Find an answer to a question in a unstructured text. Semantic Role Labeling Detect who did what to whom in sentences. Natural Language Inference Decide whether two sentences are in agreement, contradict each other, or have nothing to do with each other. Named Entity Recognition Detect and classify names people, locations, organization, numbers with units, email addresses, URLs, phone numbers … Coreference Resolution Detect what entities pronouns refer to. I Semantic Similarity Measure how similar meaning two sentences are. (Think of clustering similar question on StackOverfmow or detecting plagiarism.)
Deep Learning for Natural Language processing
82/90
Deep Learning for Natural Language processing
83/90
framework (uses PyTorch)
available
from allennlp.modules.elmo import Elmo, batch_to_ids
weight_file = ... elmo = Elmo(options_file, weight_file, 2, dropout=0) sentences = [['First', 'sentence', '.'], ['Another', '.']] character_ids = batch_to_ids(sentences) embeddings = elmo(character_ids)
https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md
Deep Learning for Natural Language processing
84/90
Pre-training Representations
representations
slightly difgerent training objective
November 2018
Pre-training of deep bidirectional transformers for language
Deep Learning for Natural Language processing
85/90
Deep Learning for Natural Language processing
86/90
All human being are born free free MASK hairy free and equal in dignity and rights
Then a classifjer should predict the missing/replaced word free
Deep Learning for Natural Language processing
87/90
Deep Learning for Natural Language processing
88/90
Tables 1 and 2. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. ArXiv e-prints, October 2018
Deep Learning for Natural Language processing
89/90
Deep Learning for Natural Language processing
embeddings
convolutional, self-attentive
decoding
tasks