Deep Learning for Natural Language processing Jindich Libovick - PowerPoint PPT Presentation

Vanilla RNN Deep Learning for Natural Language processing 23/90 ℎ 𝑢 = tanh (𝑋[ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐) • cannot propagate long-distance relations • vanishing gradient problem

Vanishing Gradient Problem (1) −2 0.4 0.6 0.8 1.0 −6 −4 0 0.0 2 4 6 𝑧 𝑦 Weight initialized ∼ 𝒪(0, 1) to have gradients further from zero. Deep Learning for Natural Language processing 0.2 24/90 −2 d 𝑦 -0.5 0.0 0.5 1.0 −6 −4 0 1 + 𝑓 −2𝑦 2 4 6 𝑧 𝑦 dtanh 𝑦 -1.0 tanh 𝑦 = 1 − 𝑓 −2𝑦 = 1 − tanh 2 𝑦 ∈ (0, 1]

Vanishing Gradient Problem (2) ∂𝐹 𝑢+1 ∂𝑐 ∂ℎ 𝑢+1 ∂𝑐 (chain rule) Deep Learning for Natural Language processing 25/90 = ∂𝐹 𝑢+1 ⋅ ∂ℎ 𝑢+1

Vanishing Gradient Problem (3) ⏟ =1 ⎞ ⎟ ⎟ ⎠ = 𝑋 ℎ ∼𝒪(0,1) ∂𝑐 tanh ′ (𝑨 𝑢 ) ⏟ ∈(0;1] ∂ℎ 𝑢−1 ∂𝑐 + tanh ′ (𝑨 𝑢 ) Deep Learning for Natural Language processing ⏟ =0 ∂ℎ 𝑢 ⎜ ∂𝑐 = ∂ tanh ⏞⏞⏞⏞⏞⏞⏞⏞⏞ ∂𝑐 = 26/90 ⎜ ⎝ ∂𝑋 ℎ ℎ 𝑢−1 ∂𝑐 ∂𝑐 ⏟ =𝑨 𝑢 (activation) (𝑋 ℎ ℎ 𝑢−1 + 𝑋 𝑦 𝑦 𝑢 + 𝑐) ( tanh ′ is derivative of tanh ) tanh ′ (𝑨 𝑢 ) ⋅ ⎛ + ∂𝑋 𝑦 𝑦 𝑢 + ∂𝑐

Long Short-Term Memory Networks LSTM = Long short-term memory Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation , 9(8):1735–1780, 1997. ISSN 0899-7667 Control the gradient fmow by explicitly gating: Deep Learning for Natural Language processing 27/90 • what to use from input, • what to use from hidden state, • what to put on output

LMST: Hidden State Deep Learning for Natural Language processing 28/90 • two types of hidden states • ℎ 𝑢 — “public” hidden state, used an output • 𝑑 𝑢 — “private” memory, no non-linearities on the way • direct fmow of gradients (without multiplying by ≤ 1 derivatives)

LSTM: Forget Gate Deep Learning for Natural Language processing 29/90 𝑔 𝑢 = 𝜏 (𝑋 𝑔 [ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐 𝑔 ) • based on input and previous state, decide what to forget from the memory

LSTM: Input Gate ̃ • ̃ 𝐷 — candidate what may want to add to the memory Deep Learning for Natural Language processing 30/90 𝑗 𝑢 = 𝜏 (𝑋 𝑗 ⋅ [ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐 𝑗 ) 𝐷 𝑢 = tanh (𝑋 𝑑 ⋅ [ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐 𝐷 ) • 𝑗 𝑢 — decide how much of the information we want to store

LMST: Cell State Update ̃ 𝐷 𝑢 Deep Learning for Natural Language processing 31/90 𝐷 𝑢 = 𝑔 𝑢 ⊙ 𝐷 𝑢−1 + 𝑗 𝑢 ⊙

LSTM: Output Gate Deep Learning for Natural Language processing 32/90 𝑝 𝑢 = 𝜏 (𝑋 𝑝 ⋅ [ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐 𝑝 ) ℎ 𝑢 = 𝑝 𝑢 ⊙ tanh 𝐷 𝑢

Here we are, LSTM! 𝑔 𝑢 Deep Learning for Natural Language processing Compute all gates in a single matrix multiplication. Question How would you implement it effjciently? = ℎ 𝑢 𝐷 𝑢 ̃ = 𝐷 𝑢 33/90 = 𝐷 𝑢 ̃ = 𝑝 𝑢 = 𝑗 𝑢 𝜏 (𝑋 𝑔 [ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐 𝑔 ) = 𝜏 (𝑋 𝑗 ⋅ [ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐 𝑗 ) 𝜏 (𝑋 𝑝 ⋅ [ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐 𝑝 ) tanh (𝑋 𝑑 ⋅ [ℎ 𝑢−1 ; 𝑦 𝑢 ] + 𝑐 𝐷 ) 𝑔 𝑢 ⊙ 𝐷 𝑢−1 + 𝑗 𝑢 ⊙ 𝑝 𝑢 ⊙ tanh 𝐷 𝑢

Gated Recurrent Units update gate remember gate candidate hidden state ̃ hidden state ℎ 𝑢 Deep Learning for Natural Language processing 34/90 𝑨 𝑢 = 𝜏(𝑦 𝑢 𝑋 𝑨 + ℎ 𝑢−1 𝑉 𝑨 + 𝑐 𝑨 ) ∈ (0, 1) 𝑠 𝑢 = 𝜏(𝑦 𝑢 𝑋 𝑠 + ℎ 𝑢−1 𝑉 𝑠 + 𝑐 𝑠 ) ∈ (0, 1) ℎ 𝑢 = tanh (𝑦 𝑢 𝑋 ℎ + (𝑠 𝑢 ⊙ ℎ 𝑢−1 )𝑉 ℎ ) ∈ (−1, 1) ℎ 𝑢 = (1 − 𝑨 𝑢 ) ⊙ ℎ 𝑢−1 + 𝑨 𝑢 ⋅ ̃

LSTM vs. GRU machine Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR , abs/1412.3555, 2014. ISSN 2331-8422; Gail Weiss, Yoav Goldberg, and Eran Yahav. On the practical computational power of fjnite precision rnns for language recognition. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages 740–745, Melbourne, Australia, July 2018. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P18-2117 Deep Learning for Natural Language processing 35/90 • GRU is smaller and therefore faster • performance similar, task dependent • theoretical limitation: GRU accepts regular languages, LSTM can simulate counter

RNN in PyTorch rnn = nn.LSTM(input_dim, hidden_dim=512, num_layers=1, bidirectional=True, dropout=0.8) output, (hidden, cell) = self.rnn(x) https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTM Deep Learning for Natural Language processing 36/90

RNN in TensorFlow inputs = ... # float tf.Tensor of shape [batch, length, dim] lengths = ... # int tf.Tensor of shape [batch] # Cell objects are templates fw_cell = tf.nn.rnn_cell.LSTMCell(512, name="fw_cell") bw_cell = tf.nn.rnn_cell.LSTMCell(512, name="bw_cell") outputs, states = tf.nn.bidirectional_dynamic_rnn( cell_fw, cell_bw, inputs, sequence_length=lengths) https://www.tensorflow.org/api_docs/python/tf/nn/bidirectional_dynamic_rnn Deep Learning for Natural Language processing 37/90

Bidirectional Networks Image from: http://colah.github.io/posts/2015-09-NN-Types-FP/ Deep Learning for Natural Language processing 38/90 • simple trick to improve performance • run one RNN forward, second one backward and concatenate outputs • state of the art in tagging, crucial for neural machine translation

Representing Sequences Convolutional Networks

1-D Convolution ≈ sliding window over the sequence 0 0 pad with 0s if we want to keep sequence length Deep Learning for Natural Language processing 39/90 ℎ 𝑗 = 𝑔 (𝑋 [𝑦 𝑗−1 ; 𝑦 𝑗 ; 𝑦 𝑗+1 ] + 𝑐) ℎ 1 = 𝑔 (𝑋[𝑦 0 ; 𝑦 1 .𝑦 2 ] + 𝑐) embeddings x = (𝑦 1 , … , 𝑦 𝑂 ) 𝑦 0 = ⃗ 𝑦 𝑂 = ⃗

1-D Convolution: Pseudocode xs = ... # input sequnce kernel_size = 3 # window size filters = 300 # output dimensions strides=1 # step size W = trained_parameter(xs.shape[2] * kernel_size, filters) b = trained_parameter(filters) window = kernel_size // 2 outputs = [] for i in range(window, xs.shape[1] - window): h = np.mul(W, xs[i - window:i + window]) + b outputs.append(h) return np.array(h) Deep Learning for Natural Language processing 40/90

1-D Convolution: Frameworks TensorFlow h = tf.layers.conv1d(x, filters=300 kernel_size=3, strides=1, padding='same') https://www.tensorflow.org/api_docs/python/tf/layers/conv1d PyTorch conv = nn.Conv1d(in_channels, out_channels=300, kernel_size=3, stride=1, padding=0, dilation=1, groups=1, bias=True) h = conv(x) https://pytorch.org/docs/stable/nn.html#torch.nn.Conv1d Deep Learning for Natural Language processing 41/90

Rectifjed Linear Units 2 0.6 0.8 1.0 −6 −4 −2 0 4 0.2 6 𝑧 𝑦 faster, sufger less with vanishing gradient Vinod Nair and Geofgrey E Hinton. Rectifjed linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) , pages 807–814, TODO, TODO 2010. TODO Deep Learning for Natural Language processing 0.4 0.0 ReLU: −6 0.0 1.0 2.0 3.0 4.0 5.0 6.0 −4 Derivative of ReLU: −2 0 2 4 6 𝑧 𝑦 42/90

Residual Connections ⊕ Deep Learning for Natural Language processing vision and pattern recognition , pages 770–778, TODO, TODO 2016. IEEE Computer Society Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer Better gradient fmow – the same as in RNNs. Why do you it helps? Allows training deeper networks. ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ 0 0 43/90 ℎ 𝑗 = 𝑔 (𝑋 [𝑦 𝑗−1 ; 𝑦 𝑗 ; 𝑦 𝑗+1 ] + 𝑐) + 𝑦 𝑗 embeddings x = (𝑦 1 , … , 𝑦 𝑂 ) 𝑦 0 = ⃗ 𝑦 𝑂 = ⃗

Residual Connections: Numerical Stability √ Deep Learning for Natural Language processing Lei Jimmy Ba, Ryan Kiros, and Geofgrey E. Hinton. Layer normalization. CoRR , abs/1607.06450, 2016. ISSN 2331-8422 𝑗=1 ∑ 𝐼 𝐼 1 ⎷ √ √ 𝜏 = Numerically unstable, we need activation to be in similar scale ⇒ layer normalization. 𝑏 𝑗 𝑗=1 ∑ 𝐼 𝐼 … 𝑕 is a trainable parameter, 𝜈 , 𝜏 estimated from data. 𝜏 𝑗 Activation before non-linearity is normalized: 44/90 𝑏 𝑗 = 𝑕 𝑗 (𝑏 𝑗 − 𝜈 𝑗 ) 𝜈 = 1 (𝑏 𝑗 − 𝜈) 2

Receptive Field 0 0 Can be enlarged by dilated convolutions. Deep Learning for Natural Language processing 45/90 embeddings x = (𝑦 1 , … , 𝑦 𝑂 ) 𝑦 0 = ⃗ 𝑦 𝑂 = ⃗

Convolutional architectures + – Deep Learning for Natural Language processing 46/90 • extremely computationally effjcient • limited context • by default no aware of 𝑜 -gram order • max-pooling over the hidden states = element-wise maximum over sequence • can be understood as an ∃ operator over the feature extractors

Representing Sequences Self-attentive Networks

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Self-attentive Networks Advances in Neural Information Processing Systems 30 , pages 6000–6010, Long Beach, CA, USA, December 2017. Curran Associates, Inc Deep Learning for Natural Language processing 47/90 • In some layers: states are linear combination of previous layer states • Originally for the Transformer model for machine translation • similarity matrix between all pairs of states • 𝑃(𝑜 2 ) memory, 𝑃(1) time (when paralelized) • next layer: sum by rows

Multi-headed scaled dot-product attention keys & values Deep Learning for Natural Language processing scaled dot-product attention scaled dot-product attention scaled dot-product attention scaled dot-product attention scaled dot-product attention concat split split split linear linear linear queries 48/90 Single-head setup Multihead-head setup ) 𝑒 √ 𝑗 ) 𝑊 𝑒 √ Attn (𝑅, 𝐿, 𝑊 ) = softmax (𝑅𝐿 ⊤ ℎ 𝑗+1 = ∑ softmax (ℎ 𝑗 ℎ ⊤ Multihead (𝑅, 𝑊 ) = (𝐼 1 ⊕ ⋯ ⊕ 𝐼 ℎ )𝑋 𝑃 𝐼 𝑗 = Attn (𝑅𝑋 𝑅 𝑗 , 𝑊 𝑋 𝐿 𝑗 , 𝑊 𝑋 𝑊 𝑗 )

Dot-Product Attention in PyTorch def attention(query, key, value, mask=None): d_k = query.size(-1) scores = torch.matmul(query, key.transpose(-2, -1)) \ / math.sqrt(d_k) p_attn = F.softmax(scores, dim = -1) return torch.matmul(p_attn, value), p_attn Deep Learning for Natural Language processing 49/90

Dot-Product Attention in TensorFlow def scaled_dot_product(self, queries, keys, values): o1 = tf.matmul(queries, keys, transpose_b=True) o2 = o1 / (dim**0.5) o3 = tf.nn.softmax(o2) return tf.matmul(o3, values) Deep Learning for Natural Language processing 50/90

Position Encoding 200 40 60 80 Text length 0 100 300 0 Dimension −0.5 0.0 0.5 1.0 Deep Learning for Natural Language processing 20 otherwise Model cannot be aware of the position in the sequence. ⎩ pos (𝑗) = ⎧ { ⎨ { 10 4 𝑗 if 𝑗 mod 2 = 0 10 4 𝑗−1 51/90 sin ( 𝑢 𝑒 ) , cos ( 𝑢 𝑒 ) ,

Stacking self-attentive Layers layer normalization Deep Learning for Natural Language processing connections feed-forward layer 𝑂× layer normalization ⊕ linear layer non-linear layer input embeddings feed-forward sublayer ⊕ queries values keys & attention multihead self-attentive sublayer encoding position ⊕ 52/90 • several layers (original paper 6) • each layer: 2 sub-layers: self-attention and • everything inter-connected with residual

Architectures Comparison computation Deep Learning for Natural Language processing 𝑒 model dimension, 𝑜 sequence length, 𝑙 convolutional kernel 𝑃(1) Self-attentive 𝑃(𝑜 ⋅ 𝑒) 𝑃(1) 𝑃(𝑙 ⋅ 𝑜 ⋅ 𝑒 2 ) Convolutional 𝑃(𝑜 ⋅ 𝑒) 𝑃(𝑜) 𝑃(𝑜 ⋅ 𝑒 2 ) Recurrent memory sequential operations 53/90 𝑃(𝑜 2 ⋅ 𝑒) 𝑃(𝑜 2 ⋅ 𝑒)

Classifjcation and Labeling

Classifjcation and Labeling Neural Networks Basics Representing Words Representing Sequences Recurrent Networks Convolutional Networks Self-attentive Networks Classifjcation and Labeling Generating Sequences Pre-training Representations Word2Vec ELMo BERT Deep Learning for Natural Language processing 54/90

Sequence Clasifjcation Deep Learning for Natural Language processing 55/90 • tasks like sentiment analysis, genre classifjcation • need to get one vector from sequence → average or max pooling • optionally hidden layers, at the and softmax for probability distribution over classes

Softmax & Cross-Entropy = Deep Learning for Natural Language processing − log 𝑄(𝑧 ∗ ) = 𝑈(𝑗) log 𝑄(𝑗) 𝑗 − ∑ 56/90 Output layer with softmax (with parameters 𝑋 , 𝑐 ): = 𝑀(𝑄 𝑧 , 𝑧 ∗ ) = 𝐼(𝑄, 𝑈) Network error = cross-entropy between estimated distribution and one-hot ground-truth exp x ⊤ 𝑋 + 𝑐 𝑄 𝑧 = softmax ( x ) = P (𝑧 = 𝑘 ∣ x ) = ∑ exp x ⊤ 𝑋 + 𝑐 distribution 𝑈 = 1 (𝑧 ∗ ) : −𝔽 𝑗∼𝑈 log 𝑄(𝑗)

Derivative of Cross-Entropy = Deep Learning for Natural Language processing Interpretation: Reinforce the correct logit, supress the rest. = = ∑ exp 𝑚 57/90 = exp 𝑚 𝑧 ∗ ∂𝑚 ∂𝑀(𝑄 𝑧 , 𝑧 ∗ ) Let 𝑚 = x ⊤ 𝑋 + 𝑐 , 𝑚 𝑧 ∗ corresponds to the correct one. − ∂ = − ∂ ∂𝑚 log ∂𝑚𝑚 𝑧 ∗ − log ∑ exp 𝑚 ∑ 𝑘 exp 𝑚 𝑘 ∂𝑚 − log ∑ exp 𝑚 = 1 𝑧 ∗ − ∑ 1 𝑧 ∗ exp 𝑚 1 𝑧 ∗ + ∂ 1 𝑧 ∗ − 𝑄 𝑧 (𝑧 ∗ )

Sequence Labeling span selection Lab next time: i/y spelling as sequence labeling Deep Learning for Natural Language processing 58/90 • assign value / probability distribution to every token in a sequence • morphological tagging, named-entity recognition, LM with unlimited history, answer • every state is classifjed independently with a classifjer • during training, error babckpropagate form all classifjers

Generating Sequences

Sequence-to-sequence Learning Deep Learning for Natural Language processing 59/90 • target sequence is of difgerent lenght tahn source • no-trivial (= not monotonic) correspondence of source and target • taks like: machine translation, text summarization, image captioning

Neural Language Model input symbol Deep Learning for Natural Language processing ⋯ 𝑄(𝑥 2 | …) softmax ℎ 2 RNN embed 𝑥 2 𝑄(𝑥 1 | …) softmax ℎ 1 RNN embed 𝑥 1 𝑄(𝑥 1 | <s> ) softmax one-hot vectors embedding lookup RNN cell (more layers) RNN state normalization distribution for the next symbol <s> embed RNN ℎ 0 60/90 • estimate probability of a sentence using the chain rule • output distributions can be used for sampling

Sampling from a LM P (𝑥 2 | …) Deep Learning for Natural Language processing Associates, Inc and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27 , pages 3104–3112, Montreal, Canada, December 2014. Curran Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, when conditioned on input → autoregressive decoder ⋯ <s> argmax P (𝑥 3 | …) softmax ℎ 3 RNN embed argmax softmax embed ℎ 2 RNN embed argmax P (𝑥 1 | …) softmax ℎ 1 RNN embed argmax P (𝑥 1 | <s> ) softmax ℎ 0 RNN 61/90

Autoregressive Decoding: Pseudocode last_w = "<s>" while last_w != "</s>": last_w_embeding = target_embeddings[last_w] state, dec_output = dec_cell(state, last_w_embeding) logits = output_projection(dec_output) last_w = np.argmax(logits) yield last_w Deep Learning for Natural Language processing 62/90

More on the topic in the MT class. Architectures in the Decoder Deep Learning for Natural Language processing 63/90 • RNN – original sequence-to-sequence learning (2015) • principle known since 2014 (University of Montreal) • made usable in 2016 (University of Edinburgh) • CNN – convolution sequence-to-sequence by Facebook (2017) • Self-attention (so called Transformer) by Google (2017)

Implementation: Runtime vs. training (ground truth) Deep Learning for Natural Language processing runtime: ̂ 64/90 training: 𝑧 𝑘 𝑧 𝑘 (decoded) × y 1 y 2 y 3 y 4 <s> x 1 x 2 x 3 x 4 <s> loss <s> ~y 1 ~y 2 ~y 3 ~y 4 ~y 5

Attention Model Deep Learning for Natural Language processing 65/90 x 1 x 2 x 3 x 4 <s> ... h 0 h 1 h 2 h 3 h 4 α 0 α 1 α 2 α 3 α 4 × × × × × s i-1 s i s i+1 + + ~y i ~y i+1

Attention Model in Equations (1) Inputs: Deep Learning for Natural Language processing ISSN 2331-8422 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR , abs/1409.0473, 2014. 𝛽 𝑗𝑘 ℎ 𝑘 𝑘=1 ∑ 𝑈 𝑦 Context vector: 𝑈 𝑦 ∑ exp (𝑓 𝑗𝑘 ) Attention distribution: Attention energies: ∀𝑗 = 1 … 𝑈 𝑦 ℎ 𝑘 ] encoder states 𝑡 𝑗 decoder state 66/90 ℎ 𝑘 ; ⃖⃖⃖⃖⃖⃖⃖ ℎ 𝑘 = [⃗⃗⃗⃗⃗⃗⃗ 𝑓 𝑗𝑘 = 𝑤 ⊤ 𝑏 tanh (𝑋 𝑏 𝑡 𝑗−1 + 𝑉 𝑏 ℎ 𝑘 + 𝑐 𝑏 ) 𝛽 𝑗𝑘 = 𝑙=1 exp (𝑓 𝑗𝑙 ) 𝑑 𝑗 =

Attention Model in Equations (2) Output projection: …attention is mixed with the hidden state Output distribution: Deep Learning for Natural Language processing 67/90 𝑢 𝑗 = MLP (𝑉 𝑝 𝑡 𝑗−1 + 𝑊 𝑝 𝐹𝑧 𝑗−1 + 𝐷 𝑝 𝑑 𝑗 + 𝑐 𝑝 ) 𝑞 (𝑧 𝑗 = 𝑙|𝑡 𝑗 , 𝑧 𝑗−1 , 𝑑 𝑗 ) ∝ exp (𝑋 𝑝 𝑢 𝑗 ) 𝑙 + 𝑐 𝑙

Transformer Decoder softmax feed-forward sublayer non-linear layer linear layer ⊕ layer normalization 𝑂× linear output symbol probabilities input embeddings attention to the encoder complete history ⇒ 𝑃(𝑜 2 ) complexity Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30 , pages 6000–6010, Long Beach, CA, USA, December 2017. Curran Associates, Inc Deep Learning for Natural Language processing encoder layer normalization ⊕ queries ⊕ position encoding self-attentive sublayer multihead attention keys & values queries ⊕ layer normalization cross-attention sublayer multihead attention keys & values 68/90 • similar to encoder, additional layer with • in every steps self-attention over

Transfomer Decoder: Non-autoregressive training … Deep Learning for Natural Language processing Question 2: How such a matrix look like for convolutional architecture? Question 1: What if the matrix was diagonal? mask matrix multiplication wait until it’s generated −∞ Values 𝑊 𝑤 1 Queries 𝑅 … … … 𝑟 𝑂 … 𝑤 𝑁 … 𝑤 3 𝑤 2 69/90 𝑟 1 𝑟 2 𝑟 3 • analogical to encoder • target is known at training: don’t need to • self attention can be parallelized via • prevent attentding the future using a …

Pre-training Representations

Pre-training Representations Neural Networks Basics Representing Words Representing Sequences Recurrent Networks Convolutional Networks Self-attentive Networks Classifjcation and Labeling Generating Sequences Pre-training Representations Word2Vec ELMo BERT Deep Learning for Natural Language processing 70/90

Pre-trained Representations language data Deep Learning for Natural Language processing 71/90 • representations that emerge in models seem to be carry a lot of information about the • representations pre-trained on large data can be re-used on tasks with smaller training

Pre-training Representations Word2Vec

Word2Vec 𝑥 2 Deep Learning for Natural Language processing Association for Computational Linguistics the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 746–751, Atlanta, Georgia, jun 2013. Tomáš Mikolov, Wen-tau Yih, and Geofgrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of round) ⋮ 𝑥 5 𝑥 4 𝑥 1 𝑥 3 ⋮ 𝑥 5 𝑥 4 𝑥 2 𝑥 1 𝑥 3 ∑ Skip-gram CBOW 72/90 • way to learn word embeddings without training the complete LM • CBOW: minimize cross-entropy of the middle word of a sliding windows • skip-gram: minimize cross-entropy of a bag of words around a word (LM other way

Word2Vec: sampling beings → (beings, All) (beings, human) (beings, are) (beings, born) 4. All human are born born free and equal in dignity … → (are, human) (are, beings) (are, born) (are, free) Deep Learning for Natural Language processing free and equal in dignity … are 1. human All human beings are born free and equal in dignity … → (All, humans) (All, beings) 2. All beings beings are born free and equal in dignity … → (human, All) (human, beings) (human, are) 3. All human 73/90

Word2Vec: Formulas log 𝑞(𝑥 𝑢+𝑑 |𝑥 𝑢 ) Deep Learning for Natural Language processing jun 2013. Association for Computational Linguistics Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 746–751, Atlanta, Georgia, Equations 1, 2. Tomáš Mikolov, Wen-tau Yih, and Geofgrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 𝑥 𝑊 𝑥 𝑗 ) 𝑥 𝑃 𝑊 𝑥 𝐽 ) 𝑞(𝑥 𝑃 |𝑥 𝐽 ) = 𝑘∼(−𝑑,𝑑) ∑ 𝑢=1 ∑ 𝑈 𝑈 1 74/90 • Training objective: • Probability estimation: exp (𝑊 ′⊤ ∑ 𝑥 exp (𝑊 ′⊤ where 𝑊 is input (embedding) matrix, 𝑊 ′ output matrix

Word2Vec: Training using Negative Sampling The summation in denominator is slow, use noise contrastive estimation: 𝑥 𝑃 𝑊 𝑥 𝐽 ) + 𝑙 ∑ 𝑗=1 𝑥 𝑗 𝑊 𝑥 𝐽 )] Main idea: classify independently by logistic regression the positive and few sampled negative examples. Equations 1, 3. Tomáš Mikolov, Wen-tau Yih, and Geofgrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 746–751, Atlanta, Georgia, jun 2013. Association for Computational Linguistics Deep Learning for Natural Language processing 75/90 log 𝜏 (𝑊 ′⊤ 𝐹 𝑥 𝑗 ∼𝑄 𝑜 (𝑥) [ log 𝜏 (−𝑊 ′⊤

Word2Vec: Vector Arithmetics man woman uncle aunt king queen kings queens king queen Image originally from Tomáš Mikolov, Wen-tau Yih, and Geofgrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 746–751, Atlanta, Georgia, jun 2013. Association for Computational Linguistics Deep Learning for Natural Language processing 76/90

Few More Notes on Embeddings Deep Learning for Natural Language processing 77/90 • many method for pre-trained words embeddings (most popluar GloVe) • embeddings capturing character-level properties • multilingual embeddings

Training models FastText – Word2Vec model implementation by Facebook https://github.com/facebookresearch/fastText ./fasttext skipgram -input data.txt -output model Deep Learning for Natural Language processing 78/90

Pre-training Representations ELMo

What is ELMo? known tricks, trained on extremely large data Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 2227–2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/N18-1202 Deep Learning for Natural Language processing 79/90 • pre-trained large language model • “nothing special” – combines all • improves almost all NLP tasks • published in June 2018

ELMo Architecture: Input 256 Deep Learning for Natural Language processing projection is needed contain gates that contol if ( ∼ soft search for learned 𝑜 -grams) level linear projection to 512 dimensions 2 × highway layer (2,048 dimensions) 1024 7 character embeddings of size 16 6 512 5 1 1D-convolution to 2,048 dimensions + max-pool window fjlters 128 80/90 32 2 32 3 64 4 • input tokenized, treated on character • 2,048 𝑜 -gram fjlters + max-pooling • 2 highway layers: 𝑕 𝑚+1 = 𝜏 (𝑋 𝑕 ℎ 𝑚 + 𝑐 𝑕 ) ℎ 𝑚+1 = (1 − 𝑕 𝑚+1 ) ⊙ ℎ 𝑚 + 𝑕 𝑚+1 ⊙ ReLu (𝑋ℎ 𝑚 + 𝑐)

ELMo Architecture: Language Models 𝑡 task Deep Learning for Natural Language processing trainable parameters. 𝑀 𝛿 task , 𝑡 task 𝑙 layer 𝑀 𝑙 ELMo task Learned layer combination for downstream tasks: connections 81/90 • token representations input for 2 language models: forward and backward • both LMs 2 layers with 4,096 dimensions wiht layer normalization and residual • output classifjer shared (only used in training, does hot have to be good) = 𝛿 task ∑ 𝑀 ℎ (𝑀)

Task where ELMo helps locations, organization, numbers with Deep Learning for Natural Language processing plagiarism.) question on StackOverfmow or detecting sentences are. (Think of clustering similar Measure how similar meaning two Semantic Similarity Detect what entities pronouns refer to. I Coreference Resolution numbers … units, email addresses, URLs, phone Detect and classify names people, Answer Span Selection Named Entity Recognition nothing to do with each other. agreement, contradict each other, or have Decide whether two sentences are in Natural Language Inference sentences. Detect who did what to whom in Semantic Role Labeling unstructured text. Find an answer to a question in a 82/90

Improvements by Elmo Deep Learning for Natural Language processing 83/90

How to use it dropout=0) Deep Learning for Natural Language processing https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md embeddings = elmo(character_ids) character_ids = batch_to_ids(sentences) ['Another', '.']] sentences = [['First', 'sentence', '.'], elmo = Elmo(options_file, weight_file, 2, weight_file = ... options_file = ... batch_to_ids from allennlp.modules.elmo import Elmo, available framework (uses PyTorch) 84/90 • implemetned in AllenNLP • pre-trained English models

Pre-training Representations BERT

What is BERT representations slightly difgerent training objective November 2018 J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. ArXiv e-prints , October 2018 Deep Learning for Natural Language processing 85/90 • another way of pretraining sentence • uses Transformer architecture and • even beeter than ELMo • done by Google, published in

Achitecture Comparison Deep Learning for Natural Language processing 86/90

Masked Language Model All Deep Learning for Natural Language processing 4. With 10% change keep asi is → free 3. With 10% change replace with random token → hairy 2. With 80% change replace with special MASK token. 1. Randomly sample a word → free rights and dignity in equal and free free MASK hairy free born are being human 87/90 Then a classifjer should predict the missing/replaced word free

Deep Learning for Natural Language processing Jindich Libovick - PowerPoint PPT Presentation

Deep Learning for Natural Language processing Jindich Libovick March 1, 2017 Introduction to Natural Language Processing Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Deep learning for natural language processing Introduction to natural language processing

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing with Deep Learning CS224N The Future of Deep Learning + NLP Kevin

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Learning for Natural Language Processing (in 2 hours) Eneko Agirre

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides

openvswitch.ko minus Open vSwitch Joe Stringer, VMware

for McEliece Im Implementations Thomas Eisenbarth Joint work with Cong Chen, Ingo von Maurich

Ins Instanc nce segm segmen enta tati tion on CV3DST | Prof. Leal-Taix 1 Se Semanti

Welcome! Todays Agenda: Recap Flow Control AVX, Larrabee, GPGPU Further

Software-based Fault Tolerance Mission (Im)possible? Peter Ulbrich The 29th CREST Open

Commissioning of the ATLAS Tile Hadronic Calorimeter with cosmic muons, single beams and first

SoC SoC Design Design g Lecture L Lecture 3: Introduction to ASICs 3 I : Introduction to

Sambuz

Useful Links

Newsletter

Mail Us

Deep Learning for Natural Language processing Jindich Libovick - PowerPoint PPT Presentation

Deep Learning for Natural Language processing Jindich Libovick March 1, 2017 Introduction to Natural Language Processing Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Deep learning for natural language processing Introduction to natural language processing

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing with Deep Learning CS224N The Future of Deep Learning + NLP Kevin

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Learning for Natural Language Processing (in 2 hours) Eneko Agirre

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides

openvswitch.ko minus Open vSwitch Joe Stringer, VMware

for McEliece Im Implementations Thomas Eisenbarth Joint work with Cong Chen, Ingo von Maurich

Ins Instanc nce segm segmen enta tati tion on CV3DST | Prof. Leal-Taix 1 Se Semanti

Welcome! Todays Agenda: Recap Flow Control AVX, Larrabee, GPGPU Further

Software-based Fault Tolerance Mission (Im)possible? Peter Ulbrich The 29th CREST Open

Commissioning of the ATLAS Tile Hadronic Calorimeter with cosmic muons, single beams and first

SoC SoC Design Design g Lecture L Lecture 3: Introduction to ASICs 3 I : Introduction to

Sambuz

Useful Links

Newsletter

Mail Us

Deep learning for natural language processing A short primer on deep learning Benoit Favre <