Vanilla RNN Deep Learning for Natural Language processing 23/90 โ ๐ข = tanh (๐[โ ๐ขโ1 ; ๐ฆ ๐ข ] + ๐) โข cannot propagate long-distance relations โข vanishing gradient problem
Vanishing Gradient Problem (1) โ2 0.4 0.6 0.8 1.0 โ6 โ4 0 0.0 2 4 6 ๐ง ๐ฆ Weight initialized โผ ๐ช(0, 1) to have gradients further from zero. Deep Learning for Natural Language processing 0.2 24/90 โ2 d ๐ฆ -0.5 0.0 0.5 1.0 โ6 โ4 0 1 + ๐ โ2๐ฆ 2 4 6 ๐ง ๐ฆ dtanh ๐ฆ -1.0 tanh ๐ฆ = 1 โ ๐ โ2๐ฆ = 1 โ tanh 2 ๐ฆ โ (0, 1]
Vanishing Gradient Problem (2) โ๐น ๐ข+1 โ๐ โโ ๐ข+1 โ๐ (chain rule) Deep Learning for Natural Language processing 25/90 = โ๐น ๐ข+1 โ โโ ๐ข+1
Vanishing Gradient Problem (3) โ =1 โ โ โ โ = ๐ โ โผ๐ช(0,1) โ๐ tanh โฒ (๐จ ๐ข ) โ โ(0;1] โโ ๐ขโ1 โ๐ + tanh โฒ (๐จ ๐ข ) Deep Learning for Natural Language processing โ =0 โโ ๐ข โ โ๐ = โ tanh โโโโโโโโโ โ๐ = 26/90 โ โ โ๐ โ โ ๐ขโ1 โ๐ โ๐ โ =๐จ ๐ข (activation) (๐ โ โ ๐ขโ1 + ๐ ๐ฆ ๐ฆ ๐ข + ๐) ( tanh โฒ is derivative of tanh ) tanh โฒ (๐จ ๐ข ) โ โ + โ๐ ๐ฆ ๐ฆ ๐ข + โ๐
Long Short-Term Memory Networks LSTM = Long short-term memory Sepp Hochreiter and Jรผrgen Schmidhuber. Long short-term memory. Neural computation , 9(8):1735โ1780, 1997. ISSN 0899-7667 Control the gradient fmow by explicitly gating: Deep Learning for Natural Language processing 27/90 โข what to use from input, โข what to use from hidden state, โข what to put on output
LMST: Hidden State Deep Learning for Natural Language processing 28/90 โข two types of hidden states โข โ ๐ข โ โpublicโ hidden state, used an output โข ๐ ๐ข โ โprivateโ memory, no non-linearities on the way โข direct fmow of gradients (without multiplying by โค 1 derivatives)
LSTM: Forget Gate Deep Learning for Natural Language processing 29/90 ๐ ๐ข = ๐ (๐ ๐ [โ ๐ขโ1 ; ๐ฆ ๐ข ] + ๐ ๐ ) โข based on input and previous state, decide what to forget from the memory
LSTM: Input Gate ฬ โข ฬ ๐ท โ candidate what may want to add to the memory Deep Learning for Natural Language processing 30/90 ๐ ๐ข = ๐ (๐ ๐ โ [โ ๐ขโ1 ; ๐ฆ ๐ข ] + ๐ ๐ ) ๐ท ๐ข = tanh (๐ ๐ โ [โ ๐ขโ1 ; ๐ฆ ๐ข ] + ๐ ๐ท ) โข ๐ ๐ข โ decide how much of the information we want to store
LMST: Cell State Update ฬ ๐ท ๐ข Deep Learning for Natural Language processing 31/90 ๐ท ๐ข = ๐ ๐ข โ ๐ท ๐ขโ1 + ๐ ๐ข โ
LSTM: Output Gate Deep Learning for Natural Language processing 32/90 ๐ ๐ข = ๐ (๐ ๐ โ [โ ๐ขโ1 ; ๐ฆ ๐ข ] + ๐ ๐ ) โ ๐ข = ๐ ๐ข โ tanh ๐ท ๐ข
Here we are, LSTM! ๐ ๐ข Deep Learning for Natural Language processing Compute all gates in a single matrix multiplication. Question How would you implement it effjciently? = โ ๐ข ๐ท ๐ข ฬ = ๐ท ๐ข 33/90 = ๐ท ๐ข ฬ = ๐ ๐ข = ๐ ๐ข ๐ (๐ ๐ [โ ๐ขโ1 ; ๐ฆ ๐ข ] + ๐ ๐ ) = ๐ (๐ ๐ โ [โ ๐ขโ1 ; ๐ฆ ๐ข ] + ๐ ๐ ) ๐ (๐ ๐ โ [โ ๐ขโ1 ; ๐ฆ ๐ข ] + ๐ ๐ ) tanh (๐ ๐ โ [โ ๐ขโ1 ; ๐ฆ ๐ข ] + ๐ ๐ท ) ๐ ๐ข โ ๐ท ๐ขโ1 + ๐ ๐ข โ ๐ ๐ข โ tanh ๐ท ๐ข
Gated Recurrent Units update gate remember gate candidate hidden state ฬ hidden state โ ๐ข Deep Learning for Natural Language processing 34/90 ๐จ ๐ข = ๐(๐ฆ ๐ข ๐ ๐จ + โ ๐ขโ1 ๐ ๐จ + ๐ ๐จ ) โ (0, 1) ๐ ๐ข = ๐(๐ฆ ๐ข ๐ ๐ + โ ๐ขโ1 ๐ ๐ + ๐ ๐ ) โ (0, 1) โ ๐ข = tanh (๐ฆ ๐ข ๐ โ + (๐ ๐ข โ โ ๐ขโ1 )๐ โ ) โ (โ1, 1) โ ๐ข = (1 โ ๐จ ๐ข ) โ โ ๐ขโ1 + ๐จ ๐ข โ ฬ
LSTM vs. GRU machine Junyoung Chung, รaglar Gรผlรงehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR , abs/1412.3555, 2014. ISSN 2331-8422; Gail Weiss, Yoav Goldberg, and Eran Yahav. On the practical computational power of fjnite precision rnns for language recognition. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages 740โ745, Melbourne, Australia, July 2018. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P18-2117 Deep Learning for Natural Language processing 35/90 โข GRU is smaller and therefore faster โข performance similar, task dependent โข theoretical limitation: GRU accepts regular languages, LSTM can simulate counter
RNN in PyTorch rnn = nn.LSTM(input_dim, hidden_dim=512, num_layers=1, bidirectional=True, dropout=0.8) output, (hidden, cell) = self.rnn(x) https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTM Deep Learning for Natural Language processing 36/90
RNN in TensorFlow inputs = ... # float tf.Tensor of shape [batch, length, dim] lengths = ... # int tf.Tensor of shape [batch] # Cell objects are templates fw_cell = tf.nn.rnn_cell.LSTMCell(512, name="fw_cell") bw_cell = tf.nn.rnn_cell.LSTMCell(512, name="bw_cell") outputs, states = tf.nn.bidirectional_dynamic_rnn( cell_fw, cell_bw, inputs, sequence_length=lengths) https://www.tensorflow.org/api_docs/python/tf/nn/bidirectional_dynamic_rnn Deep Learning for Natural Language processing 37/90
Bidirectional Networks Image from: http://colah.github.io/posts/2015-09-NN-Types-FP/ Deep Learning for Natural Language processing 38/90 โข simple trick to improve performance โข run one RNN forward, second one backward and concatenate outputs โข state of the art in tagging, crucial for neural machine translation
Representing Sequences Convolutional Networks
1-D Convolution โ sliding window over the sequence 0 0 pad with 0s if we want to keep sequence length Deep Learning for Natural Language processing 39/90 โ ๐ = ๐ (๐ [๐ฆ ๐โ1 ; ๐ฆ ๐ ; ๐ฆ ๐+1 ] + ๐) โ 1 = ๐ (๐[๐ฆ 0 ; ๐ฆ 1 .๐ฆ 2 ] + ๐) embeddings x = (๐ฆ 1 , โฆ , ๐ฆ ๐ ) ๐ฆ 0 = โ ๐ฆ ๐ = โ
1-D Convolution: Pseudocode xs = ... # input sequnce kernel_size = 3 # window size filters = 300 # output dimensions strides=1 # step size W = trained_parameter(xs.shape[2] * kernel_size, filters) b = trained_parameter(filters) window = kernel_size // 2 outputs = [] for i in range(window, xs.shape[1] - window): h = np.mul(W, xs[i - window:i + window]) + b outputs.append(h) return np.array(h) Deep Learning for Natural Language processing 40/90
1-D Convolution: Frameworks TensorFlow h = tf.layers.conv1d(x, filters=300 kernel_size=3, strides=1, padding='same') https://www.tensorflow.org/api_docs/python/tf/layers/conv1d PyTorch conv = nn.Conv1d(in_channels, out_channels=300, kernel_size=3, stride=1, padding=0, dilation=1, groups=1, bias=True) h = conv(x) https://pytorch.org/docs/stable/nn.html#torch.nn.Conv1d Deep Learning for Natural Language processing 41/90
Rectifjed Linear Units 2 0.6 0.8 1.0 โ6 โ4 โ2 0 4 0.2 6 ๐ง ๐ฆ faster, sufger less with vanishing gradient Vinod Nair and Geofgrey E Hinton. Rectifjed linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) , pages 807โ814, TODO, TODO 2010. TODO Deep Learning for Natural Language processing 0.4 0.0 ReLU: โ6 0.0 1.0 2.0 3.0 4.0 5.0 6.0 โ4 Derivative of ReLU: โ2 0 2 4 6 ๐ง ๐ฆ 42/90
Residual Connections โ Deep Learning for Natural Language processing vision and pattern recognition , pages 770โ778, TODO, TODO 2016. IEEE Computer Society Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer Better gradient fmow โ the same as in RNNs. Why do you it helps? Allows training deeper networks. โ โ โ โ โ โ โ โ โ โ โ 0 0 43/90 โ ๐ = ๐ (๐ [๐ฆ ๐โ1 ; ๐ฆ ๐ ; ๐ฆ ๐+1 ] + ๐) + ๐ฆ ๐ embeddings x = (๐ฆ 1 , โฆ , ๐ฆ ๐ ) ๐ฆ 0 = โ ๐ฆ ๐ = โ
Residual Connections: Numerical Stability โ Deep Learning for Natural Language processing Lei Jimmy Ba, Ryan Kiros, and Geofgrey E. Hinton. Layer normalization. CoRR , abs/1607.06450, 2016. ISSN 2331-8422 ๐=1 โ ๐ผ ๐ผ 1 โท โ โ ๐ = Numerically unstable, we need activation to be in similar scale โ layer normalization. ๐ ๐ ๐=1 โ ๐ผ ๐ผ โฆ ๐ is a trainable parameter, ๐ , ๐ estimated from data. ๐ ๐ Activation before non-linearity is normalized: 44/90 ๐ ๐ = ๐ ๐ (๐ ๐ โ ๐ ๐ ) ๐ = 1 (๐ ๐ โ ๐) 2
Receptive Field 0 0 Can be enlarged by dilated convolutions. Deep Learning for Natural Language processing 45/90 embeddings x = (๐ฆ 1 , โฆ , ๐ฆ ๐ ) ๐ฆ 0 = โ ๐ฆ ๐ = โ
Convolutional architectures + โ Deep Learning for Natural Language processing 46/90 โข extremely computationally effjcient โข limited context โข by default no aware of ๐ -gram order โข max-pooling over the hidden states = element-wise maximum over sequence โข can be understood as an โ operator over the feature extractors
Representing Sequences Self-attentive Networks
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ลukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Self-attentive Networks Advances in Neural Information Processing Systems 30 , pages 6000โ6010, Long Beach, CA, USA, December 2017. Curran Associates, Inc Deep Learning for Natural Language processing 47/90 โข In some layers: states are linear combination of previous layer states โข Originally for the Transformer model for machine translation โข similarity matrix between all pairs of states โข ๐(๐ 2 ) memory, ๐(1) time (when paralelized) โข next layer: sum by rows
Multi-headed scaled dot-product attention keys & values Deep Learning for Natural Language processing scaled dot-product attention scaled dot-product attention scaled dot-product attention scaled dot-product attention scaled dot-product attention concat split split split linear linear linear queries 48/90 Single-head setup Multihead-head setup ) ๐ โ ๐ ) ๐ ๐ โ Attn (๐ , ๐ฟ, ๐ ) = softmax (๐ ๐ฟ โค โ ๐+1 = โ softmax (โ ๐ โ โค Multihead (๐ , ๐ ) = (๐ผ 1 โ โฏ โ ๐ผ โ )๐ ๐ ๐ผ ๐ = Attn (๐ ๐ ๐ ๐ , ๐ ๐ ๐ฟ ๐ , ๐ ๐ ๐ ๐ )
Dot-Product Attention in PyTorch def attention(query, key, value, mask=None): d_k = query.size(-1) scores = torch.matmul(query, key.transpose(-2, -1)) \ / math.sqrt(d_k) p_attn = F.softmax(scores, dim = -1) return torch.matmul(p_attn, value), p_attn Deep Learning for Natural Language processing 49/90
Dot-Product Attention in TensorFlow def scaled_dot_product(self, queries, keys, values): o1 = tf.matmul(queries, keys, transpose_b=True) o2 = o1 / (dim**0.5) o3 = tf.nn.softmax(o2) return tf.matmul(o3, values) Deep Learning for Natural Language processing 50/90
Position Encoding 200 40 60 80 Text length 0 100 300 0 Dimension โ0.5 0.0 0.5 1.0 Deep Learning for Natural Language processing 20 otherwise Model cannot be aware of the position in the sequence. โฉ pos (๐) = โง { โจ { 10 4 ๐ if ๐ mod 2 = 0 10 4 ๐โ1 51/90 sin ( ๐ข ๐ ) , cos ( ๐ข ๐ ) ,
Stacking self-attentive Layers layer normalization Deep Learning for Natural Language processing connections feed-forward layer ๐ร layer normalization โ linear layer non-linear layer input embeddings feed-forward sublayer โ queries values keys & attention multihead self-attentive sublayer encoding position โ 52/90 โข several layers (original paper 6) โข each layer: 2 sub-layers: self-attention and โข everything inter-connected with residual
Architectures Comparison computation Deep Learning for Natural Language processing ๐ model dimension, ๐ sequence length, ๐ convolutional kernel ๐(1) Self-attentive ๐(๐ โ ๐) ๐(1) ๐(๐ โ ๐ โ ๐ 2 ) Convolutional ๐(๐ โ ๐) ๐(๐) ๐(๐ โ ๐ 2 ) Recurrent memory sequential operations 53/90 ๐(๐ 2 โ ๐) ๐(๐ 2 โ ๐)
Classifjcation and Labeling
Classifjcation and Labeling Neural Networks Basics Representing Words Representing Sequences Recurrent Networks Convolutional Networks Self-attentive Networks Classifjcation and Labeling Generating Sequences Pre-training Representations Word2Vec ELMo BERT Deep Learning for Natural Language processing 54/90
Sequence Clasifjcation Deep Learning for Natural Language processing 55/90 โข tasks like sentiment analysis, genre classifjcation โข need to get one vector from sequence โ average or max pooling โข optionally hidden layers, at the and softmax for probability distribution over classes
Softmax & Cross-Entropy = Deep Learning for Natural Language processing โ log ๐(๐ง โ ) = ๐(๐) log ๐(๐) ๐ โ โ 56/90 Output layer with softmax (with parameters ๐ , ๐ ): = ๐(๐ ๐ง , ๐ง โ ) = ๐ผ(๐, ๐) Network error = cross-entropy between estimated distribution and one-hot ground-truth exp x โค ๐ + ๐ ๐ ๐ง = softmax ( x ) = P (๐ง = ๐ โฃ x ) = โ exp x โค ๐ + ๐ distribution ๐ = 1 (๐ง โ ) : โ๐ฝ ๐โผ๐ log ๐(๐)
Derivative of Cross-Entropy = Deep Learning for Natural Language processing Interpretation: Reinforce the correct logit, supress the rest. = = โ exp ๐ 57/90 = exp ๐ ๐ง โ โ๐ โ๐(๐ ๐ง , ๐ง โ ) Let ๐ = x โค ๐ + ๐ , ๐ ๐ง โ corresponds to the correct one. โ โ = โ โ โ๐ log โ๐๐ ๐ง โ โ log โ exp ๐ โ ๐ exp ๐ ๐ โ๐ โ log โ exp ๐ = 1 ๐ง โ โ โ 1 ๐ง โ exp ๐ 1 ๐ง โ + โ 1 ๐ง โ โ ๐ ๐ง (๐ง โ )
Sequence Labeling span selection Lab next time: i/y spelling as sequence labeling Deep Learning for Natural Language processing 58/90 โข assign value / probability distribution to every token in a sequence โข morphological tagging, named-entity recognition, LM with unlimited history, answer โข every state is classifjed independently with a classifjer โข during training, error babckpropagate form all classifjers
Generating Sequences
Sequence-to-sequence Learning Deep Learning for Natural Language processing 59/90 โข target sequence is of difgerent lenght tahn source โข no-trivial (= not monotonic) correspondence of source and target โข taks like: machine translation, text summarization, image captioning
Neural Language Model input symbol Deep Learning for Natural Language processing โฏ ๐(๐ฅ 2 | โฆ) softmax โ 2 RNN embed ๐ฅ 2 ๐(๐ฅ 1 | โฆ) softmax โ 1 RNN embed ๐ฅ 1 ๐(๐ฅ 1 | <s> ) softmax one-hot vectors embedding lookup RNN cell (more layers) RNN state normalization distribution for the next symbol <s> embed RNN โ 0 60/90 โข estimate probability of a sentence using the chain rule โข output distributions can be used for sampling
Sampling from a LM P (๐ฅ 2 | โฆ) Deep Learning for Natural Language processing Associates, Inc and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27 , pages 3104โ3112, Montreal, Canada, December 2014. Curran Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, when conditioned on input โ autoregressive decoder โฏ <s> argmax P (๐ฅ 3 | โฆ) softmax โ 3 RNN embed argmax softmax embed โ 2 RNN embed argmax P (๐ฅ 1 | โฆ) softmax โ 1 RNN embed argmax P (๐ฅ 1 | <s> ) softmax โ 0 RNN 61/90
Autoregressive Decoding: Pseudocode last_w = "<s>" while last_w != "</s>": last_w_embeding = target_embeddings[last_w] state, dec_output = dec_cell(state, last_w_embeding) logits = output_projection(dec_output) last_w = np.argmax(logits) yield last_w Deep Learning for Natural Language processing 62/90
More on the topic in the MT class. Architectures in the Decoder Deep Learning for Natural Language processing 63/90 โข RNN โ original sequence-to-sequence learning (2015) โข principle known since 2014 (University of Montreal) โข made usable in 2016 (University of Edinburgh) โข CNN โ convolution sequence-to-sequence by Facebook (2017) โข Self-attention (so called Transformer) by Google (2017)
Implementation: Runtime vs. training (ground truth) Deep Learning for Natural Language processing runtime: ฬ 64/90 training: ๐ง ๐ ๐ง ๐ (decoded) ร y 1 y 2 y 3 y 4 <s> x 1 x 2 x 3 x 4 <s> loss <s> ~y 1 ~y 2 ~y 3 ~y 4 ~y 5
Attention Model Deep Learning for Natural Language processing 65/90 x 1 x 2 x 3 x 4 <s> ... h 0 h 1 h 2 h 3 h 4 ฮฑ 0 ฮฑ 1 ฮฑ 2 ฮฑ 3 ฮฑ 4 ร ร ร ร ร s i-1 s i s i+1 + + ~y i ~y i+1
Attention Model in Equations (1) Inputs: Deep Learning for Natural Language processing ISSN 2331-8422 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR , abs/1409.0473, 2014. ๐ฝ ๐๐ โ ๐ ๐=1 โ ๐ ๐ฆ Context vector: ๐ ๐ฆ โ exp (๐ ๐๐ ) Attention distribution: Attention energies: โ๐ = 1 โฆ ๐ ๐ฆ โ ๐ ] encoder states ๐ก ๐ decoder state 66/90 โ ๐ ; โโโโโโโ โ ๐ = [โโโโโโโ ๐ ๐๐ = ๐ค โค ๐ tanh (๐ ๐ ๐ก ๐โ1 + ๐ ๐ โ ๐ + ๐ ๐ ) ๐ฝ ๐๐ = ๐=1 exp (๐ ๐๐ ) ๐ ๐ =
Attention Model in Equations (2) Output projection: โฆattention is mixed with the hidden state Output distribution: Deep Learning for Natural Language processing 67/90 ๐ข ๐ = MLP (๐ ๐ ๐ก ๐โ1 + ๐ ๐ ๐น๐ง ๐โ1 + ๐ท ๐ ๐ ๐ + ๐ ๐ ) ๐ (๐ง ๐ = ๐|๐ก ๐ , ๐ง ๐โ1 , ๐ ๐ ) โ exp (๐ ๐ ๐ข ๐ ) ๐ + ๐ ๐
Transformer Decoder softmax feed-forward sublayer non-linear layer linear layer โ layer normalization ๐ร linear output symbol probabilities input embeddings attention to the encoder complete history โ ๐(๐ 2 ) complexity Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ลukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30 , pages 6000โ6010, Long Beach, CA, USA, December 2017. Curran Associates, Inc Deep Learning for Natural Language processing encoder layer normalization โ queries โ position encoding self-attentive sublayer multihead attention keys & values queries โ layer normalization cross-attention sublayer multihead attention keys & values 68/90 โข similar to encoder, additional layer with โข in every steps self-attention over
Transfomer Decoder: Non-autoregressive training โฆ Deep Learning for Natural Language processing Question 2: How such a matrix look like for convolutional architecture? Question 1: What if the matrix was diagonal? mask matrix multiplication wait until itโs generated โโ Values ๐ ๐ค 1 Queries ๐ โฆ โฆ โฆ ๐ ๐ โฆ ๐ค ๐ โฆ ๐ค 3 ๐ค 2 69/90 ๐ 1 ๐ 2 ๐ 3 โข analogical to encoder โข target is known at training: donโt need to โข self attention can be parallelized via โข prevent attentding the future using a โฆ
Pre-training Representations
Pre-training Representations Neural Networks Basics Representing Words Representing Sequences Recurrent Networks Convolutional Networks Self-attentive Networks Classifjcation and Labeling Generating Sequences Pre-training Representations Word2Vec ELMo BERT Deep Learning for Natural Language processing 70/90
Pre-trained Representations language data Deep Learning for Natural Language processing 71/90 โข representations that emerge in models seem to be carry a lot of information about the โข representations pre-trained on large data can be re-used on tasks with smaller training
Pre-training Representations Word2Vec
Word2Vec ๐ฅ 2 Deep Learning for Natural Language processing Association for Computational Linguistics the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 746โ751, Atlanta, Georgia, jun 2013. Tomรกลก Mikolov, Wen-tau Yih, and Geofgrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of round) โฎ ๐ฅ 5 ๐ฅ 4 ๐ฅ 1 ๐ฅ 3 โฎ ๐ฅ 5 ๐ฅ 4 ๐ฅ 2 ๐ฅ 1 ๐ฅ 3 โ Skip-gram CBOW 72/90 โข way to learn word embeddings without training the complete LM โข CBOW: minimize cross-entropy of the middle word of a sliding windows โข skip-gram: minimize cross-entropy of a bag of words around a word (LM other way
Word2Vec: sampling beings โ (beings, All) (beings, human) (beings, are) (beings, born) 4. All human are born born free and equal in dignity โฆ โ (are, human) (are, beings) (are, born) (are, free) Deep Learning for Natural Language processing free and equal in dignity โฆ are 1. human All human beings are born free and equal in dignity โฆ โ (All, humans) (All, beings) 2. All beings beings are born free and equal in dignity โฆ โ (human, All) (human, beings) (human, are) 3. All human 73/90
Word2Vec: Formulas log ๐(๐ฅ ๐ข+๐ |๐ฅ ๐ข ) Deep Learning for Natural Language processing jun 2013. Association for Computational Linguistics Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 746โ751, Atlanta, Georgia, Equations 1, 2. Tomรกลก Mikolov, Wen-tau Yih, and Geofgrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 ๐ฅ ๐ ๐ฅ ๐ ) ๐ฅ ๐ ๐ ๐ฅ ๐ฝ ) ๐(๐ฅ ๐ |๐ฅ ๐ฝ ) = ๐โผ(โ๐,๐) โ ๐ข=1 โ ๐ ๐ 1 74/90 โข Training objective: โข Probability estimation: exp (๐ โฒโค โ ๐ฅ exp (๐ โฒโค where ๐ is input (embedding) matrix, ๐ โฒ output matrix
Word2Vec: Training using Negative Sampling The summation in denominator is slow, use noise contrastive estimation: ๐ฅ ๐ ๐ ๐ฅ ๐ฝ ) + ๐ โ ๐=1 ๐ฅ ๐ ๐ ๐ฅ ๐ฝ )] Main idea: classify independently by logistic regression the positive and few sampled negative examples. Equations 1, 3. Tomรกลก Mikolov, Wen-tau Yih, and Geofgrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 746โ751, Atlanta, Georgia, jun 2013. Association for Computational Linguistics Deep Learning for Natural Language processing 75/90 log ๐ (๐ โฒโค ๐น ๐ฅ ๐ โผ๐ ๐ (๐ฅ) [ log ๐ (โ๐ โฒโค
Word2Vec: Vector Arithmetics man woman uncle aunt king queen kings queens king queen Image originally from Tomรกลก Mikolov, Wen-tau Yih, and Geofgrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 746โ751, Atlanta, Georgia, jun 2013. Association for Computational Linguistics Deep Learning for Natural Language processing 76/90
Few More Notes on Embeddings Deep Learning for Natural Language processing 77/90 โข many method for pre-trained words embeddings (most popluar GloVe) โข embeddings capturing character-level properties โข multilingual embeddings
Training models FastText โ Word2Vec model implementation by Facebook https://github.com/facebookresearch/fastText ./fasttext skipgram -input data.txt -output model Deep Learning for Natural Language processing 78/90
Pre-training Representations ELMo
What is ELMo? known tricks, trained on extremely large data Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 2227โ2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/N18-1202 Deep Learning for Natural Language processing 79/90 โข pre-trained large language model โข โnothing specialโ โ combines all โข improves almost all NLP tasks โข published in June 2018
ELMo Architecture: Input 256 Deep Learning for Natural Language processing projection is needed contain gates that contol if ( โผ soft search for learned ๐ -grams) level linear projection to 512 dimensions 2 ร highway layer (2,048 dimensions) 1024 7 character embeddings of size 16 6 512 5 1 1D-convolution to 2,048 dimensions + max-pool window fjlters 128 80/90 32 2 32 3 64 4 โข input tokenized, treated on character โข 2,048 ๐ -gram fjlters + max-pooling โข 2 highway layers: ๐ ๐+1 = ๐ (๐ ๐ โ ๐ + ๐ ๐ ) โ ๐+1 = (1 โ ๐ ๐+1 ) โ โ ๐ + ๐ ๐+1 โ ReLu (๐โ ๐ + ๐)
ELMo Architecture: Language Models ๐ก task Deep Learning for Natural Language processing trainable parameters. ๐ ๐ฟ task , ๐ก task ๐ layer ๐ ๐ ELMo task Learned layer combination for downstream tasks: connections 81/90 โข token representations input for 2 language models: forward and backward โข both LMs 2 layers with 4,096 dimensions wiht layer normalization and residual โข output classifjer shared (only used in training, does hot have to be good) = ๐ฟ task โ ๐ โ (๐)
Task where ELMo helps locations, organization, numbers with Deep Learning for Natural Language processing plagiarism.) question on StackOverfmow or detecting sentences are. (Think of clustering similar Measure how similar meaning two Semantic Similarity Detect what entities pronouns refer to. I Coreference Resolution numbers โฆ units, email addresses, URLs, phone Detect and classify names people, Answer Span Selection Named Entity Recognition nothing to do with each other. agreement, contradict each other, or have Decide whether two sentences are in Natural Language Inference sentences. Detect who did what to whom in Semantic Role Labeling unstructured text. Find an answer to a question in a 82/90
Improvements by Elmo Deep Learning for Natural Language processing 83/90
How to use it dropout=0) Deep Learning for Natural Language processing https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md embeddings = elmo(character_ids) character_ids = batch_to_ids(sentences) ['Another', '.']] sentences = [['First', 'sentence', '.'], elmo = Elmo(options_file, weight_file, 2, weight_file = ... options_file = ... batch_to_ids from allennlp.modules.elmo import Elmo, available framework (uses PyTorch) 84/90 โข implemetned in AllenNLP โข pre-trained English models
Pre-training Representations BERT
What is BERT representations slightly difgerent training objective November 2018 J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. ArXiv e-prints , October 2018 Deep Learning for Natural Language processing 85/90 โข another way of pretraining sentence โข uses Transformer architecture and โข even beeter than ELMo โข done by Google, published in
Achitecture Comparison Deep Learning for Natural Language processing 86/90
Masked Language Model All Deep Learning for Natural Language processing 4. With 10% change keep asi is โ free 3. With 10% change replace with random token โ hairy 2. With 80% change replace with special MASK token. 1. Randomly sample a word โ free rights and dignity in equal and free free MASK hairy free born are being human 87/90 Then a classifjer should predict the missing/replaced word free
Recommend
More recommend