deep learning for natural language processing
play

Deep Learning for Natural Language processing Jindich Libovick - PowerPoint PPT Presentation

Deep Learning for Natural Language processing Jindich Libovick March 1, 2017 Introduction to Natural Language Processing Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise


  1. Vanilla RNN Deep Learning for Natural Language processing 23/90 โ„Ž ๐‘ข = tanh (๐‘‹[โ„Ž ๐‘ขโˆ’1 ; ๐‘ฆ ๐‘ข ] + ๐‘) โ€ข cannot propagate long-distance relations โ€ข vanishing gradient problem

  2. Vanishing Gradient Problem (1) โˆ’2 0.4 0.6 0.8 1.0 โˆ’6 โˆ’4 0 0.0 2 4 6 ๐‘ง ๐‘ฆ Weight initialized โˆผ ๐’ช(0, 1) to have gradients further from zero. Deep Learning for Natural Language processing 0.2 24/90 โˆ’2 d ๐‘ฆ -0.5 0.0 0.5 1.0 โˆ’6 โˆ’4 0 1 + ๐‘“ โˆ’2๐‘ฆ 2 4 6 ๐‘ง ๐‘ฆ dtanh ๐‘ฆ -1.0 tanh ๐‘ฆ = 1 โˆ’ ๐‘“ โˆ’2๐‘ฆ = 1 โˆ’ tanh 2 ๐‘ฆ โˆˆ (0, 1]

  3. Vanishing Gradient Problem (2) โˆ‚๐น ๐‘ข+1 โˆ‚๐‘ โˆ‚โ„Ž ๐‘ข+1 โˆ‚๐‘ (chain rule) Deep Learning for Natural Language processing 25/90 = โˆ‚๐น ๐‘ข+1 โ‹… โˆ‚โ„Ž ๐‘ข+1

  4. Vanishing Gradient Problem (3) โŸ =1 โŽž โŽŸ โŽŸ โŽ  = ๐‘‹ โ„Ž โˆผ๐’ช(0,1) โˆ‚๐‘ tanh โ€ฒ (๐‘จ ๐‘ข ) โŸ โˆˆ(0;1] โˆ‚โ„Ž ๐‘ขโˆ’1 โˆ‚๐‘ + tanh โ€ฒ (๐‘จ ๐‘ข ) Deep Learning for Natural Language processing โŸ =0 โˆ‚โ„Ž ๐‘ข โŽœ โˆ‚๐‘ = โˆ‚ tanh โžโžโžโžโžโžโžโžโž โˆ‚๐‘ = 26/90 โŽœ โŽ โˆ‚๐‘‹ โ„Ž โ„Ž ๐‘ขโˆ’1 โˆ‚๐‘ โˆ‚๐‘ โŸ =๐‘จ ๐‘ข (activation) (๐‘‹ โ„Ž โ„Ž ๐‘ขโˆ’1 + ๐‘‹ ๐‘ฆ ๐‘ฆ ๐‘ข + ๐‘) ( tanh โ€ฒ is derivative of tanh ) tanh โ€ฒ (๐‘จ ๐‘ข ) โ‹… โŽ› + โˆ‚๐‘‹ ๐‘ฆ ๐‘ฆ ๐‘ข + โˆ‚๐‘

  5. Long Short-Term Memory Networks LSTM = Long short-term memory Sepp Hochreiter and Jรผrgen Schmidhuber. Long short-term memory. Neural computation , 9(8):1735โ€“1780, 1997. ISSN 0899-7667 Control the gradient fmow by explicitly gating: Deep Learning for Natural Language processing 27/90 โ€ข what to use from input, โ€ข what to use from hidden state, โ€ข what to put on output

  6. LMST: Hidden State Deep Learning for Natural Language processing 28/90 โ€ข two types of hidden states โ€ข โ„Ž ๐‘ข โ€” โ€œpublicโ€ hidden state, used an output โ€ข ๐‘‘ ๐‘ข โ€” โ€œprivateโ€ memory, no non-linearities on the way โ€ข direct fmow of gradients (without multiplying by โ‰ค 1 derivatives)

  7. LSTM: Forget Gate Deep Learning for Natural Language processing 29/90 ๐‘” ๐‘ข = ๐œ (๐‘‹ ๐‘” [โ„Ž ๐‘ขโˆ’1 ; ๐‘ฆ ๐‘ข ] + ๐‘ ๐‘” ) โ€ข based on input and previous state, decide what to forget from the memory

  8. LSTM: Input Gate ฬƒ โ€ข ฬƒ ๐ท โ€” candidate what may want to add to the memory Deep Learning for Natural Language processing 30/90 ๐‘— ๐‘ข = ๐œ (๐‘‹ ๐‘— โ‹… [โ„Ž ๐‘ขโˆ’1 ; ๐‘ฆ ๐‘ข ] + ๐‘ ๐‘— ) ๐ท ๐‘ข = tanh (๐‘‹ ๐‘‘ โ‹… [โ„Ž ๐‘ขโˆ’1 ; ๐‘ฆ ๐‘ข ] + ๐‘ ๐ท ) โ€ข ๐‘— ๐‘ข โ€” decide how much of the information we want to store

  9. LMST: Cell State Update ฬƒ ๐ท ๐‘ข Deep Learning for Natural Language processing 31/90 ๐ท ๐‘ข = ๐‘” ๐‘ข โŠ™ ๐ท ๐‘ขโˆ’1 + ๐‘— ๐‘ข โŠ™

  10. LSTM: Output Gate Deep Learning for Natural Language processing 32/90 ๐‘ ๐‘ข = ๐œ (๐‘‹ ๐‘ โ‹… [โ„Ž ๐‘ขโˆ’1 ; ๐‘ฆ ๐‘ข ] + ๐‘ ๐‘ ) โ„Ž ๐‘ข = ๐‘ ๐‘ข โŠ™ tanh ๐ท ๐‘ข

  11. Here we are, LSTM! ๐‘” ๐‘ข Deep Learning for Natural Language processing Compute all gates in a single matrix multiplication. Question How would you implement it effjciently? = โ„Ž ๐‘ข ๐ท ๐‘ข ฬƒ = ๐ท ๐‘ข 33/90 = ๐ท ๐‘ข ฬƒ = ๐‘ ๐‘ข = ๐‘— ๐‘ข ๐œ (๐‘‹ ๐‘” [โ„Ž ๐‘ขโˆ’1 ; ๐‘ฆ ๐‘ข ] + ๐‘ ๐‘” ) = ๐œ (๐‘‹ ๐‘— โ‹… [โ„Ž ๐‘ขโˆ’1 ; ๐‘ฆ ๐‘ข ] + ๐‘ ๐‘— ) ๐œ (๐‘‹ ๐‘ โ‹… [โ„Ž ๐‘ขโˆ’1 ; ๐‘ฆ ๐‘ข ] + ๐‘ ๐‘ ) tanh (๐‘‹ ๐‘‘ โ‹… [โ„Ž ๐‘ขโˆ’1 ; ๐‘ฆ ๐‘ข ] + ๐‘ ๐ท ) ๐‘” ๐‘ข โŠ™ ๐ท ๐‘ขโˆ’1 + ๐‘— ๐‘ข โŠ™ ๐‘ ๐‘ข โŠ™ tanh ๐ท ๐‘ข

  12. Gated Recurrent Units update gate remember gate candidate hidden state ฬƒ hidden state โ„Ž ๐‘ข Deep Learning for Natural Language processing 34/90 ๐‘จ ๐‘ข = ๐œ(๐‘ฆ ๐‘ข ๐‘‹ ๐‘จ + โ„Ž ๐‘ขโˆ’1 ๐‘‰ ๐‘จ + ๐‘ ๐‘จ ) โˆˆ (0, 1) ๐‘  ๐‘ข = ๐œ(๐‘ฆ ๐‘ข ๐‘‹ ๐‘  + โ„Ž ๐‘ขโˆ’1 ๐‘‰ ๐‘  + ๐‘ ๐‘  ) โˆˆ (0, 1) โ„Ž ๐‘ข = tanh (๐‘ฆ ๐‘ข ๐‘‹ โ„Ž + (๐‘  ๐‘ข โŠ™ โ„Ž ๐‘ขโˆ’1 )๐‘‰ โ„Ž ) โˆˆ (โˆ’1, 1) โ„Ž ๐‘ข = (1 โˆ’ ๐‘จ ๐‘ข ) โŠ™ โ„Ž ๐‘ขโˆ’1 + ๐‘จ ๐‘ข โ‹… ฬƒ

  13. LSTM vs. GRU machine Junyoung Chung, ร‡aglar Gรผlรงehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR , abs/1412.3555, 2014. ISSN 2331-8422; Gail Weiss, Yoav Goldberg, and Eran Yahav. On the practical computational power of fjnite precision rnns for language recognition. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages 740โ€“745, Melbourne, Australia, July 2018. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P18-2117 Deep Learning for Natural Language processing 35/90 โ€ข GRU is smaller and therefore faster โ€ข performance similar, task dependent โ€ข theoretical limitation: GRU accepts regular languages, LSTM can simulate counter

  14. RNN in PyTorch rnn = nn.LSTM(input_dim, hidden_dim=512, num_layers=1, bidirectional=True, dropout=0.8) output, (hidden, cell) = self.rnn(x) https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTM Deep Learning for Natural Language processing 36/90

  15. RNN in TensorFlow inputs = ... # float tf.Tensor of shape [batch, length, dim] lengths = ... # int tf.Tensor of shape [batch] # Cell objects are templates fw_cell = tf.nn.rnn_cell.LSTMCell(512, name="fw_cell") bw_cell = tf.nn.rnn_cell.LSTMCell(512, name="bw_cell") outputs, states = tf.nn.bidirectional_dynamic_rnn( cell_fw, cell_bw, inputs, sequence_length=lengths) https://www.tensorflow.org/api_docs/python/tf/nn/bidirectional_dynamic_rnn Deep Learning for Natural Language processing 37/90

  16. Bidirectional Networks Image from: http://colah.github.io/posts/2015-09-NN-Types-FP/ Deep Learning for Natural Language processing 38/90 โ€ข simple trick to improve performance โ€ข run one RNN forward, second one backward and concatenate outputs โ€ข state of the art in tagging, crucial for neural machine translation

  17. Representing Sequences Convolutional Networks

  18. 1-D Convolution โ‰ˆ sliding window over the sequence 0 0 pad with 0s if we want to keep sequence length Deep Learning for Natural Language processing 39/90 โ„Ž ๐‘— = ๐‘” (๐‘‹ [๐‘ฆ ๐‘—โˆ’1 ; ๐‘ฆ ๐‘— ; ๐‘ฆ ๐‘—+1 ] + ๐‘) โ„Ž 1 = ๐‘” (๐‘‹[๐‘ฆ 0 ; ๐‘ฆ 1 .๐‘ฆ 2 ] + ๐‘) embeddings x = (๐‘ฆ 1 , โ€ฆ , ๐‘ฆ ๐‘‚ ) ๐‘ฆ 0 = โƒ— ๐‘ฆ ๐‘‚ = โƒ—

  19. 1-D Convolution: Pseudocode xs = ... # input sequnce kernel_size = 3 # window size filters = 300 # output dimensions strides=1 # step size W = trained_parameter(xs.shape[2] * kernel_size, filters) b = trained_parameter(filters) window = kernel_size // 2 outputs = [] for i in range(window, xs.shape[1] - window): h = np.mul(W, xs[i - window:i + window]) + b outputs.append(h) return np.array(h) Deep Learning for Natural Language processing 40/90

  20. 1-D Convolution: Frameworks TensorFlow h = tf.layers.conv1d(x, filters=300 kernel_size=3, strides=1, padding='same') https://www.tensorflow.org/api_docs/python/tf/layers/conv1d PyTorch conv = nn.Conv1d(in_channels, out_channels=300, kernel_size=3, stride=1, padding=0, dilation=1, groups=1, bias=True) h = conv(x) https://pytorch.org/docs/stable/nn.html#torch.nn.Conv1d Deep Learning for Natural Language processing 41/90

  21. Rectifjed Linear Units 2 0.6 0.8 1.0 โˆ’6 โˆ’4 โˆ’2 0 4 0.2 6 ๐‘ง ๐‘ฆ faster, sufger less with vanishing gradient Vinod Nair and Geofgrey E Hinton. Rectifjed linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) , pages 807โ€“814, TODO, TODO 2010. TODO Deep Learning for Natural Language processing 0.4 0.0 ReLU: โˆ’6 0.0 1.0 2.0 3.0 4.0 5.0 6.0 โˆ’4 Derivative of ReLU: โˆ’2 0 2 4 6 ๐‘ง ๐‘ฆ 42/90

  22. Residual Connections โŠ• Deep Learning for Natural Language processing vision and pattern recognition , pages 770โ€“778, TODO, TODO 2016. IEEE Computer Society Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer Better gradient fmow โ€“ the same as in RNNs. Why do you it helps? Allows training deeper networks. โŠ• โŠ• โŠ• โŠ• โŠ• โŠ• โŠ• โŠ• โŠ• โŠ• โŠ• 0 0 43/90 โ„Ž ๐‘— = ๐‘” (๐‘‹ [๐‘ฆ ๐‘—โˆ’1 ; ๐‘ฆ ๐‘— ; ๐‘ฆ ๐‘—+1 ] + ๐‘) + ๐‘ฆ ๐‘— embeddings x = (๐‘ฆ 1 , โ€ฆ , ๐‘ฆ ๐‘‚ ) ๐‘ฆ 0 = โƒ— ๐‘ฆ ๐‘‚ = โƒ—

  23. Residual Connections: Numerical Stability โˆš Deep Learning for Natural Language processing Lei Jimmy Ba, Ryan Kiros, and Geofgrey E. Hinton. Layer normalization. CoRR , abs/1607.06450, 2016. ISSN 2331-8422 ๐‘—=1 โˆ‘ ๐ผ ๐ผ 1 โŽท โˆš โˆš ๐œ = Numerically unstable, we need activation to be in similar scale โ‡’ layer normalization. ๐‘ ๐‘— ๐‘—=1 โˆ‘ ๐ผ ๐ผ โ€ฆ ๐‘• is a trainable parameter, ๐œˆ , ๐œ estimated from data. ๐œ ๐‘— Activation before non-linearity is normalized: 44/90 ๐‘ ๐‘— = ๐‘• ๐‘— (๐‘ ๐‘— โˆ’ ๐œˆ ๐‘— ) ๐œˆ = 1 (๐‘ ๐‘— โˆ’ ๐œˆ) 2

  24. Receptive Field 0 0 Can be enlarged by dilated convolutions. Deep Learning for Natural Language processing 45/90 embeddings x = (๐‘ฆ 1 , โ€ฆ , ๐‘ฆ ๐‘‚ ) ๐‘ฆ 0 = โƒ— ๐‘ฆ ๐‘‚ = โƒ—

  25. Convolutional architectures + โ€“ Deep Learning for Natural Language processing 46/90 โ€ข extremely computationally effjcient โ€ข limited context โ€ข by default no aware of ๐‘œ -gram order โ€ข max-pooling over the hidden states = element-wise maximum over sequence โ€ข can be understood as an โˆƒ operator over the feature extractors

  26. Representing Sequences Self-attentive Networks

  27. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ลukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Self-attentive Networks Advances in Neural Information Processing Systems 30 , pages 6000โ€“6010, Long Beach, CA, USA, December 2017. Curran Associates, Inc Deep Learning for Natural Language processing 47/90 โ€ข In some layers: states are linear combination of previous layer states โ€ข Originally for the Transformer model for machine translation โ€ข similarity matrix between all pairs of states โ€ข ๐‘ƒ(๐‘œ 2 ) memory, ๐‘ƒ(1) time (when paralelized) โ€ข next layer: sum by rows

  28. Multi-headed scaled dot-product attention keys & values Deep Learning for Natural Language processing scaled dot-product attention scaled dot-product attention scaled dot-product attention scaled dot-product attention scaled dot-product attention concat split split split linear linear linear queries 48/90 Single-head setup Multihead-head setup ) ๐‘’ โˆš ๐‘— ) ๐‘Š ๐‘’ โˆš Attn (๐‘…, ๐ฟ, ๐‘Š ) = softmax (๐‘…๐ฟ โŠค โ„Ž ๐‘—+1 = โˆ‘ softmax (โ„Ž ๐‘— โ„Ž โŠค Multihead (๐‘…, ๐‘Š ) = (๐ผ 1 โŠ• โ‹ฏ โŠ• ๐ผ โ„Ž )๐‘‹ ๐‘ƒ ๐ผ ๐‘— = Attn (๐‘…๐‘‹ ๐‘… ๐‘— , ๐‘Š ๐‘‹ ๐ฟ ๐‘— , ๐‘Š ๐‘‹ ๐‘Š ๐‘— )

  29. Dot-Product Attention in PyTorch def attention(query, key, value, mask=None): d_k = query.size(-1) scores = torch.matmul(query, key.transpose(-2, -1)) \ / math.sqrt(d_k) p_attn = F.softmax(scores, dim = -1) return torch.matmul(p_attn, value), p_attn Deep Learning for Natural Language processing 49/90

  30. Dot-Product Attention in TensorFlow def scaled_dot_product(self, queries, keys, values): o1 = tf.matmul(queries, keys, transpose_b=True) o2 = o1 / (dim**0.5) o3 = tf.nn.softmax(o2) return tf.matmul(o3, values) Deep Learning for Natural Language processing 50/90

  31. Position Encoding 200 40 60 80 Text length 0 100 300 0 Dimension โˆ’0.5 0.0 0.5 1.0 Deep Learning for Natural Language processing 20 otherwise Model cannot be aware of the position in the sequence. โŽฉ pos (๐‘—) = โŽง { โŽจ { 10 4 ๐‘— if ๐‘— mod 2 = 0 10 4 ๐‘—โˆ’1 51/90 sin ( ๐‘ข ๐‘’ ) , cos ( ๐‘ข ๐‘’ ) ,

  32. Stacking self-attentive Layers layer normalization Deep Learning for Natural Language processing connections feed-forward layer ๐‘‚ร— layer normalization โŠ• linear layer non-linear layer input embeddings feed-forward sublayer โŠ• queries values keys & attention multihead self-attentive sublayer encoding position โŠ• 52/90 โ€ข several layers (original paper 6) โ€ข each layer: 2 sub-layers: self-attention and โ€ข everything inter-connected with residual

  33. Architectures Comparison computation Deep Learning for Natural Language processing ๐‘’ model dimension, ๐‘œ sequence length, ๐‘™ convolutional kernel ๐‘ƒ(1) Self-attentive ๐‘ƒ(๐‘œ โ‹… ๐‘’) ๐‘ƒ(1) ๐‘ƒ(๐‘™ โ‹… ๐‘œ โ‹… ๐‘’ 2 ) Convolutional ๐‘ƒ(๐‘œ โ‹… ๐‘’) ๐‘ƒ(๐‘œ) ๐‘ƒ(๐‘œ โ‹… ๐‘’ 2 ) Recurrent memory sequential operations 53/90 ๐‘ƒ(๐‘œ 2 โ‹… ๐‘’) ๐‘ƒ(๐‘œ 2 โ‹… ๐‘’)

  34. Classifjcation and Labeling

  35. Classifjcation and Labeling Neural Networks Basics Representing Words Representing Sequences Recurrent Networks Convolutional Networks Self-attentive Networks Classifjcation and Labeling Generating Sequences Pre-training Representations Word2Vec ELMo BERT Deep Learning for Natural Language processing 54/90

  36. Sequence Clasifjcation Deep Learning for Natural Language processing 55/90 โ€ข tasks like sentiment analysis, genre classifjcation โ€ข need to get one vector from sequence โ†’ average or max pooling โ€ข optionally hidden layers, at the and softmax for probability distribution over classes

  37. Softmax & Cross-Entropy = Deep Learning for Natural Language processing โˆ’ log ๐‘„(๐‘ง โˆ— ) = ๐‘ˆ(๐‘—) log ๐‘„(๐‘—) ๐‘— โˆ’ โˆ‘ 56/90 Output layer with softmax (with parameters ๐‘‹ , ๐‘ ): = ๐‘€(๐‘„ ๐‘ง , ๐‘ง โˆ— ) = ๐ผ(๐‘„, ๐‘ˆ) Network error = cross-entropy between estimated distribution and one-hot ground-truth exp x โŠค ๐‘‹ + ๐‘ ๐‘„ ๐‘ง = softmax ( x ) = P (๐‘ง = ๐‘˜ โˆฃ x ) = โˆ‘ exp x โŠค ๐‘‹ + ๐‘ distribution ๐‘ˆ = 1 (๐‘ง โˆ— ) : โˆ’๐”ฝ ๐‘—โˆผ๐‘ˆ log ๐‘„(๐‘—)

  38. Derivative of Cross-Entropy = Deep Learning for Natural Language processing Interpretation: Reinforce the correct logit, supress the rest. = = โˆ‘ exp ๐‘š 57/90 = exp ๐‘š ๐‘ง โˆ— โˆ‚๐‘š โˆ‚๐‘€(๐‘„ ๐‘ง , ๐‘ง โˆ— ) Let ๐‘š = x โŠค ๐‘‹ + ๐‘ , ๐‘š ๐‘ง โˆ— corresponds to the correct one. โˆ’ โˆ‚ = โˆ’ โˆ‚ โˆ‚๐‘š log โˆ‚๐‘š๐‘š ๐‘ง โˆ— โˆ’ log โˆ‘ exp ๐‘š โˆ‘ ๐‘˜ exp ๐‘š ๐‘˜ โˆ‚๐‘š โˆ’ log โˆ‘ exp ๐‘š = 1 ๐‘ง โˆ— โˆ’ โˆ‘ 1 ๐‘ง โˆ— exp ๐‘š 1 ๐‘ง โˆ— + โˆ‚ 1 ๐‘ง โˆ— โˆ’ ๐‘„ ๐‘ง (๐‘ง โˆ— )

  39. Sequence Labeling span selection Lab next time: i/y spelling as sequence labeling Deep Learning for Natural Language processing 58/90 โ€ข assign value / probability distribution to every token in a sequence โ€ข morphological tagging, named-entity recognition, LM with unlimited history, answer โ€ข every state is classifjed independently with a classifjer โ€ข during training, error babckpropagate form all classifjers

  40. Generating Sequences

  41. Sequence-to-sequence Learning Deep Learning for Natural Language processing 59/90 โ€ข target sequence is of difgerent lenght tahn source โ€ข no-trivial (= not monotonic) correspondence of source and target โ€ข taks like: machine translation, text summarization, image captioning

  42. Neural Language Model input symbol Deep Learning for Natural Language processing โ‹ฏ ๐‘„(๐‘ฅ 2 | โ€ฆ) softmax โ„Ž 2 RNN embed ๐‘ฅ 2 ๐‘„(๐‘ฅ 1 | โ€ฆ) softmax โ„Ž 1 RNN embed ๐‘ฅ 1 ๐‘„(๐‘ฅ 1 | <s> ) softmax one-hot vectors embedding lookup RNN cell (more layers) RNN state normalization distribution for the next symbol <s> embed RNN โ„Ž 0 60/90 โ€ข estimate probability of a sentence using the chain rule โ€ข output distributions can be used for sampling

  43. Sampling from a LM P (๐‘ฅ 2 | โ€ฆ) Deep Learning for Natural Language processing Associates, Inc and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27 , pages 3104โ€“3112, Montreal, Canada, December 2014. Curran Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, when conditioned on input โ†’ autoregressive decoder โ‹ฏ <s> argmax P (๐‘ฅ 3 | โ€ฆ) softmax โ„Ž 3 RNN embed argmax softmax embed โ„Ž 2 RNN embed argmax P (๐‘ฅ 1 | โ€ฆ) softmax โ„Ž 1 RNN embed argmax P (๐‘ฅ 1 | <s> ) softmax โ„Ž 0 RNN 61/90

  44. Autoregressive Decoding: Pseudocode last_w = "<s>" while last_w != "</s>": last_w_embeding = target_embeddings[last_w] state, dec_output = dec_cell(state, last_w_embeding) logits = output_projection(dec_output) last_w = np.argmax(logits) yield last_w Deep Learning for Natural Language processing 62/90

  45. More on the topic in the MT class. Architectures in the Decoder Deep Learning for Natural Language processing 63/90 โ€ข RNN โ€“ original sequence-to-sequence learning (2015) โ€ข principle known since 2014 (University of Montreal) โ€ข made usable in 2016 (University of Edinburgh) โ€ข CNN โ€“ convolution sequence-to-sequence by Facebook (2017) โ€ข Self-attention (so called Transformer) by Google (2017)

  46. Implementation: Runtime vs. training (ground truth) Deep Learning for Natural Language processing runtime: ฬ‚ 64/90 training: ๐‘ง ๐‘˜ ๐‘ง ๐‘˜ (decoded) ร— y 1 y 2 y 3 y 4 <s> x 1 x 2 x 3 x 4 <s> loss <s> ~y 1 ~y 2 ~y 3 ~y 4 ~y 5

  47. Attention Model Deep Learning for Natural Language processing 65/90 x 1 x 2 x 3 x 4 <s> ... h 0 h 1 h 2 h 3 h 4 ฮฑ 0 ฮฑ 1 ฮฑ 2 ฮฑ 3 ฮฑ 4 ร— ร— ร— ร— ร— s i-1 s i s i+1 + + ~y i ~y i+1

  48. Attention Model in Equations (1) Inputs: Deep Learning for Natural Language processing ISSN 2331-8422 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR , abs/1409.0473, 2014. ๐›ฝ ๐‘—๐‘˜ โ„Ž ๐‘˜ ๐‘˜=1 โˆ‘ ๐‘ˆ ๐‘ฆ Context vector: ๐‘ˆ ๐‘ฆ โˆ‘ exp (๐‘“ ๐‘—๐‘˜ ) Attention distribution: Attention energies: โˆ€๐‘— = 1 โ€ฆ ๐‘ˆ ๐‘ฆ โ„Ž ๐‘˜ ] encoder states ๐‘ก ๐‘— decoder state 66/90 โ„Ž ๐‘˜ ; โƒ–โƒ–โƒ–โƒ–โƒ–โƒ–โƒ– โ„Ž ๐‘˜ = [โƒ—โƒ—โƒ—โƒ—โƒ—โƒ—โƒ— ๐‘“ ๐‘—๐‘˜ = ๐‘ค โŠค ๐‘ tanh (๐‘‹ ๐‘ ๐‘ก ๐‘—โˆ’1 + ๐‘‰ ๐‘ โ„Ž ๐‘˜ + ๐‘ ๐‘ ) ๐›ฝ ๐‘—๐‘˜ = ๐‘™=1 exp (๐‘“ ๐‘—๐‘™ ) ๐‘‘ ๐‘— =

  49. Attention Model in Equations (2) Output projection: โ€ฆattention is mixed with the hidden state Output distribution: Deep Learning for Natural Language processing 67/90 ๐‘ข ๐‘— = MLP (๐‘‰ ๐‘ ๐‘ก ๐‘—โˆ’1 + ๐‘Š ๐‘ ๐น๐‘ง ๐‘—โˆ’1 + ๐ท ๐‘ ๐‘‘ ๐‘— + ๐‘ ๐‘ ) ๐‘ž (๐‘ง ๐‘— = ๐‘™|๐‘ก ๐‘— , ๐‘ง ๐‘—โˆ’1 , ๐‘‘ ๐‘— ) โˆ exp (๐‘‹ ๐‘ ๐‘ข ๐‘— ) ๐‘™ + ๐‘ ๐‘™

  50. Transformer Decoder softmax feed-forward sublayer non-linear layer linear layer โŠ• layer normalization ๐‘‚ร— linear output symbol probabilities input embeddings attention to the encoder complete history โ‡’ ๐‘ƒ(๐‘œ 2 ) complexity Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ลukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30 , pages 6000โ€“6010, Long Beach, CA, USA, December 2017. Curran Associates, Inc Deep Learning for Natural Language processing encoder layer normalization โŠ• queries โŠ• position encoding self-attentive sublayer multihead attention keys & values queries โŠ• layer normalization cross-attention sublayer multihead attention keys & values 68/90 โ€ข similar to encoder, additional layer with โ€ข in every steps self-attention over

  51. Transfomer Decoder: Non-autoregressive training โ€ฆ Deep Learning for Natural Language processing Question 2: How such a matrix look like for convolutional architecture? Question 1: What if the matrix was diagonal? mask matrix multiplication wait until itโ€™s generated โˆ’โˆž Values ๐‘Š ๐‘ค 1 Queries ๐‘… โ€ฆ โ€ฆ โ€ฆ ๐‘Ÿ ๐‘‚ โ€ฆ ๐‘ค ๐‘ โ€ฆ ๐‘ค 3 ๐‘ค 2 69/90 ๐‘Ÿ 1 ๐‘Ÿ 2 ๐‘Ÿ 3 โ€ข analogical to encoder โ€ข target is known at training: donโ€™t need to โ€ข self attention can be parallelized via โ€ข prevent attentding the future using a โ€ฆ

  52. Pre-training Representations

  53. Pre-training Representations Neural Networks Basics Representing Words Representing Sequences Recurrent Networks Convolutional Networks Self-attentive Networks Classifjcation and Labeling Generating Sequences Pre-training Representations Word2Vec ELMo BERT Deep Learning for Natural Language processing 70/90

  54. Pre-trained Representations language data Deep Learning for Natural Language processing 71/90 โ€ข representations that emerge in models seem to be carry a lot of information about the โ€ข representations pre-trained on large data can be re-used on tasks with smaller training

  55. Pre-training Representations Word2Vec

  56. Word2Vec ๐‘ฅ 2 Deep Learning for Natural Language processing Association for Computational Linguistics the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 746โ€“751, Atlanta, Georgia, jun 2013. Tomรกลก Mikolov, Wen-tau Yih, and Geofgrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of round) โ‹ฎ ๐‘ฅ 5 ๐‘ฅ 4 ๐‘ฅ 1 ๐‘ฅ 3 โ‹ฎ ๐‘ฅ 5 ๐‘ฅ 4 ๐‘ฅ 2 ๐‘ฅ 1 ๐‘ฅ 3 โˆ‘ Skip-gram CBOW 72/90 โ€ข way to learn word embeddings without training the complete LM โ€ข CBOW: minimize cross-entropy of the middle word of a sliding windows โ€ข skip-gram: minimize cross-entropy of a bag of words around a word (LM other way

  57. Word2Vec: sampling beings โ†’ (beings, All) (beings, human) (beings, are) (beings, born) 4. All human are born born free and equal in dignity โ€ฆ โ†’ (are, human) (are, beings) (are, born) (are, free) Deep Learning for Natural Language processing free and equal in dignity โ€ฆ are 1. human All human beings are born free and equal in dignity โ€ฆ โ†’ (All, humans) (All, beings) 2. All beings beings are born free and equal in dignity โ€ฆ โ†’ (human, All) (human, beings) (human, are) 3. All human 73/90

  58. Word2Vec: Formulas log ๐‘ž(๐‘ฅ ๐‘ข+๐‘‘ |๐‘ฅ ๐‘ข ) Deep Learning for Natural Language processing jun 2013. Association for Computational Linguistics Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 746โ€“751, Atlanta, Georgia, Equations 1, 2. Tomรกลก Mikolov, Wen-tau Yih, and Geofgrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 ๐‘ฅ ๐‘Š ๐‘ฅ ๐‘— ) ๐‘ฅ ๐‘ƒ ๐‘Š ๐‘ฅ ๐ฝ ) ๐‘ž(๐‘ฅ ๐‘ƒ |๐‘ฅ ๐ฝ ) = ๐‘˜โˆผ(โˆ’๐‘‘,๐‘‘) โˆ‘ ๐‘ข=1 โˆ‘ ๐‘ˆ ๐‘ˆ 1 74/90 โ€ข Training objective: โ€ข Probability estimation: exp (๐‘Š โ€ฒโŠค โˆ‘ ๐‘ฅ exp (๐‘Š โ€ฒโŠค where ๐‘Š is input (embedding) matrix, ๐‘Š โ€ฒ output matrix

  59. Word2Vec: Training using Negative Sampling The summation in denominator is slow, use noise contrastive estimation: ๐‘ฅ ๐‘ƒ ๐‘Š ๐‘ฅ ๐ฝ ) + ๐‘™ โˆ‘ ๐‘—=1 ๐‘ฅ ๐‘— ๐‘Š ๐‘ฅ ๐ฝ )] Main idea: classify independently by logistic regression the positive and few sampled negative examples. Equations 1, 3. Tomรกลก Mikolov, Wen-tau Yih, and Geofgrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 746โ€“751, Atlanta, Georgia, jun 2013. Association for Computational Linguistics Deep Learning for Natural Language processing 75/90 log ๐œ (๐‘Š โ€ฒโŠค ๐น ๐‘ฅ ๐‘— โˆผ๐‘„ ๐‘œ (๐‘ฅ) [ log ๐œ (โˆ’๐‘Š โ€ฒโŠค

  60. Word2Vec: Vector Arithmetics man woman uncle aunt king queen kings queens king queen Image originally from Tomรกลก Mikolov, Wen-tau Yih, and Geofgrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 746โ€“751, Atlanta, Georgia, jun 2013. Association for Computational Linguistics Deep Learning for Natural Language processing 76/90

  61. Few More Notes on Embeddings Deep Learning for Natural Language processing 77/90 โ€ข many method for pre-trained words embeddings (most popluar GloVe) โ€ข embeddings capturing character-level properties โ€ข multilingual embeddings

  62. Training models FastText โ€“ Word2Vec model implementation by Facebook https://github.com/facebookresearch/fastText ./fasttext skipgram -input data.txt -output model Deep Learning for Natural Language processing 78/90

  63. Pre-training Representations ELMo

  64. What is ELMo? known tricks, trained on extremely large data Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 2227โ€“2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/N18-1202 Deep Learning for Natural Language processing 79/90 โ€ข pre-trained large language model โ€ข โ€œnothing specialโ€ โ€“ combines all โ€ข improves almost all NLP tasks โ€ข published in June 2018

  65. ELMo Architecture: Input 256 Deep Learning for Natural Language processing projection is needed contain gates that contol if ( โˆผ soft search for learned ๐‘œ -grams) level linear projection to 512 dimensions 2 ร— highway layer (2,048 dimensions) 1024 7 character embeddings of size 16 6 512 5 1 1D-convolution to 2,048 dimensions + max-pool window fjlters 128 80/90 32 2 32 3 64 4 โ€ข input tokenized, treated on character โ€ข 2,048 ๐‘œ -gram fjlters + max-pooling โ€ข 2 highway layers: ๐‘• ๐‘š+1 = ๐œ (๐‘‹ ๐‘• โ„Ž ๐‘š + ๐‘ ๐‘• ) โ„Ž ๐‘š+1 = (1 โˆ’ ๐‘• ๐‘š+1 ) โŠ™ โ„Ž ๐‘š + ๐‘• ๐‘š+1 โŠ™ ReLu (๐‘‹โ„Ž ๐‘š + ๐‘)

  66. ELMo Architecture: Language Models ๐‘ก task Deep Learning for Natural Language processing trainable parameters. ๐‘€ ๐›ฟ task , ๐‘ก task ๐‘™ layer ๐‘€ ๐‘™ ELMo task Learned layer combination for downstream tasks: connections 81/90 โ€ข token representations input for 2 language models: forward and backward โ€ข both LMs 2 layers with 4,096 dimensions wiht layer normalization and residual โ€ข output classifjer shared (only used in training, does hot have to be good) = ๐›ฟ task โˆ‘ ๐‘€ โ„Ž (๐‘€)

  67. Task where ELMo helps locations, organization, numbers with Deep Learning for Natural Language processing plagiarism.) question on StackOverfmow or detecting sentences are. (Think of clustering similar Measure how similar meaning two Semantic Similarity Detect what entities pronouns refer to. I Coreference Resolution numbers โ€ฆ units, email addresses, URLs, phone Detect and classify names people, Answer Span Selection Named Entity Recognition nothing to do with each other. agreement, contradict each other, or have Decide whether two sentences are in Natural Language Inference sentences. Detect who did what to whom in Semantic Role Labeling unstructured text. Find an answer to a question in a 82/90

  68. Improvements by Elmo Deep Learning for Natural Language processing 83/90

  69. How to use it dropout=0) Deep Learning for Natural Language processing https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md embeddings = elmo(character_ids) character_ids = batch_to_ids(sentences) ['Another', '.']] sentences = [['First', 'sentence', '.'], elmo = Elmo(options_file, weight_file, 2, weight_file = ... options_file = ... batch_to_ids from allennlp.modules.elmo import Elmo, available framework (uses PyTorch) 84/90 โ€ข implemetned in AllenNLP โ€ข pre-trained English models

  70. Pre-training Representations BERT

  71. What is BERT representations slightly difgerent training objective November 2018 J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. ArXiv e-prints , October 2018 Deep Learning for Natural Language processing 85/90 โ€ข another way of pretraining sentence โ€ข uses Transformer architecture and โ€ข even beeter than ELMo โ€ข done by Google, published in

  72. Achitecture Comparison Deep Learning for Natural Language processing 86/90

  73. Masked Language Model All Deep Learning for Natural Language processing 4. With 10% change keep asi is โ†’ free 3. With 10% change replace with random token โ†’ hairy 2. With 80% change replace with special MASK token. 1. Randomly sample a word โ†’ free rights and dignity in equal and free free MASK hairy free born are being human 87/90 Then a classifjer should predict the missing/replaced word free

Recommend


More recommend