Machine Learning Lecture 11: Transformer and BERT Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on internet resources and references listed at the end. Nevin L. Zhang (HKUST) Machine Learning 1 / 49
Transformer Outline 1 Transformer Self-Attention Layer The Encoder The Decoder 2 BERT Overview Pre-training BERT Fine-tuning BERT Nevin L. Zhang (HKUST) Machine Learning 2 / 49
Transformer Two Seq2Seq Models RNN is sequential. It precludes parallelization within training examples http://jalammar.github.io/images/seq2seq 6.mp4 Transformer allows significantly more parallelization: http://jalammar.github.io/images/t/transformer decoding 1.gif Based solely on attention, dispensing with recurrence. Requires less time to train and achieves better results than Seq2Seq. Nevin L. Zhang (HKUST) Machine Learning 3 / 49
Transformer Self-Attention Layer Self-Attention Layer The input to the encoder of Transformer are the embedding vectors of tokens in input sequence. The self-attention layer processes the embedding vectors in parallel to obtain new representations of the tokens. Nevin L. Zhang (HKUST) Machine Learning 5 / 49
Transformer Self-Attention Layer Self-Attention Layer The purpose of the self-attention layer is to “improve” the representation of each token by combining information from other tokens. Consider sentence: The animal didn’t cross the street because it was too tired. What does “it” refer to? The street or the animal? Self-attention is able to associate “it” with “animal” when obtaining a new representation for “it”. Nevin L. Zhang (HKUST) Machine Learning 6 / 49
Transformer Self-Attention Layer Self-Attention Layer Let x 1 , . . . , x n be the current representations of input tokens. They are all row vectors of dimension d m (= 512). Consider obtaining a new representation z i for token i . We need to decide: How much attention to pay to each x j ? How to combine the x j ’s into z i ? Nevin L. Zhang (HKUST) Machine Learning 7 / 49
Transformer Self-Attention Layer Self-Attention Layer Moreover, we want the model to learn how to answer those questions from data. So, we introduce three matrices of learnable parameters (aka projection matrices): W Q : a d m × d k matrix ( d k = 64) W K : a d m × d k matrix W V : a d m × d v matrix ( d v = 64) Nevin L. Zhang (HKUST) Machine Learning 8 / 49
Transformer Self-Attention Layer Self-Attention Layer Using the three matrices of parameters, we compute z i as follows: Project x i and x j to get query, key, and value: q i = x i W Q : query vector of dimension d k . k j = x j W K : key vector of dimension d k . v j = x j W v : value vector of dimension d v . Compute attention weights: Dot-product attention: α i , j ← q i k ⊤ j . α i , j Scaled dot-product attention : α i , j ← d k . √ e α i , j Apply softmax: α i , j ← j e α i , j � Obtain z i (a vector of dimension d v ) by: � z i = α i , j v j j Nevin L. Zhang (HKUST) Machine Learning 9 / 49
Transformer Self-Attention Layer Self-Attention Layer: Example Nevin L. Zhang (HKUST) Machine Learning 10 / 49
Transformer Self-Attention Layer Self-Attention Layer: Matrix Notation Let X be the matrix with x j ’s as row vectors. Q = XW Q , K = XW K , V = XW V Let Z be the matrix with z i ’s as row vectors. Then, Nevin L. Zhang (HKUST) Machine Learning 11 / 49
Transformer Self-Attention Layer Multi-Head Attention So far, we have been talking about obtaining one new representation z i for taken i. It combines information from x j ’s in one way. Nevin L. Zhang (HKUST) Machine Learning 12 / 49
Transformer Self-Attention Layer Multi-Head Attention It is sometimes useful to consider multiple ways to combines information from x j ’s, or multiple attentions. To do so, introduce multiple sets of projection matrices: W Q i , W K i , W V i ( i = 1 , . . . , h ), each of which is called an attention head Nevin L. Zhang (HKUST) Machine Learning 13 / 49
Transformer Self-Attention Layer Multi-Head Attention For each head i , let Q i = XW Q i , K i = XW K i , V i = XW V i , and we get attention output: Z i = softmax( Q i K ⊤ i √ d k ) V i Then we concatenate those matrices to get the overall output Z = Concat ( Z 1 , . . . , Z h ) with hd v columns. To ensure the new embedding of each token is also of dimension d m , introduce another hd v × d m matrix W O of learnable parameter and project: Z ← ZW O . In Transformer, d m = 512, h = 8, d v = 64. Nevin L. Zhang (HKUST) Machine Learning 14 / 49
Transformer Self-Attention Layer Self-Attenion Layer: Summary Nevin L. Zhang (HKUST) Machine Learning 15 / 49
Transformer The Encoder Encoder Block Each output vector of the self-attention layer is fed to a feedforward network. The FNN’s at different positions share parameters and function independently. The self-attention layer and the FNN layer make up one encoder layer (aka encoder block). The self-attention layer and FNN layer are hence called sub-layers . Nevin L. Zhang (HKUST) Machine Learning 17 / 49
Transformer The Encoder Residual Connection A residual connection is added to each sub-layer, followed by layer normalization . This enables us to deep models with many layers. Nevin L. Zhang (HKUST) Machine Learning 18 / 49
Transformer The Encoder The Encoder The encoder is composed of a stack of N = 6 encoder blocks with the same structure, but different parameters (i.e., no weight sharing across different encoder blocks) . Nevin L. Zhang (HKUST) Machine Learning 19 / 49
Transformer The Encoder Positional Encoding Transformer contains no recurrence, and hence needs to inject information about token positions in order to make use of order of sequence. Positional encoding is therefore introduced. For a token at position pos , its PE is a vector of d m dimensions: sin ( pos / 1000 2 i / d m ) PE ( pos , 2 i ) = cos ( pos / 1000 2 i / d m ) PE ( pos , 2 i + 1) = where 0 ≤ i ≤ 255 Nevin L. Zhang (HKUST) Machine Learning 20 / 49
Transformer The Encoder Positional Encoding Positional encodings are added to input embeddings at the bottom. Complete structure of the encoder is shown on the next slide. Nevin L. Zhang (HKUST) Machine Learning 21 / 49
Transformer The Encoder Nevin L. Zhang (HKUST) Machine Learning 22 / 49
Transformer The Decoder The Decoder The decoder generates one word at a time. At a given step, its inputs consists of all the words generated already. The representation of those words are added with positional encodings, and the results are fed to be decoder. Nevin L. Zhang (HKUST) Machine Learning 24 / 49
Transformer The Decoder The Decoder The decoder has a stack of N = 6 identical decoder blocks. A decoder block is the same as an encoder block, except that it has an additional decoder-encoder attention layer between self-attention and FNN sub-layers Nevin L. Zhang (HKUST) Machine Learning 25 / 49
Transformer The Decoder Decoder Block The decoder-encoder attention layer performs multi-head attention where The queries Q come from the previous decoder layer. The keys K and values V come from the output of encoder. Z i = softmax( Q i K ⊤ i √ d k ) V i . where Q i = X decoder W Q i , K i = X encoder W K i , V i = X encoder W V i . Similar to RNN. (Illustration with N = 2) Nevin L. Zhang (HKUST) Machine Learning 26 / 49
Transformer The Decoder Decoder Block Self-attention layer in decoder differs from that in encoder in one important way. Each position can only attend to all preceding positions and itself, because there are no inputs from future positions. Implementation: For j > i , set α i , j = −∞ . Apply softmax: e α i , j α i , j ← j e α i , j . Then α i , j = 0 for all j > i . � Nevin L. Zhang (HKUST) Machine Learning 27 / 49
Transformer The Decoder The Decoder Finally, the decoder has a softmax layer that defines a distribution over the vocabulary, and a word is sampled from the distribution as the next output. The loss function is defined in the same way as in RNN. All parameters of the model are optimized by minimizing the loss function using the Adam optimizer. Nevin L. Zhang (HKUST) Machine Learning 28 / 49
Transformer The Decoder Transformer in Action http://jalammar.github.io/images/t/transformer decoding 1.gif http://jalammar.github.io/images/t/transformer decoding 2.gif Nevin L. Zhang (HKUST) Machine Learning 29 / 49
Transformer The Decoder Empirical Results Nevin L. Zhang (HKUST) Machine Learning 30 / 49
Transformer The Decoder Conclusions on Transformer Transformer is first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. Nevin L. Zhang (HKUST) Machine Learning 31 / 49
BERT Outline 1 Transformer Self-Attention Layer The Encoder The Decoder 2 BERT Overview Pre-training BERT Fine-tuning BERT Nevin L. Zhang (HKUST) Machine Learning 32 / 49
Recommend
More recommend