NPFL114, Lecture 12 Transformer, External Memory Networks Milan Straka May 20, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
Exams Five questions, written preparation, then we go through it together (or you can leave and let me grade it by myself). Each question is 20 points, and up to 40 points (surplus above 80 points; there is no distinction between regular and competition points) transfered from the practicals, and up to 10 points for GitHub pull requests. To pass the exam, you need to obtain at least 60, 75 and 90 out of 100 points for the written exam (plus up to 40 points from the practicals), to obtain grades 3, 2 and 1, respectively. The SIS should give you an exact time of the exam (including a gap between students) so that you do not come all at once. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 2/39
What Next In the winter semester: NPFL117 – Deep Learning Seminar [0/2 Ex] Reading group of deep learning papers (in all areas). Every participant presents a paper about deep learning, learning how to read a paper, present it in a understandable way, and get deep learning knowledge from other presentations. NPFL122 – Deep Reinforcement Learning [2/2 C+Ex] In a sense continuation of Deep Learning, but instead of supervised learning, reinforced learning is the main method. Similar format to the Deep Learning course. NPFL129 – Machine Learning 101 A course intended as a prequel to Deep Learning – introduction to machine learning (regression, classification, structured prediction, clusterization, hyperparameter optimization; decision trees, SVM, maximum entropy classifiers, gradient boosting, … ), with practicals in Python. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 3/39
Neural Architecture Search (NASNet) – 2017 Using REINFORCE with baseline, we can design neural network architectures. We fix the overall architecture and design only Normal and Reduction cells. Figure 2 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 4/39
Neural Architecture Search (NASNet) – 2017 Every block is designed by a RNN controller generating individual operations. Figure 3 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 5/39
Neural Architecture Search (NASNet) – 2017 Figure 4 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 6/39
Neural Architecture Search (NASNet) – 2017 Table 2 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 7/39
Neural Architecture Search (NASNet) – 2017 Figure 5 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 8/39
Attention is All You Need For some sequence processing tasks, sequential processing of its elements might be too restricting. Instead, we may want to combine sequence elements independently on their distance. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 9/39
Attention is All You Need Figure 1 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 10/39
Attention is All You Need Q K V The attention module for a queries , keys and values is defined as: ( QK ⊤ ) Attention( Q , K , V ) = softmax V . d k W The queries, keys and values are computed from current word representations using a linear transformation as = V ⋅ W Q Q = V ⋅ W K K = V ⋅ W V V NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 11/39
Attention is All You Need Multihead attention is used in practice. Instead of using one huge attention, we split queries, keys and values to several groups (similar to how ResNeXt works), compute the attention in each of the groups separately, and then concatenate the results. Scaled Dot-Product Attention Multi-Head Attention Figure 2 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 12/39
Attention is All You Need Positional Embeddings We need to encode positional information (which was implicit in RNNs). Learned embeddings for every position. Sinusoids of different frequencies: ( 2 i / d ) PE = sin pos /10000 ( pos ,2 i ) ( 2 i / d ) PE = cos pos /10000 ( pos ,2 i +1) This choice of functions should allow the model to attend to relative positions, since for any k PE PE pos + k pos fixed , is a linear function of . NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 13/39
Attention is All You Need Positional embeddings for 20 words of dimension 512, lighter colors representing values closer to 1 and darker colors representing values closer to -1. http://jalammar.github.io/illustrated-transformer/ NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 14/39
Attention is All You Need Regularization The network is regularized by: dropout of input embeddings, dropout of each sub-layer, just before before it is added to the residual connection (and then normalized), label smoothing. Default dropout rate and also label smoothing weight is 0.1. Parallel Execution Training can be performed in parallel because of the masked attention – the softmax weights of the self-attention are zeroed out not to allow attending words later in the sequence. However, inference is still sequential (and no substantial improvements have been achieved on parallel inference similar to WaveNet). NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 15/39
Why Attention Table 1 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 16/39
Transformers Results Table 2 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 17/39
Transformers Results train PPL BLEU params N d model d ff h d k d v P drop ǫ ls × 10 6 steps (dev) (dev) base 6 512 2048 8 64 64 0.1 0.1 100K 4.92 25.8 65 1 512 512 5.29 24.9 4 128 128 5.00 25.5 (A) 16 32 32 4.91 25.8 32 16 16 5.01 25.4 16 5.16 25.1 58 (B) 32 5.01 25.4 60 2 6.11 23.7 36 4 5.19 25.3 50 8 4.88 25.5 80 (C) 256 32 32 5.75 24.5 28 1024 128 128 4.66 26.0 168 1024 5.12 25.4 53 4096 4.75 26.2 90 0.0 5.77 24.6 0.2 4.95 25.5 (D) 0.0 4.67 25.3 0.2 5.47 25.7 (E) positional embedding instead of sinusoids 4.92 25.7 big 6 1024 4096 16 0.3 300K 4.33 26.4 213 Table 4 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 18/39
Neural Turing Machines So far, all input information was stored either directly in network weights, or in a state of a recurrent network. However, mammal brains seem to operate with a working memory – a capacity for short-term storage of information and its rule-based manipulation. M We can therefore try to introduce an external memory to a neural network. The memory will be a matrix, where rows correspond to memory cells. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 19/39
Neural Turing Machines The network will control the memory using a controller which reads from the memory and writes to is. Although the original paper also considered a feed-forward (non-recurrent) controller, usually the controller is a recurrent LSTM network. Figure 1 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401. NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 20/39
Neural Turing Machine Reading t w t To read the memory in a differentiable way, the controller at time emits a read distribution r t over memory locations, and the returned read vector is then ∑ = ( i ) ⋅ ( i ). r w M t t t i Writing t Writing is performed in two steps – an erase followed by an add . The controller at time emits w e t t a write distribution over memory locations, and also an erase vector and an add vector a t . The memory is then updates as ( i ) = ( i ) [ 1 − ( i ) e ] + ( i ) a . M M w w t −1 t t t t t NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN 21/39
Recommend
More recommend