Transformer Networks Amir Ali Moinfar - M. Soleymani Deep Learning Sharif University of Technology Spring 2019 1
The “simple” translation model • Embedding each word (word2vec, trainable, …) • Some Tricks: – Teacher forcing – Reversing the input This slide has been adapted from Bhiksha Raj, 11-785, CMU 2019 2
Problems with this framework • All the information about the input is embedded into a single vector – Last hidden node is “overloaded” with information • Particularly if the input is long • Parallelization? • Problems in back propagation through sequence This slide has been adapted from Bhiksha Raj, 11-785, CMU 2019 3
Parallelization: Convolutional Models ( ) • Some work: – Neural GPU – ByteNet – ConvS2S • Limited by size of convolution • Maximum path length: – 𝑀𝑝 $ 𝑜 Kalchbrenner et al. “Neural Machine Translation in Linear Time”, 2017 4
Removing bottleneck: Attention Mechanism • Compute a weighted combination of all the hidden outputs into a single vector • Weights are functions of current output state • The weights are a distribution over the input (sum to 1) This slide has been adapted from Bhiksha Raj, 11-785, CMU 2019 5
Attention Effect in machine translation • Left: Normal RNNs and long sentences • Right: Attention map in machine translation Bahdanau et al. "Neural Machine Translation by Jointly Learning to Align and Translate", 2014 6
RNNs with Attention for VQA ( ) • Each hidden output of LSTM selects a part of image to look at Zhu et al. “Visual7W: Grounded Question Answering in Images” 2016 7
Attention Mechanism - Abstract View • A Lookup Mechanism – Query – Key – Value 8
Attention Mechanism - Abstract View (cont.) ??? 9
Attention Mechanism - Abstract View (cont.) • For large values of 𝑒 $ , the dot products grow large in magnitude, pushing the 𝑡𝑝𝑔𝑢𝑛𝑏𝑦 function into regions where it has extremely small gradients Jay Alammar, “The Illustrated Transformer” Vaswani et al. "Attention Is All You Need", 2017 10 http://jalammar.github.io/illustrated-transformer/ (5/20/2019)
Self Attention • AKA intra-attention • An attention mechanism relating different positions of a single sequence => Q, K, V are derived from a single sequence • Check the case when – 𝑅 . = 𝑋 1 𝑌 . – 𝐿 4 , … , 𝐿 7 = 𝑋 8 𝑌 4 , … , 𝑋 8 𝑌 7 – V 4 , … , V : = 𝑋 ; 𝑌 4 , … , 𝑋 ; 𝑌 7 11 Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019)
Multi-head attention • Allows the model to – jointly attend to information – from different representation subspaces – at different positions [modified] Jay Alammar, “The Illustrated Transformer” Vaswani et al. "Attention Is All You Need", 2017 12 http://jalammar.github.io/illustrated-transformer/ (5/20/2019)
Multi-head Self Attention Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019) 13
Bonus: Attention Is All She Needs Gregory Jantz “Hungry for Attention: Is Your Cell Phone Use at Dinnertime Hurting Your Kids?”, https://www.huffpost.com/entry/cell-phone-use-at-dinnertime_n_5207272 2014
Attention Is All You Need Advantages: • Replace LSTMs with a lot of attention! Less complex • – State-of-the art results Can be paralleled, faster • – Much less computation for training Easy to learn distant dependency • Vaswani et al. "Attention Is All You Need", 2017 15
Transformer’s Behavior • Encoding + First decoding step [Link to gif] Jay Alammar, “The Illustrated Transformer” 16 Vaswani et al. "Attention Is All You Need", 2017 http://jalammar.github.io/illustrated-transformer/ (5/20/2019)
Transformer’s Behavior (cont.) • Decoding [Link to gif] Jay Alammar, “The Illustrated Transformer” 17 Vaswani et al. "Attention Is All You Need", 2017 http://jalammar.github.io/illustrated-transformer/ (5/20/2019)
Transformer architecture • The core of it – Multi-head attention – Positional encoding [Link to gif] Jakob Uszkoreit "Transformer: A Novel Neural Network Architecture for Language 18 Understanding", https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html Vaswani et al. "Attention Is All You Need", 2017
Transformer architecture (cont.) • Encoder – Input embedding (like word2vec) – Positional encoding – Multi-head self attentions – Feed-forward with residual links • Decoder – Output embedding (like word2vec) – Positional encoding – Multi-head self attentions – Multi-head encoder-decoder attentions – Feed-forward with residual links • Output – Linear + Softmax 19 Vaswani et al. "Attention Is All You Need", 2017
Transformer architecture (cont.) • Output – Linear + Softmax 20 Vaswani et al. "Attention Is All You Need", 2017
Transformer architecture (cont.) • Encoder and Decoder Jay Alammar, “The Illustrated Transformer” 21 http://jalammar.github.io/illustrated-transformer/ (5/20/2019) Vaswani et al. "Attention Is All You Need", 2017
Transformer architecture (cont.) • Feed-forward Layers • Residual links • Batch-norm • Dropout 22 Vaswani et al. "Attention Is All You Need", 2017
Transformer architecture (cont.) • Attention is all it needs 23 Vaswani et al. "Attention Is All You Need", 2017
Transformer architecture (cont.) • [Multi-head] attention is all it needs 24 Vaswani et al. "Attention Is All You Need", 2017
Transformer architecture (cont.) • Two types of attention is all it needs :D Remember signature of multi-head attention 25 Vaswani et al. "Attention Is All You Need", 2017
Transformer architecture (cont.) • Embeddings – Just a lookup table: 26 Vaswani et al. "Attention Is All You Need", 2017
Transformer architecture (cont.) • Positional Encoding • It would allow the model to easily learn to attend by relative positions sin (𝑞𝑝𝑡 + 𝑙) sin 𝑞𝑝𝑡 cos 𝑙 + cos 𝑞𝑝𝑡 sin (𝑙) = • cos (𝑞𝑝𝑡 + 𝑙) cos 𝑞𝑝𝑡 cos 𝑙 − sin 𝑞𝑝𝑡 sin (𝑙) … … 27 Alexander Rush, “The Annotated Transformer” http://nlp.seas.harvard.edu/2018/04/03/attention.html (5/20/2019) Vaswani et al. "Attention Is All You Need", 2017
Transformer architecture (cont.) • A 2 tier transformer network 28 Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019) Vaswani et al. "Attention Is All You Need", 2017
Transformer’s Behavior • Encoding + First decoding step [Link to gif] Jay Alammar, “The Illustrated Transformer” 29 Vaswani et al. "Attention Is All You Need", 2017 http://jalammar.github.io/illustrated-transformer/ (5/20/2019)
Transformer’s Behavior (cont.) • Decoding [Link to gif] Jay Alammar, “The Illustrated Transformer” 30 Vaswani et al. "Attention Is All You Need", 2017 http://jalammar.github.io/illustrated-transformer/ (5/20/2019)
Complexity • Advantages: – Less complex – Can be paralleled, faster – Easy to learn distant dependency 31 Vaswani et al. "Attention Is All You Need", 2017
Interpretability • Attention mechanism in the encoder self-attention in layer 5 of 6 32 Vaswani et al. "Attention Is All You Need", 2017
Interpretability (cont.) • Two heads in the encoder self-attention in layer 5 of 6 Vaswani et al. "Attention Is All You Need", 2017 33
References • Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems . 2017. • Alammar, Jay. “The Illustrated Transformer.” The Illustrated Transformer – Jay Alammar – Visualizing Machine Learning One Concept at a Time , 27 June 2018, jalammar.github.io/illustrated-transformer/. • Zhang, Shiyue. “Attention Is All You Need - Ppt Download.” SlidePlayer , 20 June 2017, slideplayer.com/slide/13789541/. • Kurbanov, Rauf. “Attention Is All You Need.” JetBrains Research , 27 Jan. 2019, research.jetbrains.org/files/material/5ace635c03259.pdf. • Polosukhin, Illia. “Attention Is All You Need.” LinkedIn SlideShare , 26 Sept. 2017, www.slideshare.net/ilblackdragon/attention-is-all-you-need. • Rush , Alexander. The Annotated Transformer , 3 Apr. 2018, nlp.seas.harvard.edu/2018/04/03/attention.html. • Uszkoreit, Jakob. “Transformer: A Novel Neural Network Architecture for Language Understanding.” Google AI Blog , 31 Aug. 2017, ai.googleblog.com/2017/08/transformer-novel- neural-network.html. 34
Q&A 35
Thanks for your attention! 𝑍𝑝𝑣𝑠 𝐵𝑢𝑢𝑓𝑜𝑢𝑗𝑝𝑜 = 𝑇𝑝𝑔𝑢𝑛𝑏𝑦 𝑍𝑝𝑣 [𝑄𝑠𝑓𝑡𝑓𝑜𝑢𝑏𝑢𝑗𝑝𝑜 | 𝐵𝑜𝑧𝑢ℎ𝑗𝑜 𝑓𝑚𝑡𝑓] V [𝑄𝑠𝑓𝑡𝑓𝑜𝑢𝑏𝑢𝑗𝑝𝑜𝑜 | 𝐵𝑜𝑧𝑢ℎ𝑗𝑜 𝑓𝑚𝑡𝑓] 36 36
Recommend
More recommend