. . . . . . . . . . . . . . . . Transformer Ablation Studies Simon Will Institute of Formal and Applied Linguistics Charles University Seminar: Statistical Machine Translation Instructor: Dr. Ondřej Bojar . . . . . . . . . . . . . . . . . . . . . . . . May 2019, 30
. . . . . . . . . . . . . . . . Structure The Transformer Ablation Concerns Feed Forward Layers Positional Embeddings . . . . . . . . . . . . . . . . . . . . . . . . Self-Attention Keys and Queries
. . . . . . . . . . . . . . . . . Idea Ott et al. 2018; Dai et al. 2019) part contributes . . . . . . . . . . . . . . . . . . . . . . . → Train similar models difgering in crucial points ▶ Transformer successful and many variations exist (e.g. ▶ Diffjcult to know what the essentials are and what each
. . . . . . . . . . . . . . . . . . Transformer (Vaswani et al. 2017) . . . . . . . . . . . . . . . . . . . . . . ▶ Encoder-Decoder Architecture based on attention ▶ No recurrence ▶ Constant in source and target “time” while training ▶ In inference, only constant in source “time” ▶ Better parallelizable than RNN-based network
. . . . . . . . . . . . . . . . . . Transformer Illustration . . . . . . . . . . . . . . . . . . . . . . Figure: Two-Layer Transformer (Image from Alammar 2018)
. . . . . . . . . . . . . . . . . . Ablation Concerns . . . . . . . . . . . . . . . . . . . . . . Figure: Areas of Concern for this Project
. . . . . . . . . . . . . . . . . Feed Forward Layers decoder not clear. → Three confjgurations: . . . . . . . . . . . . . . . . . . . . . . . ▶ Contribution of Feed Forward Layers in encoder and ▶ Is the attention enough? ▶ No encoder FF layer ▶ No decoder FF layer ▶ No decoder and no encoder FF layer
. . . . . . . . . . . . . . . . . . Ablation Concerns . . . . . . . . . . . . . . . . . . . . . . Figure: Areas of Concern for this Project
. . . . . . . . . . . . . . . . . . Ablation Concerns . . . . . . . . . . . . . . . . . . . . . . Figure: Areas of Concern for this Project
. . . . . . . . . . . . . . . Positional Embeddings → Add information via explicit positional embeddings et al. 2017) pos key dimensionality pos key dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . ▶ No recurrence → no information about order of tokens ▶ Added to the word embedding vector ▶ Two types: ▶ Learned embeddings of absolute position (e.g. Gehring ▶ Sinusoidal embeddings (used in Vaswani et al. 2017) ▶ ( ) PE ( pos , 2 i ) = sin 2 i 10000 ( ) PE ( pos , 2 i +1) = cos 2 i 10000
. . . . . . . . . . . . . . . . . . Positional Embeddings Illustrated Figure: Illustrated positional embeddings for 20 token sentence and . . . . . . . . . . . . . . . . . . . . . . key dimensionality 512 (taken from Alammar 2018)
. . . . . . . . . . . . . . . . . Modifjcations pos key dimensionality . . . . . . . . . . . . . . . . . . . . . . . ▶ Vary “rainbow stretch” by introducing stretching factor α ( ) PE ( pos , 2 i ) = sin 2 i α 10000 ▶ Expectations: ▶ α too low: No positional information ▶ α too high: Word embedding information destroyed ▶ α other than 1 optimal
. . . . . . . . . . . . . . . . . Self-Attention Keys and Queries d k generation with difgerent matrices . . . . . . . . . . . . . . . . . . . . . . . ▶ ( ( QW Q )( KW K ) T ) Attention ( Q , K , V ) = softmax ( VW V ) √ ▶ In encoder, source words are used for key and query ▶ Modifjcation: Use same matrix for both
. . . . . . . . . . . . . . . . . . Experiment Design . . . . . . . . . . . . . . . . . . . . . . ▶ Do all basic confjgurations ▶ Combine well-performing modifjcations ▶ How to compare? ▶ BLEU score on test set at best dev set performance ▶ Whole learning curves (similar to Popel and Bojar 2018)
. . . . . . . . . . . . . . . . . . Dataset . . . . . . . . . . . . . . . . . . . . . . performance ▶ Parallel image captions (Elliott et al. 2016) ▶ https://github.com/multi30k/dataset ▶ Short sentences ▶ Rather small (30k sentences) ▶ Good because fjtting takes less than a day ▶ Bad because dev and test performance is far below train
. . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . ▶ Experiments still pending ▶ Expecting to see mainly negative results ▶ Hopefully some positive ones
. . . . . . . . . . . . . . . . . Nice translation by the system “eine junge frau hält eine blume , um sich an der blume zu halten .” “a young woman is holding a fmower in order to hold on to the . . . . . . . . . . . . . . . . . . . . . . . fmower .”
. . . . . . . . . . . . . . . . References I Dai, Zihang et al. (2019). “Transformer-XL: Attentive Language Models Elliott, Desmond et al. (2016). “Multi30K: Multilingual English-German and Language . Berlin, Germany: Association for Computational Gehring, Jonas et al. (2017). “Convolutional Sequence to Sequence Ott, Myle et al. (2018). “Scaling Neural Machine Translation”. In: CoRR . . . . . . . . . . . . . . . . . . . . . . . . Alammar, Jay (2018). The Illustrated Transformer . url : http: //jalammar.github.io/illustrated-transformer/ . Beyond a Fixed-Length Context”. In: CoRR abs/1901.02860. arXiv: 1901.02860 . url : http://arxiv.org/abs/1901.02860 . Image Descriptions”. In: Proceedings of the 5th Workshop on Vision Linguistics, pp. 70–74. doi : 10.18653/v1/W16-3210 . url : http://www.aclweb.org/anthology/W16-3210 . Learning”. In: CoRR abs/1705.03122. arXiv: 1705.03122 . url : http://arxiv.org/abs/1705.03122 . abs/1806.00187. arXiv: 1806.00187 . url : http://arxiv.org/abs/1806.00187 .
. . . . . . . . . . . . . . . . . References II Popel, Martin and Ondrej Bojar (2018). “Training Tips for the Vaswani, Ashish et al. (2017). “Attention Is All You Need”. In: CoRR . . . . . . . . . . . . . . . . . . . . . . . Transformer Model”. In: CoRR abs/1804.00247. arXiv: 1804.00247 . url : http://arxiv.org/abs/1804.00247 . abs/1706.03762. arXiv: 1706.03762 . url : http://arxiv.org/abs/1706.03762 .
Recommend
More recommend