How Much Self-Attention Do We Need? Trading Attention for - PowerPoint PPT Presentation

How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers Kazuki Irie *, Alexander Gerstenberger, Ralf Schl¨ uter, Hermann Ney Human Language Technology and Pattern Recognition Group RWTH Aachen University, Aachen, Germany *joining the Swiss AI Lab IDSIA, USI & SUPSI, Lugano, Switzerland ICASSP 2020, Barcelona, Spain Machine Learning for Language Processing I [TU2.L3.2], May 5, 2020

Introduction • Major trend: large Transformer language models in 2019 - begin 2020. – OpenAI GPT-2 [Radford & Wu + 19] – Nvidia Megatron – Microsoft Turing-NLG • Applications to ASR (Interspeech 2019) – N-best rescoring, Nvidia [Li & Lavrukhin + 19] – Lattice rescoring & Shallow fusion, RWTH Aachen [Irie & Zeyer + 19]. • Large ASR improvements over well tuned LSTM language model: uscher & Beck + 19] – LibriSpeech SoTA Interspeech 2019 [L¨ – TED-LIUM 2 SoTA this conference ( Friday ) [Zhou & Michel + 20]. with lattice rescoring for hybrid NN/HMM ASR system. • In practice : Large memory requirement for search. For lattice rescoring, more than 100 GB for large lattices for some tasks... 2 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

Transformer language models in ASR: Large state size Feed-forward LayerNorm Self-Attention LayerNorm Positional Encoding • Transformer LM state: key and value vectors. • State size: L (layers) × d kv (key dim.) × 2 (for key and value) × n (position) • In principle: To be stored for each hypothesis. 3 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

Increasing model size, without increasing state size? Objective/Motivation • Reduce memory requirement in the original Transformer LM for search ! • From modeling perspective (quantization etc. can be applied on top of it). • Reconsider the original Transformer layer. Can we efficiently increase the model size w/o increasing the state size? • Hyper-parameters in Transformer language model: – Number of layers: L – Tied key, value, and query dimension: d kv – Feed-forward dimension: d ff – Number of attention heads: H • State size: 2 × L × d kv × n • Only possibility: increase feed-forward dimension d ff • Put parameters in the feed-forward modules more efficiently? 4 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

This work: 2 modifications for small state Transformer 1 F feed-forward layers per layer. 2 Sharing key and value matrices. • (Reduce number of Transformer layers L ). 5 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

Experimental setups Dataset: TED-LIUM release 2 • 152 K-word vocabulary. • 270 M running words for language model training. • 7 subsets including the transcriptions. • Minor overlapping problem in the official data (see our paper for details). • Some additional experiments on LibriSpeech (to be found in paper). ASR baseline • Dedicated system paper ( new SoTA on TED-LIUM 2) Friday at 15:15 - 17:15. Session: Large Vocabulary Continuous Speech Recognition and Search . Zhou et al. The RWTH ASR System for TED-LIUM Release 2: Improving Hybrid HMM with SpecAugment. • Hybrid NN/HMM system. • First pass with 4-gram/LSTM. • Lattice rescoring to apply LSTM/Transformer language models. 6 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

Baseline LM setups: TED-LIUM 2 Basic setups • 4-gram: interpolation. • LSTM: 4 layers, 2048 nodes in each layer. • Transformer: 32 layers, 4096 feed-forward dim, 768 hidden units, 12 heads, no positional encoding. # Param. Perplexity Model [M] Dev Test 4-gram 343 105 125 + pruning 161 113 128 LSTM 450 74 71 Transformer 414 62 61 All language model configurations/models online: https://github.com/rwth-i6/ returnn-experiments/tree/master/2020-lm-small-state-trafo 7 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

Effect of deep feed-forward module Perplexity results on TED-LIUM 2 L = Number of Transformer layers F = Number of feed-forward layers per Transformer layer Transformer State size per #Param. Perplexity F L d ff layer position [K] [M] Dev Test 8 4096 12 206 68 65 Standard 1 32 2048 313 63 62 49 4096 414 62 61 7 6 2048 9 280 65 63 Deep 8 12 338 63 62 Feed-forward 3 12 4096 18 379 62 61 16 464 24 61 61 Key/Value dimension ( d kv ) is fixed to 768 for all models. • ( L = 8, F = 3)-model only 2% rel. worse than baseline ( L = 32, F = 1) • with 4 times smaller state size. Also confirmed on LibriSpeech. 8 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

Effect of sharing KV Perplexity results on TED-LIUM 2 F State size per #Param. Perplexity Transformer Layer Shared- KV L position [K] [M] Dev Test No 49 414 62 61 Standard 32 1 Yes 25 395 63 61 Deep Feed-forward No 12 338 63 62 8 3 Yes 6 333 66 64 • Up to 5% degradation for the proposed model with deep feed-forward module. • Almost no degradation for the standard model. • → Counter-intuitive? More components are affected in the standard case; should there be more effect? • → Intuitive? Model with fewer self-attention layers is affected more. Importance of these few layers are greater. 9 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

Knowledge distillation • LSTM requires much less memory in ASR search. • Knowledge distillation from Transformer to LSTM. Perplexity results on TED-LIUM 2 State size for #Param. Perplexity Model n positions [K] [M] Dev Test Baseline LSTM 16 450 74 71 Teacher Transformer n × 49 414 62 61 Student LSTM 16 450 66 63 • 10-12% relative improvements over the baseline LSTM. • Still behind Transformer teacher, but much smaller memory requirement. • Compare our another paper Gerstenberger et al. Domain Robust, Fast, and Compact Neural Language Models in Session Language Modeling on Friday . 10 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

ASR results: TED-LIUM 2 • Hybrid NN/HMM System (Will be presented this Friday [Zhou & Michel + 20]). • First pass with 4-gram + LSTM. • Lattice rescoring ( → ) with Transformer. Dev Eval Model L F PPL WER PPL WER 4-gram + LSTM - - 64 5.5 69 6.1 → Transformer 32 1 55 5.1 59 5.6 8 3 56 5.2 60 5.7 • Small state Transformer: similar performance to the standard Trafo. • Require much less memory: 16 GB instead 64 GB for the largest lattice. 11 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

Summary Simple modifications to Transformer layer: • (1) F feed-forward layers per layer → works well. We can reduce the total number of layers, thus self-attention layers. • (2) Sharing key and value matrices → Extra reduction in state size w/ some degradation if combined with (1) The 1:1 ratio in the original Transformer is sub-optimal for the state size. Possible extensions to further reduce memory requirement for search • All layers do not need to have self-attention. → Lower/Mid layers do not require self-attention.[Irie & Zeyer + 19] Replace them with static weighted averaging (w/ constant state size). • Combine this with fixed memory size Transformers: e.g. Transformer-XL [Dai & Yang + 19], Compressive Transformer [Rae & Potapenko + 20] 12 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

Thank you for your attention. Please send your questions to: irie@cs.rwth-aachen.de

How Much Self-Attention Do We Need? Trading Attention for - PowerPoint PPT Presentation

How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers Kazuki Irie *, Alexander Gerstenberger, Ralf Schl uter, Hermann Ney Human Language Technology and Pattern Recognition Group RWTH Aachen University, Aachen, Germany

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Trading Aluminium Trading Aluminium Trading Aluminium Trading Aluminium The technical footprint

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Pirate Trading Platform Open source automated trading for everyone PIRATE TRADING PLATFORM

CRACK WHIPS ON WILFUL DEFAULTERS What is Insider Trading? Insider Trading is trading/ dealing of a

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

The Attention Economy What is the attention economy? A business model where you (as the

Hermes Trading s.r.o. Hermes Trading is a company concentrated to trading and servicing of webbing

A Multistate Water Quality A Multistate Water Quality Trading Tool for the Trading Tool for the

Make learning awesome Trading update January 14 th 2020 Trading update - Notice to market

Intra-Day Trading Oct 3 rd 2011 Workshop Intra-Day Trading Continuous implicit trading;

MIRROR TRADING I N T E R N A T I O N A L business opportunity presentation MIRROR TRADING I N

Company Presentation Contents Page Page General information Oil trading 16 - RWE Trading as

Phil Owen Director of Professional Standards Trading Standards Institute Leading the Trading

Attention is All You Need (Vaswani et. al. 2017) Slides and figures when not cited are from:

Combining Teaching and Research in Text-Mining from Social and Cultural Data Claire Brierley and

RIV and Resilient Authenticated Encryption Farzaneh Abed 1 , Christian Forler 2 , Eik List 1 ,

How much meaning can you pack into a real-valued vector? Semantic similarity measuring using

Identifying Relevant Sources for Data Linking using a Semantic Web Index Andriy Nikolov Mathieu

The case against specialized graph engines Jing Fan, Adalbert Gerald

Measu easuri ring What Mat atters: : KPI PIs for Data Quality, Cost st, and Sp Speed

Feature-Rich Compositional Embedding Models Mo Yu * Matt Gormley * Mark Dredze September 21,

Interactive image segmentation with integrated use of the markers and the hierarchical watershed