DEJA-VU : DOUBLE FEATURE PRESENTATION AND ITERATED LOSS IN DEEP - PowerPoint PPT Presentation

DEJA-VU : DOUBLE FEATURE PRESENTATION AND ITERATED LOSS IN DEEP TRANSFORMER NETWORKS Andros Tjandra 1* , Chunxi Liu 2 , Frank Zhang 2 , Xiaohui Zhang 2 , Yongqiang Wang 2 , Gabriel Synnaeve 2 , Satoshi Nakamura 1 , Goeffrey Zweig 2 1) NAIST, Japan 2) Facebook AI, USA * This work was done while Andros was a research intern at Facebook

Motivation • Make feature processing adaptive to what is being said. • Different feature processing, depending on what words need to be differentiated in light of a specific utterance. • To achieve this, we allow a Transformer Network to (re)-attend to the audio features, using intermediate layer activations as the Query. • Imposing the objective function on the intermediate layer ensures that it has meaningful information – and trains much faster. • Net – using these two methods lowers error rate 10-20% for Librispeech and Video ASR datasets.

Review: Self-attention in Transformers Transformer module Multihead Self Attention Dot Product Attention Image ref: Attention is all you need (Vaswani et al., NIPS 2017)

Review: VGG + Transformer Acoustic Model ℒ �� 𝑄, 𝑍 Loss function is either CTC Softmax or CE (for hybrid DNN-HMM) ℒ �� 𝑄, 𝑍 Transformer … Stack of Transformer layers Transformer Blocks of 3x3 convolution + stride (for sub-sampling Mel-spectral) VGG

Problems? • Stacking more and more layers has empirically give better result. • Computer vision: AlexNet (<10 layers) -> VGGNet (20 layers) -> ResNet (>100 layers). • However, training such deep models are difficult. • With improvements in this paper, we can reliably train up to 36 layer networks.

Idea #1: Iterated Loss 𝑎 �� ℒ 𝑄 �� , 𝑍 • In the deep neural network, the loss are always the furthest node from the Transformer input. … • Early nodes (layers) might received less Transformer feedback (due to vanishing gradients). • We add auxiliary loss in the 𝑎 �� 𝜇 ∗ ℒ 𝑄 �� , 𝑍 intermediate node. Transformer 1 Auxiliary layer �� to project 𝑎 to … prediction 𝑄 � (removed after Transformer �� training finished) �� 𝑎 �

Effect of Iterated Loss • Comparison: • Baseline 1 CTC (24) • 2 CTC (12–24) • 3 CTC (8-16-24) • 4 CTC (6-12-18-24) • Coeff

Effect of • = 0.3 vs 1.0 • = 0.3 consistent better compared to 1.0 on 2 CTC and 3 CTC

Idea #2: Feature Re-presentation 2 … Transformer combines input • After the iterated loss, we want to feature and Transformer hidden state dynamically integrate the input � � 𝑎 � 𝑎 �� features. Linear Proj + LayerNorm • Why? Transformer • The layer after iterated loss might have 𝑎 �� partial hypothesis. • We could find correlated features based Transformer on the partial hypothesis. … • There are several ways we have explored (next slide -> ) Transformer 𝑎 �

(Cont.) Feature Concatenation 𝑎 �� • (Top) Feature axis. concatenation Transformer Linear proj. + LN Q K V � 𝑎 � � 𝑎 � • (Btm) Time axis. Concatenation ✓ best performance • Split A : input as Query Split A Split B • Split B : hidden state as Query 𝑎 �� 𝑎 �� Time Cat + Post Projection Transformer Transformer Q K V Q K V � � Z � � � � � Z � 𝑎 � 𝑎 � 𝑎 � 𝑎 �

𝑎 �� ℒ 𝑄 �� , 𝑍 Final architecture Transformer 2 … Transformer combines input feature and Transformer hidden state � � 𝑎 � 𝑎 �� Linear Proj + LayerNorm Transformer 𝑎 �� 𝜇 ∗ ℒ 𝑄 �� , 𝑍 Transformer 1 Auxiliary layer to project 𝑎 to … prediction 𝑄 (removed after Transformer training finished) 𝑎 �

Result: Librispeech (CTC w/o data augmentation) Model Config dev test clean other clean other CTC Baseline VGG+24 Trf. 4.7 12.7 5.0 13.1 + Iter. Loss 12-24 4.1 11.8 4.5 12.2 12% test-clean & 8% test-other relative improvement 8-16-24 4.2 11.9 4.6 12.3 6-12-18-24 4.1 11.7 4.4 12.0 20% test-clean & 18% test-other + Feat. Cat. 12-24 3.9 10.9 4.2 11.1 relative improvement 8-16-24 3.7 10.3 4.1 10.7 6-12-18-24 3.6 10.4 4.0 10.8

Librispeech with data augmentation Model Config LM test-clean test-other CTC (Baseline) VGG+24 Trf. 4.0 9.4 Without iter-loss & feat-cat , + Iter. Loss 8-16-24 4-gram 3.5 8.4 increasing Transformer layers + Feat. Cat 8-16-24 3.3 7.6 doesn’t improve performance With iter-loss & feat-cat , CTC (Baseline) VGG+36 Trf. 4.0 9.4 we still get improvement with deeper Transformer + Iter. Loss 12-24-36 4-gram 3.4 8.1 + Feat. Cat 12-24-36 3.2 7.2

Librispeech with hybrid DNN-HMM Model Config LM test-clean test-other Hybrid (Baseline) VGG+24 Trf. 3.2 7.7 9% test-clean & + Iter. Loss 8-16-24 4-gram 3.1 7.3 12% test-other improvement + Feat. Cat 8-16-24 2.9 6.7

Video dataset Model Config video curated clean other CTC (Baseline) VGG+24 Trf. 14.0 17.4 23.6 + Iter. Loss 8-16-24 13.2 16.7 22.9 + Feat. Cat 8-16-24 12.4 16.2 22.3 CTC (Baseline) VGG+36 Trf. 14.2 17.5 23.8 13% curated 8% clean + Iter. Loss 12-24-36 12.9 16.6 22.8 6% other + Feat. Cat 12-24-36 12.3 16.1 22.3 improvement Hybrid (Baseline) VGG+24 Trf 12.8 16.1 22.1 9% curated 4% clean + Iter. Loss 8-16-24 12.1 15.7 21.8 3% other + Feat. Cat 8-16-24 11.6 15.4 21.4 improvement

Conclusion • We have proposed a method for re-processing the input features in light of the information available at an intermediate network layer. • To integrate the features from different layers, we proposed self- attention across layers by concatenating two sequences in time-axis. • Adding iterated loss in the middle of deep transformers helps the performance (tested on hybrid ASR as well). • Librispeech: 10-20% relative improvements • Video: 3.2-13% relative improvements

End of presentation  Thank you for your attention 

DEJA-VU : DOUBLE FEATURE PRESENTATION AND ITERATED LOSS IN DEEP - PowerPoint PPT Presentation

DEJA-VU : DOUBLE FEATURE PRESENTATION AND ITERATED LOSS IN DEEP TRANSFORMER NETWORKS Andros Tjandra 1* , Chunxi Liu 2 , Frank Zhang 2 , Xiaohui Zhang 2 , Yongqiang Wang 2 , Gabriel Synnaeve 2 , Satoshi Nakamura 1 , Goeffrey Zweig 2 1) NAIST,

DEJA-VU: DOUBLE FEATURE PRESENTATION IN DEEP TRANSFORMER NETWORKS Andros Tjandra 1 , Chunxi Liu

More Java Graphics Shape Classes: Face Check out Faces from SVN Finish Java Graphics: text and

Models of Language Evolution Iterated learning Michael Franke Facets of EvoLang

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

UNDERSTANDING DEJA REVIEWERS Eric Gilbert & Karrie Karahalios University of Illinois

CS 134: Operating Systems Disk Deja-vu 1 / 29 Overview CS34 Overview 2013-05-17 Working

Feature economy and iterated grammar learning Joe Pater Robert Staubs University of

Iterated Binomial Sums and their Associated Iterated Integrals Jakob Ablinger joint work with J.

Lesson 4. Iterated filtering: principles and practice Edward Ionides, Aaron A. King, and Kidus

On higher and iterated topological Hochschild homology Bruno Stonek Supervisor: Christian Ausoni

Voter Response to Iterated Poll Information Ulle Endriss Institute for Logic, Language and

3.1 Iterated Partial Derivatives Prof. Tesler Math 20C Fall 2018 Prof. Tesler 3.1 Iterated

f ( f ( x )) Solving Iterated Functions Using Genetic Programming Michael Schmidt Hod Lipson

Models of Language Evolution Session 8: The Iterated Learning Model Roland Mhlenbernd

Countering Language Drift with Seeded Iterated Learning Yuchen Lu Content Language Drift Problem

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Ahmed Rami Melhem Alex Jones Abousamra University of Pittsburgh Dj Vu Switching for

Memory Hierarchies in the Era of Specialization Sarita Adve University of Illinois at

Section 12.0 Transcription Factors, Binding Sites, and the Challenge of Finding Novel Problems

Agenda VoLTE Overview Why VoLTE Why VoLTE Initial Questions Discussion Page 2

Stoned dj vu again Peter Kleissner, Michael Eisendle Agenda Introduction to bootkits

1 1 11/17/09 2 2 11/17/09 The SiteKey. This is not a graphical password system. ...and I'm

Isolating Failure-Inducing Thread Schedules Andreas Zeller Jong-Deok Choi

CS70: Jean Walrand: Lecture 34. Conditional Expectation CS70: Jean Walrand: Lecture 34.

DEJA-VU : DOUBLE FEATURE PRESENTATION AND ITERATED LOSS IN DEEP - PowerPoint PPT Presentation

DEJA-VU : DOUBLE FEATURE PRESENTATION AND ITERATED LOSS IN DEEP TRANSFORMER NETWORKS Andros Tjandra 1* , Chunxi Liu 2 , Frank Zhang 2 , Xiaohui Zhang 2 , Yongqiang Wang 2 , Gabriel Synnaeve 2 , Satoshi Nakamura 1 , Goeffrey Zweig 2 1) NAIST,

DEJA-VU: DOUBLE FEATURE PRESENTATION IN DEEP TRANSFORMER NETWORKS Andros Tjandra 1 , Chunxi Liu

More Java Graphics Shape Classes: Face Check out Faces from SVN Finish Java Graphics: text and

Models of Language Evolution Iterated learning Michael Franke Facets of EvoLang

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

UNDERSTANDING DEJA REVIEWERS Eric Gilbert &amp; Karrie Karahalios University of Illinois

CS 134: Operating Systems Disk Deja-vu 1 / 29 Overview CS34 Overview 2013-05-17 Working

Feature economy and iterated grammar learning Joe Pater Robert Staubs University of

Iterated Binomial Sums and their Associated Iterated Integrals Jakob Ablinger joint work with J.

Lesson 4. Iterated filtering: principles and practice Edward Ionides, Aaron A. King, and Kidus

On higher and iterated topological Hochschild homology Bruno Stonek Supervisor: Christian Ausoni

Voter Response to Iterated Poll Information Ulle Endriss Institute for Logic, Language and

3.1 Iterated Partial Derivatives Prof. Tesler Math 20C Fall 2018 Prof. Tesler 3.1 Iterated

f ( f ( x )) Solving Iterated Functions Using Genetic Programming Michael Schmidt Hod Lipson

Models of Language Evolution Session 8: The Iterated Learning Model Roland Mhlenbernd

Countering Language Drift with Seeded Iterated Learning Yuchen Lu Content Language Drift Problem

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Ahmed Rami Melhem Alex Jones Abousamra University of Pittsburgh Dj Vu Switching for

Memory Hierarchies in the Era of Specialization Sarita Adve University of Illinois at

Section 12.0 Transcription Factors, Binding Sites, and the Challenge of Finding Novel Problems

Agenda VoLTE Overview Why VoLTE Why VoLTE Initial Questions Discussion Page 2

Stoned dj vu again Peter Kleissner, Michael Eisendle Agenda Introduction to bootkits

1 1 11/17/09 2 2 11/17/09 The SiteKey. This is not a graphical password system. ...and I'm

Isolating Failure-Inducing Thread Schedules Andreas Zeller Jong-Deok Choi

CS70: Jean Walrand: Lecture 34. Conditional Expectation CS70: Jean Walrand: Lecture 34.

UNDERSTANDING DEJA REVIEWERS Eric Gilbert & Karrie Karahalios University of Illinois