DEJA-VU: DOUBLE FEATURE PRESENTATION IN DEEP TRANSFORMER NETWORKS - PDF document

DEJA-VU: DOUBLE FEATURE PRESENTATION IN DEEP TRANSFORMER NETWORKS Andros Tjandra 1 ∗ , Chunxi Liu 2 , Frank Zhang 2 , Xiaohui Zhang 2 , Yongqiang Wang 2 , Gabriel Synnaeve 2 , Satoshi Nakamura 1 , Geoffrey Zweig 2 1 Nara Institute of Science and Technology, Japan 2 Facebook AI, USA { andros.tjandra.ai6,s-nakamura } @is.naist.jp, { chunxiliu,frankz,xiaohuizhang,yqw,gab,gzweig } @fb.com ABSTRACT advantages, specifically an ability to aggregate information across all the time-steps by using a self-attention mechanism. Unlike RNNs, Deep acoustic models typically receive features in the first layer of the hidden representations do not need to be computed sequentially the network, and process increasingly abstract representations in the across time, thus enabling significant efficiency improvements via subsequent layers. Here, we propose to feed the input features at parallelization. multiple depths in the acoustic model. As our motivation is to allow In the context of Transformer module, secondary feature analy- acoustic models to re-examine their input features in light of partial sis is enabled through an additional mid-network transformer mod- hypotheses we introduce intermediate model heads and loss func- ule that has access both to previous-layer activations and the raw tion. We study this architecture in the context of deep Transformer features. To implement this model, we apply the objective function networks, and we use an attention mechanism over both the previous several times at the intermediate layers, to encourage the develop- layer activations and the input features. To train this model’s inter- ment of phonetically relevant hypotheses. Interestingly, we find that mediate output hypothesis, we apply the objective function at each the iterated use of an auxiliary loss in the intermediate layers sig- layer right before feature re-use. We find that the use of such iterated nificantly improves performance by itself, as well as enabling the loss significantly improves performance by itself, as well as enabling secondary feature analysis. input feature re-use. We present results on both Librispeech, and a This paper makes two main contributions: large scale video dataset, with relative improvements of 10 - 20% for 1. We present improvements in the basic training process of Librispeech and 3.2 - 13% for videos. deep transformer networks, specifically the iterated use of Index Terms — transformer, deep learning, CTC, hybrid ASR connectionist temporal classification (CTC) or cross-entropy (CE) in intermediate layers, and 1. INTRODUCTION 2. We show that an intermediate-layer attention model with access to both previous-layer activations and raw feature inputs In this paper, we propose the processing of features not only in the can significantly improve performance. input layer of a deep network, but in the intermediate layers as well. We evaluate our proposed model on Librispeech and a large- We are motivated by a desire to enable a neural network acoustic scale video dataset. From our experimental results, we observe 10- model to adaptively process the features depending on partial hy- 20% relative improvement on Librispeech and 3.2-11% on the video potheses and noise conditions. Many previous methods for adap- dataset. tation have operated by linearly transforming either input features or intermediate layers in a two pass process where the transform is 2. TRANSFORMER MODULES learned to maximize the likelihood of some adaptation data [1, 2, 3]. Other methods have involved characterizing the input via factor anal- A transformer network [6] is a powerful approach to learning and ysis or i-vectors [4, 5]. Here, we suggest an alternative approach in modeling sequential data. A transformer network is itself con- which adaptation can be achieved by re-presenting the feature stream structed with a series of transformer modules that each perform at an intermediate layer of the network that is constructed to be cor- some processing. Each module has a self-attention mechanism related with the ultimate graphemic or phonetic output of the system. and several feed-forward layers, enabling easy parallelization over We present this work in the context of Transformer networks time-steps compared to recurrent models such as RNNs or LSTMs [6]. Transformers have become a popular deep learning architecture [10]. We use the architecture defined in [6], and provide only a brief for modeling sequential datasets, showing improvements in many summary below. tasks such as machine translation [6] and language modeling [7]. In Assume we have an input sequence that is of length S : X = the speech recognition field, Transformers have been proposed to [ x 1 , ..., x S ] . Each x i is itself a vector of activations. A transformer replace recurrent neural network (RNN) architectures such as long layer encodes X into a corresponding output representation Z = short-term memory (LSTMs) and gated recurrent units (GRUs) [8]. [ z 1 , ..., z S ] as described below. A recent survey of Transformers in many speech related applications Transformers are built around the notion of a self-attention may be found in [9]. Compared to RNNs, Transformers have several mechanism that is used to extract the relevant information for each time-step s from all time-steps [1 ..S ] in the preceding layer. * This work was done while the first author was a research intern at Self attention is defined in terms of a Query, Key, Value triplet Facebook.

Fig. 1 . A Transformer Module. Fig. 2 . A 24 layer transformer with one auxiliary loss and feature re-presentation in the 12-th layer. Z 0 represents the input features. Orange boxes represent an additional MLP network and softmax. Green boxes represent linear projections and layer-norm. { Q, K, V } ∈ R S × d k . In self-attention, the queries, keys and values are the columns of the input itself, [ x 1 , ..., x S ] . The output activations are computed as: � Q K T � Attn ( Q, K, V ) = softmax √ d k V. (1) In the following subsections, we provide detail on the feature re-presentation mechanism, and iterated loss calculation. Transformer modules deploy a multi-headed version of self- attention. As described in [6], this is done by linearly projecting 3.1. Feature Re-Presentation the queries, keys and values P times with different, learned linear projections. Self-attention is then applied to each of these projected We process the features in the intermediate layer by concatenating versions of Queries, Keys and Values. These are concatenated and a projection of the original features with a projection of previous once again projected, resulting in the final values. We refer to hidden layer activations, and then applying self-attention. the input projection matrices as W Q p , W K p , W V p , and to the output First, we project both the input and intermediate layer features projection as W O . Multihead attention is implemented as ( Z 0 ∈ R S × d 0 , Z k ∈ R S × d k ) , apply layer normalization and con- catenate with position encoding: MultiAttn ( Q, K, V ) = concat ( ¯ V 1 , .., ¯ V P ) W O (2) where ∀ p ∈ { 1 ..P } , ¯ V p = Attn ( QW Q p , KW K p , V W V p ) . ′ (3) Z 0 = cat ([ LayerNorm ( Z 0 W 1 ) , E ] , dim = 1) ′ Here, W Q p , W K p , W V ∈ R d k × d m , d m = d k /P , and W O ∈ Z k = cat ([ LayerNorm ( Z k W 2 ) , E ] , dim = 1) p R P d m × d k . where d 0 is the input feature dimension, d k is the Transformer out- After self-attention, a transformer module applies a series of lin- put dimension, dim = 1 denotes concatenation on the feature axis, ear layer, RELU, layer-norm and dropout operations, as well as the W 1 ∈ R d 0 × d c , W 2 ∈ R d k × d c and E ∈ R S × d e is a sinusoidal posi- application of residual connections. The full sequence of processing tion encoding [6]. is illustrated in Figure 1. After we project both information to the same dimension, we merge them by using time-axis concatenation: 3. ITERATED FEATURE PRESENTATION ′ ′ k ] , dim = 0) ∈ R 2 S × ( d c + d e ) O = cat ([ Z 0 , Z In this section, we present our proposal for allowing the network to (re)-consider the input features in the light of intermediate process- Then, we extract relevant features with extra Transformer layer and ing. We do this by again deploying a self-attention mechanism to followed by linear projection and ReLU: combine the information present in the original features with the information available in the activations of an intermediate layer. As � ′ Transformer ( Q = Z 0 , K = O, V = O ) , split A described earlier, we calculate the output posteriors and auxiliary ′ Z k +1 = ′ Transformer ( Q = Z k , K = O, V = O ) , split B loss at the intermediate layer as well. The overall architecture is illustrated in Figure 2. Here, we have used a 24 layer network, with ′ Z k +1 = LayerNorm ( ReLU ( Z k +1 W 3 )) feature re-presentation after the 12 th layer.

DEJA-VU: DOUBLE FEATURE PRESENTATION IN DEEP TRANSFORMER NETWORKS - PDF document

DEJA-VU: DOUBLE FEATURE PRESENTATION IN DEEP TRANSFORMER NETWORKS Andros Tjandra 1 , Chunxi Liu 2 , Frank Zhang 2 , Xiaohui Zhang 2 , Yongqiang Wang 2 , Gabriel Synnaeve 2 , Satoshi Nakamura 1 , Geoffrey Zweig 2 1 Nara Institute of Science and

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

DEJA-VU : DOUBLE FEATURE PRESENTATION AND ITERATED LOSS IN DEEP TRANSFORMER NETWORKS Andros

More Java Graphics Shape Classes: Face Check out Faces from SVN Finish Java Graphics: text and

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

West Cape Transformer Replacement Purpose Purchase a 50 MVA Load Tap Changer power

IEEE Transformer Committee PC57.167 Distribution Transformer Monitoring - User Mark Scarborough

Transformer Program Trevor Foster Electrical Engineering Manager Calpine Transformer Program 1

Electronic DC Transformer Pavol Bauer Learning objectives What is an electronic DC

UNDERSTANDING DEJA REVIEWERS Eric Gilbert & Karrie Karahalios University of Illinois

CS 134: Operating Systems Disk Deja-vu 1 / 29 Overview CS34 Overview 2013-05-17 Working

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Magnetics Design 3.1 Important magnetic equations 3.2 Magnetic losses 3.3 Transformer 3.3.1

Names Quattro S Double A Double S Double C Triple C Quattro C Variations All Boxer models

Double Chooz Experiment Status Double Chooz Experiment Status Jelena Maricic, Drexel University

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

Tuning the Performance of Convolutional Neural Network for Image Classification on GPU Agenda

Convergence of Local Search Sebastian U. Stich a,b joint work with uller a,b , Bernd G artner a

Contains public sector information published by the Health and Safety Executive and licensed

Gold Fields In Australia DELIVERY & GROWTH WA Mining Club, 27 July 2017 Forward looking

Ashmore Group plc Final Results 12 months to 30 June 2012 11 September 2012 Presentation team

Investor Presentation March 2017 1 Forward-Looking Information This management presentation (the

UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 FORM 10-Q QUARTERLY

CORPORATE PRESENTATION Q3 2017 NORONT RESOURCES 1 CAUTIONARY NOTE REGARDING FORWARD-LOOKING

DEJA-VU: DOUBLE FEATURE PRESENTATION IN DEEP TRANSFORMER NETWORKS - PDF document

DEJA-VU: DOUBLE FEATURE PRESENTATION IN DEEP TRANSFORMER NETWORKS Andros Tjandra 1 , Chunxi Liu 2 , Frank Zhang 2 , Xiaohui Zhang 2 , Yongqiang Wang 2 , Gabriel Synnaeve 2 , Satoshi Nakamura 1 , Geoffrey Zweig 2 1 Nara Institute of Science and

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

DEJA-VU : DOUBLE FEATURE PRESENTATION AND ITERATED LOSS IN DEEP TRANSFORMER NETWORKS Andros

More Java Graphics Shape Classes: Face Check out Faces from SVN Finish Java Graphics: text and

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

West Cape Transformer Replacement Purpose Purchase a 50 MVA Load Tap Changer power

IEEE Transformer Committee PC57.167 Distribution Transformer Monitoring - User Mark Scarborough

Transformer Program Trevor Foster Electrical Engineering Manager Calpine Transformer Program 1

Electronic DC Transformer Pavol Bauer Learning objectives What is an electronic DC

UNDERSTANDING DEJA REVIEWERS Eric Gilbert &amp; Karrie Karahalios University of Illinois

CS 134: Operating Systems Disk Deja-vu 1 / 29 Overview CS34 Overview 2013-05-17 Working

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Magnetics Design 3.1 Important magnetic equations 3.2 Magnetic losses 3.3 Transformer 3.3.1

Names Quattro S Double A Double S Double C Triple C Quattro C Variations All Boxer models

Double Chooz Experiment Status Double Chooz Experiment Status Jelena Maricic, Drexel University

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

Tuning the Performance of Convolutional Neural Network for Image Classification on GPU Agenda

Convergence of Local Search Sebastian U. Stich a,b joint work with uller a,b , Bernd G artner a

Contains public sector information published by the Health and Safety Executive and licensed

Gold Fields In Australia DELIVERY &amp; GROWTH WA Mining Club, 27 July 2017 Forward looking

Ashmore Group plc Final Results 12 months to 30 June 2012 11 September 2012 Presentation team

Investor Presentation March 2017 1 Forward-Looking Information This management presentation (the

UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 FORM 10-Q QUARTERLY

CORPORATE PRESENTATION Q3 2017 NORONT RESOURCES 1 CAUTIONARY NOTE REGARDING FORWARD-LOOKING

UNDERSTANDING DEJA REVIEWERS Eric Gilbert & Karrie Karahalios University of Illinois

Gold Fields In Australia DELIVERY & GROWTH WA Mining Club, 27 July 2017 Forward looking