DEJA-VU: DOUBLE FEATURE PRESENTATION IN DEEP TRANSFORMER NETWORKS Andros Tjandra 1 ∗ , Chunxi Liu 2 , Frank Zhang 2 , Xiaohui Zhang 2 , Yongqiang Wang 2 , Gabriel Synnaeve 2 , Satoshi Nakamura 1 , Geoffrey Zweig 2 1 Nara Institute of Science and Technology, Japan 2 Facebook AI, USA { andros.tjandra.ai6,s-nakamura } @is.naist.jp, { chunxiliu,frankz,xiaohuizhang,yqw,gab,gzweig } @fb.com ABSTRACT advantages, specifically an ability to aggregate information across all the time-steps by using a self-attention mechanism. Unlike RNNs, Deep acoustic models typically receive features in the first layer of the hidden representations do not need to be computed sequentially the network, and process increasingly abstract representations in the across time, thus enabling significant efficiency improvements via subsequent layers. Here, we propose to feed the input features at parallelization. multiple depths in the acoustic model. As our motivation is to allow In the context of Transformer module, secondary feature analy- acoustic models to re-examine their input features in light of partial sis is enabled through an additional mid-network transformer mod- hypotheses we introduce intermediate model heads and loss func- ule that has access both to previous-layer activations and the raw tion. We study this architecture in the context of deep Transformer features. To implement this model, we apply the objective function networks, and we use an attention mechanism over both the previous several times at the intermediate layers, to encourage the develop- layer activations and the input features. To train this model’s inter- ment of phonetically relevant hypotheses. Interestingly, we find that mediate output hypothesis, we apply the objective function at each the iterated use of an auxiliary loss in the intermediate layers sig- layer right before feature re-use. We find that the use of such iterated nificantly improves performance by itself, as well as enabling the loss significantly improves performance by itself, as well as enabling secondary feature analysis. input feature re-use. We present results on both Librispeech, and a This paper makes two main contributions: large scale video dataset, with relative improvements of 10 - 20% for 1. We present improvements in the basic training process of Librispeech and 3.2 - 13% for videos. deep transformer networks, specifically the iterated use of Index Terms — transformer, deep learning, CTC, hybrid ASR connectionist temporal classification (CTC) or cross-entropy (CE) in intermediate layers, and 1. INTRODUCTION 2. We show that an intermediate-layer attention model with ac- cess to both previous-layer activations and raw feature inputs In this paper, we propose the processing of features not only in the can significantly improve performance. input layer of a deep network, but in the intermediate layers as well. We evaluate our proposed model on Librispeech and a large- We are motivated by a desire to enable a neural network acoustic scale video dataset. From our experimental results, we observe 10- model to adaptively process the features depending on partial hy- 20% relative improvement on Librispeech and 3.2-11% on the video potheses and noise conditions. Many previous methods for adap- dataset. tation have operated by linearly transforming either input features or intermediate layers in a two pass process where the transform is 2. TRANSFORMER MODULES learned to maximize the likelihood of some adaptation data [1, 2, 3]. Other methods have involved characterizing the input via factor anal- A transformer network [6] is a powerful approach to learning and ysis or i-vectors [4, 5]. Here, we suggest an alternative approach in modeling sequential data. A transformer network is itself con- which adaptation can be achieved by re-presenting the feature stream structed with a series of transformer modules that each perform at an intermediate layer of the network that is constructed to be cor- some processing. Each module has a self-attention mechanism related with the ultimate graphemic or phonetic output of the system. and several feed-forward layers, enabling easy parallelization over We present this work in the context of Transformer networks time-steps compared to recurrent models such as RNNs or LSTMs [6]. Transformers have become a popular deep learning architecture [10]. We use the architecture defined in [6], and provide only a brief for modeling sequential datasets, showing improvements in many summary below. tasks such as machine translation [6] and language modeling [7]. In Assume we have an input sequence that is of length S : X = the speech recognition field, Transformers have been proposed to [ x 1 , ..., x S ] . Each x i is itself a vector of activations. A transformer replace recurrent neural network (RNN) architectures such as long layer encodes X into a corresponding output representation Z = short-term memory (LSTMs) and gated recurrent units (GRUs) [8]. [ z 1 , ..., z S ] as described below. A recent survey of Transformers in many speech related applications Transformers are built around the notion of a self-attention may be found in [9]. Compared to RNNs, Transformers have several mechanism that is used to extract the relevant information for each time-step s from all time-steps [1 ..S ] in the preceding layer. * This work was done while the first author was a research intern at Self attention is defined in terms of a Query, Key, Value triplet Facebook.
Fig. 1 . A Transformer Module. Fig. 2 . A 24 layer transformer with one auxiliary loss and feature re-presentation in the 12-th layer. Z 0 represents the input features. Orange boxes represent an additional MLP network and softmax. Green boxes represent linear projections and layer-norm. { Q, K, V } ∈ R S × d k . In self-attention, the queries, keys and val- ues are the columns of the input itself, [ x 1 , ..., x S ] . The output activations are computed as: � Q K T � Attn ( Q, K, V ) = softmax √ d k V. (1) In the following subsections, we provide detail on the feature re-presentation mechanism, and iterated loss calculation. Transformer modules deploy a multi-headed version of self- attention. As described in [6], this is done by linearly projecting 3.1. Feature Re-Presentation the queries, keys and values P times with different, learned linear projections. Self-attention is then applied to each of these projected We process the features in the intermediate layer by concatenating versions of Queries, Keys and Values. These are concatenated and a projection of the original features with a projection of previous once again projected, resulting in the final values. We refer to hidden layer activations, and then applying self-attention. the input projection matrices as W Q p , W K p , W V p , and to the output First, we project both the input and intermediate layer features projection as W O . Multihead attention is implemented as ( Z 0 ∈ R S × d 0 , Z k ∈ R S × d k ) , apply layer normalization and con- catenate with position encoding: MultiAttn ( Q, K, V ) = concat ( ¯ V 1 , .., ¯ V P ) W O (2) where ∀ p ∈ { 1 ..P } , ¯ V p = Attn ( QW Q p , KW K p , V W V p ) . ′ (3) Z 0 = cat ([ LayerNorm ( Z 0 W 1 ) , E ] , dim = 1) ′ Here, W Q p , W K p , W V ∈ R d k × d m , d m = d k /P , and W O ∈ Z k = cat ([ LayerNorm ( Z k W 2 ) , E ] , dim = 1) p R P d m × d k . where d 0 is the input feature dimension, d k is the Transformer out- After self-attention, a transformer module applies a series of lin- put dimension, dim = 1 denotes concatenation on the feature axis, ear layer, RELU, layer-norm and dropout operations, as well as the W 1 ∈ R d 0 × d c , W 2 ∈ R d k × d c and E ∈ R S × d e is a sinusoidal posi- application of residual connections. The full sequence of processing tion encoding [6]. is illustrated in Figure 1. After we project both information to the same dimension, we merge them by using time-axis concatenation: 3. ITERATED FEATURE PRESENTATION ′ ′ k ] , dim = 0) ∈ R 2 S × ( d c + d e ) O = cat ([ Z 0 , Z In this section, we present our proposal for allowing the network to (re)-consider the input features in the light of intermediate process- Then, we extract relevant features with extra Transformer layer and ing. We do this by again deploying a self-attention mechanism to followed by linear projection and ReLU: combine the information present in the original features with the in- formation available in the activations of an intermediate layer. As � ′ Transformer ( Q = Z 0 , K = O, V = O ) , split A described earlier, we calculate the output posteriors and auxiliary ′ Z k +1 = ′ Transformer ( Q = Z k , K = O, V = O ) , split B loss at the intermediate layer as well. The overall architecture is il- lustrated in Figure 2. Here, we have used a 24 layer network, with ′ Z k +1 = LayerNorm ( ReLU ( Z k +1 W 3 )) feature re-presentation after the 12 th layer.
Recommend
More recommend