A New Training Pipeline for an Improved Neural Transducer Albert Zeyer 1 , 2 , Andr´ e Merboldt 1 , Ralf Schl¨ uter 1 , 2 , Hermann Ney 1 , 2 1 Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition Department of Mathematics, Computer Science and Natural Sciences RWTH Aachen University 2 AppTek
Introduction , Image Motivation: • Model which allows for time-synchronous decoding Generalization & extension and comparison from: • RNN transducer (RNN-T) and Recurrent Neural Aligner (RNA) models • CTC, RNA and RNN-T label topologies – Explicit blank label, or separate emit sigmoid Training criteria: • Full sum (FS) over all possible alignments (standard RNN-T loss) • Frame-wise cross-entropy (CE) – Allows more powerful models Setup: • End-to-end on subword BPE units • Experiments on Switchboard 300h • All code & setups published: github.com/rwth-i6/returnn-experiments 2 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University
Label topologies , Image • Generalize from label topologies to compare them. • Input seq. x T 1 , output/target (non-blank) seq. y N 1 , alignment α U 1 . α u ∈ Σ ′ := {� b �} ∪ Σ, y n ∈ Σ. T input seq. len, N output seq. len, U alignment seq. len. • Blank label � b � , can be part of labels, or separat emit sigmoid. • Labels Σ are 1k BPE subword units in this work. n n n � b � g g g � b � � b � � b � /o o o o � b � � b � � b � � b � /d d d d � b � t t t CTC [Graves et al., 2006]: RNN-T [Graves, 2012]: RNA [Sak et al., 2017] or monotonic RNN-T [Tripathi et al., 2019]: • Time-sync., label repetition, • Only blank forwards in time (no label rep.), • Time-sync., no label rep., • U = T , ∆ t ≡ 1, t u = u , • U = N + T , ∆ t ( α ) = 1 α = � b � , ∆ n ( α u ) = 1 α u � = � b �∧ α u � = α u − 1 . • U = T , ∆ t ≡ 1, t u = u , ∆ n ( α ) = 1 α � = � b � . ∆ n ( α ) = 1 α � = � b � . 3 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University
Generalize RNN-T and RNA , Image Original RNN transducer (RNN-T) model [Graves, 2012] Output u Joint Network Prediction Encoder Network ... y n-1 x 1 x T 4 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University
Generalize RNN-T and RNA , Image RNN-T unrolled over alignment frames u Output u Joint Network Pred. Network Encoder t 4 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University
Generalize RNN-T and RNA , Image Monotonic RNN-T [Tripathi et al., 2019] (time-synchronous) Output t Joint Network Pred. Network Encoder t 4 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University
Generalize RNN-T and RNA , Image Recurrent neural aligner (RNA) model [Sak et al., 2017] Output t RNN t Encoder t 4 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University
Generalize RNN-T and RNA , Image Generalized (time-sync.) transducer model Output t FastRNN t SlowRNN n Encoder t 4 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University
Transducer , Model Image Output u 1 ( x T ′ • Encoder T 1 ) FastRNN u – BLSTM – potentially downsampled • SlowRNN – LSTM or FFNN SlowRNN n – per (non-blank) label n – like language model (LM) • FastRNN – LSTM+FFNN or FFNN Encoder t – per alignment frame u (per time t if time-sync.) • Unrolled generalized model over alignment frames u . • Output u ≡ α u ∈ Σ ∪ {� b �} • Input x T ′ – blank or new (non-blank) label to Encoder, Output α u per frame. Bold output is non-blank label. 1 • Dependencies are optional. 5 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University
Transducer , Model Image U 1 | x T ′ � � p ( α u | α u − 1 p ( y N , h T 1 ) := 1 ) , 1 u =1 α U 1 :( T , y N 1 ) Explicit blank label: or Separate emit sigmoid: p ( α u | ...) := softmax Σ ′ (Readout( s fast u , s slow p ( α u = � b � | ...) := σ (Readout b ( s fast u , s slow n u )) n u )) , p ( α u � = � b � | ...) = σ ( − Readout b ( s fast u , s slow n u )) , q ( α u | ...) := softmax Σ (Readout y ( s fast u , s slow n u )) , α u ∈ Σ p ( α u | ...) := p ( α u � = � b � | ...) · q ( α u | ...) , α u ∈ Σ 1 := Encoder( x T ′ h T 1 ) , � � s fast s fast u − 1 , s slow := FastRNN n u , α u − 1 , h t u , u � � s slow s slow := SlowRNN n u − 1 , α u ′ − 1 , h t u ′ , n u u ′ := min { k | k ≤ u , n k = n u } , 6 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University
Training , Image Full-sum (FS) loss: � L FS := − log p ( y N 1 | x T p ( α U 1 | x T 1 ) = − log 1 ) α U 1 :( T , y N 1 ) • To be able to calculate efficiently: – 0-order or 1-order dependency on α (but no restriction on y ) Frame-wise cross entropy (CE) loss: L CE := − log p ( α U 1 | x T 1 ) ( − ) Needs alignment α U 1 . We use a fixed alignment. (+) Much faster calculation (twice as fast training) (+) Faster and more stable convergence (+) Chunked training – Even faster because of less zero padding – Makes use of all training data (no filtering by seq. length) – Very effective regularization (16.3% → 14.7% WER) (+) Other common methods: Label smoothing, focal loss, ... (+) No restriction on order ⇒ Allows more powerful models, required for our extended generalized model 7 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University
Training Pipeline , Image 1. Start from scratch with FS training. Train 25 epochs. 2. Calculate alignment. 3. Start from scratch with extended model and CE training. Train 50 epochs. 8 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University
Decoding , Image Decision rule: U x T ′ 1 | x T ′ , x T ′ � p ( α u | α u − 1 p ( y N 1 �→ arg max 1 ) ≈ arg max 1 ) 1 N , y N U ,α U 1 1 u =1 • Beam search decoding, fixed beam size (12 hyps.) • Hypotheses: partially finished sequences α u 1 • Pruning based on scores p ( α u 1 | x T 1 ) • Time-synchronous (in case U = T ), or synchronous over the axis { 1 , . . . , U } 9 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University
Experiments , Image Ablations and variations • Switchboard 300h, WER on Hub5’00 • Transducer baselines B1 and B2 – Without external LM – Full label ( α & y ) context, FastRNN/SlowRNN are LSTMs – Separate emit sigmoid (no explicit blank label) Variant WER[%] B1 B2 Baseline 14.7 14.5 No chunked training 16.3 15.7 SlowRNN always updated (not slow) 14.8 14.8 No SlowRNN (exactly RNA) 14.8 14.7 No encoder feedback to SlowRNN (like RNN-T) 14.9 14.7 + No FastRNN α label feedback (like RNN-T) 14.9 14.5 + No FastRNN, just Joint Network (exactly RNN-T) 15.2 15.1 No separate emit sigmoid (explicit blank) (like RNN-T/RNA) 14.9 14.9 10 of 14 Zeyer & Merboldt & Schl¨ uter & Ney: A new training pipeline for an improved neural transducer Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition RWTH Aachen University
Recommend
More recommend