on the choice of modeling unit for sequence to sequence
play

On the Choice of Modeling Unit for Sequence-to-Sequence Speech - PowerPoint PPT Presentation

On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition Kazuki Irie, Rohit Prabhavalkar, Anjuli Kannan, Antoine Bruguier, David Rybach, Patrick Nguyen Interspeech 2019, Graz, Austria


  1. On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition Kazuki Irie, Rohit Prabhavalkar, Anjuli Kannan, Antoine Bruguier, David Rybach, Patrick Nguyen Interspeech 2019, Graz, Austria September 19, 2019 1 1

  2. Background Sequence to sequence speech recognition ● Directly output character based units: Grapheme, BPE, Word-Pieces ● Key for making it “end-to-end” . No need for a pronunciation lexicon . ● Jointly learn acoustic, pronunciation, and language modeling in a single model. ● But phonemes may be more natural units for acoustic modeling. ● In hybrid systems , context dependent phonemes work the best. 2

  3. Previous Work Sainath et al. (ICASSP 2018) “ No Need For A Lexicon? Evaluating The Value Of The Pronunciation Lexica In End-To-End Models” ● Grapheme models outperform phoneme based S2S models. ● Phoneme models win on “ proper nouns ” e.g., [Grapheme] Charles Lindberg vs. [Phoneme] Charles Lindbergh (Correct!) ● Very large scale tasks on 12.5K-hour English Voice Search (and a 27.5K-hour multi dialects task) 3

  4. Unanswered Questions ● Does this trend Phoneme vs Grapheme depend on the amount of data? In the hybrid system, the amount of data matters. E.g., Sung et al. (ICASSP 2008) “Revisiting graphemes with increased amount of data” . ● Can we make use of pronunciation lexicon to improve S2S ASR? Examples from Sainath et al.: [Grapheme]: Charles Lindberg vs. [Phoneme] Charles Lindbergh (Correct!) 4

  5. This work Systematic evaluation on a publicly available dataset ● Evaluate phonemic models under different amounts of data on a publicly available dataset (LibriSpeech). Investigation on phoneme / character-based units complementarity ● Separate phonemic and grapheme/word-piece models. ● Model combination approach: Decision phoneme vs. grapheme left to the score combination method. (No decision taken by the model.) Analysis of output hypotheses 5

  6. LibriSpeech Dataset ● Training data settings : 100h, 460h, and 960h. ● Evaluation data: Dev ( clean, other ), Test ( clean, other ) Pronunciation lexicon (official) ● 200K vocabulary. ● Pronunciations with stress (“ah0”, “ah1”): 70 phonemes. ● Average number of pronunciations per word: 1.03 ● Average word length in terms of phonemes: 6.5 (Max: 20) N-gram word LMs (official) ● When needed, we use 3-gram LM (no pruning) ● Trained on 800M-word extra text-only data. 6

  7. Model Architecture Standard Listen, Attend and Spell model (LAS) Chan et al. & Zhang et al. (ICASSP 2016 / 2017) ● Input: 80-dim log-mel features + deltas + acc. ● 2 CNN layers on the input (time reduction of 4) ● Bi-directional LSTM encoder ● Attention, LSTM decoder. ● Models trained with Tensorflow Lingvo https://github.com/tensorflow/lingvo ● We consider 3 model sizes : Small / Medium / Large with LSTM sizes 256 / 512 / 1024 to find the best fit to different scenarios. 7

  8. Baseline Word-Piece/Graphemic Models 960h training set Model Unit Params. Dev-Clean Dev-Other Test-Clean Test-Other Grapheme 35 M 5.3 15.6 5.6 15.8 130 M 5.3 15.2 5.5 15.3 Word-Piece 60 M 4.9 14.0 5.0 14.1 16K 180 M 4.4 13.2 4.7 13.4 + LSTM LM 3.3 10.3 3.6 10.3 ● Good baselines without SpecAugment (Park et al. Interspeech 2019!). 8

  9. Phonemic models Convert the target transcriptions to the phoneme sequence and learn the sequence-to-sequence model. Introduces two issues : ● For training , should choose the pronunciation for words with multiple pronunciation variants . ● For recognition , can not distinguish homophones without LM. In our phonemic models: ● Randomly choose one of pronunciations for each word and define a deterministic mapping before the training . → Trade-off for simplicity. ● WFST decoder : beam search constrained by a lexicon (L) and combine a language model (G) score . 9

  10. Training phonemic models (cont’d) Further specifications: ● End-of-word token <eow> in vocab. ● Sentence end token <eos> in vocab. ● Words which are out of lexicon: <unk> <eow>. Further consequence : Phoneme-level LM decoder? Weak class based LM? (1) as i approached the city i heard bells ringing (2) eze ai approached thaa citty aie her'd belhs ringing 10

  11. WER results on 960h training set Model Dev-Clean Dev-Other Test-Clean Test-Other Phoneme + LG 5.6 15.8 6.2 15.8 Grapheme 5.3 15.2 5.5 15.3 Word-Piece 16K 4.4 13.2 4.7 13.4 ● Use official word level 3-gram LM as G (trained on extra 800M-word data). ● Phoneme+LG comparable but slightly worse than the grapheme model: Similar observations to Sainath et al. ● No improvement for the grapheme model when decoded with L and G. 11

  12. Examples where the phonemic model wins over the word-piece model Word-Piece Phoneme when did you come partly when did you come bartley kerklin jumped for the jetty kirkland jumped for the jetty Man’s eyes were made fixed man’s eyes remained fixed ● bartley is unseen during training set. ● kirkland is in the training set. 12

  13. Does this extrapolate to other data size scenarios? Unique Unseen word rate ( % ) words Dev-Clean Dev-Other Test-Clean Test-Other Lexicon 200 K 0.3 0.6 0.4 0.5 960h 89 K 0.6 0.8 0.6 0.8 460h 66 K 0.9 1.2 1.0 1.3 100h 34 K 2.5 2.5 2.4 2.8 ● Less data: more unseen word rate. ● Word-Piece model trained on the corresponding portion of data. 13

  14. WER results on 460h and 100h training data Train data Model Dev-Clean Dev-Other Test-Clean Test-Other 460h Phoneme + LG 7.6 27.3 8.5 27.8 Grapheme 6.4 23.5 6.8 24.1 Word-Piece 16K 5.7 21.8 6.5 22.5 100h Phoneme + LG 13.8 38.9 14.3 40.9 Grapheme 11.6 36.1 12.0 38.0 Word-Piece 16K 12.7 33.9 12.9 35.5 ● Lower resource conditions: not more favorable to phonemic models. ● Large degradation with less data (compared with hybrid systems) 14

  15. Model combination experiments Can we get benefits from the phonemic model by model combination? Rescoring ● Generate a N-best list from one LAS model. ● Rescore with another model. ● Log-linear score combination with a weight optimized on the dev set. Cross-Rescoring and union of N-best list ● Generate N-best list from both models. ● Cross-rescore. ● Extract the 1-best from the union. 15

  16. 8-best list rescoring results (Oracle) Model Dev-Clean Dev-Other Test-Clean Test-Other Word-Pieces 4.4 (2.4) 13.2 (9.2) 4.7 (2.6) 13.4 (9.1) + Phoneme 4.1 12.4 4.3 12.4 + Grapheme 4.0 12.3 4.3 12.3 + Both 3.9 12.2 4.3 12.2 ● Improvements of up to 8% rel. WER. ● Similar improvements with graphemic or phonemic model rescoring. 16

  17. 8-best list rescoring results (Oracle) Model Dev-Clean Dev-Other Test-Clean Test-Other Word-Pieces 4.4 (2.4) 13.2 (9.2) 4.7 (2.6) 13.4 (9.1) + Phoneme 4.1 12.4 4.3 12.4 + Grapheme 4.0 12.3 4.3 12.3 + Both 3.9 12.2 4.3 12.2 ● Improvements of up to 8% rel. WER. ● Similar improvements with graphemic or phonemic model rescoring. Phoneme + LG 5.6 (4.9) 15.8 (14.4) 6.2 (5.5) 15.8 (14.7) + Word-Piece 5.4 15.5 6.0 15.5 ● Limited improvements by rescoring phonemic hypothesis. 17

  18. Examples where Word-piece + Grapheme + Phoneme wins over Word-piece + Grapheme Word-Piece + Grapheme Word-Piece + Grapheme + Phoneme Oh bartly did you write to me Oh bartley did you write to me … lettuce leaf with mayonna is ... … lettuce leaf with mayonnaise ... … eyes blaze of indignation … eyes blazed with indignation 18

  19. Rescoring with an auxiliary decoder Why not put two decoders into the same model? Shared encoder + 2 separate attention and decoder Dev Test Clean Other Clean Other Word-Pieces 180M 4.4 13.2 4.7 13.4 + Auxiliary 210M 4.3 13.0 4.6 13.1 Phoneme + Separate 310M 4.1 12.4 4.3 12.4 Phoneme ● Separate model gives better improvements (with more parameters). 19

  20. Union of N-bests vs. Simple rescoring Is it useful to decode hypotheses with the phonemic model? Model # Hyp. Dev-Clean Dev-Other Test-Clean Test-Other Word-Piece 8 4.4 13.2 4.7 13.4 + Phoneme 4.1 12.4 4.3 12.4 Union 16 4.1 12.4 4.3 12.3 ● Limited improvements. 20

  21. Union of N-bests vs. Simple rescoring Is it useful to decode hypotheses with the phonemic model? Model # Hyp. Dev-Clean Dev-Other Test-Clean Test-Other Word-Piece 8 4.4 13.2 4.7 13.4 + Phoneme 4.1 12.4 4.3 12.4 Union 16 4.1 12.4 4.3 12.3 ● Limited improvements. ● Better increasing the beam size of the word-piece model and rescore. Word-Piece 16 4.4 13.2 4.7 13.4 + Phoneme 4.0 12.3 4.3 12.2 21

Recommend


More recommend