A Spelling Correction Model for End-to-end Speech Recognition Jinxi - PowerPoint PPT Presentation

A Spelling Correction Model for End-to-end Speech Recognition Jinxi Guo 1 , Tara Sainath 2 , Ron Weiss 2 1 Electrical and Computer Engineering, University of California, Los Angeles, USA 2 Google ICASSP 2019, Brighton, UK

Motivation ● End-to-end ASR models... ○ e.g. "Listen, Attend, and Spell" sequence-to-sequence model [Chan et al, ICASSP 2016] ● are trained on fewer utterances than conventional systems ○ many fewer audio-text pairs compared to text examples used to train language models ● tend to make errors on proper nouns and rare words ○ doesn't learn how to spell words which are underrepresented in the training data ● but do a good job recognizing the underlying acoustic content ○ many errors are homophonous to the ground truth

Listen, Attend, and Spell (LAS) errors Librispeech Ground Truth LAS Output ● misspells proper nouns hand over to trevelyan hand over to trevellion ● replaces words with near homophones on trevelyan's arrival on trevelyin's arrival ● sometimes inconsistently a wandering tribe of the a wandering tribe of the blemmyes blamies a wrangler's a wrangler a ringleurs a angler answered big foot answered big foot Can incorporate a language model (LM) trained on large text corpus [Chorowski and Jaitly, Interspeech 2017], [Kannan et al, ICASSP 2018]

Proposed Method ● Pass ASR hypotheses into Spelling Correction model ○ Correct recognition errors directly ○ or create a richer n-best list by correcting each hyp in turn ● Essentially text-to-text machine "translation" or conditional language model LAS ● Challenge: Where to get training data? hand over to trevellion ○ Simulate recognition errors using large text corpus ○ Synthesize speech with TTS SC ○ Pass through LAS model to get hypotheses ○ Training pair: hypothesis -> Ground-truth transcript hand over to trevelyan

Experiments: Librispeech ● Speech ○ Read speech, long utterances ○ Training: 460 hours clean + 500 hours “other” speech ■ ~180k utterances ○ Evaluation: dev-clean, test-clean (~5.4 hours) ● Text (LM-TEXT) ○ Training: 40M sentences ● Synthetic speech (LM-TTS) ○ Synthesize speech from LM-TEXT (~60k hours) using single-voice Parallel WaveNet TTS system [Oord et al, ICML 2018]

Baseline recognizer ● Based on Listen, Attend, and Spell (LAS): attention-based encoder-decoder model ● log-mel spectrogram + delta + acceleration features ● 2x convolutional + 3x bidirectional LSTM encoder ● 4-head additive attention ● 1x LSTM decoder ● 16k wordpiece outputs WER DEV TEST LAS baseline 5.80 6.03

Methods for using text-only data 1. Train LM on LM-TEXT ○ rescore baseline LAS output with a language model 2. Train recognizer on LM-TTS ○ incorporate synthetic speech into recognizer training set 3. Train Spelling Corrector (SC) on decoded LM-TTS ○ train on recognition errors made on synthetic speech

Train LM on LM-TEXT LM Hyp (Prob) y1 (p1) LAS y* y2 (p2) … y8 (p8) ● 2 layer LSTM language model ● 16K wordpiece output vocabulary WER DEV TEST ● Rescore N-best list of 8 hyps LAS 5.80 6.03 LAS → LM (8) 4.56 (21.4%) 4.72 (21.7%) LM rescoring gives significant improvement over LAS

Train recognizer on LM-TTS ● Same LAS model, more training data ○ 960-hour speech + 60k-hour synthetic speech ○ "back-translation" for speech recognition [Hayashi et al, SLT 2018] ○ Each batch: 0.7*real + 0.3*LM-TTS WER DEV TEST LAS baseline 5.80 6.03 LAS-TTS 5.68 5.85 LAS → LM (8) 4.56 4.72 LAS-TTS → LM (8) 4.45 4.52 Training with combination of real and LM-TTS audio gives improvement before and after rescoring

Train Spelling Corrector (SC) on decoded LM-TTS ● Training data generation ○ Baseline LAS model trained on real speech ○ Decode 40M LM-TTS utterances ■ N-best (8) list after beam-search ○ Generate text-text training pairs: ■ each candidate in the N-best list -> ground truth transcript hand over to trevellion hand over to trevelyin SC LAS hand over to trevelyan … … ... Pre-trained using real audio

Model architecture ● Based on RNMT+ [Chen et al, ACL 2018] ● 16k wordpiece input/output tokens ● Encoder: 3 bidirectional LSTM layers ● Decoder: 3 unidirectional LSTM layers ● 4-head additive attention

LAS → SC: Correct top hypothesis ● Directly correct the top hypothesis WER DEV TEST LAS baseline 5.80 6.03 LAS → SC (1) 5.04 (13.1%) 5.08 (15.8%) ● Attention weights ○ Roughly monotonic ○ Attends to adjacent context at recognition errors Directly applying SC to LAS top hypothesis shows clear improvement

LAS → SC: Correct N-best hypotheses ORACLE WER DEV TEST ● Generate expanded N-best list LAS baseline 3.11 3.28 ○ LAS N-best list lacks diversity LAS → SC (1) 3.01 3.02 ○ Pass each of N candidates to SC LAS → SC (8) 1.63 1.68 ■ Generate M alternatives for each one ■ Increase N-best list to N*M Original N-best list Hyp (Prob) 8 A11 (p11) A12 (p12) … SC A18 (p18) Hyp (Prob) Hyp (Prob) A21 (p21) H1 (p1) SC A22 (p22) New LAS H2 (p2) … N-best list A28 (p28) 8*8 … ... H8 (p8) Hyp (Prob) SC A81 (p81) A82 (p82) … A88 (p88)

LAS → SC: Correct N-best hypotheses: Results ● Rescore expanded N-best list, tuning weights on dev WER DEV TEST DEV-TTS LAS 5.80 6.03 5.26 LAS → SC (1) 5.04 (13.1%) 5.08 (15.8%) 3.45 (34.0%) LAS → LM (8) 4.56 4.72 3.98 LAS → SC (8) → LM (64) 4.20 (27.6%) 4.33 (28.2%) 3.11 (40.9%) Large improvement after rescoring expanded N-best list, outperforms LAS → LM

SC Train/Test mismatch ● Mismatch between recognition errors on real and TTS audio ○ Synthetic speech has clear pronunciation -> LAS makes fewer substitution errors WER DEV TEST DEV-TTS LAS 5.80 6.03 5.26 LAS → SC (1) 5.04 (13.1%) 5.08 (15.8%) 3.45 (34.0%) LAS → LM (8) 4.56 4.72 3.98 LAS → SC (8) → LM (64) 4.20 (27.6%) 4.33 (28.2%) 3.11 (40.9%) Results on DEV-TTS show potential of SC when errors are matched between train and test

Multistyle Training (MTR) ● Increase SC training data variability WER DEV TEST LAS baseline 5.80 6.03 ● Add noise and reverberation to LM-TTS [Kim et al, Interspeech 2017] LAS → SC (1) 5.04 (13.1) 5.08 (15.8%) LAS → SC-MTR (1) 4.87 (16.0%) 4.91 (18.6%) ● Train on LM-TTS clean + MTR LAS → LM (8) 4.56 4.72 ○ total of 640M training pairs LAS → SC (8) → LM (64) 4.20 (27.6%) 4.33 (28.2%) LAS → SC-MTR (8)→ LM (64) 4.12 (29.0%) 4.28 (29.0%) MTR makes TTS audio more realistic and generates noisier N-best list with better matched errors

Example corrections ● Corrects proper nouns, rare words, tense errors Reference LAS baseline LAS → LM (8) LAS → SC (8) → LM (64) ready to hand over to ready to hand over to ready to hand over to trevellion ready to hand over to trevelyan on trevelyan's trevellion on trevelyin's arrival on trevelyan's arrival in england trevelyan on trevelyan's arrival in england in england arrival in england has countenanced the has countenance the belief has countenance the belief the has countenanced the belief the hope the wish the hope the wish that the hope the wish that the epeanites belief the hope the wish that the ebionites or at least epeanites or at least the or at least the nazarines that the ebionites or at least the nazarenes nazarines the nazarenes a wandering tribe of the a wandering tribe of the a wandering tribe of the blamis a wandering tribe of the blemmyes or nubians blamies or nubians or nubians blemmyes or nubians

Example incorrections ● Spelling corrector sometimes introduces errors Reference LAS baseline LAS → LM (8) LAS → SC (8) → LM (64) a laudable regard for the a laudable regard for the a laudable regard for the honor a laudable regard for the honor of the first proselyte honor of the first proselyte of the first proselyte honour of the first proselyte ambrosch he make good ambrosch he may good ambrose he make good farmer ambrose he made good farmer farmer farmer

Summary WER DEV TEST ● Spelling correction model to correct recognition errors LAS baseline 5.80 6.03 ● Outperforms LM rescoring alone by LAS-TTS 5.68 5.85 expanding N-best list LAS → SC (1) 5.04 5.08 ● MTR data augmentation improves SC model LAS → SC-MTR (1) 4.87 4.91 ○ Overall ~29% relative improvement ● Future work: better strategies for creating LAS → LM (8) 4.56 4.72 better matched SC training data LAS-TTS → LM (8) 4.45 4.52 LAS → SC (8) → LM (64) 4.20 4.33 LAS → SC-MTR (8) → LM (64) 4.12 4.28

A Spelling Correction Model for End-to-end Speech Recognition Jinxi - PowerPoint PPT Presentation

A Spelling Correction Model for End-to-end Speech Recognition Jinxi Guo 1 , Tara Sainath 2 , Ron Weiss 2 1 Electrical and Computer Engineering, University of California, Los Angeles, USA 2 Google ICASSP 2019, Brighton, UK Motivation

Spelling Correction and the Noisy Channel The Spelling

Spelling Correction and the Noisy Channel The$Spelling$ Correc/on$Task$ Dan$Jurafsky$

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Spelling Frome Vale Academy Finding out. about spelling within the new primary curriculum

NPFL103: Information Retrieval (2) Dictionaries, Tolerant retrieval, Spelling correction Pavel

Spelling Presentation Book, Grade 3 (SRA Reading Mastery, Signature Spelling Presentation Book,

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

SPaG Parent Workshop Agenda English and the 2014 Curriculum Spelling How we teach

Relaunch of Christ Churchs spelling scheme What is the Christ Church Spelling Scheme? It is

IAVA Education Day Bees Presentation Why Spelling Bee? Help students improve their spelling

Grammatica Spelling & Grammar API Introduction (simplified: see the expanded diagram below)

This week, we are going to look at a set of statutory spelling challenge words from the Y3/Y4

Spelling auditory memory, discrimina:on and motor skills] Spelling [encoding] is the reverse

SPELLING AND GRAMMAR Know the statutory guidelines for each year group. Know the expectations

Who cares about spelling? Why people care about spelling Computers Computers Computers Topic

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

DCASE 2016: Detection & Classification of Audio Scenes and Events Introduction and

Ns Tutorial 2002 Padmaparna Haldar (haldar@isi.edu) Xuan Chen (xuanc@isi.edu) Nov 21, 2002 1

Lossless Congestion Control Motivation Control packet retransmissions, which is undesirable for

An Empirical Study of Real Introduction Audio Traffic Internet is growing Web facilitates

Computational analysis of rhythmic aspects in Makam music of Turkey Andr Holzapfel MTG,

Good Morning! LIS1001 Information and Technology for Searching May 2017, Ulrich Werner,

Things I Wish Id Known Rod Johnson My Journey An Unexpected Career Memorable Highs

Ronald Wayne, The Third Founder of Apple Drew first Apple logo Wrote the Apple I manual

A Spelling Correction Model for End-to-end Speech Recognition Jinxi - PowerPoint PPT Presentation

A Spelling Correction Model for End-to-end Speech Recognition Jinxi Guo 1 , Tara Sainath 2 , Ron Weiss 2 1 Electrical and Computer Engineering, University of California, Los Angeles, USA 2 Google ICASSP 2019, Brighton, UK Motivation

Spelling Correction and the Noisy Channel The Spelling

Spelling Correction and the Noisy Channel The$Spelling$ Correc/on$Task$ Dan$Jurafsky$

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Spelling Frome Vale Academy Finding out. about spelling within the new primary curriculum

NPFL103: Information Retrieval (2) Dictionaries, Tolerant retrieval, Spelling correction Pavel

Spelling Presentation Book, Grade 3 (SRA Reading Mastery, Signature Spelling Presentation Book,

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

SPaG Parent Workshop Agenda English and the 2014 Curriculum Spelling How we teach

Relaunch of Christ Churchs spelling scheme What is the Christ Church Spelling Scheme? It is

IAVA Education Day Bees Presentation Why Spelling Bee? Help students improve their spelling

Grammatica Spelling &amp; Grammar API Introduction (simplified: see the expanded diagram below)

This week, we are going to look at a set of statutory spelling challenge words from the Y3/Y4

Spelling auditory memory, discrimina:on and motor skills] Spelling [encoding] is the reverse

SPELLING AND GRAMMAR Know the statutory guidelines for each year group. Know the expectations

Who cares about spelling? Why people care about spelling Computers Computers Computers Topic

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

DCASE 2016: Detection &amp; Classification of Audio Scenes and Events Introduction and

Ns Tutorial 2002 Padmaparna Haldar (haldar@isi.edu) Xuan Chen (xuanc@isi.edu) Nov 21, 2002 1

Lossless Congestion Control Motivation Control packet retransmissions, which is undesirable for

An Empirical Study of Real Introduction Audio Traffic Internet is growing Web facilitates

Computational analysis of rhythmic aspects in Makam music of Turkey Andr Holzapfel MTG,

Good Morning! LIS1001 Information and Technology for Searching May 2017, Ulrich Werner,

Things I Wish Id Known Rod Johnson My Journey An Unexpected Career Memorable Highs

Ronald Wayne, The Third Founder of Apple Drew first Apple logo Wrote the Apple I manual

Grammatica Spelling & Grammar API Introduction (simplified: see the expanded diagram below)

DCASE 2016: Detection & Classification of Audio Scenes and Events Introduction and