C S T R Neural Networks for Distant Speech Recognition Steve Renals ! Joint work with ! Centre for Speech Technology Research ! Pawel Ś wi ę toja ń ski University of Edinburgh ! ! s.renals@ed.ac.uk 14 May 2014 ! Significant contributions from Peter Bell & Arnab Ghoshal
C Distant Speech Recognition S T R hmm ... so you have your energy source your user interface who’s controlling the chip ... click rustle
C S T Why study meetings? R • Natural communication scenes ! • Multistream - multiple asynchronous streams of data ! • Multimodal - words, prosody, gesture, attention ! • Multiparty - social roles, individual and group behaviours ! • Meetings offer realistic, complex behaviours but in a circumscribed setting ! • Applications based on meeting capture, analysis, recognition and interpretation ! • Great arena for interdisciplinary research
C S T “ASR Complete” problem R • Transcription of conversational speech ! • Distant speech recognition with microphone arrays ! • Speech separation, multiple acoustic channels ! • Reverberation ! • Overlap detection ! • Utterance and speaker segmentation ! • Disfluency detection
C S T Today’s Menu R • MDM corpora: ICSI and AMI meetings corpora ! • MDM systems in 2010: GMMs, beamforming, and lots of adaptation ! • MDM systems in 2014: Neural networks, less beamforming, and less adaptation
C S T R Corpora
C S ICSI Corpus T R Headset mics Tabletop boundary mics
C S T R AMI Corpus Headset mic Lapel mic Mic Array http://corpus.amiproject.org
C AMI Corpus Example S T R
C Meeting recording S T R (c. 2005)
C S Meeting recording (2010s) T R
C S T R GMM-based systems ! (State-of-the-art 2010)
C Basic system S BEAMFORMER T beamformer beamformer.env R • Speech/non-speech segmentation ! • PLP/MFCC features ! • ML trained HMM/GMM system (122k 39D Gaussians) ! • 50k vocabulary ! • Trigram language model (small: 26M words, PPL 78) ! • Weighted FST decoder
C S T Additional components R • Microphone array front end ! • Speaker / channel adaptation ! • Vocal tract length normalisation (VTLN) ! • Maximum likelihood linear regression (MLLR) ! • Input feature transform – LDA/STC ! • Discriminative training ! • eg boosted maximum mutual information, BMMI ! • Discriminative features ! • Model combination
C GMM results (WER) S T R 70 ASR Word Error Rates for GMM/HMM Systems AMI 63.2 60 ICSI AMI 56.1 54.8 50 ICSI 46.8 40 WER/% AMI 30 29.6 20 10 0 SDM MDM beamforming IHM 0
C Microphone array processing ! S T R for distant speech recognition • Mic array processing in AMIDA ASR system (Hain et al, 2012) ! • Wiener noise filter ! • Filter-sum beamforming based on time-delay-of-arrival ! • Viterbi smoother post processing ! • Track direction of maximum energy • Optimise beamforming for speech recognition ! • LIMABEAM (Seltzer et al, 2004, 2006) [explicit] ! • Simply concatenate feature vectors from multiple mics (Marino and Hain, 2011) [implicit]
C S T R (Deep) Neural Networks
The Perceptron C S T R (Rosenblatt)
The Perceptron C S T R (Rosenblatt)
The Perceptron C S T R (Rosenblatt) NN Winter #1
MLPs and backprop ! C S T R (mid 1980s)
C MLPs and backprop S T R • Train multiple layers of Outputs y K y 1 y � hidden units – nested δ K δ � δ 1 nonlinear functions ! w (2) w ( 2 ) w ( 2 ) K j 1 j � j • Powerful feature detectors ! Hidden units • Posterior probability � z j δ j = h � ( b j ) δ � w � j estimation ! � w (1) • Theorem: any ji function can be approximated with a x i single hidden layer
“Hybrid” Neural network C S T R acoustic models (1990s) Perceptual Error (%) Chronos DARPA RM 1992 Linear Decoder 11.0 CI-HMM Prediction CI RNN 10.0 Speech 9.0 Modulation Chronos 8.0 ROVER Spectrogram Decoder 7.0 CI MLP 6.0 CI-MLP Utterance Hypothesis 5.0 Perceptual Chronos CD-HMM Linear Decoder 4.0 Prediction MIX 3.0 CD RNN 2.0 Broadcast news 1998 ! 1.0 20.8% WER ! 0.0 0 1 2 3 4 5 6 Million Parameters (best GMM-based system, 13.5%) ! Bourlard & Morgan, 1994 ! Cook, Christie, Ellis, Fosler-Lussier, Gotoh, ! Robinson, IEEE TNN 1994 Kingsbury, Morgan, Renals, Robinson, & Williams, DARPA, 1999 Renals, Morgan, Cohen & Franco, ICASSP 1992
NN acoustic models C S T R Limitations vs GMMs • Computationally restricted to monophone outputs ! • CD-RNN factored over multiple networks – limited within-word context ! • Training not easily parallelisable ! • experimental turnaround slower ! • systems less complex (fewer parameters) ! • RNN – <100k parameters ! • MLP – ~1M parameters ! • Rapid adaptation hard (cf MLLR)
C S T R s-iy+l f-iy-l t-iy-n t-iy-m GMM SVM CRF NN Winter #2
Discriminative long-term C S T R features – Tandem • A neural network-based technique provided the biggest increase in accuracy in speech recognition during the 2000s • Tandem features (Hermansky, Ellis & Sharma, 2000) ! • use (transformed) outputs or (bottleneck) hidden values as input features for a GMM ! • deep networks – e.g. 5 layer MLP to obtain bottleneck features (Grézl, Karafiát, Kontár & Č ernock ý , 2007) ! • reduces errors by about 10% relative (Hain, Burget, Dines, Garner, Grezl, el Hannani, Huijbregts, Karafiat, Lincoln & Wan, 2012)
Deep Neural Networks C S T R (2010s) CD Hybrid Phone Outputs 12000 Tandem Bottleneck layer ! 26 ! 3–8 hidden layers Hidden units 2000 MFCC Inputs Dahl, Yu, Deng & Acero, ! IEEE TASLP2012 (39*9=351) Hinton, Deng, Yu, Dahl, Mohamed, Jaitly, Senior, Vanhoucke, Nguyen, Sainath & Kingsbury, IEEE SP Mag 2012
C Deep neural networks S T R What’s new?
C Deep neural networks S T R 1. Unsupervised pretraining (Hinton, Osindero & Teh, 2006) ! • Train a stacked RBM generative model, then finetune ! • Good initialisation ! • Regularisation 2. Deep – many hidden layers ! • Deeper models more accurate ! • GPUs gave us the computational power 3. Wide output layer (context dependent phone classes) rather than factorised into multiple nets ! • More accurate phone models ! • GPUs gave us the computational power
C Deep neural networks S T R 1. Unsupervised pretraining (Hinton, Osindero & Teh, 2006) ! • Train a stacked RBM generative model, then finetune ! • Good initialisation ! • Regularisation 2. Deep – many hidden layers ! • Deeper models more accurate ! • GPUs gave us the computational power 3. Wide output layer (context dependent phone classes) rather than factorised into multiple nets ! • More accurate phone models ! • GPUs gave us the computational power
C K Vesely, A Ghoshal, L Burget, and D Povey, ! S “Sequence-discriminative training of deep neural networks”, Interspeech–2013. T R Switchboard 35 Hub5 '00 test set CHE 300 hour training set 33.0 30 CHE AVE CHE 25 25.8 25.7 24.1 AVE SWB 20 AVE 20.0 WER/% 18.6 18.4 SWB 15 SWB 14.2 12.6 10 5 0 ---GMM/BMMI--- ----DNN/CE---- ---DNN/sMBR---
C K Vesely, A Ghoshal, L Burget, and D Povey, ! S “Sequence-discriminative training of deep neural networks”, Interspeech–2013. T R Switchboard 35 Hub5 '00 test set CHE 300 hour training set 33.0 30 CHE AVE CHE 25 25.8 25.7 http://kaldi.sf.net/ ! 24.1 AVE SWB 20 AVE 20.0 WER/% 18.6 18.4 SWB 15 SWB 14.2 12.6 10 5 0 ---GMM/BMMI--- ----DNN/CE---- ---DNN/sMBR---
Neural network ! C S T R acoustic models Softmax output layer ~6000 CD phone outputs ~2000 hidden units Automatically learned ! 3-8 hidden layers feature extraction Aim to learn representations for distant speech recognition based multiple mic channels 9x39 MFCC inputs
Neural network ! C S T R acoustic models Softmax output layer ~6000 CD phone outputs ~2000 hidden units Automatically learned ! 3-8 hidden layers feature extraction Aim to learn representations for distant speech recognition based multiple mic channels Multi-channel input ! 9x39 MFCC inputs Spectral domain?
C Neural network acoustic models S T R for distant speech recognition • NNs have proven to result in accurate systems for a variety of tasks – TIMIT, WSJ, Switchboard, Broadcast News, Lectures, Aurora4, … • NNs can integrate information from multiple frames of data (in comparison with GMMs) • NNs can construct feature representations, from multiple sources of data • NNs are well suited to learning multiple modules with a common objective function
C Baseline DNN system S T R ~4000 tied state outputs 50,000 word pronunciation dictionary ! ! Small trigram LM ! (PPL 78, trained on 26M words) 2048 hidden units 6 hidden layers mic array Wiener filter noise cancellation Smoothed tdoa estimates 11x120 FBANK inputs Delay-sum beamforming
C Baseline GMM results S T R 70 ASR Word Error Rates for GMM/HMM Systems AMI 63.2 60 ICSI AMI 56.1 54.8 50 ICSI 46.8 40 WER/% AMI 30 29.6 20 10 0 SDM MDM beamforming IHM 0
Recommend
More recommend