Neural Networks for Distant Speech Recognition Steve Renals ! Joint - PowerPoint PPT Presentation

C S T R Neural Networks for Distant Speech Recognition Steve Renals ! Joint work with ! Centre for Speech Technology Research ! Pawel Ś wi ę toja ń ski University of Edinburgh ! ! s.renals@ed.ac.uk 14 May 2014 ! Significant contributions from Peter Bell & Arnab Ghoshal

C Distant Speech Recognition S T R hmm ... so you have your energy source your user interface who’s controlling the chip ... click rustle

C S T Why study meetings? R • Natural communication scenes ! • Multistream - multiple asynchronous streams of data ! • Multimodal - words, prosody, gesture, attention ! • Multiparty - social roles, individual and group behaviours ! • Meetings offer realistic, complex behaviours but in a circumscribed setting ! • Applications based on meeting capture, analysis, recognition and interpretation ! • Great arena for interdisciplinary research

C S T “ASR Complete” problem R • Transcription of conversational speech ! • Distant speech recognition with microphone arrays ! • Speech separation, multiple acoustic channels ! • Reverberation ! • Overlap detection ! • Utterance and speaker segmentation ! • Disfluency detection

C S T Today’s Menu R • MDM corpora: ICSI and AMI meetings corpora ! • MDM systems in 2010: GMMs, beamforming, and lots of adaptation ! • MDM systems in 2014: Neural networks, less beamforming, and less adaptation

C S T R Corpora

C S ICSI Corpus T R Headset mics Tabletop boundary mics

C S T R AMI Corpus Headset mic Lapel mic Mic Array http://corpus.amiproject.org

C AMI Corpus Example S T R

C Meeting recording S T R (c. 2005)

C S Meeting recording (2010s) T R

C S T R GMM-based systems ! (State-of-the-art 2010)

C Basic system S BEAMFORMER T beamformer beamformer.env R • Speech/non-speech segmentation ! • PLP/MFCC features ! • ML trained HMM/GMM system (122k 39D Gaussians) ! • 50k vocabulary ! • Trigram language model (small: 26M words, PPL 78) ! • Weighted FST decoder

C S T Additional components R • Microphone array front end ! • Speaker / channel adaptation ! • Vocal tract length normalisation (VTLN) ! • Maximum likelihood linear regression (MLLR) ! • Input feature transform – LDA/STC ! • Discriminative training ! • eg boosted maximum mutual information, BMMI ! • Discriminative features ! • Model combination

C GMM results (WER) S T R 70 ASR Word Error Rates for GMM/HMM Systems AMI 63.2 60 ICSI AMI 56.1 54.8 50 ICSI 46.8 40 WER/% AMI 30 29.6 20 10 0 SDM MDM beamforming IHM 0

C Microphone array processing ! S T R for distant speech recognition • Mic array processing in AMIDA ASR system (Hain et al, 2012) ! • Wiener noise filter ! • Filter-sum beamforming based on time-delay-of-arrival ! • Viterbi smoother post processing ! • Track direction of maximum energy • Optimise beamforming for speech recognition ! • LIMABEAM (Seltzer et al, 2004, 2006) [explicit] ! • Simply concatenate feature vectors from multiple mics (Marino and Hain, 2011) [implicit]

C S T R (Deep) Neural Networks

The Perceptron C S T R (Rosenblatt)

The Perceptron C S T R (Rosenblatt) NN Winter #1

MLPs and backprop ! C S T R (mid 1980s)

C MLPs and backprop S T R • Train multiple layers of Outputs y K y 1 y � hidden units – nested δ K δ � δ 1 nonlinear functions ! w (2) w ( 2 ) w ( 2 ) K j 1 j � j • Powerful feature detectors ! Hidden units • Posterior probability � z j δ j = h � ( b j ) δ � w � j estimation ! � w (1) • Theorem: any ji function can be approximated with a x i single hidden layer

“Hybrid” Neural network C S T R acoustic models (1990s) Perceptual Error (%) Chronos DARPA RM 1992 Linear Decoder 11.0 CI-HMM Prediction CI RNN 10.0 Speech 9.0 Modulation Chronos 8.0 ROVER Spectrogram Decoder 7.0 CI MLP 6.0 CI-MLP Utterance Hypothesis 5.0 Perceptual Chronos CD-HMM Linear Decoder 4.0 Prediction MIX 3.0 CD RNN 2.0 Broadcast news 1998 ! 1.0 20.8% WER ! 0.0 0 1 2 3 4 5 6 Million Parameters (best GMM-based system, 13.5%) ! Bourlard & Morgan, 1994 ! Cook, Christie, Ellis, Fosler-Lussier, Gotoh, ! Robinson, IEEE TNN 1994 Kingsbury, Morgan, Renals, Robinson, & Williams, DARPA, 1999 Renals, Morgan, Cohen & Franco, ICASSP 1992

NN acoustic models C S T R Limitations vs GMMs • Computationally restricted to monophone outputs ! • CD-RNN factored over multiple networks – limited within-word context ! • Training not easily parallelisable ! • experimental turnaround slower ! • systems less complex (fewer parameters) ! • RNN – <100k parameters ! • MLP – ~1M parameters ! • Rapid adaptation hard (cf MLLR)

C S T R s-iy+l f-iy-l t-iy-n t-iy-m GMM SVM CRF NN Winter #2

Discriminative long-term C S T R features – Tandem • A neural network-based technique provided the biggest increase in accuracy in speech recognition during the 2000s • Tandem features (Hermansky, Ellis & Sharma, 2000) ! • use (transformed) outputs or (bottleneck) hidden values as input features for a GMM ! • deep networks – e.g. 5 layer MLP to obtain bottleneck features (Grézl, Karafiát, Kontár & Č ernock ý , 2007) ! • reduces errors by about 10% relative (Hain, Burget, Dines, Garner, Grezl, el Hannani, Huijbregts, Karafiat, Lincoln & Wan, 2012)

Deep Neural Networks C S T R (2010s) CD Hybrid Phone Outputs 12000 Tandem Bottleneck layer ! 26 ! 3–8 hidden layers Hidden units 2000 MFCC Inputs Dahl, Yu, Deng & Acero, ! IEEE TASLP2012 (39*9=351) Hinton, Deng, Yu, Dahl, Mohamed, Jaitly, Senior, Vanhoucke, Nguyen, Sainath & Kingsbury, IEEE SP Mag 2012

C Deep neural networks S T R What’s new?

C Deep neural networks S T R 1. Unsupervised pretraining (Hinton, Osindero & Teh, 2006) ! • Train a stacked RBM generative model, then finetune ! • Good initialisation ! • Regularisation 2. Deep – many hidden layers ! • Deeper models more accurate ! • GPUs gave us the computational power 3. Wide output layer (context dependent phone classes) rather than factorised into multiple nets ! • More accurate phone models ! • GPUs gave us the computational power

C K Vesely, A Ghoshal, L Burget, and D Povey, ! S “Sequence-discriminative training of deep neural networks”, Interspeech–2013. T R Switchboard 35 Hub5 '00 test set CHE 300 hour training set 33.0 30 CHE AVE CHE 25 25.8 25.7 24.1 AVE SWB 20 AVE 20.0 WER/% 18.6 18.4 SWB 15 SWB 14.2 12.6 10 5 0 ---GMM/BMMI--- ----DNN/CE---- ---DNN/sMBR---

C K Vesely, A Ghoshal, L Burget, and D Povey, ! S “Sequence-discriminative training of deep neural networks”, Interspeech–2013. T R Switchboard 35 Hub5 '00 test set CHE 300 hour training set 33.0 30 CHE AVE CHE 25 25.8 25.7 http://kaldi.sf.net/ ! 24.1 AVE SWB 20 AVE 20.0 WER/% 18.6 18.4 SWB 15 SWB 14.2 12.6 10 5 0 ---GMM/BMMI--- ----DNN/CE---- ---DNN/sMBR---

Neural network ! C S T R acoustic models Softmax output layer ~6000 CD phone outputs ~2000 hidden units Automatically learned ! 3-8 hidden layers feature extraction Aim to learn representations for distant speech recognition based multiple mic channels 9x39 MFCC inputs

Neural network ! C S T R acoustic models Softmax output layer ~6000 CD phone outputs ~2000 hidden units Automatically learned ! 3-8 hidden layers feature extraction Aim to learn representations for distant speech recognition based multiple mic channels Multi-channel input ! 9x39 MFCC inputs Spectral domain?

C Neural network acoustic models S T R for distant speech recognition • NNs have proven to result in accurate systems for a variety of tasks – TIMIT, WSJ, Switchboard, Broadcast News, Lectures, Aurora4, … • NNs can integrate information from multiple frames of data (in comparison with GMMs) • NNs can construct feature representations, from multiple sources of data • NNs are well suited to learning multiple modules with a common objective function

C Baseline DNN system S T R ~4000 tied state outputs 50,000 word pronunciation dictionary ! ! Small trigram LM ! (PPL 78, trained on 26M words) 2048 hidden units 6 hidden layers mic array Wiener filter noise cancellation Smoothed tdoa estimates 11x120 FBANK inputs Delay-sum beamforming

C Baseline GMM results S T R 70 ASR Word Error Rates for GMM/HMM Systems AMI 63.2 60 ICSI AMI 56.1 54.8 50 ICSI 46.8 40 WER/% AMI 30 29.6 20 10 0 SDM MDM beamforming IHM 0

Neural Networks for Distant Speech Recognition Steve Renals ! Joint - PowerPoint PPT Presentation

C S T R Neural Networks for Distant Speech Recognition Steve Renals ! Joint work with ! Centre for Speech Technology Research ! Pawel wi toja ski University of Edinburgh ! ! s.renals@ed.ac.uk 14 May 2014 ! Significant contributions

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Networks 1. Introduction Spring 2019 1 Neural Networks are taking over! Neural

Comments on DNS Robustness Mark Allman Reformed IETF Native Applied Networking Research

Resources for New Research Directions in Speaker Recognition: The Mixer 3, 4 and 5 Corpora*

Lecture 1: Introduction to Discrete Structures Dr. Chengjiang Long Computer Vision Researcher at

Computer Communication Networks Introduction IECE / ICSI 416 Spring 2020 Prof. Dola Saha 1

A Digital Fountain Approach to Reliable Distribution of Bulk Data John Byers, ICSI Michael Luby,

P ERSON N AMES WITH U SER I NTERACTION 1 M OTIVATION Search an author in DBLP Do these papers

Cybercasing the Joint: On the Privacy Implications of Geo-Tagging Gerald Friedland, Robin Sommer

The Computing Community Consortium: Stimulating Bigger Thinking Ed Lazowska, UW and CCC Susan