Speaker Recognition and Speaker Recognition and the ETSI Standard the ETSI Standard Distributed Speech Distributed Speech Recognition Front- -End End Recognition Front Charles Broun David Pearce William Campbell Holly Kelleher Motorola Labs Motorola Limited Human Interface Lab Basingstoke, UK Tempe, Arizona, USA
Outline Outline Outline • Background • Speaker Verification – Embedded Process – Distributed Process • Distributed Speech Recognition (DSR) • Classifier • Experimental Setup • Results • Conclusion 2
Background Background Background Motivation • Issues with Embedded Solutions Mobile devices do not have the necessary memory or battery capacity Updating software requires access to each device Multiple devices may contain different speaker models • Potential Benefits of Distributed Solutions Server supports computation and memory requirements Software updates are handled in a single location A single speaker model may support multiple mobile devices – enabling a ‘portable’ interface • Distributed Speech Recognition (DSR) Standard This standard addresses the above issues for speech recognition Can work on this standard be leveraged for speaker verification? 3
Speaker Verification Speaker Verification Speaker Verification Embedded Process • Feature extractor and classifier typically combined into a proprietary solution • Can jointly optimize both components • Target system must support computation & memory requirements of both components Accept >T Input Score Feature Compare to Classifier Speech Extractor Threshold, T Data <T Reject Speaker Model 4
Speaker Verification Speaker Verification Speaker Verification Distributed Process • Feature extractor is standardized • Cannot jointly optimize both components • Client only supports computational & memory requirements of feature extractor • Server supports higher load of classifier Accept >T Input Score Feature Compare to Classifier Speech Extractor Threshold, T Data <T Reject Wireless Channel Speaker Model 5
DSR DSR DSR Background of DSR Standard • Motivation of Standard Front-End Potential benefits of distributed solutions for speech recognition Eliminates voice/vocoder channel mismatch • Activities European Telecommunications Standards Institute (ETSI) Aurora Working Group within ETSI First standard published in February 2000 6
DSR DSR DSR ETSI Standard DSR System Concept • Terminal front-end targeted to mobile devices • Features transmitted over a low-error data channel • Speech recognizer runs on high power server Terminal DSR Front-End Parameterisation Frame Compression Structure & M el-Cepstrum Split V Q Error Protection W ireless Data Channel – 4.8 kbit/s Server DSR Back-End Error Detection Decompression Recognition & M itigation 7
DSR DSR DSR ETSI Standard DSR Front-End • Feature set consists of 12 mel-cepstum coefficient, logE, C0 • Quantization supports a data rate of 4800 b/s • Error protection supports robustness to transmission errors Input Speech ADC Offcom Framing PE W FFT MF LOG DCT logE Abbreviations: Feature Compression ADC Analog-to-digital conversion Offcom Offset compensation PE Pre-emphasis logE Energy measure computation Bit Stream Formatting W Windowing FFT Fast Fourier transform MF Mel-filtering LOG Non-linear transform To Transmission Channel DCT Discrete cosine transform 8
Classifier Classifier Classifier Polynomial Classifier [ ] t = = Given x x x and K 2 • Compute the polynomial basis vector 1 2 [ ] t = 2 2 p ( x ) 1 x x x x x x 1 2 1 1 2 2 • Apply a polynomial discriminant = t d ( x , w ) w p ( x ) function 1 M 1 M • Compute the score as the average ∑ ∑ = t = t s w p ( x ) w p ( x ) k k M M across all frames = = k 1 k 1 DSR Polynomial Discriminant Score Average s x Feature Basis Vector Function Σ k Vectors p ( x ) d ( x , w ) Speaker Model w 9
Experimental Setup Experimental Setup Experimental Setup YOHO Database • 138 speakers • Enrollment – 4 sessions – 24 phrases – “23-45-56” • Testing – 10 sessions – 4 phrases – “45-23-56” Speaker Verification System • Classifier: 3 rd order polynomial • Features: 12 MFCCs from the DSR front-end • Channel: GSM bit-error masks 10
Results Results Results Performance Average Equal Error Rate (%) for a 1-Phrase Test Verify Un- Error EP1 EP2 EP3 quantized -Free Enroll Unquantized 1.18 - - - - Error-Free - 1.22 1.22 1.26 1.67 EP1 - 1.22 1.22 1.26 1.67 EP2 - 1.22 1.22 1.27 1.66 EP3 - 1.26 1.26 1.30 1.70 11
Conclusion Conclusion Conclusion Demonstrated that the ETSI Standard Distributed Speech Recognition Front-End is viable for speaker verification 12
Recommend
More recommend