A Low-Power Text-Dependent Speaker Verification System with Narrow-band Feature Pre-selection and Weighted Dynamic Time Warping Qing He, Gregory Wornell, Wei Ma June 21, 2016 - Texas Instruments MIT - Signals Information and Algorithms Lab
Motivation: Low-Power Wake-up • Conventionally, for voice wake up, the host device is always ON – High data acquisition rate to minimize information loss and to enable flexible downstream processing – Involves many stages of processing on high dimensional data • Much lower power consumption can be achieved with an application-specific voice-authenticated wake-up front-end – Early-stage signal dimension reduction with analog components – Adaptive data acquisition and robust processing Always ON : >100mWs Host device Turn on with wake-up signal ADC (DSP) Wake-up signal Low-Power Front-end Always ON : ~ 50-300uW 2
System Architecture: A Comparison High power-consumption! Conventional System: Acoustic Speaker Accept/ Windowing feature Verification Reject MFCC extraction Sampling Feature Extraction Unit Sampling Rate High Dimensional Fast processing e.g., ~24 kHz Features Proposed System: Low power-consumption! Enrollment Samples NBSC Accept/ Narrowband Weighted DTW Reject Filter Bank Pattern Match Low rate ADC Spectral feature extraction Low-Rate Analog Front-End < 4 kHz Processing (kHz) 3
Spectral Feature Pre-Selection Spectral Feature Pre-Selection A few carefully selected narrow-bands are Speaker-verification capable of preserving most speech information Backend Output 4
Feature Selection Weighted DTW Experiments Review: Speech Sound Generation (b) Vocal tract modulation: (c) Speech spectrogram: (a) Excitation signal: Essential for speech recognition Separable in the cepstral domain 5
Feature Selection Feature Extraction Command Recognition Cepstral Representation of Speech Harmonics Vocal Tract Modulation Cepstral Coefficients Spectral Density Spectrogram Frequency (Hz) IFFT Time (s) Quefrency (Cycle/kHz) Frequency (Hz) • Acquiring the entire speech spectrum and performing transformation to the cepstral domain is power expensive! Question: How to extract without acquiring the full spectrum or performing transformation to the cepstral domain? 6
Feature Selection Weighted DTW Experiments Point-Wise Spectral Sampling on the Harmonics Cepstral Coefficients Spectrogram is retained [dB] Frequency (Hz) 0 Time (s) Frequency (Hz) [dB] × 0 0 7
Feature Selection Feature Extraction Command Recognition Narrow-Band Spectral Filtering Spectrogram [dB] Frequency (Hz) 0 Time (s) Frequency (Hz) [dB] × 0 8
Feature Selection Weighted DTW Experiments Narrow-band Spectral Filtering: Parameters Spectrogram is mostly retained [dB] Frequency (Hz) 0 Time (s) Frequency (Hz) With = 100Hz, = 200Hz, = 800Hz, aliasing at the baseband is attenuated significantly where is the narrow-band band-width is the spacing between narrow-bands 9
Feature Selection Weighted DTW Experiments Narrow-Band Spectral Coefficients (NBSC) BP Filtering BP Filtering • Narrow-band spectral features retain essential speech information • A small number of filters low-power • Low-rate sampling and simple processing 10
Feature Selection Weighted DTW Experiments Block Diagram of the Proposed System Enrollment Samples × Weighted threshold Dynamic Time-Warping Accept/ . Reject (DTW) . . Digital back-end Analog front-end • Individual bands can be discarded in the presence of noise 11
Feature Selection Weighted DTW Experiments Weighted DTW Spectral Feature Selection Voice-authenticated wake-up: Identifies the user and the passphrase in one shot Weighted Dynamic User-defined passphrase (~1s) Time Warping Very few enrollment samples (e.g., 3) Output 12
Feature Selection Weighted DTW Experiments Overview: Speaker-Verification Systems • Text-Dependent Speaker-Verification – Model based: GMM [Reynolds, 2000], i-vectors [Dehak, 2011], DNN [Liu, 2015], HMM [Rosenberg, 1990] Enrollments Training data Model Model Training Adaptation threshold MFCC Accept/ Feature Model Reject Extraction – Template based: DTW [ Sakoe, 1978 ]: Enrollments No prior model training threshold Distance Feature Accept/ Reject Extraction Measure 13
Feature Selection Weighted DTW Experiments Weighted Dynamic Time Warping Reference signal Speech input stretch compress Large penalty Small penalty M = 3 • The distance between two points and is equal to the distance plus a penalty term • Penalty scales with the # of consecutive warping steps M • Penalty scales with the signal magnitude • Penalty for warping is low when signal is small • Penalty for warping is high when signal is large 14
Feature Selection Weighted DTW Experiments Distance Matrix Computation where Cost is a function of the signal magnitude and the # of consecutive warping steps 15
Feature Selection Weighted DTW Experiments Classical v.s. Weighted DTW Fails to align the signal envelopes The shape of T is mutated Signal envelopes are well aligned Less mutation on signal envelope 16
Feature Selection Weighted DTW Experiments Spectral Feature Selection System Experiment Weighted Dynamic Time Warping Output 17
Feature Selection Weighted DTW Experiments Experiment Setup Passphrase # of speakers # of repetitions Data Set: Hi Galaxy 40 40 OK Glass 40 20 OK Hua Wei 30 20 • Noisy samples: • Wind and car noises are added to each clean sample such that the total SNR is 3dB • # of enrollment samples: 3 Parameters: • Narrow-band spectral coefficients (NBSC) band-width: 200Hz • f0 estimation using autocorrelation method [Rabiner, 1976] Baseline Systems: • 40-dim MFCC + Classical DTW • 40-dim MFCC + GMM-UBM model 18
Feature Selection Weighted DTW Experiments Summary of Experiment Results features below 2kHz are dropped Clean (EER [%]) Noisy (3dB) (EER [%]) Features MFCC NBSC MFCC NBSC Algorithm (40-dim) (12 bands) (40-dim) (8 bands) Weighted-DTW 0.9 1.1 10.5 5.7 DTW 1.4 1.5 13 6.7 GMM/UBM 2.6 N/A 6.8 N/A • Without noise, the NBSC yields comparable accuracy to the MFCC features • At 3dB SNR, the NBSC yields much better accuracy than the MFCC features • The Weighted-DTW yields improved accuracy than the classical DTW for all features • The proposed system yields improved accuracy than the GMM/UBM method • Taking only 3 enrollment samples as prior • Without prior background model training 19
Feature Selection Weighted DTW Experiments Experiments: Adaptive Band Selection Features NBSC (EER [%]) # of filters 6 8 10 12 Clean 1.99 1.9 1.54 1.1 Noisy 6.8 6.6 6.3 5.7 (band selection) Noisy 15.5 15 15 14.5 (all bands) • Accuracy improves as the # of bands increases • Accuracy improves significantly with band selection 20
System Power Estimation Fixed-Power Additional Total Power (uW) Power per Band (12 bands) (uW) (uW) Front-end TI’s 13 band filter-bank features 150 10 270 Back-end Text-dependent speaker verification 0 <9 <108 <380 • Back-end implementation: – Cortex-M0 micro-controller – Clock-speed: 40MHz – Decision: every 60 ms 21
Summary: Low-Power Text-Dependent Speaker Verification Early stage signal dimension reduction Spectral Feature Analog feature extraction front-end Selection Low-rate sampling and processing Support adaptive band-selection Improved robustness to noise (discard Weighted Dynamic noisy bands) Time Warping Demonstrated comparable accuracy to existing systems, with low-power implementation Output 22
Questions? 23
Back-up Slides 24
Feature Selection Feature Extraction Text-dependent speaker verification False-Positive under Continuous Running • Out-of-vocabulary samples: • 50000 samples of 1.2s duration • Short commands, utterances from audio books and conversations • Decision threshold is the same as the speaker-verification EER threshold Clean Noisy (3dB) Features (OOV False Positive [%]) (OOV False Positive [%]) MFCC NBSC MFCC NBSC Algorithm (40-dim) (12 bands) (40-dim) (8 bands) Weighted-DTW 0 0 1.4 0.6 ~1 false-positive per hour in a noisy restaurant 25
Feature Selection Feature Extraction Text-dependent speaker verification Experiments: Adaptive Band Selection Features NBSC MFSC (EER [%]) (EER [%]) # of filters 6 8 10 12 13 26 1.95 1.83 Clean 1.99 1.9 1.54 1.1 16.4 17.2 Noisy 6.8 6.6 6.3 5.7 (band selection) 33.4 33.9 Noisy 15.5 15 15 14.5 (all bands) • Accuracy improves as the # of bands increases • Accuracy improves significantly with band selection • NBSC yields much better performance than the MFSC, which uses a larger number of filters 26
Feature Selection Weighted DTW Experiments Narrow-Band Spectral Filtering Cepstral Coefficients Spectrogram [dB] Frequency (Hz) Quefrency (Cycle/kHz) Time (s) Frequency (Hz) = 0 0 × = 0 27 0 0
Recommend
More recommend