Improved Soft Decisions in Missing Data ASR: Using Harmonicity in Conjunction with Local SNR Estimates Speech and Hearing Research Group, Dept. Computer Science, University of Sheffield, UK January 24, 2001
� � � � � ved Soft Decisions in Missing Data ASR Improved Soft Decisions in Missing Data ASR: Combining Masks Soft Decisions in Missing Data Harmonicity-based Fuzzy Masks Merging Local SNR and Harmonicity Masks Aurora 2000 Results Conclusions Septemer 25, 2000 1
ved Soft Decisions in Missing Data ASR Soft Decisions in Missing Data Discrete 0/1 Mask Threshold 1 SNR Estimate 0 Frequency Fuzzy Mask F (S) v, T Time 1 0 Soft mask values are interpreted as "the probability that the data is reliable". So rather than use the present data likelihood OR the missing data ‘induction constraint’ , every point uses weighted sum of BOTH terms. Septemer 25, 2000 2
✂ ✔ ✢ ✂ ✌ ✏ ✡ � ✁ ✂ ☎ ✆ ✞ ✟ ✡ ☎ ✕ ✖ ✗ ✘ ✡ � ✡ ✎ ✡ ✞ ✞ ✛ ✁ ✆ ✌ ✘ ✡ ✣ ✡ � ✂ ✡ ✡ ✡ ☛ ☞ � ✡ � ☎ ✒ ✆ ✞ ✌ ☛ ✍ ✎ ✂ ✌ ✏ ✑ ✒ � ✌ ✡ ved Soft Decisions in Missing Data ASR Using Soft Decisions Missing data probability calculation for discrete masks, showing the separate present and missing components: ✞✠✟ ✁✄✂ ☎✝✆ ✁✄✂ ✁✄✂ ✞✄✓ With soft decisions the probability due to each feature vector component becomes a weighted sum of the present and missing probability terms: ✎✝✜ ✁✄✂ ☎✚✙ ✁✄✂ ☎✚✙ ✞✄✓ Septemer 25, 2000 3
✓ ✔ ✠ ☛ ✒ ✝ ✖ ✎ ✍ ☛ ✑ ✝ ✍ ✗ ✘ ✙ ✏ ✚ ✆ ✑ ✒ ✟ ✟ ☛ ✜ � ✁ ✂ ✄ ☎ ☛ ✍ ☞ ✍ ✌ ✍ ✂ ✄ ✑ ✍ ✏ ✍ ✠ ✍ ved Soft Decisions in Missing Data ASR Using Soft Decisions Generalising to models employing Gaussian mixtures: ✆✞✝ ✟✡✠ ✆✞✎ ✆✞✑ ✆✕✔ ☛✞✛ Septemer 25, 2000 4
� � � ved Soft Decisions in Missing Data ASR Harmonicity Masks 32 Channels 32 frequency channels, 150 lags s ,T 1 1 Fuzzy Select lag 1 32 frequency channels Harmonicity from Correlogram (freq, lag) 0 Mask Correlogram Peak’s lag index (~1/f0) Noisy Gammatone Sum Across f0 Haircell Pitch Peak Autocorrelation Filterbank Signal Frequency Model Tracking Peak’s Height (Degree of Voicing) Instanteous Envelope Summary Applied to each channel Autocorrelogram over a temporal window The Harmonicity Mask is designed to mark voiced speech regions. It works well when noise is inharmonic or the SNR is favourable. Refinements necessary when noise is harmonic and dominant: –> pitch tracking, multisource decoding? Septemer 25, 2000 5
� � ved Soft Decisions in Missing Data ASR Mask Combination We now have two fuzzy masks: Fuzzy SNR-based mask - Works well in stationary noise. Fuzzy Harmonicity-based mask - Highlights voiced speech regions. We also have a ‘degree of voicing’ parameter, V. How do we combine the masks? Septemer 25, 2000 6
✙ ✂ ✂ � ✁ � ved Soft Decisions in Missing Data ASR Mask Combination Discrete combination : (One parameter) If frame is Voiced, else frame is Unvoiced. Then, Voiced frames –> Use harmonicity-based mask Unvoiced frames –> Fall back on SNR masks Fuzzy combination : (Two parameters) Raw Harmonicity Data Harmonicity Mask, M h 1 0 s ,T 1 1 Hybrid Mask Mask Combination Degree of Voicing 1 0 w wM +(1-w)M s s ,T h 2 2 Local SNR Estimate 1 0 SNR Mask, M s ,T s 3 3 Septemer 25, 2000 7
ved Soft Decisions in Missing Data ASR Tuning the Voicing Sigmoid Clean Car 10dB 1.2 1.2 1 1 Voicing Voicing 0.8 0.8 0.6 0.6 0.4 0.4 5ms 10ms 15ms 5ms 10ms 15ms Lag (~ 1/f0) Lag (~ 1/f0) Voicing vs. Lag for female and male speakers. Septemer 25, 2000 8
ved Soft Decisions in Missing Data ASR Comparison with Apriori Masks Male "4382" + Car @ 20dB SNR Apriori 3800Hz 50Hz Local SNR Estimate Mask Harmonicity Based Mask Combined Mask 0 1.7 secs Septemer 25, 2000 9
ved Soft Decisions in Missing Data ASR Comparison with Apriori Masks Male "4382" + Car @ 10dB SNR Apriori 3800Hz 50Hz Local SNR Estimate Mask Harmonicity Based Mask Combined Mask 0 1.7 secs Septemer 25, 2000 10
� � � � � ved Soft Decisions in Missing Data ASR Aurora 2000 Experiments Trained on clean data . Testing using Set A (i.e. subway, exhibition, babble and car noises). Features: 32 channel gammatone filter bank, + deltas. Two slightly different sets of models + Aurora Models: 16 states per digit, + DC Models: 11.5 states per digit on average. 7 mixtures per state (note, relatively large num. of mixes needed for spectral features). Septemer 25, 2000 11
ved Soft Decisions in Missing Data ASR Aurora Results: Test Set A Car Noise Exhibition Noise 100 100 Discrete SNR Discrete SNR 80 80 Fuzzy SNR Fuzzy SNR +Harmonicity +Harmonicity 60 60 WER WER 40 40 20 20 0 0 −5 0 5 10 15 20 Clean −5 0 5 10 15 20 Clean SNR (dB) SNR (dB) Subway Noise Babble Noise 100 100 Discrete SNR Discrete SNR 80 80 Fuzzy SNR Fuzzy SNR +Harmonicity +Harmonicity 60 60 WER WER 40 40 20 20 0 0 −5 0 5 10 15 20 Clean −5 0 5 10 15 20 Clean SNR (dB) SNR (dB) (32 channel filter bank + deltas) Septemer 25, 2000 12
ved Soft Decisions in Missing Data ASR Aurora Results: WER averaged over noise condition 100 Discrete SNR Fuzzy SNR 90 +Harmonicity (MultiCondition) 80 70 60 WER 50 40 30 20 10 0 −5 0 5 10 15 20 Clean SNR (dB) MASK / SNR -5dB 0dB 5dB 10dB 15dB 20dB Clean Discrete SNR 83.8 56.6 34.0 17.2 8.5 4.1 1.2 Fuzzy SNR 69.7 41.2 20.1 10.1 5.7 3.4 1.5 + Harmonicity 66.6 36.4 16.9 8.3 4.3 2.5 1.4 Septemer 25, 2000 13
ved Soft Decisions in Missing Data ASR Aurora WER Results: Aurora vs. DC Word Models 100 16 State Models DC Models 90 80 70 60 WER 50 40 30 20 10 0 −5 0 5 10 15 20 Clean SNR (dB) Models / SNR -5dB 0dB 5dB 10dB 15dB 20dB Clean 16 State Models 66.6 36.4 16.9 8.3 4.3 2.5 1.4 DC Word Models 69.4 39.8 19.9 9.9 5.3 3.2 1.7 Septemer 25, 2000 14
� � � ved Soft Decisions in Missing Data ASR Conclusions In combination, Harmonicity and Local SNR masks perform better than either mask individually , i.e: + better approximation to the apriori (‘cheating’) mask, + better recognition results. The mask generation parameters are robust , i.e. one set of parameters will perform well over a large range of noise types, and noise levels. Sensible values can be estimated from clean speech. Septemer 25, 2000 15
� � � � ved Soft Decisions in Missing Data ASR Further Work Temporal Smoothing. Smoothing the masks appears to improve results for some noise types - but seriously damages results for others. Using F0 Information. Using F0 to distinguish between voiced speech and harmonic noise. F0 tracking. ‘Multi-pitch’ decoding. Adaptive Sigmoid Parameters. Techniques for fine tuning the mask generation parameters according to the noise estimate. More General Mask Combination Techniques. Septemer 25, 2000 16
ved Soft Decisions in Missing Data ASR Learning Noise Specific Parameters 20 KHz TIDigits + Factory Noise 50 Discrete SNR Fuzzy SNR (ICSLP) 45 Tuned Fuzzy Autoc/SNR (Apriori) 40 Digit recognition accuracy 35 30 25 20 15 10 5 0 0 5 10 15 20 200 SNR (dB) Parameters tuned to minimise distance to Apriori masks at 0 & 5 db. Septemer 25, 2000 17
Recommend
More recommend