Speech recognition in the presence of highly non-stationary noise - PowerPoint PPT Presentation

Speech recognition in the presence of highly non-stationary noise based on spatial, spectral and temporal speech/ noise modeling combined with dynamic variance adaptation M. Delcroix, K. Kinoshita, T. Nakatani, S. Araki, A. Ogawa, T. Hori, S. Watanabe, M. Fujimoto, T. Yoshioka, T. Oba, Y. Kubo, M. Souden, S. Hahm, A. Nakamura

Motivation of our system  Speech enhancement - Deal with highly non-stationary noise, using all information available about speech/noise Spatial - Spectral - Temporal - Realized using two complementary enhancement processes  Recognition - Interconnection of speech enhancement and recognizer using dynamic acoustic model adaptation - Use of state of the art ASR technologies (discriminative training, system combination … ) Average accuracy improves 69 %  91.7 % Average accuracy improves 69 %  91.7 % 2

Approaches for noise robust ASR Information Handling highly Interconnection w/ ASR used non-stationary noise Acoustic model   compensation, Spectral e.g. VTS Speech Spatial/spectral/   enhancement, temporal e.g. BSS Proposed 3

System overview Speech enhancement ASR Word decoding Speech-noise Example based separation enhancement Spatial & spectral Spectral & temporal Use spatial spectral and  temporal information  Enable removal of highly Acoustic Language non-stationary noise model model 4

System overview Speech enhancement ASR Word decoding Speech-noise Example based separation enhancement Dynamic model adaptation Good interconnection  Good interconnection with recognizer with recognizer Acoustic AM model 5

Approaches for noise robust ASR Information Handling highly Interconnection w/ ASR used non-stationary noise Acoustic model   compensation, Spectral e.g. VTS Speech Spatial/spectral/   enhancement, temporal e.g. BSS Spatial, spectral &   Proposed temporal 6

System overview Speech enhancement ASR Word decoding Speech-noise Example based separation enhancement Dynamic model adaptation AM 7

Speech-noise separation [Nakatani, 2011]  Integrate spatial-based and spectral-based separation in a single framework Spatial separation Spectral separation Speech spatial Speech spectral Speech spatial Speech spectral model model model model Sparseness Log-max assumption Spectral Location L k L k assumption [Roweis, 2003] feature Feature [Yilmaz,2004] Noise spatial Noise spectral Noise spatial Noise spectral model model model model L k : dominant source index, i.e. indicates whether speech or noise is more dominant at each frequency k 8

Speech-noise separation [Nakatani, 2011] L k  Combined using dominant source index Spatial separation Spectral separation Speech spatial Speech spectral Speech spatial Speech spectral model model model model Sparseness Log-max assumption Spectral Location L k assumption [Roweis, 2003] feature Feature [Yilmaz,2004] Noise spatial Noise spectral Noise spatial Noise spectral model model model model L k : dominant source index, i.e. indicates whether speech or noise is more dominant at each frequency k 9

Speech-noise separation [Nakatani, 2011] DOLPHIN dominance based locational and power-spectral characteristics integration  Estimate speech spectral component sequence using EM algorithm MMSE Speech spectral  Estimated speech obtained using components estimation MMSE Integrate efficiently spatial and Integrate efficiently spatial and Spectral Spatial spectral information spectral information model model to remove non-stationary noise to remove non-stationary noise 10

Example-based enhancement [Kinoshita,2011]  Use a parallel corpus model (clean and processed speech) that represents the fine spectral and temporal structure of speech - Train a GMM from multi-condition training data processed with DOLPHIN - Generate corpus model Corpus model Processed ･･････ Speech (training) GMM component ･･････ sequence Clean speech ･･････ example 12

Example-based enhancement [Kinoshita,2011]  Look for the longest example segments Test utterance Wiener filtering Corpus model Processed ･･････ Best-example Speech (training) searching GMM component ･･････ sequence Corpus Clean speech ･･････ model example  Use the corresponding clean speech example for Wiener filtering Using precise model of temporal structure of speech Using precise model of temporal structure of speech  remove remaining highly non-stationary noise  remove remaining highly non-stationary noise  recover precisely speech  recover precisely speech 13

Dynamic model adaptation [Delcroix, 2009]  Compensate mismatch between enhanced speech and acoustic model - Non-stationary noise & frame by frame processing  Mismatch changes frame by frame (dynamic)  Conventional acoustic model compensation techniques (MLLR) not sufficient  Dynamic variance compensation (Uncertainty decoding) [Deng, 2005] - Mitigate the mismatch frame by frame by considering feature variance     p y n p m N y μ σ σ ( | ) ( ) ( ; , ) n , m t t n m n m t , ,   m Enhanced speech t feature 15

Dynamic feature variance model Enhanced Observed feature feature Speech Recognizer Enhancement   σ α u y 2 α ( ) ( ) t t t For each feature dimension  Assumption Amount Feature variance Amount of ∝ ∝ of Noise (Feature uncertainty) noise reduction The more we process the signal, the more we introduce uncertainty 16

Dynamic feature variance model Enhanced Observed feature feature Speech Recognizer Enhancement   σ α u y 2 α ( ) ( ) t t t For each feature dimension  Optimized for recognition with ML criterion using adaptation data (Dynamic variance adaptation - DVA)  Can be combined with MLLR for static adaptation of the acoustic model mean parameters  Good interconnection Good interconnection with recognizer with recognizer 17

System overview ASR Word Word Word Word decoding Wiener MMSE filtering Speech spectral components Best-example Dynamic model estimation searching adaptation Corpus Spectral Spatial model model model Example-based AM DOLPHIN enhancement 18

Multi-condition/ discriminative training Add background noise training dMMI : differenced maximum samples to clean training data mutual information [McDermott, 2010] ASR Word Word Word Word decoding Wiener MMSE filtering Speech spectral components Best-example Dynamic model estimation searching adaptation Corpus AM Spectral Spatial model model model Example-based Clean/Multi DOLPHIN enhancement dMMI 19

System combination [Evermann, 2000] System Word combination ASR decoding Wiener MMSE filtering lattice Speech spectral components Best-example Dynamic model estimation searching adaptation Corpus AM Spectral Spatial model model model Example-based Clean/Multi DOLPHIN enhancement dMMI 20

Settings - Enhancement DOLPHIN  Spatial model - 4 mixture components  Spectral model - 256 mixture components - Speaker dependent model  Models trained in advanced using the noise/speech training data  Long windows (100 ms) to capture reverberation  Corpus model Example-based - GMM w/ 4096 mixture components - Trained on DOLPHIN processed speech - Features 60 order MFCC w/ log energy 21

Settings - Recognition Recognizer  SOLON [Hori, 2007] Acoustic Model  Trained with SOLON (ML & discriminative (dMMI))  Clean - HMM w/ 254 states (include silent state) - HMM state modeled by GMM with 7 components  Multi-condition - 20 components per HMM state - No silent model  Added background noise samples to clean training data Multi-condition  7 noise environment x 6 SNR conditions data  Unsupervised/speaker dependent Adaptation  use all test data for a given speaker 22

Development m6dB m3dB 0dB 3dB 6dB 9dB MEAN Baseline 49.75 52.58 64.25 75.08 84.25 90.58 69.42 Proposed 84.33 88.58 90.17 92.33 94.50 95.00 90.82 System combination System combination 90.8 % 90.8 % Relative Adap . improvement Adap . 90.1 % Multi-cond 51% Adap . 89.3 % 87.3 % Dolphin 45% Adap . DOLPHIN HTK baseline 86.7 % Example-based  SOLON 21% 88.9 % 86.5 % Adap. 20% Example-based DOLPHIN Ex. based 18% 83.2 % 83.4 % Multi-condition dMMI Baseline dMMI 10% Clean dMMI Baseline 85.0 % Sys comb 7% 69.6 % Multi-condition ML Baseline 83.2 % Clean ML Baseline 69.4 % 23

Evaluation m6dB m3dB 0dB 3dB 6dB 9dB MEAN Baseline 45.67 52.67 65.25 75.42 83.33 91.67 69.00 Proposed 85.58 88.33 92.33 93.67 94.17 95.83 91.65 System combination System combination 91.7 % 91.7 % Adap . 91.1 % Adap . Adap . 90.1 % 88.9 % DOLPHIN Example-based Adap . 90.2 % Example-based 88.5 % 88.0 % 84.6 % Multi-condition dMMI Baseline DOLPHIN 84.7 % 85.1 % Clean dMMI Baseline 69.0 % 24

Speech recognition in the presence of highly non-stationary noise - PowerPoint PPT Presentation

Speech recognition in the presence of highly non-stationary noise based on spatial, spectral and temporal speech/ noise modeling combined with dynamic variance adaptation M. Delcroix, K. Kinoshita, T. Nakatani, S. Araki, A. Ogawa, T.

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Effective Open Source Speech Recognition in Your Application #kde-speech Peter Grasch

Speech Processing 15-492/18-492 Speech Recognition Intro Acoustic modelling HMMs Speech

The The W World Oce ld Ocean an Re Regime: Polluted Wa Waters and Exhausted Fi Fishe

GPSCP: A General-Purpose Support-Circuit Preconditioning Approach to Large Scale SPICE Accurate

PS : Neural Vanishing Point Scanner via Conic Convolution Yichao Zhou * Haozhi Qi * Jingwei Huang

Discrete Mathematics, Chapter 1.1.-1.3: Propositional Logic Richard Mayr University of

Dominant Dimension and Orders over Cohen-Macaulay Rings 2. This Year 1. Last Year 1 Table of

cDeepArch: A Compact Deep Neural Network Architecture for Mobile Sensing Kang Yang 1 , Xiaoqing

Germany: The Dominant Power in Europe The Dominant Power in Europe meelis_kitsing@uml.edu

Tractable Term Structure ModelsA New Approach Bruno Feunou, Jean-S ebastien Fontaine, Anh