An Efferent-inspired Auditory Model Front-end for Speech Recognition - PowerPoint PPT Presentation

An Efferent-inspired Auditory Model Front-end for Speech Recognition Chia-ying Lee, James Glass and Oded Ghitza* MIT Computer Science and Artificial Intelligence Lab, Cambridge, MA, USA *Boston University Hearing Research Lab, Boston, MA, USA

Motivation • Human v.s. Automatic Speech Recognizers (ASRs) - Humans are particularly good at dealing with previously unseen noise or dynamic noises.

Motivation • Human v.s. Automatic Speech Recognizers (ASRs) - Humans are particularly good at dealing with previously unseen noise or dynamic noises. • Mounting evidence of the role of efferent-feedback in mammalian auditory systems - Operating point of the cochlea is regulated by background noise - Results in stable internal representations

Motivation • Human v.s. Automatic Speech Recognizers (ASRs) - Humans are particularly good at dealing with previously unseen noise or dynamic noises. • Mounting evidence of the role of efferent-feedback in mammalian auditory systems - Operating point of the cochlea is regulated by background noise - Results in stable internal representations • Explore potential use of a feedback mechanism for ASR - Use a MOC efferent-inspired auditory model as an ASR front-end

An Efferent-inspired Auditory Model • Messing et al., 2009 Inner Dynamic Middle Cochlea Hair Range Ear Cell Window G

Model of Ascending Pathway Inner Dynamic Middle Cochlea Hair Range Ear Cell Window

Model of Ascending Pathway Inner Dynamic Middle Cochlea Hair Range Ear Cell Window • Middle Ear - Modeled by a high-pass filter

Model of Ascending Pathway Non- Inner Dynamic Middle linear Hair Range Ear Cochlea Cell Window • J. Goldstein, 1990 • Multi-Band Path Non-Linear model (MBPNL)

MBPNL Model • Modeling cochlear nonlinearity • Example for center frequency = 1820 Hz - filter characteristics change instantaneously as a function of input signal strength

Model of Ascending Pathway Inner Dynamic Middle Cochlea Hair Range Ear Cell Window • Inner Hair Cell - Generic MIT model - A half-wave rectifier followed by a low pass filter

Model of Ascending Pathway Inner Dynamic Middle Cochlea Hair Range Ear Cell Window • Dynamic Range Window (DRW) - A hard limiter with upper and lower bounds, representing the dynamic range of auditory nerve firing

Dynamic Range Window Output Lower Bound Upper Bound Input • No firing for signals below the lower bound • Saturation in firing rate for signals above the upper bound

An Efferent-inspired Auditory Model n(t) Inner Dynamic Middle Cochlea Hair Range Ear Cell Window

An Efferent-inspired Auditory Model n(t) Inner Dynamic Middle Cochlea Hair Range Ear Cell Window G • G is adjusted based on the background noise such that the output of the DRW is at “epsilon level”. - G impacts the filter response in the MBPNL cochlear model.

An Efferent-inspired Auditory Model s(t) + n(t) Inner Dynamic Middle Cochlea Hair Range Ear Cell Window G • The noisy speech signal is processed by the tuned auditory model.

Definitions • Open-loop model - The model for the ascending pathway Non- Inner Dynamic Middle linear Hair Range Ear Cochlea Cell Window

Definitions • Closed-loop model - The ascending pathway model with the efferent-inspired feedback Non- Inner Dynamic Middle linear Hair Range Ear Cochlea Cell Window G

Visual Illustration • Rows represent speech in different types of noise at 10 dB SNR Short time Fourier transform Closed-loop model

A Closed-loop Front-end for ASR s(t)+n(t) Inner Dynamic Middle Cochlea Hair Range Ear Cell Window G • Need to extract features that can be processed by speech recognizers

A Closed-loop Front-end for ASR s(t)+n(t) Inner Dynamic R(n) Middle Cochlea Framing DCT Hair Range Log Ear Cell Window G DC Log Framing offset • The feature generation method follows the standard MFCC extraction process.

Experimental Setup • Corpus creation (noisy speech data synthesis) • Feature extraction methods • Recognizer training and testing • Experimental results

Corpus Creation • Noise signals - Stationary noise: speech-shaped, white, pink - Non-stationary Aurora2 noise: train, subway • Speech signals - Aurora2 digits (TIDigits) • Noisy speech synthesis - Noise signals are fixed at 70 dB SPL - Speech signals are adjusted to create 5 to 20 dB SNRs - 300 ms adaptation prior to speech signal

Feature Extraction Methods • Three feature extraction methods - MFCC baseline with conventional normalization method - The open-loop auditory model (in paper) - The closed-loop auditory model

Recognizer Training and Testing • Standard Aurora2 HMM-based recognizer was used • Jackknifing experiments with mismatched training and test conditions Training data Test data N1 N2 N3 N4 N5 20 dB SNR 20 dB SNR 4004 15 dB SNR 15 dB SNR 6672 Test Training Utterances 10 dB SNR 10 dB SNR Utterances 5 dB SNR 5 dB SNR

Experimental Results Accuracy MFCC Baseline Closed-loop model (%) Average 86 92 8.6 4.7 STD • The closed-loop model performs 43% better than the MFCC baseline, and reduced variation across mismatched conditions by 45%.

Experimental Results MFCC baseline Closed-loop model Acc (%) Acc (%) speech- speech- Train Pink Subway Train Pink Subway White White dB SNR shaped shaped dB SNR 20 20 95 92 91 88 94 96 94 95 93 96 15 15 94 90 89 84 93 96 93 96 92 95 10 10 91 85 85 76 92 94 91 95 89 93 81 73 76 62 84 5 83 83 91 78 84 5 90 85 85 77 91 92 90 94 88 92 Avg Avg • The closed-loop model performed better than the baseline across all mismatched training and test conditions.

Conclusions • Key ideas - Efferent-inspired feedback regulates the operating point of the front-end - Results in a stable representation -- a desired property for ASR • Experimental validation - Digit recognition in noise in mismatched conditions with multiple noise types and SNRs - The closed-loop model outperformed the baseline across all mismatched training and test conditions. - The results indicate that incorporating feedback in the front-end shows promise for generating robust speech features.

An Efferent-inspired Auditory Model Front-end for Speech Recognition - PowerPoint PPT Presentation

An Efferent-inspired Auditory Model Front-end for Speech Recognition Chia-ying Lee, James Glass and Oded Ghitza* MIT Computer Science and Artificial Intelligence Lab, Cambridge, MA, USA *Boston University Hearing Research Lab, Boston, MA, USA

Human Speech Hermansky Spring 2020 EN.520.680 Speech and Auditory Processing by Humans and

EE E6820: Speech & Audio Processing & Recognition Lecture 4: Auditory Perception 1

Chapter 4 Hearing, Auditory Models, and Speech Perception

WHAT IS AUDITORY PROCESSING? HOW DOES IT IMPACT UPON LEARNERS? WHAT IMPACTS UPON AUDITORY

Reconstructing Speech from Human Auditory Cortex Alex Francois Nienaber CSC2518 Fall 2014

Tim Bunnell Center for Pediatric Auditory & Speech Sciences Nemours/Alfred I. duPont

rtAIM : a real-time implementation of the auditory image model of auditory periphery Willem van

1 2 Auditory processing is crucial because our learning is heavily reliant on auditory system---=

Cecil n A classless object model n Uniform use of messages for everything n Inspired by CLOS: n

Cecil n A classless object model n Uniform use of messages for everything n Inspired by CLOS: n

SNEAK M&S SHOP FRONT, 2015 INSPIRED BY EXPLORING THE GROTTO WENDY PLOMP CROSSING THE

Age-related hearing loss: Speech perception problems and speech technology needs Sandra

Cepstral analysis in speech processing From speech production model, we have: s[n] = (p[n]*g[n] +

Front Office First Impressions How to train your Front Office

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

(De)Composing auditory ERPs: Estimating cross-linguistic variations by combining auditory change

Effective Open Source Speech Recognition in Your Application #kde-speech Peter Grasch

Auditory Training and Challenges Associated with Participation and less than 20% of new users

Primer on Auditory Processing Mounya Elhilali Department of Electrical & Computer Engineering

Vestibular and Auditory Sensory Systems Auditory Modulation difficulties Low Frequency:

Speech Technology Using in Wechat FENG RAO Powered by WeChat Outline Introduce Algorithm of

Auditory System Whats the frequency Kenneth? Overview Intro Physical Stimulus: Sound

Learning Objectives 1. I de ntify the pre va le nc e o f Co lo ra do c hildre n ide

Filter Banks SPEECH RECOGNITION 40833 1 2 Spectral Analysis Models (a) Pattern Recognition

An Efferent-inspired Auditory Model Front-end for Speech Recognition - PowerPoint PPT Presentation

An Efferent-inspired Auditory Model Front-end for Speech Recognition Chia-ying Lee, James Glass and Oded Ghitza* MIT Computer Science and Artificial Intelligence Lab, Cambridge, MA, USA *Boston University Hearing Research Lab, Boston, MA, USA

Human Speech Hermansky Spring 2020 EN.520.680 Speech and Auditory Processing by Humans and

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 4: Auditory Perception 1

Chapter 4 Hearing, Auditory Models, and Speech Perception

WHAT IS AUDITORY PROCESSING? HOW DOES IT IMPACT UPON LEARNERS? WHAT IMPACTS UPON AUDITORY

Reconstructing Speech from Human Auditory Cortex Alex Francois Nienaber CSC2518 Fall 2014

Tim Bunnell Center for Pediatric Auditory &amp; Speech Sciences Nemours/Alfred I. duPont

rtAIM : a real-time implementation of the auditory image model of auditory periphery Willem van

1 2 Auditory processing is crucial because our learning is heavily reliant on auditory system---=

Cecil n A classless object model n Uniform use of messages for everything n Inspired by CLOS: n

Cecil n A classless object model n Uniform use of messages for everything n Inspired by CLOS: n

SNEAK M&amp;S SHOP FRONT, 2015 INSPIRED BY EXPLORING THE GROTTO WENDY PLOMP CROSSING THE

Age-related hearing loss: Speech perception problems and speech technology needs Sandra

Cepstral analysis in speech processing From speech production model, we have: s[n] = (p[n]*g[n] +

Front Office First Impressions How to train your Front Office

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

(De)Composing auditory ERPs: Estimating cross-linguistic variations by combining auditory change

Effective Open Source Speech Recognition in Your Application #kde-speech Peter Grasch

Auditory Training and Challenges Associated with Participation and less than 20% of new users

Primer on Auditory Processing Mounya Elhilali Department of Electrical &amp; Computer Engineering

Vestibular and Auditory Sensory Systems Auditory Modulation difficulties Low Frequency:

Speech Technology Using in Wechat FENG RAO Powered by WeChat Outline Introduce Algorithm of

Auditory System Whats the frequency Kenneth? Overview Intro Physical Stimulus: Sound

Learning Objectives 1. I de ntify the pre va le nc e o f Co lo ra do c hildre n ide

Filter Banks SPEECH RECOGNITION 40833 1 2 Spectral Analysis Models (a) Pattern Recognition

EE E6820: Speech & Audio Processing & Recognition Lecture 4: Auditory Perception 1

Tim Bunnell Center for Pediatric Auditory & Speech Sciences Nemours/Alfred I. duPont

SNEAK M&S SHOP FRONT, 2015 INSPIRED BY EXPLORING THE GROTTO WENDY PLOMP CROSSING THE

Primer on Auditory Processing Mounya Elhilali Department of Electrical & Computer Engineering