Speech Separation for Recognition and Enhancement Dan Ellis - PowerPoint PPT Presentation

Speech Separation for Recognition and Enhancement Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia University, NY and International Computer Science Institute, Berkeley CA dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/ 1. Speech in the Wild 2. Separation by Space 3. Separation by Pitch 4. Separation by Model Speech Separation - Dan Ellis 2011-10-27 /18 1

1. Speech in the Wild • The world is cluttered sound is transparent mixtures are inevitable • Useful information is structured by ‘sources’ specific definition of a ‘source’: intentional independence Speech Separation - Dan Ellis 2011-10-27 /18 2

Speech in the Wild: Examples • Multi-party discussions • Ambient recordings • Applications: communications o robots o lifelogging/archives Speech Separation - Dan Ellis 2011-10-27 /18 3

Recognizing Speech in the Wild • Current ASR relies on low-D representations e.g. 13 dimensional MFCC features every 10ms ICSI Meeting Room excerpt 4 freq / kHz very successful 3 2 0 for clean speech! 1 − 20 inadequate for 0 − 40 MFCC − based resynthesis 4 freq / kHz mixtures − 60 3 level / dB 2 1 0 0 1 2 3 4 5 6 7 8 9 10 time / s • We need separation! Speech Separation - Dan Ellis 2011-10-27 /18 4

2. Speech Separation • How can we separate speech information? • Spatial • Pitch • Speech probs Analyze • ... Noisy Speech Select / Enhance Application Cleaned Speech (features) • T-F masking • Recognition • Weiner filtering • Listening • Reconstruction • ... Speech Separation - Dan Ellis 2011-10-27 /18 5

Separation by Spatial Info • Given multiple microphones, sound carries spatial information about source • E.g. model interaural spectrum of each source as stationary level and time differences: • e.g. at 75°, in reverb: IPD IPD residual ILD Speech Separation - Dan Ellis 2011-10-27 /18 6

Model-Based EM Source Separation and Localization (MESSL) Mandel et al. ’10 Re-estimate source parameters Assign spectrogram points to sources can model more sources than sensors Speech Separation - Dan Ellis 2011-10-27 /18 7

MESSL Results • Modeling uncertainty improves results tradeoff between constraints & noisiness 2.45 dB 0.22 dB 2.45 dB 0.22 dB 12.35 dB 12.35 dB 8.77 dB 8.77 dB Algorithmic masks 9.12 dB -2.72 dB 9.12 dB -2.72 dB 100 80 • Helps with recognition 60 Human Sawada 40 Mouba digits accuracy MESSL − G MESSL − ΩΩ 20 DUET Mixes 0 40 − 40 − 20 0 20 40 Target − to − masker ratio (dB) Speech Separation - Dan Ellis 2011-10-27 /18 8

3. Separation by Pitch • Voiced syllables have near-periodic “pitch” perceptually salient lost in MFCCs Brungart et al.’01 • Can we track pitch & use it for separation? ... and other speech tasks? Speech Separation - Dan Ellis 2011-10-27 /18 9

Noise-Robust Pitch Tracking • Important for voice detection & separation BS Lee & Ellis ’12 • Based on channel selection Wu, Wang & Brown ’03 pitch from summary autocorrelation over “good” bands trained classifier decides which channels to include Speech Separation - Dan Ellis 2011-10-27 /18 10

Noise-Robust Pitch Tracking • Channel-based classifiers learn domain channel/noise characteristics then separate, or derive features for recognition 08_rbf_pinknoise5dB Freq / kHz 4 20 0 2 −20 −40 0 Channels 40 20 CS-GT 25 12.5 period / ms 0 25 12.5 0 • Only works for pitched sounds 0 1 2 3 4 time / sec need a broader description of the speech source... Speech Separation - Dan Ellis 2011-10-27 /18 11

4. Separation by Models • If ASR is finding best-fit parameters Varga & Moore, ’90 Hershey et al., ’10 argmax P( W | X ) ... • Recognize mixtures with Factorial HMM model + state sequence for each voice/source exploit sequence constraints, speaker differences model 2 model 1 observations / time separation relies on detailed speaker model Speech Separation - Dan Ellis 2011-10-27 /18 12

Eigenvoices Kuhn et al. ’98, ’00 • Idea: Find Weiss & Ellis ’10 speaker model parameter space generalize without Speaker models losing detail? Speaker subspace bases • Eigenvoice model: Mean Voice 8 10 Frequency (kHz) 20 6 30 4 40 2 50 b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax U w B h = µ + ¯ + Eigenvoice dimension 1 µ 8 8 Frequency (kHz) 6 6 4 4 adapted mean eigenvoice weights channel channel 2 2 0 model voice bases bases weights b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax Eigenvoice dimension 2 8 8 Frequency (kHz) 6 6 89,600 dimensional space 4 4 2 2 0 b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax Eigenvoice dimension 3 Speech Separation - Dan Ellis 2011-10-27 /18 13

Eigenvoice Speech Separation Speech Separation - Dan Ellis 2011-10-27 /18 14

Eigenvoice Speech Separation • Eigenvoices for Speech Separation task speaker adapted (SA) performs midway between speaker-dependent (SD) & speaker-indep (SI) Mix SI SA SD Speech Separation - Dan Ellis 2011-10-27 /18 15

Spatial + Model Separation • MESSL + Eigenvoice “priors” Weiss, Mandel & Ellis ’11 Speech Separation - Dan Ellis 2011-10-27 /18 16

Summary • Speech in the Wild ... real, challenging problem ... applications in communications, lifelogs ... • Speech Separation ... by generic properties (location, pitch) ... via statistical models • Recognition and Enhancement ... separate-then-X, or integrated solution? Speech Separation - Dan Ellis 2011-10-27 /18 17

References • John Hershey, Steve Rennie, Pedr Olsen, Trausti Kristjansson, “Super-human multi-talker speech recognition: A graphical modeling approach,” Computer Speech & Lang. 24 (1), 45-66, 2010. • Jon Barker, Martin Cooke, Dan Ellis, “Decoding Speech in the Presence of Other Sources,” Speech Communication 45(1): 5-25, 2005. • R. Kuhn, J. Junqua, P. Nguyen, N. Niedzielski, “Rapid speaker adaptation in eigenvoice space,” . IEEE Tr. Speech & Audio Proc. 8(6): 695–707, Nov 2000. • Byung-Suk Lee & Dan Ellis, “Noise-robust pitch tracking by trained channel selection,” submitted to ICASSP , 2012. • Michael Mandel, Ron Weiss, Dan Ellis, “Model-Based Expectation-Maximization Source Separation and Localization,” IEEE Tr. Audio, Speech, Lang. Proc. 18(2): 382-394, Feb 2010. • A. Varga and R. Moore, “Hidden markov model decomposition of speech and noise,” ICASSP-90 , 845–848, 1990. • Ron Weiss & Dan Ellis, “Speech separation using speaker-adapted Eigenvoice speech models,” Computer Speech & Lang . 24(1): 16-29, 2010. • Ron Weiss, Michael Mandel, Dan Ellis, “Combining localization cues and source model constraints for binaural source separation,” Speech Communication 53(5): 606-621, May 2011. • Mingyang Wu, DeLiang Wang, Guy Brown, “A multipitch tracking algorithm for noisy speech,” IEEE Tr. Speech & Audio Proc. 11(3): 229–241, May 2003. Speech Separation - Dan Ellis 2011-10-27 /18 18

Speech Separation for Recognition and Enhancement Dan Ellis - PowerPoint PPT Presentation

Speech Separation for Recognition and Enhancement Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia University, NY and International Computer Science Institute, Berkeley CA

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Separation energies A = 21 isobaric chain one-nucleon separation energies two-nucleon separation

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Combining Speech and Speaker Recognition - A Joint Modeling Approach Hang Su Supervised by:

Library S.B.M.S. College, Sualkuchi, Assam Library S.B.M.S. College, Sualkuchi, Assam Dist.:

ALLIED LAWS AT MUMBAI ON 28 TH MARCH 2015 PRESENTATION BY CS B NARASIMHAN 1 SECRETARIAL

Background Play Golf Make a Difference National Gurdwara Advisory Group - Safe Re-opening of

NWC-CAC | June 29, 2017 MIG Placemaking Team The MIG Team PlacemakingWhy it Matters What

Artificial Bandwidth Extension Using Deep Neural Networks for Spectral Envelope Estimation

AIRCRAFT SPECIFICATIONS 1992 CESSNA CITATION V Serial Number: 560-0190 R EGISTRATION N UMBER :

Measure M Policy Briefing September 12, 2016 Our Transportation Vision: What is Measure M?

MANUFACTURE AND OXIDATION BEHAVIOR OF C/SIC COMPOSITES MODIFIED WITH B-RICH SIBC COATING X.Z. Zuo,