EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and synthesis 1 Modeling speech signals 2 Spectral and cepstral models 3 Linear Predictive models (LPC) 4 Other signal models 5 Speech synthesis Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e6820/ E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 1
The speech signal 1 • Speech sounds in the spectrogram ^ ε θ t cl y h z e w c n z d a m I I I has a watch thin as a dime • Elements of the speech signal: - spectral resonances (formants, moving) - periodic excitation (voicing, pitched) + pitch contour - noise excitation (fricatives, unvoiced, no pitch) - transients (stop-release bursts) - amplitude modulation (nasals, approximants) - timing! E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 2
The source-filter model • Notional separation of: source : excitation, fine t-f structure & filter : resonance, broad spectral structure Formants t Glottal pulse Pitch train Vocal tract Speech Radiation Voiced/ resonances + characteristic unvoiced f Frication noise t Source Filter • More a modeling approach than a model E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 3
Signal modeling • Signal models are a kind of representation - to make some aspect explicit - for efficiency - for flexibility • Nature of model depends on goal - classification: remove irrelevant details - coding/transmission: remove perceptual irrelevance - modification: isolate control parameters • But commonalities emerge - perceptually irrelevant detail (coding) will also be irrelevant for classification - modification domain will usually reflect ‘independent’ perceptual attributes - getting at the abstract information in the signal E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 4
Different influences for signal models • Receiver: - see how signal is treated by listeners → cochlea-style filterbank models • Transmitter (source) - physical apparatus can generate only a limited range of signals... → LPC models of vocal tract resonances • Making explicit particular aspects - compact, separable resonance correlates → cepstrum - modeling prominent features of NB spectrogram → sinusoid models - addressing unnaturalness in synthesis → Harmonic+Noise model E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 5
Applications of (speech) signal models • Classification / matching Goal: highlight important information - speech recognition (lexical content) - speaker recognition (identity or class) - other signal classification - content-based retrieval • Coding / transmission / storage Goal: represent just enough information - real-time transmission e.g. mobile phones - archive storage e.g. voicemail • Modification/synthesis Goal: change certain parts independently - speech synthesis / text-to-speech (change the words) - speech transformation / disguise (change the speaker) E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 6
Outline 1 Modeling speech signals 2 Spectral and cepstral models - Auditorily-inspired spectra - The cepstrum - Feature correlation 3 Linear predictive models (LPC) 4 Other models 5 Speech synthesis E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 7
Spectral and cepstral models 2 • Spectrogram seems like a good representation - long history - satisfying in use - experts can ‘read’ the speech • What is the information? - intensity in time-frequency cells; typically 5ms x 200 Hz x 50 dB → Discarded information: - phase - fine-scale timing • The starting point for other representations E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 8
The filterbank interpretation of the short-time Fourier transform (STFT) • Can regard spectrogram rows as coming from separate bandpass filters: f sound • Mathematically: j 2 π k n ( ) – n 0 ∑ [ , ] [ ] w n ⋅ [ ] ⋅ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - = – exp – X k n 0 x n n 0 n N ∑ [ ] h k n 0 ⋅ [ ] = – x n n n Hk (ej ω ) hk [ n ] W (ej( ω − 2 π k/N ) ) j 2 π kn w [ -n ] [ ] [ ] ⋅ = – exp - - - - - - - - - - - - - h k n w n where N n ω 2 π k/N E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 9
Spectral models: Which bandpass filters? • Constant bandwidth? (analog / FFT) • But: cochlea physiology & critical bandwidths → use actual bandpass filters in ear models & choose bandwidths by e.g. CB estimates • Auditory frequency scales - constant ‘Q’ (center freq/bandwidth), mel, Bark... E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 10
Gammatone filterbank • Given bandwidths, which filter shapes ? - match inferred temporal integration window - match inferred spectral shape (sharp hi-F slope) - keep it simple (since it’s only approximate) → Gammatone filters [ ] ⋅ ⋅ ( ω i n ) – 1 N = exp – cos h n n bn time → z plane 0 2 -10 mag / dB 2 log -20 axis! -30 2 -40 2 -50 50 100 200 500 1000 2000 5000 freq / Hz - 2N poles, 2 zeros, low complexity - reasonable linear match to cochlea E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 11
Constant-BW vs. cochlea model • Spectrograms: • Frequency responses: Effective FFT filterbank FFT-based WB spectrogram (N=128) 0 8000 -10 6000 freq / Hz Gain / dB -20 4000 -30 2000 -40 0 -50 0 0.5 1 1.5 2 2.5 3 0 1000 2000 3000 4000 5000 6000 7000 8000 Gammatone filterbank Q=4 4 pole 2 zero cochlea model downsampled @ 64 0 5000 -10 2000 Gain / dB freq / Hz -20 1000 -30 500 -40 200 100 -50 0 1000 2000 3000 4000 5000 6000 7000 8000 0 0.5 1 1.5 2 2.5 3 Freq / Hz time / s linear axis • Magnitude smoothed over 5-20 ms time window E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 12
Limitations of spectral models • Not much data thrown away - just fine phase/time structure (smoothing) - little actual ‘modeling’ - still a large representation! • Little separation of features - e.g. formants and pitch • Highly correlated features - modifications affect multiple parameters • But, quite easy to reconstruct - iterative reconstruction of lost phase E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 13
The cepstrum • Original motivation: Assume a source-filter model: Excitation Resonance n filter H ( e j ω ) source g [ n ] ω n • Define ‘Homomorphic deconvolution’: - source-filter convolution: g [ n ] *h [ n ] G ( e j ω ) ·H ( e j ω ) - FT → product log G ( e j ω ) + log H ( e j ω ) - log → sum: - IFT → separate fine structure: c g [ n ] + c h [ n ] = deconvolution • Definition: ( ( [ ] ) ) = idft log dft x n c n Real cepstrum E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 14
Stages in cepstral deconvolution Waveform and min. phase IR • Original waveform has 0.2 excitation fine structure 0 convolved with resonances -0.2 • DFT shows harmonics 0 100 200 300 400 samps abs(dft) and liftered 20 modulated by resonances 10 • Log DFT is sum of harmonic ‘comb’ and 0 1000 2000 3000 0 freq / Hz resonant bumps log(abs(dft)) and liftered dB 0 • IDFT separates out -20 resonant bumps (low -40 quefrency) and regular, 1000 2000 3000 0 freq / Hz fine structure (‘pitch pulse’) real cepstrum and lifter 200 pitch pulse 100 • Selecting low-n cepstrum 0 separates resonance information quefrency 0 100 200 (deconvolution / ‘liftering’) E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 15
Properties of the cepstrum • Separate source (fine) & filter (broad structure) - smooth the log mag. spectrum to get resonances • Smoothing spectrum is filtering along freq. - i.e. convolution applied in Fourier domain → multiplication in IFT (‘liftering’) Periodicity in time → harmonics in spectrum • → ‘pitch pulse’ in high-n cepstrum • Low-n cepstral coefficients are DCT of broad filter / resonance shape: X e j ω ∫ = ( ) ⋅ ( n ω n ω ) ω c n log cos + sin j d 5th order Cepstral reconstruction Cepstral coefs 0..5 0.1 2 1 0 0 -1 0 1 2 3 4 5 -0.1 0 1000 2000 3000 4000 5000 6000 7000 E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 16
Aside: Correlation of elements • Cepstrum is a popular in speech recognition - feature vector elements are decorrelated : Features Covariance matrix Example joint distrib (10,15) 20 25 -2 16 Auditory spectrum 20 -3 12 15 -4 10 8 -5 5 4 20 18 3 16 16 coefficients 2 14 Cepstral 1 12 12 0 10 8 -1 8 6 -2 4 4 -3 2 -4 50 100 150 5 10 15 20 -5 0 5 frames - c 0 ‘normalizes out’ average log energy • Decorrelated pdfs fit diagonal Gaussians - simple correlation is a waste of parameters • DCT is close to PCA for spectra? E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 17
Outline 1 Modeling speech signals 2 Spectral and cepstral modes 3 Linear Predictive models (LPC) - The LPC model - Interpretation & application - Formant tracking 4 Other models 5 Speech synthesis E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 18
Recommend
More recommend