Computer Vision, Speech Communication & Signal Processing Group, Intelligent Robotics and Automation Laboratory National Technical University of Athens, Greece (NTUA) Robot Perception and Interaction Unit, Athena Research and Innovation Center (Athena RIC) Nonlinear Aspects of Speech Production: Modulations and Energy Operators Petros Maragos Summer School on Speech Signal Processing (S4P) DA-IICT, Gandhinagar, India, 9-11 Sept. 2018 1
Outline Nonlinear Speech Processing Modulations Energy Operators AM-FM Speech Model, Demodulation Algorithms Applications to Speech Recognition Applications to Music Recognition Application to Audio Summarization Application to Distant Speech Recognition Applications of Spatio-Temporal Modulations to Image and Video Processing 2
LINEAR Physics of Linear models ACOUSTICS speech airflow of speech APPROXIMATION production
Physics of Speech Airflow p • airflow variables: = air density; = pressure u = 3D air particle velocity • governing equations: u 0 mass conservation (continuity eqn): t momentum conservation (Navier-Stokes eqn): u 1 2 u u p g u u t 3 p 1.4 const. state equation: • time-varying boundary conditions
Nonlinear Speech Processing • Modulations • Turbulence – Fractals – Chaos
Evidence for Speech Modulations • separated & unstable airflow • vortices • oscillators with time-varying elements • energy pulses (Teager)
Time-varying Oscillators AM-FM Simple second-order oscillators with time-varying elements produce modulations: - If mass or compliance are time-varying FM [Van der Pol, Proc. IRE 1930] - If damping is time-varying AM [Van der Pol, IEE J. London 1946]
AM-FM Speech Model, Energy Demodulation Algorithms
AM-FM Speech Modulation Model [ Maragos, Kaiser & Quatieri, IEEE T-SP Oct.1993 ] • One Single Resonance as damped AM–FM: t t S ( t ) A ( t ) e cos t q ( ) d c 0 a ( t ) ( t ) d Inst.Frequency: ω (t) 2 π f(t) φ (t) ω (t) q(t) c dt • If due to 2 nd -order LTI system constant, A(t) ω (t) ω c • Speech Signal as multi-component AM-FM: Speech ( t ) a ( t ) cos ( t ) k k k
AM-FM Demodulation Problem Given , estimate a t ( ), ( ) t x t ( ) a t ( ) cos( ( )) t • Variational approach • Hilbert Transform: 1 x (t) x(t) x(t) +j π t ω -j d x arctan 2 2 x x a dt x • Energy Operators
Energy Tracking in Oscillators • harmonic oscillator • energy x(t) K m 1 1 m 2 2 2 2 E m x kx ( A ) constant 2 2 2 • motion equation kx m x 0 • energy tracking • response E 2 2 2 (x) ( x ) x x A ω ( m 2 ) x ( t ) A cos( t ), 2 k m
1D Energy Operators (Teager, Kaiser ICASSP 1990) • Continuous-time signals x ( t ) : 2 x ( t ) [ x ( t )] x ( t ) x ( t ) c property: rt 2 2 rt 2 Ae cos ( ω t θ ) A e ω c c c • Discrete-time signals x(n) : -Discretize Derivatives [Maragos, Kaiser & Quatieri, T- 2 x ( n ) x ( n ) x ( n 1 ) x ( n 1 ) SP Apr.1993] d -Special case of Quadratic opers [Atlas & Fang, T-SP 1995] property: n 2 2 n 2 A r cos ( Ω n ) A r sin ( Ω ) d c c
Energy Separation Algorithm (ESA) (Maragos, Kaiser & Quatieri, IEEE T-SP Oct. 1993) x(t) A cos ( ω (t) θ ) • Cosine: c 2 2 2 4 [ x(t) ] A ω [ x (t) ] A ω c c t x(t) a(t) cos ( ω ( τ )d τ ) • AM-FM signal: 0 a(t), ω (t) do not vary too fast or too much w.r.t. c [ ( )] x t [ ( )] x t a t ( ) ( ) t [ ( )] x t [ ( )] x t
Discrete ESA (DESA-2) n • AM-FM Signal: x n [ ] a n [ ]cos ( (m)dm ) 0 • Energy Tracking: 2 2 x n [ ] a [ ] sin n [ ] n 4 4 x n [ 1] x n [ 1] 4 a [ ] sin n [ ] n • DESA-2: 2 x n [ ] a n [ ] x n [ 1] x n [ 1] x n [ 1] x n [ 1] arcsin [ ] n 4 x n [ ]
ESA Applied to Synthetic AM-FM 1.25 1 SQRT ENERGY AM--FM SIGNAL 0 0.5 -1.25 0 0 100 200 300 400 0 100 200 300 400 SAMPLE SAMPLE 1.25 0.25 INST. FREQUENCY / PI AMPLITUDE ENVELOPE 1 0.2 0.75 0.15 0 100 200 300 400 0 100 200 300 400 SAMPLE SAMPLE 0.0007 0.006 FREQUENCY ERROR / PI AMPLITUDE ERROR 0 0 -0.0006 -0.007 0 100 200 300 400 0 100 200 300 400 SAMPLE SAMPLE
ESA Applied to Speech Resonance 1 1 SPEECH SIGNAL SQRT ENERGY 0 0.5 -1 0 0 10 20 30 0 10 20 30 TIME (msec) TIME (msec) 200 3 AMPLITUDE ENVELOPE 100 100 SPEECH SPECTRUM (dB) 2 0 -100 1 -200 0 0 10 20 30 -300 0 1 2 3 4 5 6 TIME (msec) FREQUENCY (kHz) 3800 1.1 INST. FREQUENCY (Hz) 3600 BANDPASS SPEECH 3400 0 3200 3000 2800 -1.1 0 10 20 30 0 10 20 30 TIME (msec) TIME (msec)
ESA in Noise and BP Filtering (Bovik, Maragos & Quatieri, IEEE T-SP Dec. 1993) t • AM-FM signal: x(t) a(t) cos ( ω ( τ )d τ ) n(t) 0 signal • Noise: wss Gaussian zero-mean, p.spectrum N( ξ ) • Bandpass Filter: 2 a (t) SNR(t) x(t) G(ξ) y(t) N ( ) d passband • ESA Ampl./Freq. Estimates: a (t), ω (t) 4 SNR(t) E 2 2 [ ω (t) ] ω (t) 1 2 [ SNR(t) 2 ] 10 SNR(t) 4 2 2 2 E [ a (t) ] a (t) 1 G ω (t) SNR(t) [ SNR(t) 2 ]
Multiband Demodulation and F/B Tracking … f f f f 1 2 3 N x ( t , f ) x ( t , f ) x ( t , f ) x ( t , f ) 3 N 1 2 … ESA ESA ESA ESA a ( t , f ) a ( t , f ) a ( t , f ) a ( t , f ) 1 2 3 N f ( t , f ) f ( t , f ) f ( t , f ) f ( t , f ) 1 2 3 N … f 2 f 2 f 2 f 2 a a a a F(t,f) B(t,f) [ A. Potamianos & P. Maragos, JASA 1996 ]
Frequency and Bandwidth Estimates • Center Frequency Estimates: 2 T f t a 1 ( ) ( ) t dt T o F f t dt ( ) Fw o u 2 ( ) T a T t dt o • Bandwidth Estimates: 1 2 T 2 B ( f t ( ) F ) dt o u u T T 2 2 2 ( ( ) / 2 ) a t ( f t ( ) F ) a ( ) t dt o w 2 Bw 2 ( ) T a t dt o
Speech Pyknogram [ A. Potamianos & P. Maragos, JASA 1996 ]
Smooth Energy Operators and tracking Teager-Kaiser Energy Operator (TKEO): AM-FM signals : Regularized or Gabor TKEO : where the Gabor filter’s impulse response Wideband signals (sum of non-stationary sinusoids) Simultaneous narrowband component separation, energy tracking and denoising 2D Gabor TKEO : Refs: Dimitriadis & Maragos, Speech Com 2006. Kokkinos, Evangelopoulos & Maragos, T-PAMI 2009
1/f Speech Modulation Model • Model a resonance of a random speech phoneme as a phase-modulated 1/f signal: S t ( ) A cos t P t ( ) c ( ) t • Nonlinear phase signal P(t) modeled as 1/f random process . • Useful model for broad resonances often observed in fricative voiced or unvoiced sounds and probably caused by nonlinear phenomena during speech production. [ Dimakis & Maragos, IEEE T-SP 2005 ]
Recommend
More recommend