modeling speech using pole zero models
play

Modeling speech using pole-zero models Christian H. Kasess - PowerPoint PPT Presentation

Modeling speech using pole-zero models Christian H. Kasess Acoustics Research Institute 25.10.2012 Kasess (ARI) Vocal tract modeling SPL 2012 1 / 31 The vocal tract Roughly divided into three cavities Pharyngeal Oral Nasal Oral vowel


  1. Modeling speech using pole-zero models Christian H. Kasess Acoustics Research Institute 25.10.2012 Kasess (ARI) Vocal tract modeling SPL 2012 1 / 31

  2. The vocal tract Roughly divided into three cavities Pharyngeal Oral Nasal Oral vowel production Nasal section closed off by velum Nasals and nasalized vowels Nasal section coupled Laterals (e.g. /l/) Airflow on one (or both) sides of the tongue http://pegasus.cc.ucf.edu/ cnye/vocal Generates side branches tract pic.htm Kasess (ARI) Vocal tract modeling SPL 2012 2 / 31

  3. Source-filter model http://health.tau.ac.il/Communication Disorders/noam Glottis acts as source (pulse train) Vocal tract acts as ’slowly’ varying linear filter Kasess (ARI) Vocal tract modeling SPL 2012 3 / 31

  4. Source-filter model Source and filter often assumed independent Glottal opening and closing changes VT filter Glottal pulse is not ideal pulse Effect of glottis not linear Still the source-filter model is useful Commonly used in phonetics Model parameters can be used for speaker recognition Useful for formant tracking Kasess (ARI) Vocal tract modeling SPL 2012 4 / 31

  5. All-pole model All-pole model captures resonances or formants Autoregressive model (AR), linear predictive coding (LPC) p � y ( n ) = a i y ( n − i ) + x ( n ) i = 1 Works well with vowels Easy to estimate Solve the Yule-Walker equations (Toeplitz) with the Levinson-Durbin algorithm p � a i γ ( n − i ) + σ 2 γ ( n ) = x δ n , 0 i = 1 Direct link to simple physical model Correlation function... γ ( i ) = E [ y ( n ) y ( n − i )] Kasess (ARI) Vocal tract modeling SPL 2012 5 / 31

  6. Pole-zero models Nasal spectra show spectral dips Oral cavities and paranasal cavities act as resonators Side branches cause decrease in energy Pole-zero model more efficient Problems with pole-zero models Trickier to estimate Requires in general non-linear methods Correspondence to physical model more difficult Kasess (ARI) Vocal tract modeling SPL 2012 6 / 31

  7. All-pole vs. pole-zero model ctd. ● ● ● Envelope ● ● ● ● ● ● (15,0), RMS= 0.56 −10 ● ● ● ● ● ● ● ● ● ● (10,5), RMS= 0.46 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● (15,5), RMS= 0.45 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● (20,20), RMS= 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● level[dB] ● ● ● ● ● ● ● ● −30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −50 ● ● ● ● ● ● ● ● ● ● 0 1000 2000 3000 4000 f[Hz] Kasess (ARI) Vocal tract modeling SPL 2012 7 / 31

  8. Pole-zero models Auto Regressive Moving Average (ARMA) p q � � y ( n ) − a k y ( n − k ) = b j x ( n − j ) (1) k = 1 j = 0 Pole-zero model q b j e − i ω k � e − i ω , θ � � x ( ω ) = B j = 0 y ( ω ) = ˆ ˆ A ( e − i ω , θ ) ˆ x ( ω ) (2) p � a k e − i ω k k = 0 Estimation in general a non-linear problem Kasess (ARI) Vocal tract modeling SPL 2012 8 / 31

  9. Time or frequency? Time domain Not suitable for perceputal frequency scales Spectral domain Perceputal frequency scales can be included Logarithmic spectrum can be used Spectral envelope needs to be extracted Harmonics for voiced segments due to glottis Envelope represents VT transfer function (+ glottal pulse) Kasess (ARI) Vocal tract modeling SPL 2012 9 / 31

  10. Spectral error measures Linear spectrum Assumptions about phase are necessary (minimum phase) Speech signal is not minimum phase (glottis) Log spectrum 2 K − 1 � � � e i ω k , θ ′ � y ( ω k ) − log B � � � θ = argmin θ ′ � log ˆ � � A ( e i ω k , θ ′ ) � � � k = 0 Perceptually relevant Log amplitude spectrum 2 � � � � K − 1 � e i ω k , θ ′ � B � � � � � � log | ˆ y ( ω k ) | − log θ = argmin θ ′ � � � � A ( e i ω k , θ ′ ) � � � � � � � k = 0 Phase ignored, minimum phase system easy to obtain Cepstral domain Computationally efficient (only for linear frequency ) Kasess (ARI) Vocal tract modeling SPL 2012 10 / 31

  11. Optimization Methods Estimate numerator and denominator separately Recursive Methods Do not necessarily converge to local minimum Non-linear optimization Newton method Calculation of Hessian necessary Numerically expensive and potentially unstable Gauss-Newton method Hessian approximated through first derivatives Convergence issues Quasi-Newton Approximate Hessian (or its inverse) using iterative scheme Numerically stable and inexpensive Kasess (ARI) Vocal tract modeling SPL 2012 11 / 31

  12. PZ representation Postitions of poles and zeros Number of complex and real poles/zeros needs Multiplicity Quadratic factors Multiplicity Polynomial coefficients Only number of poles and zeros Kasess (ARI) Vocal tract modeling SPL 2012 12 / 31

  13. Recursive estimation Substitute non-linear problem with a linear one Steiglitz-McBride (1965, 1977) 2 K − 1 � � A ( e i ω k ,θ ′ ) B ( e i ω k ,θ ′ ) � � θ i = argmin θ ′ � � ˆ y ( ω k ) A ( e i ω k ,θ i − 1 ) − � � A ( e i ω k ,θ i − 1 ) k = 0 � 2 � 2 � � � K − 1 B ( e i ω k ,θ ′ ) A ( e i ω k ,θ ′ ) � � � � � y ( ω k ) − = argmin θ ′ � ˆ � � � � A ( e i ω k ,θ ′ ) A ( e i ω k ,θ i − 1 ) � � � k = 0 More general: Weighted linear least squares (WLLS) K − 1 � 2 � e i ω k , θ ′ � e i ω k , θ ′ �� � � � θ i = argmin θ ′ W ( ω k , θ i − 1 ) � ˆ y ( ω k ) A − B k = 0 Kasess (ARI) Vocal tract modeling SPL 2012 13 / 31

  14. Marelli and Balazs 2010 Logarithmic amplitude spectrum Estimation of polynomial coefficients Quasi-Newton with line search Gradient calculated analytically Broyden-Fletcher-Goldfarb-Shanno (BFGS) method Iterative approximation of the inverse Hessian (rank-one updates) Line search along gradient Initialized using the WLLS method Kasess (ARI) Vocal tract modeling SPL 2012 14 / 31

  15. Marelli and Balazs 2010 New method shows lowest error Fewer iterations for polynomial representation Kasess (ARI) Vocal tract modeling SPL 2012 15 / 31

  16. Summary Pole-zero Efficient representation for laterals, nasals, ... Different estimation schemes Newton-like method gives good results Speaker verification improved as compared to LPC only (Enzinger et al. 2011) Important questions What is an appropriate degree for the polynomials? Should the glottal source be corrected? What about physiological constraints? Kasess (ARI) Vocal tract modeling SPL 2012 16 / 31

  17. Segmented tube model Vocaltract as a segmented tube (Wakita 1973, Fant 1960) A N+1 A N A 1 A 0 Glottis Lips x Two equations per segment m (volume velocity) ρ c p m ( x ) = A m ( u + m exp ( − ikx ) + u − m exp ( ikx )) (3) u + m exp ( − ikx ) − u − u m ( x ) = m exp ( ikx ) Volume velocity and pressure are matched at boundaries Lossless model (no friction or viscosity, below 4000 Hz ...) Kasess (ARI) Vocal tract modeling SPL 2012 17 / 31

  18. One-tube Model Transfer function u lips / u glottis = u 0 / u N � � 1 0 � � 1 µ m 1 A ( µ, z ) = z N / 2 ( 1 0 ) ˆ � (4) µ m z − 1 z − 1 1 − µ m 0 m = N Correspondence requires fixed segment length (related to f s ) specific boundary conditions required (e.g. N=2) A ( µ, z ) ∝ 1 + ( µ 0 µ 1 + µ 1 µ 2 ) z − 1 + µ 0 µ 2 z − 2 ˆ For µ 0 or µ N = ± 1 reflection coefficients are calculated by recursive algorithm (Markel and Gray, 1976) m -th reflection coefficient µ m := A m − A m + 1 A m + A m + 1 and z := exp i 2 π f f s = exp i 2 π f c 2 l Kasess (ARI) Vocal tract modeling SPL 2012 18 / 31

  19. Branching Tubes nasal cavity pharynx velum glottis oral cavity Nasal tract is added Each tract is modeled as segmented tube For nasals: nasal tract open, oral tract closed Vocaltract model has pole-zero characteristic ˆ B ( µ, z ) Transfer function given as f ( µ, z ) = ˆ A ( µ, z ) Kasess (ARI) Vocal tract modeling SPL 2012 19 / 31

  20. Pole-zero Model No direct way from pole-zero to branched-tube model Numerator polynomial appears also in denominator Pole-zero model has 2 N + M + L coefficients Two-tube model has N + M + L + 1 parameters Numerator can be calculated precisely Current estimation methods Estimate pole-zero model Apply step-down to numerator and Minimize error with respect to either denomiator polynomial (Lim and Lee 1996) or signal filtered with numerator(Schnell 2003) Gives precedence to zeros Kasess (ARI) Vocal tract modeling SPL 2012 20 / 31

Recommend


More recommend