a general purpose 32 ms prosodic vector for hidden markov
play

A General-Purpose 32 ms Prosodic Vector for Hidden Markov Modeling - PowerPoint PPT Presentation

Introduction FFV Representation Applicability Experiments Conclusion A General-Purpose 32 ms Prosodic Vector for Hidden Markov Modeling Kornel Laskowski 1 , 2 , Mattias Heldner 3 & Jens Edlund 3 1 Carnegie Mellon University, Pittsburgh PA,


  1. Introduction FFV Representation Applicability Experiments Conclusion A General-Purpose 32 ms Prosodic Vector for Hidden Markov Modeling Kornel Laskowski 1 , 2 , Mattias Heldner 3 & Jens Edlund 3 1 Carnegie Mellon University, Pittsburgh PA, USA 2 Universit¨ at Karlsruhe, Karlsruhe, Germany 3 KTH — Royal Institute of Technology, Stockholm, Sweden 8 September, 2008 K. Laskowski, M. Heldner & J. Edlund Interspeech 2009, Brighton, UK 1/20

  2. Introduction FFV Representation Applicability Experiments Conclusion Imagine you had ... a local representation of tone estimated from a single ASR-size analysis frame which would not require: prior determination of voicing speaker normalization with separable codeword clusters for absence of voicing presence of voicing, constant F 0 presence of voicing, falling F 0 , with rate of change presence of voicing, rising F 0 , with rate of change K. Laskowski, M. Heldner & J. Edlund Interspeech 2009, Brighton, UK 2/20

  3. Introduction FFV Representation Applicability Experiments Conclusion Then you could do lots of things cheaply ... Examples include: online prosodic modeling improved ASR for tonal languages enriched ASR for other languages contrastive phone models variously accented same-word lexicon entries (word-conditioned) prosodic phrasing for free K. Laskowski, M. Heldner & J. Edlund Interspeech 2009, Brighton, UK 3/20

  4. Introduction FFV Representation Applicability Experiments Conclusion Instead, currently you need to ... 1 run a pitch tracker , which computes a local estimate of voicing and of pitch 1 applies dynamic programming over a long observation time 2 2 heuristically correct its output , by pruning outliers, based on long-observation-time trends, and/or 1 applying a piecewise linear approximation 2 3 normalize for the speaker , by determining a long-observation-time speaker norm 1 applying the normalization to each frame 2 4 treat unvoiced regions by interpolating inside them, or posting exceptions in downstream modeling/handling 5 compute a first-order log-difference K. Laskowski, M. Heldner & J. Edlund Interspeech 2009, Brighton, UK 4/20

  5. Introduction FFV Representation Applicability Experiments Conclusion What we will present ... 1 Fundamental Frequency Variation (FFV) 2 Applicability of the FFV Representation speaker change prediction speaker classification dialog act classification 3 Several Basic Questions feature transformation feature regularization concatenation with other features runtime improvements acoustic model complexity 4 Summary K. Laskowski, M. Heldner & J. Edlund Interspeech 2009, Brighton, UK 5/20

  6. Introduction FFV Representation Applicability Experiments Conclusion Computation, for Each (32 ms) Analysis Frame estimate the FFV spectrum g [ ρ ] estimate the power spectra F L and F R R 7 dilate F R by a factor 2 ρ , ρ > 0 dot product with undilated F L time domain repeat for a continuum of ρ values freq domain pass g ( ρ ) through a filterbank to yield G ∈ decorrelate G K. Laskowski, M. Heldner & J. Edlund Interspeech 2009, Brighton, UK 6/20

  7. Introduction FFV Representation Applicability Experiments Conclusion Computation, for Each (32 ms) Analysis Frame estimate the FFV spectrum g [ ρ ] estimate the power spectra F L and F R R 7 dilate F R by a factor 2 ρ , ρ > 0 dot product with undilated F L time domain repeat for a continuum of ρ values freq domain pass g ( ρ ) through a filterbank to yield G ∈ decorrelate G K. Laskowski, M. Heldner & J. Edlund Interspeech 2009, Brighton, UK 6/20

  8. Introduction FFV Representation Applicability Experiments Conclusion Computation, for Each (32 ms) Analysis Frame estimate the FFV spectrum g [ ρ ] estimate the power spectra F L and F R R 7 dilate F R by a factor 2 ρ , ρ > 0 dot product with undilated F L time domain repeat for a continuum of ρ values freq domain pass g ( ρ ) through a filterbank to yield G ∈ decorrelate G K. Laskowski, M. Heldner & J. Edlund Interspeech 2009, Brighton, UK 6/20

  9. Introduction FFV Representation Applicability Experiments Conclusion Computation, for Each (32 ms) Analysis Frame estimate the FFV spectrum g [ ρ ] estimate the power spectra F L and F R R 7 dilate F R by a factor 2 ρ , ρ > 0 dot product with undilated F L time domain repeat for a continuum of ρ values freq domain pass g ( ρ ) through a filterbank to yield G ∈ decorrelate G K. Laskowski, M. Heldner & J. Edlund Interspeech 2009, Brighton, UK 6/20

  10. Introduction FFV Representation Applicability Experiments Conclusion Computation, for Each (32 ms) Analysis Frame estimate the FFV spectrum g [ ρ ] estimate the power spectra F L and F R R 7 dilate F R by a factor 2 ρ , ρ > 0 dot product with undilated F L time domain repeat for a continuum of ρ values freq domain pass g ( ρ ) through a filterbank to yield G ∈ decorrelate G K. Laskowski, M. Heldner & J. Edlund Interspeech 2009, Brighton, UK 6/20

  11. Introduction FFV Representation Applicability Experiments Conclusion Computation, for Each (32 ms) Analysis Frame estimate the FFV spectrum g [ ρ ] estimate the power spectra F L and F R R 7 dilate F R by a factor 2 ρ , ρ > 0 dot product with undilated F L time domain repeat for a continuum of ρ values freq domain pass g ( ρ ) through a filterbank to yield G ∈ decorrelate G K. Laskowski, M. Heldner & J. Edlund Interspeech 2009, Brighton, UK 6/20

  12. Introduction FFV Representation Applicability Experiments Conclusion Computation, for Each (32 ms) Analysis Frame estimate the FFV spectrum g [ ρ ] estimate the power spectra F L and F R R 7 dilate F R by a factor 2 ρ , ρ > 0 dot product with undilated F L time domain repeat for a continuum of ρ values freq domain pass g ( ρ ) through a filterbank to yield G ∈ decorrelate G K. Laskowski, M. Heldner & J. Edlund Interspeech 2009, Brighton, UK 6/20

  13. Introduction FFV Representation Applicability Experiments Conclusion Computation, for Each (32 ms) Analysis Frame estimate the FFV spectrum g [ ρ ] estimate the power spectra F L and F R R 7 dilate F R by a factor 2 ρ , ρ > 0 dot product with undilated F L time domain repeat for a continuum of ρ values freq domain pass g ( ρ ) through a filterbank to yield G ∈ decorrelate G K. Laskowski, M. Heldner & J. Edlund Interspeech 2009, Brighton, UK 6/20

  14. Introduction FFV Representation Applicability Experiments Conclusion Computation, for Each (32 ms) Analysis Frame estimate the FFV spectrum g [ ρ ] estimate the power spectra F L and F R R 7 dilate F R by a factor 2 ρ , ρ > 0 dot product with undilated F L time domain repeat for a continuum of ρ values freq domain pass g ( ρ ) through a filterbank to yield G ∈ decorrelate G K. Laskowski, M. Heldner & J. Edlund Interspeech 2009, Brighton, UK 6/20

  15. Introduction FFV Representation Applicability Experiments Conclusion Computation, for Each (32 ms) Analysis Frame estimate the FFV spectrum g [ ρ ] estimate the power spectra F L and F R R 7 dilate F R by a factor 2 ρ , ρ > 0 dot product with undilated F L time domain repeat for a continuum of ρ values 1 0.8 0.6 0.4 0.2 freq domain 0 −2 −1 0 +1 +2 pass g ( ρ ) through a filterbank to yield G ∈ decorrelate G K. Laskowski, M. Heldner & J. Edlund Interspeech 2009, Brighton, UK 6/20

  16. Introduction FFV Representation Applicability Experiments Conclusion Computation, for Each (32 ms) Analysis Frame estimate the FFV spectrum g [ ρ ] estimate the power spectra F L and F R R 7 dilate F R by a factor 2 ρ , ρ > 0 dot product with undilated F L time domain repeat for a continuum of ρ values 1 0.8 0.6 0.4 0.2 freq domain 0 −2 −1 0 +1 +2 pass g ( ρ ) through a filterbank to yield G ∈ decorrelate G K. Laskowski, M. Heldner & J. Edlund Interspeech 2009, Brighton, UK 6/20

  17. Introduction FFV Representation Applicability Experiments Conclusion Computation, for Each (32 ms) Analysis Frame estimate the FFV spectrum g [ ρ ] estimate the power spectra F L and F R R 7 dilate F R by a factor 2 ρ , ρ > 0 dot product with undilated F L time domain repeat for a continuum of ρ values 1 0.8 0.6 0.4 0.2 freq domain 0 −2 −1 0 +1 +2 pass g ( ρ ) through a filterbank to yield G ∈ decorrelate G K. Laskowski, M. Heldner & J. Edlund Interspeech 2009, Brighton, UK 6/20

  18. Introduction FFV Representation Applicability Experiments Conclusion Comparison with MFCC Computation AUDIO AUDIO PRE−EMPHASIS PRE−EMPHASIS POW SPECTRUM FFV SPECTRUM ESTIMATION ESTIMATION FILTERBANK PERCEPTUAL FILTERBANK (MEL) DECORRELATE DECORRELATE (INV. COS−II) (KLT) MODELING MODELING MFCC features FFV features K. Laskowski, M. Heldner & J. Edlund Interspeech 2009, Brighton, UK 7/20

  19. Introduction FFV Representation Applicability Experiments Conclusion FFV versus Pitch Tracking, Conceptually Formant Pitch FFV Peak Tracking Tracking Tracking − → − → − − − → → → − → − → FFT Autocorr FFV Spectrum Spectrum Spectrum K. Laskowski, M. Heldner & J. Edlund Interspeech 2009, Brighton, UK 8/20

Recommend


More recommend