classification user model speech sensor adapts its dialog
play

? classification user model speech = sensor adapts its dialog - PowerPoint PPT Presentation

Speaker Classification: Supervector Approach and Detection Task Christian Mller, DFKI Speech as a Source for Non-Intrusive UM Now its time to get to gate 38. Information about adaptive the user speech dialog system A speaker ?


  1. Speaker Classification: Supervector Approach and Detection Task Christian Müller, DFKI

  2. Speech as a Source for Non-Intrusive UM Now it’s time to get to gate 38. Information about adaptive the user speech dialog system A speaker ? classification user model speech = sensor adapts its dialog behavior inference from (e.g. detailed map with sensors shops vs. arrows) ( not intrusive ) B provides explicit statement recommendations ( intrusive ) (e.g. a different route to the gate) Christian M ü ller

  3. Overview Speech as a source of information for non-intrusive  user modeling Speech/signal processing Take-away messages GMM/SVM supervector Classification method   approach for acoustic for independent “bag of speech features observations” features Detection task and Valid application-   pseudo-NIST evaluation independent evaluation procedure Rank and polynomial Feature space warping   rank normalization normalization  Conclusions Christian M ü ller

  4. Speaker Classification Systems Cognitive Load  Best Research Paper Award UM 2001 Age and Gender  Voice Award 2007  Telekom live operation 2009 S y Language Audio segment s  14 languages + dialects (telephone quality)  NIST evaluation 2007 t e Identity m  Project with BKA 2009  NIST* Evaluation 2008 Acoustic Events  Project with VW 2008  Interspeech 2008 Christian M ü ller

  5.  How can your features be modeled assuming that they  are multi-dimensional  represent repeating observations of the same kind  can be assumed to be independent (“bag” of observations)  Proposing the GMM/SVM Supervector Approach on the example of frame-by-frame acoustic features Christian M ü ller

  6. Hierarchical Feature Model High-level features (learned characteristics) semantics ? dialog A b b a e b B : d d e c : ideolect <s> how shall I say this <c> <s> yeah I know... phonetics /S/ /oU/ /m/ /i:/ /D/ /&/ /m/ /  / /n/ /i:/ ... prosody spectrum Low-level features (physical characterstics) Christian M ü ller

  7. Modeling Acoustics and Prosodics semantics ? dialog A b b a e b B : d d e c : ideolect no ASR <s> how shall I say this <c> <s> yeah I know... phonetics /S/ /oU/ /m/ /i:/ /D/ /&/ /m/ /  / /n/ /i:/ ... prosody spectrum Christian M ü ller

  8. General Classification Scheme z k e.g. channel compensation w kj -0,4 multilayer perceptron support-vector machines 0.7 -1 (not addressed in this networks Preprocessing talk) y 1 y 2 -1.5 0.5 1 Feature 1 1 w ji Extraction 1 x 2 x 1 Classification Fusion Top-Down- Knowledge Christian M ü ller

  9. Generative Approach: Gaussian Mixture Model (GMM) training “emergency vehicle” probability density “emergency feature vehicle” extraction model frame of speech test ? avg likelihood over all frames “emergency feature for class vehicle” extraction “emergency model vehicle” Christian M ü ller

  10. Generative Approach: Gaussian Mixture Model (GMM) test ? “emergency feature vehicle” extraction avg. log model likelihood ratio over all frames for frame of speech class “emergency vehicle” back- ground model Christian M ü ller

  11. A Mixture of Gaussians  Means, variances, and mixtures weights are optimized in training  Black line = mixture of 3 Gaussians Christian M ü ller

  12. Discriminative Method: Support Vector Machine (SVM) training “em. vehic.” (1) “em. vehic.” feature model “not em. vehic.” (-1) extraction Features are transformed into higher-dimensional space where problem  is linear Discriminating hyper plane is learned using linear regression  Trade-o fg between training error and width of margin  Model is stored in form of “support vectors” (data points on the margin)  Christian M ü ller

  13. Discriminative Method: Support Vector Machine (SVM) test ? feature score extraction (distance to hyper plane) Discriminative methods have shown to be superior to generative  methods for similar tasks Features vectors have to be of the same lengths (sensitive to variable  segment lengths) Solutions:   feature statistics calculated over the entire utterance  fixes portion of the segment  sequential kernels Christian M ü ller

  14. GMM/SVM Supervector Approach feature extraction Gaussian means (MAP adapted)  Combines discriminative power of SVMs with length independency of GMMs  Very successful with similar tasks such as speaker recognition  GMM is trained using MAP adaptation Christian M ü ller

  15. Evaluation Results Christian Müller, Joan-Isaac Biel, Edward Kim, and Daniel Rosario, “Speech-overlapped Acoustic Event Detection for Automotive Applications,” in Proceedings of the Interspeech 2008 , Brisbane, Australia, 2008. Christian M ü ller

  16.  How can you evaluate your multi- class models independently from the given application?  How can you establish a appropriate evaluation procedure in order to obtain valid results?  Proposing the detection task and the “pseudo NIST” evaluation procedure on the example of acoustic event detection and speaker age recognition. Christian M ü ller

  17. Background  With multi-class recognition problems, many test/analyzing methods are very application specific.  e.g. confusion matrices.  we want a method that allows results to be generalized across a large set of applications.  With home-grown databases, parameter tuning on the evaluation set often compromises the validity of the results/inferences.  we want a fair “one shot” evaluation. Christian M ü ller

  18. The Detection Task system yes , 1.324326 emergency vehicle ?  Given  a speech segment (s)  and an acoustic event to be detected (target event, ET )  the task is to decide whether ET is present in s (yes or no)  the system's output shall also contain a score indicating its confidence with more positive scores indicating greater confidence. Christian M ü ller

  19. Terminology  Segment class  e.g. segment event, segment age-class.  ground truth (not known).  Target  the hypothesized class.  Trial  a combination of segment and target. Christian M ü ller

  20. Evaluation yes 1.32432 system no -0.3212 emergency vehicle ? no 1.8463 music ? no -2.5773 talking ? yes 0.00132 laughing ? phone ? no 2.20122 no event ?  The system performance is evaluated by presenting it with a set of trials.  Each test segment is used for multiple trials.  The absence of all of all targets is explicitly included. Christian M ü ller

  21. Type of Errors segment “em. vehic.” system no “MISS” target “em. vehic” ? segment “em. vehic” system yes “FALSE ALARM” target “phone” ? Christian M ü ller

  22. Decision-Error Tradeo fg misses “equal error rate” false alarms  Selecting an operating point (decision threshold) along the dotted line trades misses o fg false alarms.  Optimal operating point is application dependent.  Low false alarm rates are desirable for most applications. Christian M ü ller

  23. Decision Cost Function C(E T , E N ) = C Miss · P Target · P Miss (E T ) + C FA · (1-P Target ) · P FA (E T ,E N ) where E T and E N are the target and non-target events, and C Miss , C FA and P Target are application model parameters. The application parameters for EER are: C Miss = C FA = 1 and P Target = 0.5  Weighted sum of misses and false alarms using variable costs and priors.  Application model parameters are selected according to the application. Christian M ü ller

  24. Example DET-Plot miss probability false alarm probability Christian Müller, Joan-Isaac Biel, Edward Kim, and Daniel Rosario, “Speech-overlapped Acoustic Event Detection for Automotive Applications,” in Proceedings of the Interspeech 2008 , Brisbane, Australia, 2008. Christian M ü ller

Recommend


More recommend