speech processing 15 492 18 492
play

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final presentation on New Parameterizations for Emotional Speech Synthesis) Processing Emotional Speech What is it? Emotion/Expressive/Style


  1. Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final presentation on New Parameterizations for Emotional Speech Synthesis)

  2. Processing Emotional Speech  What is it?  Emotion/Expressive/Style  Things beyond the textual content  Why?  Detect frustrated users  Detect confusion/confidence in speakers  Detect truth/lies  Detect engagement in task  How?  Combination of words, spectrum, F0 etc

  3. What is emotional speech  The standard 4 emotions  Neutral, Happy, Sad and Angry  But there are many more  Cold-anger, dominant, passive, shame  Confident, non-confident etc

  4. Can machines recognize emotions?

  5. Where to get data  Record actors  For synthesis this is probably best  People hear more acted emotions than real ones  Mine tv/movies  But usually background music  Mine call-center logs  Lots of angry examples  Mine youtube videos  Probably all emotions, but hard to search

  6. Can machines recognize emotions?  LDC Emotional Prosody Speech and Transcripts • English, dates and numbers, 7 actors, • 2418 utterances, average 3sec., total: ~2h • 4 class problem: happy, hot-anger, sadness, neutral • 6 class problem: […], interest, panic • 15 class problem: […], anxiety, boredom, cold -anger, contempt, despair, disgust, elation, pride, shame  Berlin Emotional Database (emoDB) • German, semantically neutral utterances, 10 actors • 535 utterances, average 2.8sec., total: ~25 min • 6 emotions: anger, boredom, disgust, anxiety/fear, happiness, sadness

  7. Acoustic Features  Feature extraction: 1582 features (openSMILE) • 124 Prosodic Features: 72 F0, 38 Energy, 154 Dur./Pos. • 140 Voice Quality: 68 Jitter (JT), 34 Shimmer (SH), 38 Voicing (VC) • 1178 Spectral Features: 570 MFCC, 304 MEL, 304 LSP 38 low-level descriptors 21 functionals PCM loudness, position max./min., MFCC [0-14] arith.mean, standard deviation, Log Mel Freq. Band [0-7] skewness, kurtosis, LSP Frequency [0-7] lin. regression coefficient 1/2, F0, Voicing lin. regression error Q/A, Jitter / shimmer (local / DDP) percentile 1/99,

  8. Classification and Evaluation  Classification • Using discriminative training, multi-class SVM (1:1), WEKA • Linear kernel, complexity parameter set by cross- validation • Standardized feature sets  Evaluation • Applying 10-fold cross-validation or LOSO (leave-one- speaker/sentence-out) • Also testing on held-out set (test set) • Evaluation criterion accuracy or unweighted average recall (UAR)

  9. Results  LDC Emotional Prosody Speech and Transcripts (LOSO) UAR [%] 4 classes 6 classes 15 classes whole data set 70.4 53.5 23.6 test set 68.3 43.3 23.5 chance level 25.0 16.6 6.7 Berlin Emotional Speech Database UAR [%] 7 classes whole data set 77.0 test set 80.2 Chance level 14.3

  10. Results (normalized)  Speaker / sentence normalization • z-score (X s ) = (X s – mean(X s )) / std(X s )  LDC Emotional Prosody Speech and Transcripts (LOSO) UAR [%] 4 classes 6 classes 15 classes whole data set 75.0 (+4.6) 54.4 (+.9) 27.2 (+3.6)  Berlin Emotional Speech Database (LOSO) UAR [%] 7 classes whole data set 84.2 (+7.2)

  11. Classification Analysis: LDC Emotion

  12. Classification Analysis: emoDB

  13. Can people recognize emotions?

  14. Mechanical Turk  Anonymous workers (Worker-ID)  Simple tasks for small amounts of money  Minimal time, effort, and cost required for significant amounts of crowd-sourced data

  15. English LDC Emotion (4 Emotions)  Short, 1-2 second, wav files  English speech – dates such as “ November 3 rd ”  4 fundamental, distinct emotions  74 unique workers and 169 total HITs completed Results Emotion % Correct Uni-directional Confusion Anger 69% Sadness 67% Neutral 66% Happiness 46% Total 60%

  16. English LDC Emotion (15 Emotions)  Same parameters as previous experiment.  Including less well-defined emotions  Pride, shame, etc.  68 unique workers and 218 total HITs completed Results Emotion % Correct Emotion % Correct Uni-directional Confusion Neutral 29% Happiness 9% Hot-Anger 26% Pride 9% Sadness 25% Despair 8% Boredom 17% Cold-Anger 7% Panic 14% Anxiety 5% Interest 12% Disgust 5% Shame 4% Elation 10% Total 12% Contempt 10%

  17. German Berlin Emotion (7 Emotions)  Short sentences with no emotional connotation  “ The tablecloth is lying on the fridge. ”  37 unique workers and 245 total HITs completed Results Emotion % Correct Common Confusion Pair Neutral 68% 41.8% Anger 62% Sadness 53% Anxiety 45% Happiness 35% Boredom 27% Disgust 11% Total 41%

  18. Subjective Evaluation Takeaways • Humans are significantly more accurate than chance for smaller numbers of emotions – This includes cross-lingual recognition • Certain emotions are consistently identified accurately – Sadness, Neutral, Hot-Anger

  19. Emotional TTS  Record lots of data  1 hour plus in each domain  (Easy to get boredom and anger)  Do voice conversion/parametric synthesis  Better  In all the results aren’t encouraging  Hard to make it sound very natural

  20. Synthesis using AF13s Types of Synthesis Text to speech with no emotion/personality content tts Predicts durations, f0, and spectrum (through AFs) Text to speech with emotion/personality flag ttsE/P Predicts durations, f0, and spectrum (through AFs) No explicit emotion/personality flag, but cgp Natural durations. Predicts f0 and spectrum (through AFs) cgpE/P Emotion/personality flag, and Natural durations. Predicts f0 and spectrum (through AFs) resynth Pure re-synthesis from natural durations, f0, and spectrum “ The best we can do. ”

  21. LDC (English) Emotion Synthesis Objective Evaluation: Chance = 25% Without Speaker Normalization Train human Synthesis tts ttsE cgp cgpE resynth Test human Type 32% 36% 38% 40% 56% 70% UAR With Speaker Normalization Train human Synthesis tts ttsE cgp cgpE resynth Test human Type 33% 54% 57% 61% 61% 75% UAR

  22. LDC (English) Emotion Synthesis Mechanical Turk Human Evaluation: Chance = 25% Average Workers: 59 Average HITs Completed: 308 Files per HIT: 12 Natural Synthesis tts ttsE cgp cgpE resynth Speech Type 28% 28% 35% 37% 41% 60% Percent Correct We see the same trend and ordering in the human evaluation as in the objective classification

  23. German Synthesis Berlin Emotion Objective Evaluation: Chance = 14% tts ttsE cgp cgpE resynth Train human Synthesis Type Test human 14% 29% 65% 72% 82% 84% UAR ADFS Personality Objective Evaluation: Chance = 10% tts ttsP cgp cgpP resynth Train human Synthesis Type Test human 10% 60% 60% 78% 89% 92% UAR

  24. What we actually need  Expressive styles  Frustrated (angry, annoyed, etc)  Interested/Uninterested  Pleased/Unhappy  Cooperative/non-cooperative

  25. How can we use it  Detect frustrated customers  Be frustrated back at them (or not)  What techniques can deflate frustration  Detect (non)confidence  Better aid in tutorial systems  S2S Translation  Copy emotion across language

Recommend


More recommend