Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final presentation on New Parameterizations for Emotional Speech Synthesis)
Processing Emotional Speech What is it? Emotion/Expressive/Style Things beyond the textual content Why? Detect frustrated users Detect confusion/confidence in speakers Detect truth/lies Detect engagement in task How? Combination of words, spectrum, F0 etc
What is emotional speech The standard 4 emotions Neutral, Happy, Sad and Angry But there are many more Cold-anger, dominant, passive, shame Confident, non-confident etc
Can machines recognize emotions?
Where to get data Record actors For synthesis this is probably best People hear more acted emotions than real ones Mine tv/movies But usually background music Mine call-center logs Lots of angry examples Mine youtube videos Probably all emotions, but hard to search
Can machines recognize emotions? LDC Emotional Prosody Speech and Transcripts • English, dates and numbers, 7 actors, • 2418 utterances, average 3sec., total: ~2h • 4 class problem: happy, hot-anger, sadness, neutral • 6 class problem: […], interest, panic • 15 class problem: […], anxiety, boredom, cold -anger, contempt, despair, disgust, elation, pride, shame Berlin Emotional Database (emoDB) • German, semantically neutral utterances, 10 actors • 535 utterances, average 2.8sec., total: ~25 min • 6 emotions: anger, boredom, disgust, anxiety/fear, happiness, sadness
Acoustic Features Feature extraction: 1582 features (openSMILE) • 124 Prosodic Features: 72 F0, 38 Energy, 154 Dur./Pos. • 140 Voice Quality: 68 Jitter (JT), 34 Shimmer (SH), 38 Voicing (VC) • 1178 Spectral Features: 570 MFCC, 304 MEL, 304 LSP 38 low-level descriptors 21 functionals PCM loudness, position max./min., MFCC [0-14] arith.mean, standard deviation, Log Mel Freq. Band [0-7] skewness, kurtosis, LSP Frequency [0-7] lin. regression coefficient 1/2, F0, Voicing lin. regression error Q/A, Jitter / shimmer (local / DDP) percentile 1/99,
Classification and Evaluation Classification • Using discriminative training, multi-class SVM (1:1), WEKA • Linear kernel, complexity parameter set by cross- validation • Standardized feature sets Evaluation • Applying 10-fold cross-validation or LOSO (leave-one- speaker/sentence-out) • Also testing on held-out set (test set) • Evaluation criterion accuracy or unweighted average recall (UAR)
Results LDC Emotional Prosody Speech and Transcripts (LOSO) UAR [%] 4 classes 6 classes 15 classes whole data set 70.4 53.5 23.6 test set 68.3 43.3 23.5 chance level 25.0 16.6 6.7 Berlin Emotional Speech Database UAR [%] 7 classes whole data set 77.0 test set 80.2 Chance level 14.3
Results (normalized) Speaker / sentence normalization • z-score (X s ) = (X s – mean(X s )) / std(X s ) LDC Emotional Prosody Speech and Transcripts (LOSO) UAR [%] 4 classes 6 classes 15 classes whole data set 75.0 (+4.6) 54.4 (+.9) 27.2 (+3.6) Berlin Emotional Speech Database (LOSO) UAR [%] 7 classes whole data set 84.2 (+7.2)
Classification Analysis: LDC Emotion
Classification Analysis: emoDB
Can people recognize emotions?
Mechanical Turk Anonymous workers (Worker-ID) Simple tasks for small amounts of money Minimal time, effort, and cost required for significant amounts of crowd-sourced data
English LDC Emotion (4 Emotions) Short, 1-2 second, wav files English speech – dates such as “ November 3 rd ” 4 fundamental, distinct emotions 74 unique workers and 169 total HITs completed Results Emotion % Correct Uni-directional Confusion Anger 69% Sadness 67% Neutral 66% Happiness 46% Total 60%
English LDC Emotion (15 Emotions) Same parameters as previous experiment. Including less well-defined emotions Pride, shame, etc. 68 unique workers and 218 total HITs completed Results Emotion % Correct Emotion % Correct Uni-directional Confusion Neutral 29% Happiness 9% Hot-Anger 26% Pride 9% Sadness 25% Despair 8% Boredom 17% Cold-Anger 7% Panic 14% Anxiety 5% Interest 12% Disgust 5% Shame 4% Elation 10% Total 12% Contempt 10%
German Berlin Emotion (7 Emotions) Short sentences with no emotional connotation “ The tablecloth is lying on the fridge. ” 37 unique workers and 245 total HITs completed Results Emotion % Correct Common Confusion Pair Neutral 68% 41.8% Anger 62% Sadness 53% Anxiety 45% Happiness 35% Boredom 27% Disgust 11% Total 41%
Subjective Evaluation Takeaways • Humans are significantly more accurate than chance for smaller numbers of emotions – This includes cross-lingual recognition • Certain emotions are consistently identified accurately – Sadness, Neutral, Hot-Anger
Emotional TTS Record lots of data 1 hour plus in each domain (Easy to get boredom and anger) Do voice conversion/parametric synthesis Better In all the results aren’t encouraging Hard to make it sound very natural
Synthesis using AF13s Types of Synthesis Text to speech with no emotion/personality content tts Predicts durations, f0, and spectrum (through AFs) Text to speech with emotion/personality flag ttsE/P Predicts durations, f0, and spectrum (through AFs) No explicit emotion/personality flag, but cgp Natural durations. Predicts f0 and spectrum (through AFs) cgpE/P Emotion/personality flag, and Natural durations. Predicts f0 and spectrum (through AFs) resynth Pure re-synthesis from natural durations, f0, and spectrum “ The best we can do. ”
LDC (English) Emotion Synthesis Objective Evaluation: Chance = 25% Without Speaker Normalization Train human Synthesis tts ttsE cgp cgpE resynth Test human Type 32% 36% 38% 40% 56% 70% UAR With Speaker Normalization Train human Synthesis tts ttsE cgp cgpE resynth Test human Type 33% 54% 57% 61% 61% 75% UAR
LDC (English) Emotion Synthesis Mechanical Turk Human Evaluation: Chance = 25% Average Workers: 59 Average HITs Completed: 308 Files per HIT: 12 Natural Synthesis tts ttsE cgp cgpE resynth Speech Type 28% 28% 35% 37% 41% 60% Percent Correct We see the same trend and ordering in the human evaluation as in the objective classification
German Synthesis Berlin Emotion Objective Evaluation: Chance = 14% tts ttsE cgp cgpE resynth Train human Synthesis Type Test human 14% 29% 65% 72% 82% 84% UAR ADFS Personality Objective Evaluation: Chance = 10% tts ttsP cgp cgpP resynth Train human Synthesis Type Test human 10% 60% 60% 78% 89% 92% UAR
What we actually need Expressive styles Frustrated (angry, annoyed, etc) Interested/Uninterested Pleased/Unhappy Cooperative/non-cooperative
How can we use it Detect frustrated customers Be frustrated back at them (or not) What techniques can deflate frustration Detect (non)confidence Better aid in tutorial systems S2S Translation Copy emotion across language
Recommend
More recommend