speaker and emotion recognition of tv series data using
play

Speaker and Emotion Recognition of TV-Series Data Using Multimodal - PowerPoint PPT Presentation

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning Sashi Novitasari 1 , Quoc Truong Do 1 , Sakriani Saktj 1,3 , Dessi Lestari 2 , Satoshi Nakamura 1,3 1 Graduate School of Informatjon Science, Nara


  1. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning Sashi Novitasari 1 , Quoc Truong Do 1 , Sakriani Saktj 1,3 , Dessi Lestari 2 , Satoshi Nakamura 1,3 1 Graduate School of Informatjon Science, Nara Instjtute of Science and Technology 2 Department of Informatjcs, Bandung Instjtute of Technology 3 RIKEN AIP 1 {sashi.novitasari.si3, do.truong.dj3, ssaktj, s-nakamura}@is.naist.jp 2 {dessipuji}@informatjka.org

  2. Outline 1. Introductjon 2. Data 3. Model Architectures 4. Features 5. Experiment 6. Conclusion

  3. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning I. Introduction ● Real-life communicatjon involves linguistjc and paralinguistjc aspects ● Multjmodal and multjtask recognitjon of non-verbal aspects of speech ● Recognitjon of speech’s speaker and emotjon from emotjon-rich data ● Previous works: - Multjmodal or multjtask emotjon-speaker recognitjon (not integrated) (Tang et al., 2016; Tian et al., 2016; Vallet et al., 2013)

  4. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning II. Data ● TV-series data → expressive conversatjon ○ Video graphic: Facial features ○ Audio : Acoustjc features ○ Subtjtle : Lexical features ● English ● Utuerance-level annotatjon ○ Speaker : 57 names ○ Emotjon - valence: 3 classes (negatjve - neutral – positjve) ○ Emotjon - arousal : 3 classes (negatjve - neutral - positjve)

  5. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning III. Model Architectures ● Multjlayer perceptron models (5 layers) ● Multjmodal classifjcatjon ● Multjtask classifjcatjon

  6. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning III. Model Architectures Multimodal Classifjcation 2 evaluated approaches: a. Features concatenatjon b. Features hierarchical fusion

  7. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning III. Model Architectures Multitask Classifjcation Perform classifjcatjon on several tasks at once.

  8. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning IV. Features 1. Acoustjc (main) ○ INTERSPEECH 2010 feature conf. ○ openSMILE toolkit (Eyben et al., 2010) 2. Lexical ○ Word-vectors average ○ Pre-trained Google Word2Vec (Mikolov et al., 2013) 3. Facial ○ Facial contours and angles ○ openFace toolkit (Baltrusaitjs et al., 2016)

  9. V. Experiment

  10. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning V. Experiment ● Train set : 2460 utuerances ○ Speaker : 57 speaker (imbalanced) ○ Valence : Negatjve 31%, Neutral 60% , Positjve 9% ○ Arousal : Negatjve 4%, Neutral 75% , Positjve 21% ● Evaluated on 300 utuerances ○ Speaker : 10 speaker, 30 samples each ○ Valence : Negatjve 32%, Neutral 57% , Positjve 11% ○ Arousal : Negatjve 1%, Neutral 78% , Positjve 21% ● Compared performance of unimodal, multjmodal, single-task, and multjtask models ● Evaluated based on F1-score(%) on evaluatjon set

  11. V. Experiment Result

  12. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning V. Experiment Result: Speaker F1-scores (%) on evaluatjon set *Multjmodal approaches U - Unimodal C - Features concatenatjon H - Features hierarchical fusion Feature types A - Acoustjc F - Facial L - Lexical

  13. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning V. Experiment Result: Emotion F1-score (%) on evaluatjon set *Multjmodal approaches Feature types U - Unimodal A - Acoustjc C - Features concatenatjon F - Facial H - Features hierarchical fusion L - Lexical

  14. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning V. Experiment Result Summary *Multjmodal approaches Feature types U - Unimodal A - Acoustjc C - Features concatenatjon F - Facial H - Features hierarchical fusion L - Lexical

  15. Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning VI. Conclusion ● We constructed the multimodal and multitask speaker-emotion recognition model by using deep learning and TV-series data ● Multitask model able to outperform single-task model, especially when recognizing emotion by using acoustic features only ● Multimodal-multitask model did not result in a signifjcant improvement (larger data might be needed)

  16. Thank You

Recommend


More recommend