Speaker and Emotion Recognition of TV-Series Data Using Multimodal - PowerPoint PPT Presentation

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning Sashi Novitasari 1 , Quoc Truong Do 1 , Sakriani Saktj 1,3 , Dessi Lestari 2 , Satoshi Nakamura 1,3 1 Graduate School of Informatjon Science, Nara Instjtute of Science and Technology 2 Department of Informatjcs, Bandung Instjtute of Technology 3 RIKEN AIP 1 {sashi.novitasari.si3, do.truong.dj3, ssaktj, s-nakamura}@is.naist.jp 2 {dessipuji}@informatjka.org

Outline 1. Introductjon 2. Data 3. Model Architectures 4. Features 5. Experiment 6. Conclusion

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning I. Introduction ● Real-life communicatjon involves linguistjc and paralinguistjc aspects ● Multjmodal and multjtask recognitjon of non-verbal aspects of speech ● Recognitjon of speech’s speaker and emotjon from emotjon-rich data ● Previous works: - Multjmodal or multjtask emotjon-speaker recognitjon (not integrated) (Tang et al., 2016; Tian et al., 2016; Vallet et al., 2013)

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning II. Data ● TV-series data → expressive conversatjon ○ Video graphic: Facial features ○ Audio : Acoustjc features ○ Subtjtle : Lexical features ● English ● Utuerance-level annotatjon ○ Speaker : 57 names ○ Emotjon - valence: 3 classes (negatjve - neutral – positjve) ○ Emotjon - arousal : 3 classes (negatjve - neutral - positjve)

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning III. Model Architectures ● Multjlayer perceptron models (5 layers) ● Multjmodal classifjcatjon ● Multjtask classifjcatjon

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning III. Model Architectures Multimodal Classifjcation 2 evaluated approaches: a. Features concatenatjon b. Features hierarchical fusion

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning III. Model Architectures Multitask Classifjcation Perform classifjcatjon on several tasks at once.

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning IV. Features 1. Acoustjc (main) ○ INTERSPEECH 2010 feature conf. ○ openSMILE toolkit (Eyben et al., 2010) 2. Lexical ○ Word-vectors average ○ Pre-trained Google Word2Vec (Mikolov et al., 2013) 3. Facial ○ Facial contours and angles ○ openFace toolkit (Baltrusaitjs et al., 2016)

V. Experiment

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning V. Experiment ● Train set : 2460 utuerances ○ Speaker : 57 speaker (imbalanced) ○ Valence : Negatjve 31%, Neutral 60% , Positjve 9% ○ Arousal : Negatjve 4%, Neutral 75% , Positjve 21% ● Evaluated on 300 utuerances ○ Speaker : 10 speaker, 30 samples each ○ Valence : Negatjve 32%, Neutral 57% , Positjve 11% ○ Arousal : Negatjve 1%, Neutral 78% , Positjve 21% ● Compared performance of unimodal, multjmodal, single-task, and multjtask models ● Evaluated based on F1-score(%) on evaluatjon set

V. Experiment Result

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning V. Experiment Result: Speaker F1-scores (%) on evaluatjon set *Multjmodal approaches U - Unimodal C - Features concatenatjon H - Features hierarchical fusion Feature types A - Acoustjc F - Facial L - Lexical

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning V. Experiment Result: Emotion F1-score (%) on evaluatjon set *Multjmodal approaches Feature types U - Unimodal A - Acoustjc C - Features concatenatjon F - Facial H - Features hierarchical fusion L - Lexical

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning V. Experiment Result Summary *Multjmodal approaches Feature types U - Unimodal A - Acoustjc C - Features concatenatjon F - Facial H - Features hierarchical fusion L - Lexical

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning VI. Conclusion ● We constructed the multimodal and multitask speaker-emotion recognition model by using deep learning and TV-series data ● Multitask model able to outperform single-task model, especially when recognizing emotion by using acoustic features only ● Multimodal-multitask model did not result in a signifjcant improvement (larger data might be needed)

Thank You

Speaker and Emotion Recognition of TV-Series Data Using Multimodal - PowerPoint PPT Presentation

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning Sashi Novitasari 1 , Quoc Truong Do 1 , Sakriani Saktj 1,3 , Dessi Lestari 2 , Satoshi Nakamura 1,3 1 Graduate School of Informatjon Science, Nara

Motivation and Emotion: Emotions, Stress and Health Unit Overview Theories of Emotion

Emotion and Child Development Eve Ekman, MSW, PhC UC Berkeley, UCSF Overview Science of Emotion

Speech Processing 15-492/18-492 Speaker ID Who is speaking? Speaker ID, Speaker Recognition

Emotion Lecturer: Dr Tony Mowbray (tony.mowbray@monash.edu) Learning Objectives Define

Lead Screw Motors LSM08 Series LSM11 Series LSM14 Series LSM17 Series

Speaker Recognition and Speaker Recognition and the ETSI Standard the ETSI Standard Distributed

Based on Signal Processing SHREEKANT MARWADI Why Emotion Recognition in HCI? 1 2 3 Natural

EVE: Emotion Vector Encoding Towards Learning Feature Representations for Emotion Embeddings Yuya

Combining Speech and Speaker Recognition - A Joint Modeling Approach Hang Su Supervised by:

Emotion-Cause Pair Extraction: A New Task to Emotion Analysis in Texts Rui Xia, Zixiang Ding

Emotion AI: A New Frontier Boisy Pitre Emotion AI Evangelist Affectiva @boisypitre

25/02/2013 OVERVIEW Emotion competence in conduct PROMOTING CHILD DEVELOPMENTAL problem

Creating Emotion in the Blink of an Eye presented by Jeroen Snepvangers Q: How can we transform

Newtonian Emotion System Introduction Psychology Plutchik Valentin Lungu Lazarus Perception

Mental Ingredients as a route to Emotion Markup Language Isabella Poggi & Francesca D'Errico

eCos in commercial use - the Sinar eMotion Outline Introduction Sinar eMotion Overview

Multimodal Dependent Type Theory Daniel Gratzer 0 Alex Kavvos 0 Andreas Nuyts 1 Lars Birkedal 0

Production in a Multimodal Corpus: How Speakers Communicate Complex Actions LREC 2008 Carlos

Multimodal Corpus for Integrated language and action Rishabh Nigam 10598 Cognitive Sciences

Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin Leila Abdelrahman Cong Kevin Chen

Multimodal Abstractive Summarization for How2 Videos ACL19 Shru* Palaskar Jindrich

Nave Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Applications November 20, 2008 CS 486/686 University of Waterloo Outline Alchemy

Probabilistic Classifiers -- Generative Naive Bayes Announcements Math for Visual

Speaker and Emotion Recognition of TV-Series Data Using Multimodal - PowerPoint PPT Presentation

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning Sashi Novitasari 1 , Quoc Truong Do 1 , Sakriani Saktj 1,3 , Dessi Lestari 2 , Satoshi Nakamura 1,3 1 Graduate School of Informatjon Science, Nara

Motivation and Emotion: Emotions, Stress and Health Unit Overview Theories of Emotion

Emotion and Child Development Eve Ekman, MSW, PhC UC Berkeley, UCSF Overview Science of Emotion

Speech Processing 15-492/18-492 Speaker ID Who is speaking? Speaker ID, Speaker Recognition

Emotion Lecturer: Dr Tony Mowbray (tony.mowbray@monash.edu) Learning Objectives Define

Lead Screw Motors LSM08 Series LSM11 Series LSM14 Series LSM17 Series

Speaker Recognition and Speaker Recognition and the ETSI Standard the ETSI Standard Distributed

Based on Signal Processing SHREEKANT MARWADI Why Emotion Recognition in HCI? 1 2 3 Natural

EVE: Emotion Vector Encoding Towards Learning Feature Representations for Emotion Embeddings Yuya

Combining Speech and Speaker Recognition - A Joint Modeling Approach Hang Su Supervised by:

Emotion-Cause Pair Extraction: A New Task to Emotion Analysis in Texts Rui Xia, Zixiang Ding

Emotion AI: A New Frontier Boisy Pitre Emotion AI Evangelist Affectiva @boisypitre

25/02/2013 OVERVIEW Emotion competence in conduct PROMOTING CHILD DEVELOPMENTAL problem

Creating Emotion in the Blink of an Eye presented by Jeroen Snepvangers Q: How can we transform

Newtonian Emotion System Introduction Psychology Plutchik Valentin Lungu Lazarus Perception

Mental Ingredients as a route to Emotion Markup Language Isabella Poggi &amp; Francesca D'Errico

eCos in commercial use - the Sinar eMotion Outline Introduction Sinar eMotion Overview

Multimodal Dependent Type Theory Daniel Gratzer 0 Alex Kavvos 0 Andreas Nuyts 1 Lars Birkedal 0

Production in a Multimodal Corpus: How Speakers Communicate Complex Actions LREC 2008 Carlos

Multimodal Corpus for Integrated language and action Rishabh Nigam 10598 Cognitive Sciences

Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin Leila Abdelrahman Cong Kevin Chen

Multimodal Abstractive Summarization for How2 Videos ACL19 Shru* Palaskar Jindrich

Nave Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky &amp; James Martin, Jacob

Applications November 20, 2008 CS 486/686 University of Waterloo Outline Alchemy

Probabilistic Classifiers -- Generative Naive Bayes Announcements Math for Visual

Mental Ingredients as a route to Emotion Markup Language Isabella Poggi & Francesca D'Errico

Nave Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob