Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin - PowerPoint PPT Presentation

Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin Leila Abdelrahman Cong Kevin Chen Amil Khanzada Stanford University University of Miami University of California University of California Berkeley Berkeley tomjin@stanford.edu lxa215@miami.edu amil@berkeley.edu kevincong95@berkeley.edu ACM International Conference on Multimodal Interaction October 24th-29th 2020

Introductions Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Introduction: Understanding Video Sentiment Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Dataset: EmotiW 2020 Video Group Emotion [1,2] Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Our Approach Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Multimodal Hidden Layer Ensembling Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Scene Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Scene Modality Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Scene: Activated on People Strong activations on foreground individuals Activations followed individuals frame-to-frame Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Scene: ResNet-50 Outperforms Inception-v3 ResNet activates on foreground Inception-v3 gets people distracted background lighting Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Image Captioning Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Captions Tell Descriptive Nouns Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Pose Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Pose Model Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Pose: Importance of Upper Body Joints Hands Elbows Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Audio and Laughter Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Audio Model System Diagram ● CNN-LSTM ● Time dependent frames Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Laughter Model Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Facial Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Facial Pipeline Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Results & Error Analysis Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Independent Modality Results Modality Accuracy F1-Score Scene 0.546 0.541 Pose 0.486 0.489 Audio 0.577 0.577 Face 0.400 0.348 Image Captioning 0.505 0.506 Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Fully Connected Early Fusion Results Dataset Accuracy Baseline [Test] 0.479 Validation 0.640 Test 0.639 Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Ablation Study on Modalities ● ResNet Scene had high positive class saliency ● Every modality struggled to predict positive videos accurately, except for scene Ensemble: 64.0% Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Modality Activation Contributions Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Summary & Future Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Summary ● Required research into various modalities and ensembling methods ● We provide 2 novel modalities: image captioning and laughter ● Early fusion methods improved classification performance ● Beat the baseline test accuracy of 47.9% by about 16 percentage points: 63.9% Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Future ● Transitory facial expression datasets ● 3D pose points, YOLO and object sentiment analysis ● Real world / research: YouTube likes, self-driving cars, telehealth ● Affective image captioning Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Acknowledgements ● Dr. Fei Fei Li, Dr. Ranjay Krishna, and Christina Yuan, and the rest of the Stanford CS231n teaching staff guided us through the project. ● Dr. Pawan Nandakishore reviewed our approaches and provided guidance. ● Vincent La helped us explore using YOLOv3 to perform object detection and text-based sentiment analysis. ● EmotiW competition organizers for providing an interesting challenge and large dataset. Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

References [1] Garima Sharma, Shreya Ghosh, and Abhinav Dhall. Automatic group level affect and cohesion prediction in videos. In Nadia Bianchi-Berthouze, Julien Epps, Andrea Kleinsmith, and Picard Rosalind, editors, International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) 2019, pages 161–167, United States of America, 2019. IEEE, Institute of Electrical and Electronics Engineers. International Conference on Affective Computing and Intelligent Interaction Workshops and Demos 2019, ACIIW 2019 ; Conference date: 03-09-2019 Through 06-09-2019. [2] Roland Goecke, Abhinav Dhall, Garima Sharma and Tom Gedeon. Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based challenges. ACM International Conference on Multimodal Interaction 2020, 2020. Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Thank You! We’re on GitHub! https://github.com/kevincong95/cs231n-emotiw Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin - PowerPoint PPT Presentation

Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin Leila Abdelrahman Cong Kevin Chen Amil Khanzada Stanford University University of Miami University of California University of California Berkeley Berkeley tomjin@stanford.edu

Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning Minghai Chen*,

Sentiment in Speech Ahmad Elshenawy Steele Carter May 13, 2014 Towards Multimodal Sentiment

Training of Deep Bidirectional RNNs for Hand Motion Filtering via Multimodal Data Fusion Soroosh

Multimodal Language Analysis with Recurrent Multistage Fusion Presenter: Paul Pu Liang Paul Pu

Ef EfficientLo Low-ra rank Multimodal Fusion Wi With h Moda dality-sp specifi fic Factors

PKU-IDM @ TRECVID 2011 CCD: Video Copy Detection using a Cascade of Multimodal Features &

Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion

Deborah A. Dahl Conversational Technologies Chair, W3C Multimodal Interaction Working Group

Alpha Presentation Sentiment & Emotional Analysis of Video Interviews The Capstone

MultiModal I nform ation Fusion Ling Guan Ryerson Multimedia Laboratory & Centre for

Artificial Neural Networks for Multimodal Information Fusion Friedhelm Schwenker Institute of

IRISA @ TRECVID2017 Beyond Crossmodal and Multimodal Models Task: Video Hyperlinking Mikail

Multimodal 2DCNN action recognition from RGB-D Data with Video Summarization Vicent Roig Ripoll

Multimodal audio-video person recognition using Deep Neural Networks Thesis advisor: Student:

Montage4D: Interactive Seamless Fusion of Multiview Video Textures Ruofei Du , Ming Chuang

Sentiment Analysis A Baseline Algorithm Dan Jurafsky Sentiment

Sentiment Analysis What is Sentiment Analysis? Positive or negative

Linguistic Expressions of Sentiment, Subjectivity & Stance Ling575 Sentiment April 1, 2014

Sentiment analysis Christopher Potts CS 244U: Natural language understanding May 19 1 / 83

Twitter Sentiment Analysis Twitter Sentiment Analysis Presented by: Loitongbam Gyanendro Singh

High resolution image fusion via fusion frames Shidong Li San Francisco State University

Fusion Nothing But The Truth Fusion Orbotech s True Commitment To The PCB Industry Overall

Unpaired Sentiment-to-Sentiment Translation: A Cycled Reinforcement Learning Approach Jingjing

Sentiment Analysis What is Sentiment Analysis? Dan Jurafsky Positive or negative movie review?

Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin - PowerPoint PPT Presentation

Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin Leila Abdelrahman Cong Kevin Chen Amil Khanzada Stanford University University of Miami University of California University of California Berkeley Berkeley tomjin@stanford.edu

Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning Minghai Chen*,

Sentiment in Speech Ahmad Elshenawy Steele Carter May 13, 2014 Towards Multimodal Sentiment

Training of Deep Bidirectional RNNs for Hand Motion Filtering via Multimodal Data Fusion Soroosh

Multimodal Language Analysis with Recurrent Multistage Fusion Presenter: Paul Pu Liang Paul Pu

Ef EfficientLo Low-ra rank Multimodal Fusion Wi With h Moda dality-sp specifi fic Factors

PKU-IDM @ TRECVID 2011 CCD: Video Copy Detection using a Cascade of Multimodal Features &amp;

Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion

Deborah A. Dahl Conversational Technologies Chair, W3C Multimodal Interaction Working Group

Alpha Presentation Sentiment &amp; Emotional Analysis of Video Interviews The Capstone

MultiModal I nform ation Fusion Ling Guan Ryerson Multimedia Laboratory &amp; Centre for

Artificial Neural Networks for Multimodal Information Fusion Friedhelm Schwenker Institute of

IRISA @ TRECVID2017 Beyond Crossmodal and Multimodal Models Task: Video Hyperlinking Mikail

Multimodal 2DCNN action recognition from RGB-D Data with Video Summarization Vicent Roig Ripoll

Multimodal audio-video person recognition using Deep Neural Networks Thesis advisor: Student:

Montage4D: Interactive Seamless Fusion of Multiview Video Textures Ruofei Du , Ming Chuang

Sentiment Analysis A Baseline Algorithm Dan Jurafsky Sentiment

Sentiment Analysis What is Sentiment Analysis? Positive or negative

Linguistic Expressions of Sentiment, Subjectivity &amp; Stance Ling575 Sentiment April 1, 2014

Sentiment analysis Christopher Potts CS 244U: Natural language understanding May 19 1 / 83

Twitter Sentiment Analysis Twitter Sentiment Analysis Presented by: Loitongbam Gyanendro Singh

High resolution image fusion via fusion frames Shidong Li San Francisco State University

Fusion Nothing But The Truth Fusion Orbotech s True Commitment To The PCB Industry Overall

Unpaired Sentiment-to-Sentiment Translation: A Cycled Reinforcement Learning Approach Jingjing

Sentiment Analysis What is Sentiment Analysis? Dan Jurafsky Positive or negative movie review?

PKU-IDM @ TRECVID 2011 CCD: Video Copy Detection using a Cascade of Multimodal Features &

Alpha Presentation Sentiment & Emotional Analysis of Video Interviews The Capstone

MultiModal I nform ation Fusion Ling Guan Ryerson Multimedia Laboratory & Centre for

Linguistic Expressions of Sentiment, Subjectivity & Stance Ling575 Sentiment April 1, 2014