fusical multimodal fusion for video sentiment
play

Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin - PowerPoint PPT Presentation

Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin Leila Abdelrahman Cong Kevin Chen Amil Khanzada Stanford University University of Miami University of California University of California Berkeley Berkeley tomjin@stanford.edu


  1. Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin Leila Abdelrahman Cong Kevin Chen Amil Khanzada Stanford University University of Miami University of California University of California Berkeley Berkeley tomjin@stanford.edu lxa215@miami.edu amil@berkeley.edu kevincong95@berkeley.edu ACM International Conference on Multimodal Interaction October 24th-29th 2020

  2. Introductions Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  3. Introduction: Understanding Video Sentiment Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  4. Dataset: EmotiW 2020 Video Group Emotion [1,2] Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  5. Our Approach Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  6. Multimodal Hidden Layer Ensembling Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  7. Scene Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  8. Scene Modality Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  9. Scene: Activated on People Strong activations on foreground individuals Activations followed individuals frame-to-frame Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  10. Scene: ResNet-50 Outperforms Inception-v3 ResNet activates on foreground Inception-v3 gets people distracted background lighting Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  11. Image Captioning Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  12. Captions Tell Descriptive Nouns Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  13. Pose Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  14. Pose Model Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  15. Pose: Importance of Upper Body Joints Hands Elbows Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  16. Audio and Laughter Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  17. Audio Model System Diagram ● CNN-LSTM ● Time dependent frames Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  18. Laughter Model Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  19. Facial Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  20. Facial Pipeline Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  21. Results & Error Analysis Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  22. Independent Modality Results Modality Accuracy F1-Score Scene 0.546 0.541 Pose 0.486 0.489 Audio 0.577 0.577 Face 0.400 0.348 Image Captioning 0.505 0.506 Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  23. Fully Connected Early Fusion Results Dataset Accuracy Baseline [Test] 0.479 Validation 0.640 Test 0.639 Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  24. Ablation Study on Modalities ● ResNet Scene had high positive class saliency ● Every modality struggled to predict positive videos accurately, except for scene Ensemble: 64.0% Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  25. Modality Activation Contributions Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  26. Summary & Future Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  27. Summary ● Required research into various modalities and ensembling methods ● We provide 2 novel modalities: image captioning and laughter ● Early fusion methods improved classification performance ● Beat the baseline test accuracy of 47.9% by about 16 percentage points: 63.9% Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  28. Future ● Transitory facial expression datasets ● 3D pose points, YOLO and object sentiment analysis ● Real world / research: YouTube likes, self-driving cars, telehealth ● Affective image captioning Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  29. Acknowledgements ● Dr. Fei Fei Li, Dr. Ranjay Krishna, and Christina Yuan, and the rest of the Stanford CS231n teaching staff guided us through the project. ● Dr. Pawan Nandakishore reviewed our approaches and provided guidance. ● Vincent La helped us explore using YOLOv3 to perform object detection and text-based sentiment analysis. ● EmotiW competition organizers for providing an interesting challenge and large dataset. Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  30. References [1] Garima Sharma, Shreya Ghosh, and Abhinav Dhall. Automatic group level affect and cohesion prediction in videos. In Nadia Bianchi-Berthouze, Julien Epps, Andrea Kleinsmith, and Picard Rosalind, editors, International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) 2019, pages 161–167, United States of America, 2019. IEEE, Institute of Electrical and Electronics Engineers. International Conference on Affective Computing and Intelligent Interaction Workshops and Demos 2019, ACIIW 2019 ; Conference date: 03-09-2019 Through 06-09-2019. [2] Roland Goecke, Abhinav Dhall, Garima Sharma and Tom Gedeon. Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based challenges. ACM International Conference on Multimodal Interaction 2020, 2020. Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

  31. Thank You! We’re on GitHub! https://github.com/kevincong95/cs231n-emotiw Tom Jin | Leila Abdelrahman | Cong Kevin Chen | Amil Khanzada

Recommend


More recommend