Multimodal Sentiment Analysis with Word-Level Fusion and - PowerPoint PPT Presentation

Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning Minghai Chen*, Sen Wang*, Paul Pu Liang*, Tadas Baltrusaitis, Amir Zadeh, Louis-Philippe Morency

Natural Computer Interaction Int Intel elligent gent Robo Ro bots a ts and Parasocial Interactions Per Personal nal As Assistant ant (e.g., mul (e.g ., multimed media content) Virtual Agents 2

Multimodal Communicative Behaviors V erbal V isual Sentiment § Positive § Lexicon § Gestures § Negative § Words § Head gestures Emotion § Eye gestures § Syntax § Arm gestures § Part-of-speech § Anger § Dependencies § Body language § Disgust § Body posture § Pragmatics § Fear § Proxemics § Discourse acts § Happiness V ocal § Eye contact § Sadness § Head gaze § Surprise § Prosody § Eye gaze § Intonation Social § Facial expressions § Voice quality § Empathy § FACS action units § Vocal expressions § Engagement § Smile, frowning § Laughter, moans § Dominance 3

Multimodal Sentiment Analysis Sentiment § Highly positive § Positive § Weakly positive § Neutral § Weakly negative § Negative § Highly negative 4

CMU-MOSI Dataset § 93 videos of movie reviews § 89 distinct speakers § 48 male and 41 female speakers § 2199 opinion segments § Average length: 4.2 sec § Average word count: 12 § 5 different annotators for each opinion segment § Krippendorf’s Alpha: 0.77 5

CMU-MOSI Dataset 6

Three Main Challenges Addressed in This Work 1 What granularity should we use? Ø Conventional approach summarizes features for the whole video Ø But some multimodal interactions happen at the word level: q The word “crazy” with smile: Positive q The word “crazy with frown: Negative 7

Three Main Challenges Addressed in This Work 2 What if a modality is noisy (e.g., occlusion)? 8

Three Main Challenges Addressed in This Work 3 What part of the video is relevant for the prediction task? 9

Main Contributions 1 What granularity should we use? Word-level feature representation 2 What if a modality is noisy (e.g., occlusion)? Modality-specific “on/off gate” 3 What part of the video is relevant for the prediction task? Temporal attention 10

Challenge 1: LSTM with Word-Level Fusion LSTM LSTM LSTM LSTM I v 1 a 1 Iike v 2 a 2 the v 3 a 3 movie v 4 a 4 11

Challenge 2: Gated Multimodal Embedding (GME) LSTM LSTM LSTM LSTM … … GME GME GME 12

Challenge 3: LSTM with Temporal Attention Attention Units FC-ReLU LSTM LSTM LSTM LSTM … … 13

Attention Units FC-ReLU LSTM LSTM LSTM LSTM … … GME GME GME Reinforcement Learning

Experiments Text § Transcripts of videos as well as pre-trained Glove word embeddings Audio § Covarep to extract acoustic features Video § Facet and Openface to extract facial landmarks, head pose, gaze tracking etc. 15

Baseline Models § C-MKL : Convolutional Multi-Kernel Learning model. CNN to extract textual features and uses for fusion. (Poria et al., 2015) § SAL-CNN: Select-Additive Learning. Reduces impact of identity-specific information. (Wang et al., 2016) § SVM-MD : Support Vector Machine with Multimodal Dictionary. Multimodal features using early fusion. (Zadeh et al., 2016b) § RF : Random Forest 16

Results – Multimodal Predictions Acc F-score MAE Method Random 50.2 48.7 1.880 SAL-CNN 73.0 - - SVM-MD 71.6 72.3 1.100 C-MKL 73.5 - - RF 57.4 59.0 - No Attention LSTM 69.4 63.7 1.245 Without GME LSTM(A) 75.7 72.1 1.019 GME-LSTM(A) 76.5 73.4 0.955 Our model Human 85.7 87.5 0.710 3.0 1.1 0.145 17

Results – Text Only Method Acc F-score MAE RNTN (73.7) (73.4) (0.990) DAN 70.0 69.4 - D-CNN 69.0 65.1 - SAL-CNN text 73.5 - - SVM-MD text 73.3 72.1 1.186 RF text 57.6 57.5 - LSTM text 67.8 51.2 1.234 LSTM(A) text 71.3 67.3 1.062 GME-LSTM(A) 76.5 73.4 0.955 18

LSTM with Word-Level Features Modalities Acc F-score MAE text 67.8 51.2 1.234 audio 44.9 61.9 1.511 video 44.9 61.9 1.505 text+audio 66.8 55.3 1.211 text+video 63.0 65.6 1.302 text+audio+video 69.4 63.7 1.245 19

LSTM with Temporal Attention (LSTM(A)) Modalities Acc F-score MAE text 71.3 67.3 1.062 audio 55.4 63.0 1.451 video 52.3 57.3 1.443 text+audio 73.5 70.3 1.036 text+video 74.3 69.9 1.026 text+audio+video 75.7 72.1 1.019 20

Temporal Attention on Word features But a lot of the footage was kind of unnecessary. And she really enjoyed the film. I thought it was fun . So yes I really enjoyed it. 21

Example from LSTM with Temporal Attention Transcript: He’s not gonna be looking like a chirper bright young man but early thirties really you want me to buy that. Visual modality: Looks disappointed LSTM sentiment prediction: 1.24 LSTM(A) sentiment prediction: -0.94 Ground truth sentiment: -1.8 22

Example for Gated Multimodal Embedding Transcript: First of all I’d like to say little James or Jimmy he’s so cute he’s so xxx. LSTM(A) Attention: little (her mouth is covered by her hands) GME-LSTM(A) Attention: cute LSTM(A) prediction: -0.94 GME-LSTM(A) prediction: 1.57 Ground truth: 3.0 23

Video example showing the effect of GME 24

GME Analysis Visual RL Gate: Reject Pass Reject LSTM(A) prediction: -2.0032 GME-LSTM(A) prediction: 1.4835 Ground truth: 1.2 25

Main Contributions 1 What granularity should we use? Word-level feature representation 2 What if a modality is noisy (e.g., occlusion)? Modality-specific “on/off gate” 3 What part of the video is relevant for the prediction task? Temporal attention 26

MERCI !

Multimodal Sentiment Analysis with Word-Level Fusion and - PowerPoint PPT Presentation

Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning Minghai Chen, Sen Wang, Paul Pu Liang*, Tadas Baltrusaitis, Amir Zadeh, Louis-Philippe Morency Natural Computer Interaction Int Intel elligent gent Robo Ro

Probabilistic and Model Fusion: . . . Model Fusion: . . . Interval Uncertainty Model Fusion:

Twitter Sentiment Analysis Twitter Sentiment Analysis Presented by: Loitongbam Gyanendro Singh

High resolution image fusion via fusion frames Shidong Li San Francisco State University

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Sentiment in Speech Ahmad Elshenawy Steele Carter May 13, 2014 Towards Multimodal Sentiment

Multimodal Language Analysis with Recurrent Multistage Fusion Presenter: Paul Pu Liang Paul Pu

Linguistic Expressions of Sentiment, Subjectivity & Stance Ling575 Sentiment April 1, 2014

Update on the Fusion Update on the Fusion Energy Sciences Program Energy Sciences Program Ed

October 2016 October 2016 WHAT IS FUSION? TWO FUSION TYPES NEUTRONIC ANEUTRONIC TWO

Modeling with MOSEK Fusion Ulf Worse INFORMS Minneapolis October 5 2013 http://www.mosek.com

Sentiment analysis Christopher Potts CS 244U: Natural language understanding May 19 1 / 83

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Pl u tchik ' s w heel of emotion , polarit y v s . sentiment SE N TIME N T AN ALYSIS IN R Ted K

Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin Leila Abdelrahman Cong Kevin Chen

Word Embeddings for Arabic Sentiment Analysis A. Aziz Altowayan and L. Tao Pace University IEEE

Chapter 6: Space & Depth Perception Lec 12 Jonathan Pillow, Sensation & Perception (PSY

Photographer anonymity Joint work with Peter Schaffer (Uni. of Luxembourg) Djamila Aouada (Uni.

Analyzing Performance of QtQuick Applications Thomas McGuire KDAB thomas@kdab.com Performance:

Learning to Synthesize Motion Blur CVPR 2019 Tim Brooks and Jon Barron Research Motion During

Occultations and Binaries Extra science from Gaia Marc W. Buie Southwest Research Institute

Slide 1-- Dreams and Visions: The Language of the Spirit Slides 2 New Covenant of the

Grazing Occultation Geometry Moscow Institute of Electronics and Mathematics 50 th Anniversary

Observing Asteroidal Occultations from Multiple Stations 2012 August 26 ESOP-31, Pescara, Italy

Multimodal Sentiment Analysis with Word-Level Fusion and - PowerPoint PPT Presentation

Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning Minghai Chen*, Sen Wang*, Paul Pu Liang*, Tadas Baltrusaitis, Amir Zadeh, Louis-Philippe Morency Natural Computer Interaction Int Intel elligent gent Robo Ro

Probabilistic and Model Fusion: . . . Model Fusion: . . . Interval Uncertainty Model Fusion:

Twitter Sentiment Analysis Twitter Sentiment Analysis Presented by: Loitongbam Gyanendro Singh

High resolution image fusion via fusion frames Shidong Li San Francisco State University

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Sentiment in Speech Ahmad Elshenawy Steele Carter May 13, 2014 Towards Multimodal Sentiment

Multimodal Language Analysis with Recurrent Multistage Fusion Presenter: Paul Pu Liang Paul Pu

Linguistic Expressions of Sentiment, Subjectivity &amp; Stance Ling575 Sentiment April 1, 2014

Update on the Fusion Update on the Fusion Energy Sciences Program Energy Sciences Program Ed

October 2016 October 2016 WHAT IS FUSION? TWO FUSION TYPES NEUTRONIC ANEUTRONIC TWO

Modeling with MOSEK Fusion Ulf Worse INFORMS Minneapolis October 5 2013 http://www.mosek.com

Sentiment analysis Christopher Potts CS 244U: Natural language understanding May 19 1 / 83

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Pl u tchik ' s w heel of emotion , polarit y v s . sentiment SE N TIME N T AN ALYSIS IN R Ted K

Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin Leila Abdelrahman Cong Kevin Chen

Word Embeddings for Arabic Sentiment Analysis A. Aziz Altowayan and L. Tao Pace University IEEE

Chapter 6: Space &amp; Depth Perception Lec 12 Jonathan Pillow, Sensation &amp; Perception (PSY

Photographer anonymity Joint work with Peter Schaffer (Uni. of Luxembourg) Djamila Aouada (Uni.

Analyzing Performance of QtQuick Applications Thomas McGuire KDAB thomas@kdab.com Performance:

Learning to Synthesize Motion Blur CVPR 2019 Tim Brooks and Jon Barron Research Motion During

Occultations and Binaries Extra science from Gaia Marc W. Buie Southwest Research Institute

Slide 1-- Dreams and Visions: The Language of the Spirit Slides 2 New Covenant of the

Grazing Occultation Geometry Moscow Institute of Electronics and Mathematics 50 th Anniversary

Observing Asteroidal Occultations from Multiple Stations 2012 August 26 ESOP-31, Pescara, Italy

Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning Minghai Chen, Sen Wang, Paul Pu Liang*, Tadas Baltrusaitis, Amir Zadeh, Louis-Philippe Morency Natural Computer Interaction Int Intel elligent gent Robo Ro

Linguistic Expressions of Sentiment, Subjectivity & Stance Ling575 Sentiment April 1, 2014

Chapter 6: Space & Depth Perception Lec 12 Jonathan Pillow, Sensation & Perception (PSY