Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment Yue Gu* , Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, Ivan Marsic Multimedia Image Processing Lab Electrical and Computer Engineering Department Rutgers, The State University of New Jersey
Why the affective analysis is necessary? Human Speech Question and Answer Affects Recommendation System Accurate Response 2 AI Assistant
Progress of Affective Computing Affective Analysis Affective Analysis Emotion Recognition Emotion Recognition Sentiment Analysis Sentiment Analysis Happy, Excited Happy, Excited Strong Positive Strong Positive Sadness Sadness Positive Positive Anger Anger Neutral Neutral Neutral Neutral Negative Negative Frustration Frustration Strong Negative Strong Negative Speech Signal Speech Signal Natural Language Natural Language Processing Processing Processing Processing Multi-Modality MFCCs MFCCs BoW BoW Prosody Prosody POS POS Vocal Quality Vocal Quality CNNs, LSTMs CNNs, LSTMs 3
Is multi-modality needed? Vocal signal prominence Oh Oh you you don’t don’t like like that that you you are are west-sider west-sider Neutral or Frustration 4
Is multi-modality needed? Vocal signal prominence Oh Oh Oh you you you don’t don’t don’t like like like that that that you you you are are are west-sider west-sider west-sider Neutral or Frustration Happy 5
Is multi-modality needed? Vocal signal prominence Oh Oh Oh you you you don’t don’t don’t like like like that that that you you you are are are west-sider west-sider west-sider Neutral or Frustration Happy Acoustic ambiguity “ I love this city! ” “ I hate this city! ” 6
Challenges: Feature Extraction Gap between features and actual affective states Lack of high-level associations Not all parts contribute equally 7
Challenges: Modality Fusion Decision-level Fusion Lack of mutual association learning Feature-level Fusion Fail to learn time-dependent interactions Lack of consistency 8
Proposed Solutions Feature Extraction Hierarchical attention based bidirectional GRUs Modality Fusion Word-level fusion with attention An End-to-End multimodal network 9
Data Pre-processing Text Branch Word Embedding: word2vec Audio Branch Mel-frequency spectral coefficients (MFSCs) Synchronization Word-level forced alignment 10
… “mean” Text “I” “guys” (Embedded) Word-level Text … BiGRU 𝑢_ℎ 1 𝑢_ℎ 1 𝑢_ℎ 2 𝑢_ℎ 2 𝑢_ℎ 𝑂 𝑢_ℎ 𝑂 Fusion Word-level … 𝑢_𝑓 1 𝑢_𝑓 2 𝑢_𝑓 𝑂 Textual Attention … 𝑢_𝛽 1 𝑢_𝛽 2 𝑢_𝛽 1 𝑢_𝛽 2 𝑢_𝛽 𝑂 Softmax Layer 𝑢_𝛽 𝑂 Fusion Result … 𝑊 𝑊 𝑊 CNN 1 2 𝑂 … 𝑥_𝛽 1 𝑥_𝛽 1 𝑥_𝛽 2 𝑥_𝛽 2 𝑥_𝛽 𝑂 𝑥_𝛽 𝑂 Word-level … Acoustic 𝑥_𝑓 1 𝑥_𝑓 2 𝑥_𝑓 𝑂 ⊺ 𝑤 𝑔 ) Attention 𝑓𝑦𝑞(𝑔_𝑓 𝑗𝑘 𝑔_𝛽 𝑗𝑘 = … BiGRU 𝑥_ℎ 1 𝑥_ℎ 1 𝑥_ℎ 2 𝑥_ℎ 2 𝑥_ℎ 𝑂 𝑥_ℎ 𝑂 ⊺ 𝑤 𝑔 ) 𝑀 𝑙=1 𝑓𝑦𝑞(𝑔_𝑓 𝑗𝑙 Audio Frame-level 𝑔_𝑓 2𝑘 and 𝑔_𝛽 2𝑘 Acoustic Attention … 𝑔_𝑓 𝑗𝑘 = 𝑢𝑏𝑜ℎ 𝑋 𝑔 𝑔_ℎ 𝑗𝑘 + 𝑐 𝑔 BiGRU 𝑔_ℎ 21 𝑔_ℎ 22 𝑔_ℎ 2𝑀 11 Audio (MFSC)
Word-level Fusion 𝑊 𝑊 𝑊 𝑗 𝑗 𝑗 Dense Layer 𝑣_𝛽 𝑗 ℎ 𝑗 𝑡_𝛽 𝑗 Attention Layer ℎ 𝑗 𝑡_𝛽 𝑗 𝑥_𝑊 𝑢_𝑊 Dense Layer 𝑗 𝑗 Dense Layer c c 𝑢_ℎ 𝑗 𝑢_ℎ 𝑗 𝑢_𝛽 𝑗 𝑥_ℎ 𝑗 𝑢_ℎ 𝑗 𝑢_𝛽 𝑗 𝑥_𝛽 𝑗 𝑥_ℎ 𝑗 𝑥_𝛽 𝑗 𝑥_ℎ 𝑗 𝑥_𝛽 𝑗 𝑢_𝛽 𝑗 (c) Fine-tuning Attention Fusion (b) Vertical Fusion (a) Horizontal Fusion 𝑥_𝛽 𝑗 𝑥_𝛽 𝑗 Word-level acoustic attention distribution 𝑓𝑦𝑞(𝑣_𝑓 𝑗⊺ 𝑤 𝑣 ) 𝑢_𝛽 𝑗 𝑢_𝛽 𝑗 Word-level textual attention distribution 𝑣_𝛽 𝑗 = 𝑓𝑦𝑞(𝑣_𝑓 𝑙 ⊺ 𝑤 𝑣 ) + 𝑡_𝛽 𝑗 𝑂 𝑥_ℎ 𝑗 𝑥_ℎ 𝑗 Word-level acoustic contextual state 12 𝑙=1 𝑢_ℎ 𝑗 𝑢_ℎ 𝑗 Word-level textual contextual state
Baselines Sentiment Analysis BL-SVM, LSTM-SVM C-MKL, TFN, LSTM(A) Emotion Recognition SVM Trees, GSV-eVector C-MKL, H-DMS Fusion Decision-level, Feature-level (utterance-level) 13
Sentiment Analysis Result MOSI 78 76 74 72 70 68 66 64 62 60 Weighted Accuracy Weighted F1 14
Emotion Recognition Result IEMOCAP 75 70 65 60 55 50 Weighted Accuracy Unweighted Accuracy 15
Multimodal architecture is needed MOSI 80 70 60 50 T A T+A Weighted Accuracy Weighted F1 IEMOCAP 75 70 65 60 55 T A T+A 16 Weighted Accuracy Weighted F1
Generalization MOSI to YouTube 68 Weighted Accuracy Weighted F1 66 64 62 60 Ours-HF Ours-VF Ours-HAF IEMOCAP to EmotiW 62 Weighted Accuracy Weighted F1 61 60 59 58 57 56 Ours-HF Ours-VF Ours-HAF 17
Attention Visualization Carry representative information in Successfully combine both both text and audio textual and acoustic attentions Label: anger 𝑥_𝛽 𝑗 𝑥_𝛽 𝑗 𝑢_𝛽 𝑗 𝑢_𝛽 𝑗 𝑡_𝛽 𝑗 𝑡_𝛽 𝑗 𝑣_𝛽 𝑗 𝑣_𝛽 𝑗 What about the business what the hell is this 𝑥_𝛽 𝑗 𝑥_𝛽 𝑗 𝑡_𝛽 𝑗 𝑡_𝛽 𝑗 Word-level acoustic attention distribution Shared attention distribution 𝑢_𝛽 𝑗 𝑢_𝛽 𝑗 𝑣_𝛽 𝑗 𝑣_𝛽 𝑗 Word-level textual attention distribution Fine-tuning attention distribution 18
Attention Visualization Capture emphasis and Vocal signal prominence importance variation Label: happy 𝑥_𝛽 𝑗 𝑥_𝛽 𝑗 𝑢_𝛽 𝑗 𝑢_𝛽 𝑗 𝑡_𝛽 𝑗 𝑡_𝛽 𝑗 𝑣_𝛽 𝑗 𝑣_𝛽 𝑗 Oh you don’t like that you’re west-sider 𝑥_𝛽 𝑗 𝑥_𝛽 𝑗 𝑡_𝛽 𝑗 𝑡_𝛽 𝑗 Word-level acoustic attention distribution Shared attention distribution 𝑢_𝛽 𝑗 𝑢_𝛽 𝑗 𝑣_𝛽 𝑗 𝑣_𝛽 𝑗 Word-level textual attention distribution Fine-tuning attention distribution 19
Summary A hierarchical attention based multimodal structure The word-level fusion strategies Word-level attention visualization 20
Thank you ! Email : yg202@scarletmail.rutgers.edu Homepage : www.ieyuegu.com 21
Recommend
More recommend