multimodal affective analysis using hierarchical
play

Multimodal Affective Analysis Using Hierarchical Attention Strategy - PowerPoint PPT Presentation

Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment Yue Gu* , Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, Ivan Marsic Multimedia Image Processing Lab Electrical and Computer Engineering Department


  1. Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment Yue Gu* , Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, Ivan Marsic Multimedia Image Processing Lab Electrical and Computer Engineering Department Rutgers, The State University of New Jersey

  2. Why the affective analysis is necessary? Human Speech Question and Answer Affects Recommendation System Accurate Response 2 AI Assistant

  3. Progress of Affective Computing Affective Analysis Affective Analysis Emotion Recognition Emotion Recognition Sentiment Analysis Sentiment Analysis  Happy, Excited  Happy, Excited  Strong Positive  Strong Positive  Sadness  Sadness  Positive  Positive  Anger  Anger  Neutral  Neutral  Neutral  Neutral  Negative  Negative  Frustration  Frustration  Strong Negative  Strong Negative Speech Signal Speech Signal Natural Language Natural Language Processing Processing Processing Processing Multi-Modality  MFCCs  MFCCs  BoW  BoW  Prosody  Prosody  POS  POS  Vocal Quality  Vocal Quality  CNNs, LSTMs  CNNs, LSTMs 3

  4. Is multi-modality needed?  Vocal signal prominence Oh Oh you you don’t don’t like like that that you you are are west-sider west-sider Neutral or Frustration 4

  5. Is multi-modality needed?  Vocal signal prominence Oh Oh Oh you you you don’t don’t don’t like like like that that that you you you are are are west-sider west-sider west-sider Neutral or Frustration Happy 5

  6. Is multi-modality needed?  Vocal signal prominence Oh Oh Oh you you you don’t don’t don’t like like like that that that you you you are are are west-sider west-sider west-sider Neutral or Frustration Happy  Acoustic ambiguity “ I love this city! ” “ I hate this city! ” 6

  7. Challenges: Feature Extraction  Gap between features and actual affective states  Lack of high-level associations  Not all parts contribute equally 7

  8. Challenges: Modality Fusion  Decision-level Fusion  Lack of mutual association learning  Feature-level Fusion  Fail to learn time-dependent interactions  Lack of consistency 8

  9. Proposed Solutions  Feature Extraction  Hierarchical attention based bidirectional GRUs  Modality Fusion  Word-level fusion with attention  An End-to-End multimodal network 9

  10. Data Pre-processing  Text Branch  Word Embedding: word2vec  Audio Branch  Mel-frequency spectral coefficients (MFSCs)  Synchronization  Word-level forced alignment 10

  11. … “mean” Text “I” “guys” (Embedded) Word-level Text … BiGRU 𝑢_ℎ 1 𝑢_ℎ 1 𝑢_ℎ 2 𝑢_ℎ 2 𝑢_ℎ 𝑂 𝑢_ℎ 𝑂 Fusion Word-level … 𝑢_𝑓 1 𝑢_𝑓 2 𝑢_𝑓 𝑂 Textual Attention … 𝑢_𝛽 1 𝑢_𝛽 2 𝑢_𝛽 1 𝑢_𝛽 2 𝑢_𝛽 𝑂 Softmax Layer 𝑢_𝛽 𝑂 Fusion Result … 𝑊 𝑊 𝑊 CNN 1 2 𝑂 … 𝑥_𝛽 1 𝑥_𝛽 1 𝑥_𝛽 2 𝑥_𝛽 2 𝑥_𝛽 𝑂 𝑥_𝛽 𝑂 Word-level … Acoustic 𝑥_𝑓 1 𝑥_𝑓 2 𝑥_𝑓 𝑂 ⊺ 𝑤 𝑔 ) Attention 𝑓𝑦𝑞(𝑔_𝑓 𝑗𝑘 𝑔_𝛽 𝑗𝑘 = … BiGRU 𝑥_ℎ 1 𝑥_ℎ 1 𝑥_ℎ 2 𝑥_ℎ 2 𝑥_ℎ 𝑂 𝑥_ℎ 𝑂 ⊺ 𝑤 𝑔 ) 𝑀 𝑙=1 𝑓𝑦𝑞(𝑔_𝑓 𝑗𝑙 Audio Frame-level 𝑔_𝑓 2𝑘 and 𝑔_𝛽 2𝑘 Acoustic Attention … 𝑔_𝑓 𝑗𝑘 = 𝑢𝑏𝑜ℎ 𝑋 𝑔 𝑔_ℎ 𝑗𝑘 + 𝑐 𝑔 BiGRU 𝑔_ℎ 21 𝑔_ℎ 22 𝑔_ℎ 2𝑀 11 Audio (MFSC)

  12. Word-level Fusion 𝑊 𝑊 𝑊 𝑗 𝑗 𝑗 Dense Layer 𝑣_𝛽 𝑗 ℎ 𝑗 𝑡_𝛽 𝑗 Attention Layer ℎ 𝑗 𝑡_𝛽 𝑗 𝑥_𝑊 𝑢_𝑊 Dense Layer 𝑗 𝑗 Dense Layer c c 𝑢_ℎ 𝑗 𝑢_ℎ 𝑗 𝑢_𝛽 𝑗 𝑥_ℎ 𝑗 𝑢_ℎ 𝑗 𝑢_𝛽 𝑗 𝑥_𝛽 𝑗 𝑥_ℎ 𝑗 𝑥_𝛽 𝑗 𝑥_ℎ 𝑗 𝑥_𝛽 𝑗 𝑢_𝛽 𝑗 (c) Fine-tuning Attention Fusion (b) Vertical Fusion (a) Horizontal Fusion 𝑥_𝛽 𝑗 𝑥_𝛽 𝑗 Word-level acoustic attention distribution 𝑓𝑦𝑞(𝑣_𝑓 𝑗⊺ 𝑤 𝑣 ) 𝑢_𝛽 𝑗 𝑢_𝛽 𝑗 Word-level textual attention distribution 𝑣_𝛽 𝑗 = 𝑓𝑦𝑞(𝑣_𝑓 𝑙 ⊺ 𝑤 𝑣 ) + 𝑡_𝛽 𝑗 𝑂 𝑥_ℎ 𝑗 𝑥_ℎ 𝑗 Word-level acoustic contextual state 12 𝑙=1 𝑢_ℎ 𝑗 𝑢_ℎ 𝑗 Word-level textual contextual state

  13. Baselines  Sentiment Analysis  BL-SVM, LSTM-SVM  C-MKL, TFN, LSTM(A)  Emotion Recognition  SVM Trees, GSV-eVector  C-MKL, H-DMS  Fusion  Decision-level, Feature-level (utterance-level) 13

  14. Sentiment Analysis Result MOSI 78 76 74 72 70 68 66 64 62 60 Weighted Accuracy Weighted F1 14

  15. Emotion Recognition Result IEMOCAP 75 70 65 60 55 50 Weighted Accuracy Unweighted Accuracy 15

  16. Multimodal architecture is needed MOSI 80 70 60 50 T A T+A Weighted Accuracy Weighted F1 IEMOCAP 75 70 65 60 55 T A T+A 16 Weighted Accuracy Weighted F1

  17. Generalization MOSI to YouTube 68 Weighted Accuracy Weighted F1 66 64 62 60 Ours-HF Ours-VF Ours-HAF IEMOCAP to EmotiW 62 Weighted Accuracy Weighted F1 61 60 59 58 57 56 Ours-HF Ours-VF Ours-HAF 17

  18. Attention Visualization Carry representative information in Successfully combine both both text and audio textual and acoustic attentions Label: anger 𝑥_𝛽 𝑗 𝑥_𝛽 𝑗 𝑢_𝛽 𝑗 𝑢_𝛽 𝑗 𝑡_𝛽 𝑗 𝑡_𝛽 𝑗 𝑣_𝛽 𝑗 𝑣_𝛽 𝑗 What about the business what the hell is this 𝑥_𝛽 𝑗 𝑥_𝛽 𝑗 𝑡_𝛽 𝑗 𝑡_𝛽 𝑗 Word-level acoustic attention distribution Shared attention distribution 𝑢_𝛽 𝑗 𝑢_𝛽 𝑗 𝑣_𝛽 𝑗 𝑣_𝛽 𝑗 Word-level textual attention distribution Fine-tuning attention distribution 18

  19. Attention Visualization Capture emphasis and Vocal signal prominence importance variation Label: happy 𝑥_𝛽 𝑗 𝑥_𝛽 𝑗 𝑢_𝛽 𝑗 𝑢_𝛽 𝑗 𝑡_𝛽 𝑗 𝑡_𝛽 𝑗 𝑣_𝛽 𝑗 𝑣_𝛽 𝑗 Oh you don’t like that you’re west-sider 𝑥_𝛽 𝑗 𝑥_𝛽 𝑗 𝑡_𝛽 𝑗 𝑡_𝛽 𝑗 Word-level acoustic attention distribution Shared attention distribution 𝑢_𝛽 𝑗 𝑢_𝛽 𝑗 𝑣_𝛽 𝑗 𝑣_𝛽 𝑗 Word-level textual attention distribution Fine-tuning attention distribution 19

  20. Summary  A hierarchical attention based multimodal structure  The word-level fusion strategies  Word-level attention visualization 20

  21. Thank you ! Email : yg202@scarletmail.rutgers.edu Homepage : www.ieyuegu.com 21

Recommend


More recommend