ef efficientlo low ra rank multimodal fusion wi with h
play

Ef EfficientLo Low-ra rank Multimodal Fusion Wi With h Moda - PowerPoint PPT Presentation

Ef EfficientLo Low-ra rank Multimodal Fusion Wi With h Moda dality-sp specifi fic Factors Zhun Liu, Ying Shen, Varun Bharadwaj, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency Artifici cial Intelligence ce Sen Sentimen ent an


  1. Ef EfficientLo Low-ra rank Multimodal Fusion Wi With h Moda dality-sp specifi fic Factors Zhun Liu, Ying Shen, Varun Bharadwaj, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency

  2. Artifici cial Intelligence ce

  3. Sen Sentimen ent an and Em Emotion An Analysis Speaker’s behaviors Sentiment Intensity ? “This movie is sick ” Smile Loud time

  4. Mul Multi timoda dal Sen Sentimen ent an and Em Emotion An Analysis Speaker’s behaviors Sentiment Intensity ? Unimodal “This movie is sick ” Bimodal Trimodal Smile Loud time ① Intra-modal Interactions ② Cross-modal Interactions Multimodal Representation (Multimodal Fusion) ③ Computational Efficiency

  5. Mul Multi timoda dal Fu Fusion us using ng Te Tensor Re Representation Bimodal |ℎ| Visual ··· Language “This movie Multimodal is sick” ··· Representation Unimodal 𝒶 = 𝑨 # 1 ⊗ 𝑨 % = 𝑨 # 𝑨 # ⊗ 𝑨 % Intra-modal interactions 1 1 𝑨 % Cross-modal interactions Computational efficiency “Tensor Fusion Network for Multimodal Sentiment Analysis” by Zadeh, A., et, al. (2017)

  6. Co Comp mputati tional Co Comp mplexity ty – Tensor Product ct 𝑵 𝑷 3 𝒆 𝒏 𝒏6𝟐 𝒶 𝟐 𝑷(𝒆 𝟐 ×𝒆 𝟑 ×𝒆 𝟒 ) 𝒶 𝟐 𝑷(𝒆 𝟐 ×𝒆 𝟑 ) M= M=2 M= M=3

  7. CO CORE Low-rank Multimodal CO CONTRIBUTIONS Fusion (LMF) 7

  8. Fr From T m Ten ensor Re Representation to Low-ra rank Fusion Low-rank Multimodal Fusion Visual Language ③ Rearrange the computation of ℎ . ② Decomposition of input tensor 𝑎 . ① Decomposition of weight 𝑋 . Visual Language Tensor Fusion Networks 8

  9. Canonical Polyadic c (CP) Decomposition of tensors Rank of tensor 𝑋 : minimum number of vector tuples needed for exact reconstruction 9

  10. Canonical Polyadic (CP) Decomposition of 3D tensors |ℎ| |ℎ| |ℎ| + ⨂ ⨂ 10

  11. Mo Moda dality ty-speci cific De Decomp mpositi tion |ℎ| |ℎ| |ℎ| Retain the dimension for the multimodal representation ℎ during decomposition 11

  12. ① De Decomp mpositi tion o of w weight t t tensor W W 𝑨 # 𝟐 ; ⨂ 𝒶 = ℎ 𝒳 𝟐 𝑨 % 𝟐 12

  13. ① De Decomp mpositi tion o of w weight t t tensor W W (>) (@) 𝑥 # 𝑥 # 𝑨 # 𝟐 ; ⨂ 𝒶 = + ⋯ ℎ + 𝟐 𝑨 % (>) (@) ⨂ ⨂ 𝑥 % 𝑥 % 𝟐 13

  14. ② De Decomp mpositi tion o of Z Z (>) (@) 𝑥 # 𝑥 # 𝑨 # 𝟐 ; ⨂ 𝒶 = + ⋯ ℎ + 𝟐 𝑨 % (>) (@) ⨂ ⨂ 𝑥 % 𝑥 % 𝟐 14

  15. ③ Re Rearranging computation 15

  16. Lo Low-ra rank Multimodal Fusion 16

  17. Ea Easily scales to more modalities Intra-modal interactions Cross-modal interactions Computational complexity 17

  18. EX EXPER PERIMEN ENTS AND RE RESUL SULTS 18

  19. Da Datasets ts CMU-MOSI POM IEMOCAP Sentiment Analysis Speaker Trait Recognition Emotion Recognition 2199 video segments 1000 full video clips 10039 video segments Single-speaker Single-speaker Dyadic interaction • • • From 93 Movie reviews Movie reviews From 302 videos • • • Segment level annotations Video level annotations Segment level annotations Sentiment 16 types of speaker traits 10 classes of emotions • • • Real-valued Categorical annotations Categorical annotations • • • 19

  20. Comp Co mpare t e to f full r rank t k ten ensor f fusion CMU-MOSI Low-rank Multimodal Fusion LMF (Our Model) Tensor Fusion Networks TFN (Zadeh, et al., 2017) 0.98 0.97 0.67 0.67 76.5 0.90 76.4 33.5 75.7 32.8 73.9 73.4 0.91 0.63 32.1 31.6 0.60 0.88 71.5 Correlation Acc-2 F1 MAE Acc-7 20

  21. Co Comp mpare t e to f full r rank t k ten ensor f fusion CMU-MOSI POM IEMOCAP 0.98 0.97 0.67 0.89 1.0 0.67 0.90 86.0 85.9 85.8 83.6 0.40 0.91 0.63 82.8 0.80 0.09 0.60 0.75 81.0 0.88 71.5 0.0 Correlation MAE MAE F1-Happy F1-Sad Correlation 21

  22. Comp Co mpare w e with th St State-of of-the the-Ar Art Approach ches CMU-MOSI Low-rank Multimodal Fusion LMF (our model) Memory Fusion Networks MFN 1.15 (Zadeh, et al., 2018) 1.143 1.019 0.968 0.970 0.965 Multi-attention Recurrent Networks MARN 0.912 (Zadeh, et al., 2018) Tensor Fusion Networks TFN (Zadeh, et al., 2017) Multi-view LSTM MV-LSTM (Rajagopalan, et al., 2016) 0.0 Deep Fusion Deep Fusion (Nojavanasghari, et al., 2016) Mean Average Error (MAE) 22

  23. Co Comp mpare w e with th To Top 2 St State-of of-the the-Ar Art Approach ches CMU-MOSI POM IEMOCAP LMF MFN MARN 1.15 0.912 0.965 0.968 0.886 0.6 0.89 90.0 0.67 89.0 TFN 0.668 85.9 MV-LSTM 0.396 0.349 84.3 84.2 0.805 0.270 82.8 0.633 82.1 0.796 0.632 0.0 0.60 0.75 81.0 0.00 Correlation MAE MAE Correlation F1-Angry F1-Sad 23

  24. Effici ciency cy Improvement LMF (Ours) CMU-MOSI TFN (Zadeh, et al., 2017) 2500 2249.9 2000 1500 1177.17 Efficiency Metric: Number of data samples 1134.82 processed per second 1000 • Training Efficiency 500 340.74 • Testing Efficiency 0 Training - samples/s Testing - samples/s 24

  25. Concl clusions Intra-modal interactions Cross-modal interactions Computational complexity State-of-the-art results 25

  26. Thank yo Th you! Code: https://github.com/Justin1904/Low-rank-Multimodal-Fusion http://multicomp.cs.cmu.edu/

Recommend


More recommend