Ef EfficientLo Low-ra rank Multimodal Fusion Wi With h Moda dality-sp specifi fic Factors Zhun Liu, Ying Shen, Varun Bharadwaj, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency
Artifici cial Intelligence ce
Sen Sentimen ent an and Em Emotion An Analysis Speaker’s behaviors Sentiment Intensity ? “This movie is sick ” Smile Loud time
Mul Multi timoda dal Sen Sentimen ent an and Em Emotion An Analysis Speaker’s behaviors Sentiment Intensity ? Unimodal “This movie is sick ” Bimodal Trimodal Smile Loud time ① Intra-modal Interactions ② Cross-modal Interactions Multimodal Representation (Multimodal Fusion) ③ Computational Efficiency
Mul Multi timoda dal Fu Fusion us using ng Te Tensor Re Representation Bimodal |ℎ| Visual ··· Language “This movie Multimodal is sick” ··· Representation Unimodal 𝒶 = 𝑨 # 1 ⊗ 𝑨 % = 𝑨 # 𝑨 # ⊗ 𝑨 % Intra-modal interactions 1 1 𝑨 % Cross-modal interactions Computational efficiency “Tensor Fusion Network for Multimodal Sentiment Analysis” by Zadeh, A., et, al. (2017)
Co Comp mputati tional Co Comp mplexity ty – Tensor Product ct 𝑵 𝑷 3 𝒆 𝒏 𝒏6𝟐 𝒶 𝟐 𝑷(𝒆 𝟐 ×𝒆 𝟑 ×𝒆 𝟒 ) 𝒶 𝟐 𝑷(𝒆 𝟐 ×𝒆 𝟑 ) M= M=2 M= M=3
CO CORE Low-rank Multimodal CO CONTRIBUTIONS Fusion (LMF) 7
Fr From T m Ten ensor Re Representation to Low-ra rank Fusion Low-rank Multimodal Fusion Visual Language ③ Rearrange the computation of ℎ . ② Decomposition of input tensor 𝑎 . ① Decomposition of weight 𝑋 . Visual Language Tensor Fusion Networks 8
Canonical Polyadic c (CP) Decomposition of tensors Rank of tensor 𝑋 : minimum number of vector tuples needed for exact reconstruction 9
Canonical Polyadic (CP) Decomposition of 3D tensors |ℎ| |ℎ| |ℎ| + ⨂ ⨂ 10
Mo Moda dality ty-speci cific De Decomp mpositi tion |ℎ| |ℎ| |ℎ| Retain the dimension for the multimodal representation ℎ during decomposition 11
① De Decomp mpositi tion o of w weight t t tensor W W 𝑨 # 𝟐 ; ⨂ 𝒶 = ℎ 𝒳 𝟐 𝑨 % 𝟐 12
① De Decomp mpositi tion o of w weight t t tensor W W (>) (@) 𝑥 # 𝑥 # 𝑨 # 𝟐 ; ⨂ 𝒶 = + ⋯ ℎ + 𝟐 𝑨 % (>) (@) ⨂ ⨂ 𝑥 % 𝑥 % 𝟐 13
② De Decomp mpositi tion o of Z Z (>) (@) 𝑥 # 𝑥 # 𝑨 # 𝟐 ; ⨂ 𝒶 = + ⋯ ℎ + 𝟐 𝑨 % (>) (@) ⨂ ⨂ 𝑥 % 𝑥 % 𝟐 14
③ Re Rearranging computation 15
Lo Low-ra rank Multimodal Fusion 16
Ea Easily scales to more modalities Intra-modal interactions Cross-modal interactions Computational complexity 17
EX EXPER PERIMEN ENTS AND RE RESUL SULTS 18
Da Datasets ts CMU-MOSI POM IEMOCAP Sentiment Analysis Speaker Trait Recognition Emotion Recognition 2199 video segments 1000 full video clips 10039 video segments Single-speaker Single-speaker Dyadic interaction • • • From 93 Movie reviews Movie reviews From 302 videos • • • Segment level annotations Video level annotations Segment level annotations Sentiment 16 types of speaker traits 10 classes of emotions • • • Real-valued Categorical annotations Categorical annotations • • • 19
Comp Co mpare t e to f full r rank t k ten ensor f fusion CMU-MOSI Low-rank Multimodal Fusion LMF (Our Model) Tensor Fusion Networks TFN (Zadeh, et al., 2017) 0.98 0.97 0.67 0.67 76.5 0.90 76.4 33.5 75.7 32.8 73.9 73.4 0.91 0.63 32.1 31.6 0.60 0.88 71.5 Correlation Acc-2 F1 MAE Acc-7 20
Co Comp mpare t e to f full r rank t k ten ensor f fusion CMU-MOSI POM IEMOCAP 0.98 0.97 0.67 0.89 1.0 0.67 0.90 86.0 85.9 85.8 83.6 0.40 0.91 0.63 82.8 0.80 0.09 0.60 0.75 81.0 0.88 71.5 0.0 Correlation MAE MAE F1-Happy F1-Sad Correlation 21
Comp Co mpare w e with th St State-of of-the the-Ar Art Approach ches CMU-MOSI Low-rank Multimodal Fusion LMF (our model) Memory Fusion Networks MFN 1.15 (Zadeh, et al., 2018) 1.143 1.019 0.968 0.970 0.965 Multi-attention Recurrent Networks MARN 0.912 (Zadeh, et al., 2018) Tensor Fusion Networks TFN (Zadeh, et al., 2017) Multi-view LSTM MV-LSTM (Rajagopalan, et al., 2016) 0.0 Deep Fusion Deep Fusion (Nojavanasghari, et al., 2016) Mean Average Error (MAE) 22
Co Comp mpare w e with th To Top 2 St State-of of-the the-Ar Art Approach ches CMU-MOSI POM IEMOCAP LMF MFN MARN 1.15 0.912 0.965 0.968 0.886 0.6 0.89 90.0 0.67 89.0 TFN 0.668 85.9 MV-LSTM 0.396 0.349 84.3 84.2 0.805 0.270 82.8 0.633 82.1 0.796 0.632 0.0 0.60 0.75 81.0 0.00 Correlation MAE MAE Correlation F1-Angry F1-Sad 23
Effici ciency cy Improvement LMF (Ours) CMU-MOSI TFN (Zadeh, et al., 2017) 2500 2249.9 2000 1500 1177.17 Efficiency Metric: Number of data samples 1134.82 processed per second 1000 • Training Efficiency 500 340.74 • Testing Efficiency 0 Training - samples/s Testing - samples/s 24
Concl clusions Intra-modal interactions Cross-modal interactions Computational complexity State-of-the-art results 25
Thank yo Th you! Code: https://github.com/Justin1904/Low-rank-Multimodal-Fusion http://multicomp.cs.cmu.edu/
Recommend
More recommend