Multi-attention Recurrent Network for Human Communication Comprehension Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, Louis-Philippe Morency Presenter: Paul Pu Liang 1
Progress of Artificial Intelligence Intelligent Robots and Multimedia Content Personal Assistants Virtual Agents 2
Multimodal Communicative Behaviors Language Visual Sentiment Ø Positive Ø Lexicon Ø Gestures Ø Negative Words Head gestures • • Emotion Eye gestures Ø Syntax • Arm gestures Part-of-speech • Ø Anger • Dependencies Ø Body language Ø Disgust • Body posture Ø Fear Ø Pragmatics • Proxemics Discourse acts • Ø Happiness • Acoustic Ø Eye contact Ø Sadness Head gaze • Ø Surprise Ø Prosody Eye gaze • Intonation Personality • Ø Facial expressions Voice quality • Ø Confidence FACS action units Ø Vocal expressions • § Vocal expressions Ø Persuasion Smile, frowning Laughter, moans • • § Laughter, moans Ø Passion 3
Challenge 1: Intra-modal dynamics Speaker’s behaviors Sentiment Intensity “This movie is great” Intra-modal time Head nod Smile time 4
Challenge 1: Intra-modal dynamics Speaker’s behaviors Sentiment Intensity “This movie is great” Intra-modal time Head nod Smile time 5
Challenge 2: Cross-modal Dynamics a) Multiple co-occurring interactions Speaker’s behaviors Sentiment Intensity Cross-modal “This movie is great ” Smile Loud voice time 6
Challenge 2: Cross-modal Dynamics a) Multiple co-occurring interactions b) Different weighted combinations Speaker’s behaviors Sentiment Intensity Cross-modal “This movie is fair ” Smile Loud voice time 7
Challenge 2: Cross-modal Dynamics a) Multiple co-occurring interactions b) Different weighted combinations c) Multiple prediction targets Speaker’s behaviors Emotions Cross-modal “This movie is great” Happy Raised Eyebrows Surprised Loud voice time 8
Multi-attention Recurrent Network (MARN) 1 Modeling intra-modal dynamics Set of Long-short Term Memories 9
Multi-attention Recurrent Network (MARN) 1 Modeling intra-modal dynamics Set of Long-short Term Memories 2 Modeling cross-modal dynamics Set of Long-short Term Hybrid Memories + Single-attention Block 10
Multi-attention Recurrent Network (MARN) 1 Modeling intra-modal dynamics Set of Long-short Term Memories 2 Modeling cross-modal dynamics Set of Long-short Term Hybrid Memories + Single-attention Block Modeling multiple cross-modal dynamics Set of Long-short Term Hybrid Memories + Multi-attention Block 11
Challenge 1: Intra-modal Dynamics LSTM 𝑚 LSTM 𝑚 LSTM 𝑚 LSTM 𝑚 LSTM 𝑚 𝒎𝒃𝒐𝒉𝒗𝒃𝒉𝒇 great This movie is - LSTM 𝑤 𝒘𝒋𝒕𝒗𝒃𝒎 LSTM 𝑤 LSTM 𝑤 LSTM 𝑤 LSTM 𝑤 - - - (smile) - LSTM 𝑏 LSTM 𝑏 LSTM 𝑏 LSTM 𝑏 LSTM 𝑏 𝒃𝒅𝒑𝒗𝒕𝒖𝒋𝒅 - - - - (loud voice) 12
Challenge 2: Cross-modal Dynamics Ø How do we capture cross-modal dynamics continuously across time? LSTM 𝑚 LSTM 𝑚 LSTM 𝑤 LSTM 𝑤 LSTM 𝑏 LSTM 𝑏 13
Challenge 2: Single-attention Block 3 ℎ 2 Captures cross-modal 𝑨 2 4 ℎ 2 dynamics. 5 ℎ 2 LSTM 𝑚 LSTM 𝑚 LSTM 𝑤 LSTM 𝑤 LSTM 𝑏 LSTM 𝑏 14
Challenge 2: Single-attention Block 8 𝑏 2 3 ℎ 2 4 ℎ 2 5 ℎ 2 LSTM 𝑚 LSTM 𝑚 LSTM 𝑤 LSTM 𝑤 LSTM 𝑏 LSTM 𝑏 15
Challenge 2: Single-attention Block 9 2 8 8 𝑏 2 ℎ language 3 ℎ 2 4 vision ℎ 2 acoustic 5 ℎ 2 ⨂ LSTM 𝑚 LSTM 𝑚 LSTM 𝑤 LSTM 𝑤 LSTM 𝑏 LSTM 𝑏 16
Challenge 2: Single-attention Block 9 2 8 8 𝑏 2 ℎ 𝒟 3 language 3 ℎ 2 3 𝑡 2 𝒟 4 4 vision 4 ℎ 2 𝑡 2 𝒟 5 acoustic 5 5 ℎ 2 𝑡 2 ⨂ LSTM 𝑚 LSTM 𝑚 LSTM 𝑤 LSTM 𝑤 LSTM 𝑏 LSTM 𝑏 17
Challenge 2: Single-attention Block 9 2 8 8 𝑏 2 ℎ 𝒟 3 language 3 ℎ 2 3 𝑡 2 𝑨 2 𝒟 4 4 vision 4 ℎ 2 𝑡 2 𝒟 5 acoustic 5 5 ℎ 2 𝑡 2 ⨂ LSTM 𝑚 LSTM 𝑚 LSTM 𝑤 LSTM 𝑤 LSTM 𝑏 LSTM 𝑏 18
Challenge 2: Single-attention Block LSTM update 9 2 8 8 𝑏 2 ℎ 3 + 𝑐 3 𝑋 3 𝑦 2@8 3 + 𝑉 3 ℎ 2 𝒟 3 language 3 ℎ 2 3 𝑡 2 𝑨 2 𝒟 4 4 vision 4 ℎ 2 𝑡 2 𝒟 5 acoustic 5 5 ℎ 2 𝑡 2 ⨂ LSTM 𝑚 LSTM 𝑚 LSTM 𝑤 LSTM 𝑤 LSTM 𝑏 LSTM 𝑏 19
Challenge 2: Long-short Term Hybrid Memory LSTHM update 9 2 8 8 𝑏 2 ℎ 3 + 𝑾 𝒎 𝒜 𝒖 𝒎 + 𝑐 3 3 𝑋 3 𝑦 2@8 + 𝑉 3 ℎ 2 𝒟 3 language 3 ℎ 2 3 𝑡 2 𝑨 2 𝒟 4 4 vision 4 ℎ 2 𝑡 2 𝒟 5 acoustic 5 5 ℎ 2 𝑡 2 ⨂ LSTHM 𝒎 LSTHM 𝒎 LSTHM 𝒘 LSTHM 𝒘 LSTHM 𝒃 LSTHM 𝒃 20
Challenge 2: Multi-attention Block language 𝒟 3 3 3 ℎ 2 𝑡 2 ⋯ ⋯ vision 𝑨 2 𝒟 4 4 4 𝑡 2 ℎ 2 acoustic 𝒟 5 5 5 𝑡 2 ℎ 2 ⨂ ⨂ ⨂ LSTHM 𝒎 LSTHM 𝒎 LSTHM 𝒘 LSTHM 𝒘 LSTHM 𝒃 LSTHM 𝒃 21
Multi-attention Recurrent Network (MARN) language 𝒟 3 3 3 ℎ 2 𝑡 2 ⋯ ⋯ vision 𝑨 2 𝒟 4 4 4 𝑡 2 ℎ 2 acoustic 𝒟 5 5 5 𝑡 2 ℎ 2 ⨂ ⨂ ⨂ LSTHM 𝒎 LSTHM 𝒎 LSTHM 𝒘 LSTHM 𝒘 LSTHM 𝒃 LSTHM 𝒃 22
Experiments Language Sentiment Ø Positive Ø Glove word embeddings Ø Negative Visual Emotion Alignment Ø Anger Ø Facet features Ø Disgust FACS action units • Ø Word level Emotions Ø Fear • Ø P2FA Acoustic Ø Happiness Ø Sadness Ø COVAREP features Ø Surprise MFCCs • Personality Pitch tracking • Ø Confidence § Vocal expressions Ø Persuasion § Laughter, moans Ø Passion 23
Baseline Models 1. Non-temporal Models • SVM-MD, RF 2. Early Fusion • HMM, EF-LSTM, EF-HCRF, C-MKL, SAL-CNN 3. Late Fusion • DF, TFN, BC-LSTM 4. Multi-view Learning • MV-HMMs, MV-HCRFs, MV-LSTM 24
State-of-the-art Results CMU-MOSI Sentiment Analysis 80 75 70 65 60 55 50 45 THMM RF EF-HCRF MV-HCRF SVM-MD C-MKL DF SAL-CNN EF-LSTM MV-LSTM BC-LSTM TFN MARN Baseline Models Multi-attention Recurrent Network (MARN) 25
State-of-the-art Results Emotion Recognition Personality Trait Prediction Sentiment Analysis 38 34 90 33 37 85 32 80 36 75 31 35 70 30 34 65 29 60 33 28 55 32 27 50 31 26 45 40 30 25 CMU-MOSI ICT-MMMO MOUD YouTube IEMOCAP POM Confidence POM Persusasion POM Passion POM Credibility State-of-the-art Baseline Multi-attention Recurrent Network (MARN) 26
Multi-attention Block is Important Emotion Recognition Personality Trait Prediction Sentiment Analysis 38 34 90 33 37 85 32 80 36 75 31 35 70 30 34 65 29 60 33 28 55 32 27 50 31 26 45 40 30 25 CMU-MOSI ICT-MMMO MOUD YouTube IEMOCAP POM Confidence POM Persusasion POM Passion POM Credibility No Multi-attention Block Multi-attention Recurrent Network (MARN) 27
Multiple Attentions are Important YouTube Sentiment Analysis CMU-MOSI Sentiment Analysis 77.2 56 77 54 76.8 52 76.6 50 76.4 48 76.2 46 76 44 75.8 42 75.6 40 1 2 3 4 5 1 2 3 4 5 6 Number of attentions Number of attentions 28
Visualization Attentions show diversity and are sensitive to different cross-modal dynamics inactive active 29
Visualization Some attentions always inactive • Carry only intra-modal dynamics • No cross-modal dynamics inactive active 30
Visualization Attentions change behaviors across time , some changes are more drastic than others. time inactive active 31
Visualization Different attentions focus on different modalities. active inactive inactive active 32
Multi-attention Recurrent Network (MARN) 1 Modeling intra-modal dynamics Set of Long-short Term Memories 2 Modeling cross-modal dynamics Set of Long-short Term Hybrid Memories + Single-attention Block Modeling multiple cross-modal dynamics Set of Long-short Term Hybrid Memories + Multi-attention Block 33
The End! Code: https://github.com/A2Zadeh/MARN Email: pliang@cs.cmu.edu 34
The End! Code: https://github.com/A2Zadeh/MARN Email: pliang@cs.cmu.edu Workshop @ ACL 2018 First Workshop on Computational Modeling of Human Multimodal Language multicomp.cs.cmu.edu/acl2018multimodalchallenge/ 35
Recommend
More recommend