March 2020 Mat MattNet tNet: : Modu Modular Atten lar Attention tion Network for Referring Network for Referring Expres Expression Comp sion Comprehe rehension nsion Tong Gao
Background Referring expressions are natural language utterances that • indicate particular objects within a scene Most of these work uses a concatenation of all features as • input and a LSTM to encode/decode the whole expression, ignoring the features of referring expressions
Introduction MAttNet is the first modular network for general referring • expression comprehension task Decompose referring expression into three phrase • embeddings, which are used to trigger visual modules for: – Subject – Location – Relation
Our Model - Workflow Given a candidate object 𝑝 𝑗 and referring expression 𝑠 : Language Attention Network -> 3 phrase embeddings 1. Three visual modules -> matching scores for 𝑝 𝑗 to phrase 2. embeddings Weighted combination of these scores -> matching score for 𝑝 𝑗 , 𝑠 3.
Language Attention Network
Language Attention Network with 3 individual embeddings 𝑔 𝑛 Constructed on Wording Embedding
Visual Modules Backbone: Faster R-CNN • ResNet as feature extractor • Crop C3 feature for each 𝑝 𝑗 , and further compute C4 feature • In the end, compute the matching scores • Subject: 𝑇 𝑝 𝑗 𝑟 𝑡𝑣𝑐𝑘 – Location: 𝑇 𝑝 𝑗 𝑟 𝑚𝑝𝑑 – Relationship: 𝑇 𝑝 𝑗 𝑟 𝑠𝑓𝑚 –
Visual Modules – Subject Module – “woman in red”
Visual Modules – Subject Module – “woman in red” 𝑡𝑣𝑐𝑘 𝑤 𝑗 V 1. Compute attention score based on V, 𝑟 𝑡𝑣𝑐𝑘 2. Get subject visual 𝑡𝑣𝑐𝑘 representation 𝑤 𝑗
Visual Modules – Location Module - “cat on the right” • 5-d vector 𝑚 , encoding top-left and bottom-right position and relative area to the image (Up to five)
Visual Modules – Location Module - “second left person” • 5-d vector 𝜀𝑚 𝑗𝑘 = • Encoding relative location to same category neighbors (Up to five)
Visual Modules – Relationship Module - “cat on chaise lounge” • 5-d vector 𝜀𝑛 𝑗𝑘 = • Look for surrounding objects regardless of their categories
Loss Function Randomly sample two negative pairs (𝑝 𝑗 , 𝑠 𝑘 ) , (𝑝 𝑙 , 𝑠 𝑗 ) •
Datasets RefCOCO, RefCOCO+ RefCOCOg Collected in Interactive game interface Non-interactive setting Average length of 3.5 8.4 expressions Same-type object 3.9 1.63 Absolute location words Yes No
Datasets RefCOCO, RefCOCO+ RefCOCOg Splitting For evaluation: First partition: • • Test A: Persons Spitted by objects • • Test B: Objects Same images could appear in training and validation sets • No overlap between training, No testing set (not released) validation and testing sets Second partition: • Randomly split into training, validation and test set
Evaluation
Ablation Study
Ablation Study
Incorrect examples
Critique • Focus on specific domain – referring expressions, carefully design the model with prior knowledge • Compared to similar works, they utilize more visual hidden features – C3 & C4 features from ResNet • Take the unbalanced data issues into account (in loss function of attribute prediction) • Good comparison and ablation study
Critique • Location module & relationship module may double count the same object – should this case be considered? • In the relationship module, they use unusual expression of relative object locations, dependent on the width & height of given object 𝑝 𝑗 – why not use 𝑋 and 𝐼 ? • May add pairs of ground truth expression and object with same type as negative examples
Critique • Can the model skip synonyms when selecting top-5 attributes to precept more attribute information?
Thank you!
Recommend
More recommend