mat mattnet tnet modu modular atten lar attention tion
play

Mat MattNet tNet: : Modu Modular Atten lar Attention tion - PowerPoint PPT Presentation

March 2020 Mat MattNet tNet: : Modu Modular Atten lar Attention tion Network for Referring Network for Referring Expres Expression Comp sion Comprehe rehension nsion Tong Gao Background Referring expressions are natural language


  1. March 2020 Mat MattNet tNet: : Modu Modular Atten lar Attention tion Network for Referring Network for Referring Expres Expression Comp sion Comprehe rehension nsion Tong Gao

  2. Background Referring expressions are natural language utterances that • indicate particular objects within a scene Most of these work uses a concatenation of all features as • input and a LSTM to encode/decode the whole expression, ignoring the features of referring expressions

  3. Introduction MAttNet is the first modular network for general referring • expression comprehension task Decompose referring expression into three phrase • embeddings, which are used to trigger visual modules for: – Subject – Location – Relation

  4. Our Model - Workflow Given a candidate object 𝑝 𝑗 and referring expression 𝑠 : Language Attention Network -> 3 phrase embeddings 1. Three visual modules -> matching scores for 𝑝 𝑗 to phrase 2. embeddings Weighted combination of these scores -> matching score for 𝑝 𝑗 , 𝑠 3.

  5. Language Attention Network

  6. Language Attention Network with 3 individual embeddings 𝑔 𝑛 Constructed on Wording Embedding

  7. Visual Modules Backbone: Faster R-CNN • ResNet as feature extractor • Crop C3 feature for each 𝑝 𝑗 , and further compute C4 feature • In the end, compute the matching scores • Subject: 𝑇 𝑝 𝑗 𝑟 𝑡𝑣𝑐𝑘 – Location: 𝑇 𝑝 𝑗 𝑟 𝑚𝑝𝑑 – Relationship: 𝑇 𝑝 𝑗 𝑟 𝑠𝑓𝑚 –

  8. Visual Modules – Subject Module – “woman in red”

  9. Visual Modules – Subject Module – “woman in red” 𝑡𝑣𝑐𝑘 𝑤 𝑗 ෥ V 1. Compute attention score based on V, 𝑟 𝑡𝑣𝑐𝑘 2. Get subject visual 𝑡𝑣𝑐𝑘 representation ෥ 𝑤 𝑗

  10. Visual Modules – Location Module - “cat on the right” • 5-d vector 𝑚 , encoding top-left and bottom-right position and relative area to the image (Up to five)

  11. Visual Modules – Location Module - “second left person” • 5-d vector 𝜀𝑚 𝑗𝑘 = • Encoding relative location to same category neighbors (Up to five)

  12. Visual Modules – Relationship Module - “cat on chaise lounge” • 5-d vector 𝜀𝑛 𝑗𝑘 = • Look for surrounding objects regardless of their categories

  13. Loss Function Randomly sample two negative pairs (𝑝 𝑗 , 𝑠 𝑘 ) , (𝑝 𝑙 , 𝑠 𝑗 ) •

  14. Datasets RefCOCO, RefCOCO+ RefCOCOg Collected in Interactive game interface Non-interactive setting Average length of 3.5 8.4 expressions Same-type object 3.9 1.63 Absolute location words Yes No

  15. Datasets RefCOCO, RefCOCO+ RefCOCOg Splitting For evaluation: First partition: • • Test A: Persons Spitted by objects • • Test B: Objects Same images could appear in training and validation sets • No overlap between training, No testing set (not released) validation and testing sets Second partition: • Randomly split into training, validation and test set

  16. Evaluation

  17. Ablation Study

  18. Ablation Study

  19. Incorrect examples

  20. Critique • Focus on specific domain – referring expressions, carefully design the model with prior knowledge • Compared to similar works, they utilize more visual hidden features – C3 & C4 features from ResNet • Take the unbalanced data issues into account (in loss function of attribute prediction) • Good comparison and ablation study

  21. Critique • Location module & relationship module may double count the same object – should this case be considered? • In the relationship module, they use unusual expression of relative object locations, dependent on the width & height of given object 𝑝 𝑗 – why not use 𝑋 and 𝐼 ? • May add pairs of ground truth expression and object with same type as negative examples

  22. Critique • Can the model skip synonyms when selecting top-5 attributes to precept more attribute information?

  23. Thank you!

Recommend


More recommend