Mat MattNet tNet: : Modu Modular Atten lar Attention tion - PowerPoint PPT Presentation

March 2020 Mat MattNet tNet: : Modu Modular Atten lar Attention tion Network for Referring Network for Referring Expres Expression Comp sion Comprehe rehension nsion Tong Gao

Background Referring expressions are natural language utterances that • indicate particular objects within a scene Most of these work uses a concatenation of all features as • input and a LSTM to encode/decode the whole expression, ignoring the features of referring expressions

Introduction MAttNet is the first modular network for general referring • expression comprehension task Decompose referring expression into three phrase • embeddings, which are used to trigger visual modules for: – Subject – Location – Relation

Our Model - Workflow Given a candidate object 𝑝 𝑗 and referring expression 𝑠 : Language Attention Network -> 3 phrase embeddings 1. Three visual modules -> matching scores for 𝑝 𝑗 to phrase 2. embeddings Weighted combination of these scores -> matching score for 𝑝 𝑗 , 𝑠 3.

Language Attention Network

Language Attention Network with 3 individual embeddings 𝑔 𝑛 Constructed on Wording Embedding

Visual Modules Backbone: Faster R-CNN • ResNet as feature extractor • Crop C3 feature for each 𝑝 𝑗 , and further compute C4 feature • In the end, compute the matching scores • Subject: 𝑇 𝑝 𝑗 𝑟 𝑡𝑣𝑐𝑘 – Location: 𝑇 𝑝 𝑗 𝑟 𝑚𝑝𝑑 – Relationship: 𝑇 𝑝 𝑗 𝑟 𝑠𝑓𝑚 –

Visual Modules – Subject Module – “woman in red”

Visual Modules – Subject Module – “woman in red” 𝑡𝑣𝑐𝑘 𝑤 𝑗 ෥ V 1. Compute attention score based on V, 𝑟 𝑡𝑣𝑐𝑘 2. Get subject visual 𝑡𝑣𝑐𝑘 representation ෥ 𝑤 𝑗

Visual Modules – Location Module - “cat on the right” • 5-d vector 𝑚 , encoding top-left and bottom-right position and relative area to the image (Up to five)

Visual Modules – Location Module - “second left person” • 5-d vector 𝜀𝑚 𝑗𝑘 = • Encoding relative location to same category neighbors (Up to five)

Visual Modules – Relationship Module - “cat on chaise lounge” • 5-d vector 𝜀𝑛 𝑗𝑘 = • Look for surrounding objects regardless of their categories

Loss Function Randomly sample two negative pairs (𝑝 𝑗 , 𝑠 𝑘 ) , (𝑝 𝑙 , 𝑠 𝑗 ) •

Datasets RefCOCO, RefCOCO+ RefCOCOg Collected in Interactive game interface Non-interactive setting Average length of 3.5 8.4 expressions Same-type object 3.9 1.63 Absolute location words Yes No

Datasets RefCOCO, RefCOCO+ RefCOCOg Splitting For evaluation: First partition: • • Test A: Persons Spitted by objects • • Test B: Objects Same images could appear in training and validation sets • No overlap between training, No testing set (not released) validation and testing sets Second partition: • Randomly split into training, validation and test set

Evaluation

Ablation Study

Incorrect examples

Critique • Focus on specific domain – referring expressions, carefully design the model with prior knowledge • Compared to similar works, they utilize more visual hidden features – C3 & C4 features from ResNet • Take the unbalanced data issues into account (in loss function of attribute prediction) • Good comparison and ablation study

Critique • Location module & relationship module may double count the same object – should this case be considered? • In the relationship module, they use unusual expression of relative object locations, dependent on the width & height of given object 𝑝 𝑗 – why not use 𝑋 and 𝐼 ? • May add pairs of ground truth expression and object with same type as negative examples

Critique • Can the model skip synonyms when selecting top-5 attributes to precept more attribute information?

Thank you!

Mat MattNet tNet: : Modu Modular Atten lar Attention tion - PowerPoint PPT Presentation

March 2020 Mat MattNet tNet: : Modu Modular Atten lar Attention tion Network for Referring Network for Referring Expres Expression Comp sion Comprehe rehension nsion Tong Gao Background Referring expressions are natural language

Agenda Introduction to the Problem What is TNet? Why Adopt TNet? www.transplantnet.org

Modu Mo dule le 4 REG EGUL ULATION TION AND ND PO POLI LICY CY APP PPROACHES CHES TO

Modu Mo dule le 4 REG EGUL ULATION TION AND ND PO POLI LICY CY APP PPROACHES CHES TO

Ques Question Answ tion Answering ering Jiyang Zhang, Tong Gao Background Image captioning and

Quicker and Safer Deployment of Deepwater MODU Moorings www.offspringinternational.com Deepwater

JNA Awar JN ards gain ained atten tention ion in in th the Mid iddle le Eas ast Hong

BIO IOTE TECHNO CHNOLOGY GY Prof. . Donald ald Ot Otieno eno University of Eldoret 1. Mo

BIO IOTE TECHNO CHNOLOGY GY Prof. . Donald ald Ot Otieno eno University of Eldoret

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

The Attention Economy What is the attention economy? A business model where you (as the

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Visual Attention FEF V4 spatial attention: simultaneous neural recordings in V4

Leveraging Microsoft investments for pre-project success edison365 A modular lar suite te of

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

A Convolutional Attention Network for Extreme Summarization of Source Code ATTENTION

1 TEMPORARY MODULAR HOUSING Meeting Purpose Learn how Temporary Modular Housing will allow

we balance Europe 11/26/2017 1 Mod Modular ular Li Li-Ion Ion ce cell ll pr prod oduc

Attention is All You Need (Vaswani et. al. 2017) Slides and figures when not cited are from:

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

ML modu modules les Daniel Jackson MIT Lab for Computer Science 6898: Advanced Topics in

Attention and its (mis)interpretation Danish Pruthi 1 Acknowledgements Mansi Gupta Bhuwan

NOPSEMA briefing and MODU mooring systems in cyclonic conditions Kerry Gordon Manager

E le c tr onic Visit Ve r ific a tion (E VV) Imple me nta tion in Se lf Dir e c tion

The ModuLase project has received funding from the European Unions Horizon 2020 research and