Language-Driven Visual Reasoning for Referring Expression - PowerPoint PPT Presentation

Language-Driven Visual Reasoning for Referring Expression Comprehension 李冠彬中山大学数据科学与计算机学院 VALSE 2019-12-18

Outline  Introduction and Related Work  Cross-Modal Relationship Inference for Grounding Referring Expressions, CVPR 2019  Dynamic Graph Attention for Referring Expression Comprehension, ICCV2019  Conclusion and Future Work Discussion

Introduction Referring Expression Comprehension Classic Image Understanding 1. The sheep in the middle Sheep 2 2. The fattest sheep Sheep 3 3. The sheep farthest from the grass Sheep 1

Introduction Requires Relationship Modeling and Reasoning May also require common sense knowledge 2 1 2 1 1. The hat worn by the man bending over and stroking the dog 1. The lady to the right of the waiter 2. The person who ordered the dish served by the waiter 2. The hat on the guy to the left of the man in the yellow shirt

Related Work ( Nagaraja et al. ECCV2016) (Rohrbach et al. ECCV2016)

Related Work S V O Modular Attention Network (CVPR2018) Accumulated Co-Attention Method (CVPR2018)

Cross-Modal Relationship Inference (CVPR2019) Motivation:  The extraction and modeling of relationships (including first-order and multi-order) is essential for visual grounding.  Graph based information propagation helps to explicitly capture multi-order relationships. Our Proposed Method:  Language-Guided Visual Relation Graph  Gated Graph Convolutional Network based feature propagation for semantic context modeling  Triplet loss with online hard negative sample mining

Language-Guided Visual Relation Graph Spatial Relation Graph Construction is the index label of relationship

Language-Guided Visual Relation Graph Language-Guided Visual Relation Graph Construction 1. Given expression , Bidirectional LSTM for word feature extraction 2. The type (i.e. entity, relation, absolute location and unnecessary word) for each word Global language context: Weighted normalized attention of word refer to vertex , The language context at vertex :

Language-Guided Visual Relation Graph Language-Guided Visual Relation Graph Construction 3. : : The language-guided multi-modal graph is defined as:

Language-Guided Visual Relation Graph The n-th gated graph convolution operation at vertex :

Language-Guided Visual Relation Graph

Experiments Evaluation Datasets: RefCOCO, RefCOCO+ and RefCOCOg Evaluation Metric: Precision@1 metric (the fraction of correct predictions) Comparison with state-of-the-art approaches on RefCOCO, RefCOCO+ and RefCOCOg

Experiments global langcxt+vis instance: Visual feature + location feature, last hidden unit of LSTM, matching global langcxt+global viscxt(2): GCN on the spatial relation graph weighted langcxt+guided viscxt: Gated GCN on the language-guided visual relation graph weighted langcxt+guided viscxt+fusion: Gated GCN on cross-modal relation graph Ablation study on variances of our proposed CMRIN on RefCOCO, RefCOCO+ and RefCOCOg

Visualization Results “an elephant between two other elephants” Initial Attention Score objects Final matching score left left Input Image Result right right

Visualization Results “green plant behind a table visible behind a lady ’ s head” Input Image Objects Result Initial Attention Score Final matching score “sandwich in center row all the way on right” Objects Final matching score Input Image Result Initial Attention Score

Dynamic Graph Attention (ICCV2019) Motivation:  Referring expression comprehension inherently requires visual reasoning on top of the relationships among the objects in the image. Example “the umbrella held by the person in the pink hat”  Human visual reasoning of grounding is guided by the linguistic structure of the referring expression. Our Proposed Method:  Specify the reasoning process as a sequence of constituent expressions.  A dynamic graph attention network to perform multi-step visual reasoning to identify compound objects by following the predicted reasoning process.

Dynamic Graph Attention Network 3 1 2 1. Graph construction 3. Step-wisely dynamic reasoning  Visual graph  Multi-modal graph  performs on the top of the graph under the guidance 2. Linguistic structure analysis  highlight edges and nodes  identify  Constituent expressions  Guidance of reasoning compound objects

Graph construction Directed graph: Multi-modal graph: word embedding :

Language Guided Visual Reasoning Process Model expression as a sequence of constituent expressions (soft distribution over words in the expression) bi-directional LSTM overall expression

Step-wisely Dynamic Reasoning The probability of the l-th word referring to each node and type of edge: The weight of each node (or the edge type) being mentioned in time step: Update the gates for every node or the edge type: Identify the compound object corresponding to each node:

Experiments Comparison with state-of-the-art methods on RefCOCO, RefCOCO+ and RefCOCOg when ground-truth bounding boxes are used.

Experiments Comparison with the state-of-the-art methods on RefCOCO, RefCOCO+ and RefCOCOg when detected objects are used.

Explainable Visualization

Visualization Results tree structure “lady” “purple shirt” matching “cake” “a lady wearing a purple shirt with a T = 1 T = 3 T = 2 birthday cake” matching “elephant” “man” “the elephant “gray shirt” behind the man chain structure wearing a gray shirt”

Conclusion  Cross-modal relationship modeling is helpful to enhance the contextual feature representation and improve the performance of visual grounding.  Language-guided reasoning over object relation graph helps to better locate the objects referred to in complex language descriptions and generate interpretable results.

Future Work Discussion Spatio-Temporally Reasoning in video grounding Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video, ACL2019

Future Work Discussion Embodied Referring Expressions Comprehension RERERE: Remote Embodied Referring Expressions in Real indoor Environments, Arxiv 2019

Future Work Discussion Commonsense Reasoning for Visual Grounding 2 1 From Recognition to Cognition: Visual Commonsense Reasoning, CVPR2019 1. The lady to the right of the waiter 2. The person who ordered the dish served by the waiter

Future Work Discussion Task Driven Object Detection What object in the scene would a human choose I want to watch the “ The Big to serve wine ？ Bang Theory ” now, by the way, [Sawatzky et al. CVPR2019] the room is too bright.

Thank You! http://guanbinli.com/

Language-Driven Visual Reasoning for Referring Expression - PowerPoint PPT Presentation

Language-Driven Visual Reasoning for Referring Expression Comprehension VALSE 2019-12-18 Outline Introduction and Related Work Cross-Modal Relationship Inference for Grounding

Language-Driven Visual Reasoning for Referring Expression Comprehension

Visual complexity and referring expression generation Micha Elsner with Alasdair Clarke,

A Corpus of Natural Language for Visual Reasoning Cornell Natural Language Visual Reasoning

Learning Distribu.ons over Logical Forms for Referring Expression

Mat MattNet tNet: : Modu Modular Atten lar Attention tion Network for Referring Network for

Visual Question Answering and Visual Reasoning Zhe Gan 6/15/2020 Overview Goal of this part

Data-Driven and Ontological Analysis of FrameNet for Natural Language Reasoning for Natural

Vision and Language Representation Learning Self Supervised Pretraining and Multi-Task Learning

Learning(Distribu.ons(over(Logical(Forms(for( Referring(Expression(Genera.on(

NeuralREG: an end-to-end approach for Referring Expression Generation Thiago Castro Ferreira1

Visual Analytics Visual Analytics is the science of analytical reasoning supported by interactive

Natural Language for Visual Reasoning Alane Suhr, Mike Lewis, James Yeh, Yoav Artzi

NEURO-SYMBOLIC VISUAL REASONING: DISENTANGLING VISUAL FROM REASONING HAMID PALANGI

ReferItGame: Referring to Objects in Photographs of Natural Scenes Motivation First

Towards X Visual Reasoning Hanwang Zhang hanwangzhang@ntu.edu.sg Pattern Recognition

Inferring and Executing Programs for Visual Reasoning Justin Johnson , Bharath Hariharan, Laurens

Information Structure Prediction for Visual-World Referring Expressions Micha Elsner Hannah

Motivation: conflict-driven reasoning from PL to FOL SGGS: model representation and FO clausal

and-Language Research Zhe Gan, Licheng Yu, Yu Cheng, Luowei Zhou, Linjie Li, Yen-Chun Chen,

Self-Critical Reasoning for Robust Visual Question Answering Jialin Wu and Raymond J. Mooney

The Expression Problem and Lenses Lambdajam 2016 Tony Morris The Expression Problem A new name

Speech Encoder Importance of body language 2 Why data-driven? Yoon et al. "Robots Learn

KEEPING UP WITH DATA:SMART CITIES IN 3D A new language: VISUAL VISUAL THINKING THINKING

CSC165 Fall 2014 Mathematical Expression and Reasoning for Computer Science Section L5101 Larry

Language-Driven Visual Reasoning for Referring Expression - PowerPoint PPT Presentation

Language-Driven Visual Reasoning for Referring Expression Comprehension VALSE 2019-12-18 Outline Introduction and Related Work Cross-Modal Relationship Inference for Grounding

Language-Driven Visual Reasoning for Referring Expression Comprehension

Visual complexity and referring expression generation Micha Elsner with Alasdair Clarke,

A Corpus of Natural Language for Visual Reasoning Cornell Natural Language Visual Reasoning

Learning Distribu.ons over Logical Forms for Referring Expression

Mat MattNet tNet: : Modu Modular Atten lar Attention tion Network for Referring Network for

Visual Question Answering and Visual Reasoning Zhe Gan 6/15/2020 Overview Goal of this part

Data-Driven and Ontological Analysis of FrameNet for Natural Language Reasoning for Natural

Vision and Language Representation Learning Self Supervised Pretraining and Multi-Task Learning

Learning(Distribu.ons(over(Logical(Forms(for( Referring(Expression(Genera.on(

NeuralREG: an end-to-end approach for Referring Expression Generation Thiago Castro Ferreira1

Visual Analytics Visual Analytics is the science of analytical reasoning supported by interactive

Natural Language for Visual Reasoning Alane Suhr, Mike Lewis, James Yeh, Yoav Artzi

NEURO-SYMBOLIC VISUAL REASONING: DISENTANGLING VISUAL FROM REASONING HAMID PALANGI

ReferItGame: Referring to Objects in Photographs of Natural Scenes Motivation First

Towards X Visual Reasoning Hanwang Zhang hanwangzhang@ntu.edu.sg Pattern Recognition

Inferring and Executing Programs for Visual Reasoning Justin Johnson , Bharath Hariharan, Laurens

Information Structure Prediction for Visual-World Referring Expressions Micha Elsner Hannah

Motivation: conflict-driven reasoning from PL to FOL SGGS: model representation and FO clausal

and-Language Research Zhe Gan, Licheng Yu, Yu Cheng, Luowei Zhou, Linjie Li, Yen-Chun Chen,

Self-Critical Reasoning for Robust Visual Question Answering Jialin Wu and Raymond J. Mooney

The Expression Problem and Lenses Lambdajam 2016 Tony Morris The Expression Problem A new name

Speech Encoder Importance of body language 2 Why data-driven? Yoon et al. &quot;Robots Learn

KEEPING UP WITH DATA:SMART CITIES IN 3D A new language: VISUAL VISUAL THINKING THINKING

CSC165 Fall 2014 Mathematical Expression and Reasoning for Computer Science Section L5101 Larry

Speech Encoder Importance of body language 2 Why data-driven? Yoon et al. "Robots Learn