Vision and Language Learning with Graph Neural Networks Linchao Zhu 22 Apr, 2020 Recognition, LEarning, Reasoning UTS CRICOS 00099F
Overview • RNNs for Image Captioning • Transformer for Image Captioning • Graph Network for Visual Commonsense Reasoning Recognition, LEarning, Reasoning
Image Captioning Zero-shot novel object captioning: the model How to generate descriptions for • needs to caption novel objects without additional unseen words? training sentence data about the object. Wu et al., Decoupled Novel Object Captioner, ACM MM 2018. Recognition, LEarning, Reasoning
Image Captioning Wu et al., Decoupled Novel Object Captioner, ACM MM 2018. Recognition, LEarning, Reasoning
Novel Image Captioning Results on eight novel objects in the held-out MS COCO dataset • A larger dataset: nocaps: novel object captioning at scale, ICCV 2019 Wu et al., Decoupled Novel Object Captioner, ACM MM 2018. Recognition, LEarning, Reasoning
Image Captioning • Semantic attributes are useful Visual regions are useful • Li et al., Entangled Transformer for Image Captioning, ICCV 2019. Recognition, LEarning, Reasoning
Image Captioning Li et al., Entangled Transformer for Image Captioning, ICCV 2019. Recognition, LEarning, Reasoning
Image Captioning EnTangled Attention • Li et al., Entangled Transformer for Image Captioning, ICCV 2019. Recognition, LEarning, Reasoning
Image Captioning Gated Bilateral Controller • Li et al., Entangled Transformer for Image Captioning, ICCV 2019. Recognition, LEarning, Reasoning
Image Captioning Results on MSCOCO (Karpathy’s split) • Fuse two models Li et al., Entangled Transformer for Image Captioning, ICCV 2019. Recognition, LEarning, Reasoning
Image Captioning Results on MSCOCO (Karpathy’s split) with sequence-level optimization • Transformer v : visual input only (w/o GBC) Transformer s : semantic attributes only (w/o GBC) Parallel: no ETA but use GBC Stacked v : stacked two visual layers(w/o GBC) Stacked s : stacked two semantic layers (w/o GBC) ETA: ours Li et al., Entangled Transformer for Image Captioning, ICCV 2019. Recognition, LEarning, Reasoning
Visual Commonsense Reasoning Question -> Answer -> Rationale Zellers et al., 2015 Recognition, LEarning, Reasoning
Visual Commonsense Reasoning Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019. Recognition, LEarning, Reasoning
Visual Commonsense Reasoning Local features Global features Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019. Recognition, LEarning, Reasoning
Visual Commonsense Reasoning Directional Reasoning • Conv Local features Global features attention Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019. Recognition, LEarning, Reasoning
Visual Commonsense Reasoning Loss: multi-class cross-entropy loss • • Results on the VCR dataset Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019. Recognition, LEarning, Reasoning
Visual Commonsense Reasoning Conditional centers • Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019. Recognition, LEarning, Reasoning
Visual Commonsense Reasoning Ablation studies for GraphVLAD • No-C: No conditional center No-G: No graph convolution Ablation studies for directional reasoning • No-R: no reasoning module LSTM-R: use LSTM for reasoning GCN: use GCN for reasoning D-GCN: directional GCN for reasoning Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019. Recognition, LEarning, Reasoning
Conclusion • Visual reasoning is challenging • Graph Networks are powerful. More studies to be investigated. • One model solves them all?
Recommend
More recommend