structured query based image retrieval using scene graphs
play

Structured Query-Based Image Retrieval using Scene Graphs Brigit - PowerPoint PPT Presentation

Structured Query-Based Image Retrieval using Scene Graphs Brigit Schroeder , UCSC Subarna Tripathi, Intel Labs Complexity of Object Interactions for Retrieval woman rides vs woman motorcycle motorcycle Structured queries capture


  1. Structured Query-Based Image Retrieval using Scene Graphs Brigit Schroeder , UCSC Subarna Tripathi, Intel Labs

  2. Complexity of Object Interactions for Retrieval woman rides vs woman motorcycle motorcycle ● Structured queries capture complexity of object interactions unlike single objects. ● Visual relationships are directed subgraph with subject and object as nodes connected by a predicate. ● We propose to retrieve images from such queries (NOT from RGB image features) utilizing a learned scene embedding space.

  3. Related Work ● Image Retrieval Using Scene Graphs (Johnson et al., CVPR 2015) ⇒ Use a CRF model to match best possible bounding box groundings from SG to image for retrieval. ● Cross-Modal Scene Graph Matching for Relationship-Aware Image-Text Retrieval (Wang et al., WACV 2020) ⇒ Use cross-modal scene graphs for image-text retrieval relying upon both word embeddings and image features.

  4. Subgraph Query for Retrieval ● Directed subgraphs are extracted from scene graphs (objects as nodes, predicated as edges). ● Each subgraph contains a subject and object as nodes connected by an edge representing a predicate relationship. ● Visual relationship, represented as the above subgraph, posed as structured queries. ● Similarity metric for retrieval in scene embedding space. Scene ● Scene embedding learned via a pretext task Graph (described in the next slide)

  5. Scene Graph Embeddings from Layout Prediction ● Scene graph embedding is learned via a pretext task, layout prediction. ● Layout prediction utilizes object localization for individual objects AND Triplet-superbox regression network and Triplet-mask prediction network (described in the next slide). ● Visual relation as directed subgraph as structured query, such as: Left of giraffe giraffe ● Euclidean distance in this scene embedding space used for retrieval. Learning Scene Graph Embedding

  6. Triplet Mask Network Triplet mask prediction: Triplets containing a <subject,predicate,object> found in a scene graph are used to predict corresponding triplet masks, labelling pixels either as subject and object. The mask prediction is used as supervisory signal during training

  7. Qualitative Retrieval Results Image Retrieval Results . Retrieval for structured queries with object types with varying levels of frequency in COCO-Stuff dataset: (a) head ( person, tree ), (b) (long-tail) medium frequency ( zebra, truck) , and (c) (long-tail) low frequency ( skateboard, skis ). Query is in left-most column corresponding to red boxes.

  8. Quantitative Results Image Retrieval Performance. Recall@k for all classes (left) and long-tail vs. head classes (right) found in COCO-Stuff

  9. Retrieval (NO input RGB image features) Performance Adding a visual relationship-inspired (triplet) loss boosts our recall by 10% in the best case.

  10. Conclusions ● We have trained scene graph embeddings for layout prediction with triplet-based loss functions. ● For the downstream application of image retrieval, we use structured queries formed using the learned embeddings instead of input image content. ● Our approach achieves high recall even on long-tailed object classes in the COCO-Stuff dataset.

  11. Thank You! Please check out our paper online: https://arxiv.org/pdf/2005.06653.pdf Brigit Schroeder Subarna Tripathi UC Santa Cruz Intel Labs brschroe@ucsc.edu subarna.tripathi@intel.com http://www.cs.uml.edu/~bschroed/ https://subarnatripathi.github.io/

Recommend


More recommend