Image Retrieval using Scene Graphs Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David A. Shamma, Michael S. Bernstein, Li Fei-Fei CVPR 2015 Presented by Youngki Kwon
Contents ● Introduction ● Background ● Main approach ● Experiment ● Conclusion 2
Introduction ● There are needs to retrieve semantically similar images by describing detailed semantic of scene ● Scene graph can represent scene ● How about using scene graph as query? Ideal Result 3
Introduction ● Develop novel framework for semantic image retrieval based on the notion of a scene graph ● Use scene graphs as query ● Introduce a novel dataset of 5K human- generated scene graphs grounded to images Measure Score Output Query Object & Attribute Relationship 4
Background ● Scene graph is data structure that describes the contents of scene ● Encode object instances, attributes of objects, and relationships between objects 5 <Ranjay Krishna et al. IJCV16>
Background ● Attribute can be <Ali Farhadi et al. CVPR09> ● Relationship can be <Cewu Lu et al. ECCV16> 6
Main approach ● Under the assumption that scene graph query is given and image is represented by a set of candidate bounding boxes ● Measure the agreement between query scene graph and an unannotated test image ● Examining the best possible grounding of the scene graph to the image ● Perform maximum a posteriori (MAP) inference to find the most likely grounding ● Likelihood of this MAP solution is taken as the score measuring the agreement between the scene graphs and the image 7
Main approach ● Scene Graph Grounding ● G = (O, E) is a scene graph ● B is a set of bounding boxes in image ● 𝜹 is a grounding of the scene graph to the image ● Model the distribution over possible groundings as Unary Potential Binary Potential 8
Main approach ● Unary Potential ● Model how well the bounding box 𝜹 𝒑 agree with the known object class and attributes of the objects o ● If o = (c, A) then we decompose this term as Output 0.113 Class 1 Input Class 2 0.4213 . Class N . Attribute 1 . 0.712 Attribute M R-CNN 9
Main approach ● Binary Potential ● Model how well the pair of bounding boxes 𝜹 𝒑 , 𝜹 𝒑 ′ express the tuple (𝒑, 𝒔, 𝒑 ′ ) ● Extract features 𝒈(𝜹 𝒑 , 𝜹 𝒑 ′ ) encoding their relative position and scale ● Train Gaussian mixture model (GMM) to model 𝑸 𝒈 𝜹 𝒑 , 𝜹 𝒑 ′ 𝒅, 𝒔, 𝒅 ′ ) using training data and use GMM density function as probability Input Output (o,r,o ’) 1 0.482 0.134 (o,r,o ’) 2 . (o,r,o ’) N 0.772 GMM 10
Experiment ● Perform image retrieval experiments using two types of scene graph as queries ● Full ground-truth scene graph ● Simple scene graph ● Evaluate the groundings found by proposed models ● Check object localization performance [1] Full scene graph [2] Simple scene graph 11
Experiment ● Full scene graph queries 12
Experiment ● Simple scene graph queries 13
Experiment Success Case [1] [2] [3] 14
Experiment Failure case 15
Conclusion ● Use scene graph as novel representation for detailed semantics in visual scene ● Introduce a dataset of scene graphs grounded to real world images ● Construct CRF model for semantic image retrieval using scene graphs as queries 16
Reference ● Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Ranjay Krishna et al. (IJCV16) ● Describing objects by their attributes - Ali Farhadi et al. (CVPR09) ● Visual Relationship Detection with Language Priors - Cewu Lu et al. (ECCV16) 17
Quiz ● 1. Scene graph consists of object, attribute and ( ). ● A. relationship ● B. tag ● C. visual feature ● D. relative position ● 2. For measuring score, examining the best possible ( ) of the scene graph to the image ● A. reconstruction ● B. grounding ● C. resizing ● D. transformation 19
Recommend
More recommend