ReferItGame: Referring to Objects in Photographs of Natural Scenes
Motivation ● First large-scale referring expression dataset ● Reference expressions are the natural way people talk – Of psychological interest in the ‘70s; Grice, Rosch, Winograd ● Application to human-computer interaction, robots ● Introduce – A large-scale dataset of referring expressions – A benchmark model for generating referral expressions
Motivation ● Natural referring expressions are free-form – ‘smiling boy’; only subject – ‘man on left’; subject and preposition ● Other work requires expression as (subj, prep, obj) – ‘cat on the chair’
Dataset ● Build on SAIAPR TC-12 dataset with 238 object categories ● Visual features include segmentations with – absolute properties: area, boundary, width, height… – relative properties: adjacent, disjoint, beside, X-aligned, above…
Dataset ● Player 1 writes an expression referencing the segmented object ● Player 2 clicks on where that object should be – This verifies the expression is reasonable
Dataset ● Collected through Turkers and volunteers – ~130,000 expressions – ~100,000 distinct objects – ~20,000 photographs ● www.referitgame.com is down unfortunately
Dataset ● Parse expressions into 7-tuple set of attributes, R – entry-level category; ‘bird’ – color; ‘blue’ – size; ‘tiny’ – absolute location; ‘top of the image’ – relative location relation; ‘the car to the left of the tree’ – relative location object; ‘the car to the left of the tree’ – generic; ‘wooden’, ‘round’ ● The big old white cabin beside the tree – R = {cabin, white, big, Ø, beside, tree, old} ● StanfordCoreNLP parser and attribute template
Dataset ● Psychology analysis – ‘woman’ often replaced with ‘person’
Dataset ● Attribute use – Roughly half of parsed descriptions are just category
Model ● Optimize R over P and S using ILP – R is 7-tuple set of attributes – P is visual features of object being referred to – S is visual features of the scene ● Different hand-engineered distributions for different attributes ● Unary priors between attribute and object ● Pairwise priors between pairs of attributes
Evaluation ● Three test sets of 500 images each – A contains interesting objects – B contains most frequently occurring interesting objects – C contains interesting objects when multiple are present ● Baseline model – Incorporates only the priors, so no S or attributes ● Humans ~72% accuracy
Critique ● How important is the scene for the attributes? – S is only used for relative location {relation, object} attributes – Absolute location is most commonly used attribute – Over half of parsed descriptions only include object category ● Why don’t the authors include more information on the visual features? – Which visual features are most important? ● Better metric than precision and recall? – Just ask AMT workers if description is reasonable?
Critique ● Why don’t the authors analyze training referral expressions more? – Paid Turk workers per every 10 images – Some human expressions are just the object
Future Work ● Scale up the dataset and train end-to-end with the best neural networks ● Identify referred object instead of generating expression – Done in upcoming MAttNet paper ● Make the images and expressions more challenging
Recommend
More recommend