spice semantic propositional image caption evaluation
play

SPICE: Semantic Propositional Image Caption Evaluation Presented to - PowerPoint PPT Presentation

SPICE: Semantic Propositional Image Caption Evaluation Presented to the COCO Consortium, Sept 2016 Peter Anderson 1 , Basura Fernando 1 , Mark Johnson 2 and Stephen Gould 1 1 Australian National University 2 Macquarie University ARC Centre of


  1. SPICE: Semantic Propositional Image Caption Evaluation Presented to the COCO Consortium, Sept 2016 Peter Anderson 1 , Basura Fernando 1 , Mark Johnson 2 and Stephen Gould 1 1 Australian National University 2 Macquarie University ARC Centre of Excellence for Robotic Vision 1 ARC Centre of Excellence for Robotic Vision

  2. Image captioning Source: MS COCO Captions dataset Source: http://aipoly.com/ www.roboticvisi ARC Centre of Excellence for Robotic Vision 2 on.org

  3. Automatic caption evaluation • Benchmark datasets require fast to compute, accurate and inexpensive evaluation metrics • Good metrics can be used to help construct better models The Evaluation Task: Given a candidate caption c i and a set of m reference captions R i = {r i1 ,…,r im } , compute a score S i that represents similarity between c i and R i . www.roboticvisi ARC Centre of Excellence for Robotic Vision 3 on.org

  4. Existing state of the art • Nearest neighbour captions often ranked higher than human captions Source: Lin Cui, Large-scale Scene UNderstanding Workshop, CVPR 2015 www.roboticvisi ARC Centre of Excellence for Robotic Vision 4 on.org

  5. Existing metrics • BLEU: Precision • METEOR: Align with brevity penalty, fragments, take geometric mean harmonic mean of over n-grams precision & recall • ROUGE-L: F -score • CIDEr: Cosine based on Longest similarity with TF- Common Substring IDF weighting www.roboticvisi ARC Centre of Excellence for Robotic Vision 5 on.org

  6. Motivation ‘False positive’ ‘False negative’ (High n-gram similarity) (Low n-gram similarity) A young girl A shiny metal standing on top pot filled with of a tennis some diced court. veggies. A giraffe The pan on the standing on top stove has of a green field. chopped vegetables in it. …n-gram overlap is not necessary or suffjcient for two sentences to mean the same …SPICE primarily addresses false positives Source: MS COCO Captions dataset www.roboticvisi ARC Centre of Excellence for Robotic Vision 6 on.org

  7. Is this a good caption? “A young girl standing on top of a basketball court” www.roboticvisi ARC Centre of Excellence for Robotic Vision 7 on.org

  8. Is this a good caption? “A young girl standing on top of a basketball court” Semantic propositions: 1.There is girl 2.The girl is young 3.The girl is standing 4.There is court 5.The court is used for basketball 6.The girl is on the court www.roboticvisi ARC Centre of Excellence for Robotic Vision 8 on.org

  9. Key Idea – scene graphs 1 2. Parse 2 1. Input 3. Scene Graph 3 4. T uples (girl) (court) (girl, young) (girl, standing) (court, tennis) (girl, on-top-of, court) 1 Johnson et. al. Image Retrieval Using Scene Graphs, CVPR 2015 2 Klein & Manning: Accurate Unlexicalized Parsing, ACL 2003 3 Schuster et. al: Generating semantically precise scene graphs from textual descriptions for improved image retrieval, EMNLP 2015 www.roboticvisi ARC Centre of Excellence for Robotic Vision 9 on.org

  10. SPICE Calculation SPICE calculated as an F-score over tuples, with: • Merging of synonymous nodes, and • Wordnet synsets used for tuple matching and merging. Given candidate caption c, a set of reference captions S , and the mapping T from captions to tuples: www.roboticvisi ARC Centre of Excellence for Robotic Vision 10 on.org

  11. Example – good caption www.roboticvisi ARC Centre of Excellence for Robotic Vision 11 on.org

  12. Example – good caption www.roboticvisi ARC Centre of Excellence for Robotic Vision 12 on.org

  13. Example – weak caption www.roboticvisi ARC Centre of Excellence for Robotic Vision 13 on.org

  14. Example – weak caption www.roboticvisi ARC Centre of Excellence for Robotic Vision 14 on.org

  15. Evaluation – MS COCO (C40) Pearson ρ correlation between evaluation metrics and human judgments for the 15 competition entries plus human captions in the 2015 COCO Captioning Challenge, using 40 reference captions. Source: Our thanks to the COCO Consortium for performing this evaluation using MS COCO Captions C40. www.roboticvisi ARC Centre of Excellence for Robotic Vision 15 on.org

  16. Evaluation – MS COCO (C40) SPICE picks the same top-5 as human evaluators. Absolute scores are lower with 40 reference captions (compared to 5 reference captions) Source: Our thanks to the COCO Consortium for performing this evaluation using MS COCO Captions C40. www.roboticvisi ARC Centre of Excellence for Robotic Vision 16 on.org

  17. Gameability • SPICE measures how well caption models recover objects, attributes and relations • Fluency is neglected (as with n-gram metrics) • If fluency is a concern, include a fluency metric such as surprisal* *Hale, J: A probabilistic Earley Parser as a Psycholinguistic Model 2001; Levy, R: Expectation-based syntactic comprehension 2008 www.roboticvisi ARC Centre of Excellence for Robotic Vision 17 on.org

  18. SPICE for error analysis Breakdown of SPICE F-score over objects, attributes and relations www.roboticvisi ARC Centre of Excellence for Robotic Vision 18 on.org

  19. Can caption models count? Breakdown of attribute F-score over color, number and size attributes www.roboticvisi ARC Centre of Excellence for Robotic Vision 19 on.org

  20. Summary • SPICE measures how well caption models recover objects, attributes and relations • SPICE captures human judgment better than CIDEr, BLEU, METEOR and ROUGE • Tuples can be categorized for detailed error analysis • Scope for further improvement as better semantic parsers are developed • Next steps: Using SPICE to build better caption models! www.roboticvisi ARC Centre of Excellence for Robotic Vision 20 on.org

  21. Thank you Link: SPICE Project Page (http://panderson.me/spice) Acknowledgement: We are grateful to the COCO Consortium for re-evaluating the 2015 Captioning Challenge entries using SPICE. ARC Centre of Excellence for Robotic Vision ARC Centre of Excellence for Robotic Vision 21

Recommend


More recommend