Understanding complex scenes a man holding a tennis racquet on a tennis court the man is on the tennis court playing a game
Knowledge Freebase Text Vision Barack Obama is an American politician serving as the 44th President of the United States. Born in Honolulu, Hawaii, … in 2008, he defeated Republican nominee and was inaugurated as president on January 20, 2009. http://s122.photobucket.com/user/b (Wikipedia.org) meuppls/media/stampede.jpg.html
Winning entries of COCO 2015 Caption Challenge Compositional framework is *less elegant* but can potentially exploit non paired image-caption data more effectively
Turing ng T est st Re Resu sult lts at the MS COCO Captioning Challenge 2015 % of captions that Official pass the Turing Test Rank MSR 32.2% % 1st Goog ogle le 31.7% 1st 1st Still a big gap! MSR Captivato tivator r 30.1% 3rd Mont ntreal eal/T /T or oront nto 27.2% 3rd Berkeley ley LRCN 26.8% 5th Other er gr grou oups ps: Baidu/ u/UCL CLA, Stanf anfor ord, , Tsinghua, hua, etc. Human 67.5% --
Visual concepts Celebrity Language Model A small boat in Ha Long Bay high ConvNets Confidence Landmark Model low This image contains: water, Features vector DMSM boat, lake, mountain, etc. [Kenneth Tran, Xiaodong He, Lei Zhang, Jian Sun, Cornelia Carapcea, Chris Thrasher, Chris Buehler, Chris Sienkiewicz submitted to CVPR Deep Vision 2016]
[He, Zhang, Ren, Sun, 2015]
cabinets room wooden kitchen stove Repeat to generate 500 candidates cabinets sink floor [Fang, et al., CVPR 2015]
The deep multimod modal al semant mantic ic model l [Fang, et al., CVPR 2015] sema mantic ntic space ce : The overall semantics of a caption will also be represented by a vector in this space. If these two vectors are close to each other, then the caption is a good match for the image. W 4 W 4 Otherwise, not a matching caption. H3 H3 H3 H3 W 3 W 3 H2 H2 W 2 W 2 H1 H1 W 1 W 1 Input t1 Input s Text: a man holding a tennis Fully connected Image feature racquet on a tennis court Convolution/pooling Raw Image pixels [Huang, He, Gao, Deng et al., 2013] [He, Zhang, Ren, Sun, 2015]
[Guo, Zhang, Hu, He, Gao, 2016]
W 4 W 4 H3 H3 H3 H3 W 3 W 3 H2 H2 W 2 W 2 H1 H1 W 1 W 1 Input t1 Input s caption: a man holding a Image tennis racquet on a tennis court
System Excellent Good Bad Embarrassing Fang et al., 40.6% 26.8% 28.8% 3.8% 2015 New 51.8% 23.4% 22.5% 2.4% system Human evaluation on 1000 random samples of the COCO test set.
System Excellent Good Bad Embarrassing Fang et al., 12.0% 13.4% 63.0% 11.6% 2015 New 25.4% 24.1% 45.3% 5.2% system Human evaluation on Instagram test set, which contains 1380 random images that we scraped from Instagram.
Cognitive Services http://CaptionBot.ai
Recommend
More recommend