Introduction Methods Results Saliency Map Image Identification with Natural Language Specification Qi Feng, Donghyun Kim Department of Computer Science, Boston University fung@bu.edu, donhk@bu.edu December 08, 2017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Image Identification with Natural Language Specification Qi Feng, Donghyun Kim
Introduction Methods Results Saliency Map Outline Introduction 1 Methods 2 Results 3 4 Saliency Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Image Identification with Natural Language Specification Qi Feng, Donghyun Kim
Introduction Methods Results Saliency Map Photo Search Figure: Screen shot of a natural language search on Google Photos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Image Identification with Natural Language Specification Qi Feng, Donghyun Kim
Introduction Methods Results Saliency Map The Problem Figure: Identification of the target image by natural language specification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Image Identification with Natural Language Specification Qi Feng, Donghyun Kim
Introduction Methods Results Saliency Map GloVe Embedding GloVe is an unsupervised learning algorithm for obtaining vector representations for words.[2] Figure: The projection of word embedding into 2D space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Image Identification with Natural Language Specification Qi Feng, Donghyun Kim
Introduction Methods Results Saliency Map The Baseline Model ▶ Cosine similarity ▶ average of word embeddings[2] ▶ the input query ▶ a generated caption for an image[4] ▶ The Inception v3 ▶ pretrained on the ILSVRC-2012-CLS[3]. ▶ The language model ▶ trained 20,000 iterations on MSCOCO[1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Image Identification with Natural Language Specification Qi Feng, Donghyun Kim
Introduction Methods Results Saliency Map The Proposed Model Image Query CNN Visual Representation concat Language Model(LSTM) Similarity Figure: The proposed model. Red rounded rectangles are inputs to the model. The blue rectangle is the intermediate result from the convolutional neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Image Identification with Natural Language Specification Qi Feng, Donghyun Kim
Introduction Methods Results Saliency Map Training and Testing Figure: Positive Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Image Identification with Natural Language Specification Qi Feng, Donghyun Kim
Introduction Methods Results Saliency Map Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Image Identification with Natural Language Specification Qi Feng, Donghyun Kim
Introduction Methods Results Saliency Map Results cont. ▶ The Baseline Model: 91.1% ▶ The Proposed Model: 93.4% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Image Identification with Natural Language Specification Qi Feng, Donghyun Kim
Introduction Methods Results Saliency Map Excitation Back-propagation for Saliency Map ▶ Goal ▶ The goal is to find a salient region in input to interpret model’s predictions using a back-propagation. ▶ Assumptions ▶ The response of the activation neuron is non-negative. ▶ An activation neuron is tuned to detect certain visual features. Its response is positively correlated to its confidence of the detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Image Identification with Natural Language Specification Qi Feng, Donghyun Kim
Introduction Methods Results Saliency Map Spacial and Temporal Saliency Figure: Spatial and temporal saliency on MS-COCO. Original images on the left and saliency maps on the right. The queries are shown under each image. Red word represents the maximum temporal saliency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Image Identification with Natural Language Specification Qi Feng, Donghyun Kim
Introduction Methods Results Saliency Map Conclusion ▶ A model that identify an image by natural language specifications. ▶ An RNN to measure the similarity between images and queries. ▶ Excitation Back-propagation for finding spatial and temporal groundings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Image Identification with Natural Language Specification Qi Feng, Donghyun Kim
Introduction Methods Results Saliency Map Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. CoRR , abs/1405.0312, 2014. Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages 1532–1543, 2014. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) , 115(3):211–252, 2015. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Image Identification with Natural Language Specification Qi Feng, Donghyun Kim
Introduction Methods Results Saliency Map Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge. CoRR , abs/1609.06647, 2016. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Image Identification with Natural Language Specification Qi Feng, Donghyun Kim
Recommend
More recommend