CS688 Paper Presentation 1 Deep Image-Text Embeddings Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016) Woobin Im ( 임우빈 ) 2016-11-08
Sentence-to-image Retrieval Retrieval system Query text A cat next to a blue chair and a deck User Result image 2
Image-to-sentence Retrieval Query image Retrieval system User A black and white cat laying on the carrying case of a computer Result text 3
Image-to-sentence Retrieval Query image Retrieval system User Among sentence list A black and white cat laying on the carrying case of a computer Result text 4
Image Description Generation Query image Retrieval system User Text generation by NLP tech. A black and white cat laying on the carrying case of a computer Result text 5
Text-sentence Embeddings Image Text representation representation Projection Projection Source: Accounting for the Relative Importance of Objects in Image Retrieval 6
Examples of image-to-sentence retrieval Source: Associating neural word embeddings with deep image representations using Fisher Vectors 7
Datasets ● MSCOCO, Flickr 8K, Flickr 30K, Pascal 1K … ● Have a few captions for each image ● MSCOCO has object segment information ● Flickr30K has phrase localizations Example of Flikr30k Entities dataset Source: Flickr30k Entities: Collecting Region-to-Phrase 8 Correspondences for Richer Image-to-Sentence Models
Paper ● Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016) Image feature Text feature 9
Paper ● Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016) Image Sentence Word2vec Pretrained CNN FV-HGLMM Image feature Text feature 10
Paper ● Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016) Loss Image Sentence B-norm B-norm Word2vec fc fc Pretrained CNN (VGG) FV-HGLMM fc fc PCA Image feature Text feature 11
Paper ● Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016) Loss Image Sentence B-norm B-norm Word2vec fc fc Pretrained CNN (VGG) FV-HGLMM fc fc PCA Image feature Text feature 12
Image feature extraction ● Using Pretrained VGG-VD-19 5 crops & flip = 10 crops Image feature (4096D) Averaging 4 corners + center ImageFeatures (4096D) x 10 Resized Image 13
Paper ● Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016) Loss Image Sentence B-norm B-norm Word2vec fc fc Pretrained CNN (VGG) FV-HGLMM fc fc PCA Image feature Text feature 14
Text feature extraction ● Word2Vec – word semantic embedding 15 Source: Distributed representations of words and phrases and their compositionality
Text feature extraction ● Fisher Vector of (HGLMM + GMM) Hybrid Sentence Gaussian-Laplacian Gaussian Mixture model Mixture model Word2Vec EM(Training) Fisher Vector Final Vector (6000D) Fisher Vector PCA Concatenation Word Vector (18000D) Work of “Associating neural word embeddings with deep image 16 representations using Fisher Vectors” v
Paper ● Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016) Loss Image Sentence B-norm B-norm Word2vec fc fc Pretrained CNN (VGG) FV-HGLMM fc fc PCA Image feature Text feature 17
Loss Calculation ● Structure-preserving triplet loss ! : anchor instance $ : image " : matching instance % : sentence # : non-matching instance &((, *) : euclidean distance , : margin image - sentence sentence - image Image structure preserving Text structure preserving 18
Loss Calculation ● Triplet loss? margin Source: “FaceNet: A unified embedding for face recognition and clustering” 19
Loss Calculation ● Structure-preserving Square: image Circle: sentence 20
Paper ● Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016) Loss Image Sentence B-norm B-norm Word2vec fc fc Pretrained CNN (VGG) FV-HGLMM fc fc PCA Image feature Text feature 21
Evaluation ● Task: ● Image-to-sentence retrieval - Given an image, find nearest K sentences ● Sentence-to-image retrieval - Given a sentence, find nearest K images ● L2-distance ● Dataset ● MSCOCO ● Flickr30K ● Metric ● Recall @ 1, 5, 10 (GT: 5 captions per image) 22
Evaluation setting index ● Net models ● Linear: just one linear projection (one fc) ● Non-linear: what we’ve covered image - sentence ● Training constraints ● One-directional : - . = 0 sentence - image ● Bi-directional : - . = 1 Image structure preserving ● Structure : - 2 = 0. 1 Text structure preserving ● - 1 = 0 for all cases ● No images have the same caption 23
Result (Flickr30K) ● Mean vector: mean of word2vec vectors in a sentence ● Tf-idf: what we learned 24
Result (MSCOCO 1K test) ● Mean vector: mean of word2vec vectors in a sentence ● Tf-idf: what we learned 25
Additional application - Phrase localization on Flickr30K ● Region proposal + text-image Embedding 26
Summary ● Image-to-text & text-to-image retrieval ● By embedding them to one space ● Image feature: pretrained CNN ● Text feature: word2vec + HLGMM+ FV ● Loss: structure-preserving triplet loss ● Test: ● Image-to-text & text-to-image retrieval ● Phrase localization 27
Q&A 28
Recommend
More recommend