deep image text embeddings
play

Deep Image-Text Embeddings Learning Deep Structure-Preserving - PowerPoint PPT Presentation

CS688 Paper Presentation 1 Deep Image-Text Embeddings Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016) Woobin Im ( ) 2016-11-08 Sentence-to-image Retrieval Retrieval system Query text A cat next to a blue chair


  1. CS688 Paper Presentation 1 Deep Image-Text Embeddings Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016) Woobin Im ( 임우빈 ) 2016-11-08

  2. Sentence-to-image Retrieval Retrieval system Query text A cat next to a blue chair and a deck User Result image 2

  3. Image-to-sentence Retrieval Query image Retrieval system User A black and white cat laying on the carrying case of a computer Result text 3

  4. Image-to-sentence Retrieval Query image Retrieval system User Among sentence list A black and white cat laying on the carrying case of a computer Result text 4

  5. Image Description Generation Query image Retrieval system User Text generation by NLP tech. A black and white cat laying on the carrying case of a computer Result text 5

  6. Text-sentence Embeddings Image Text representation representation Projection Projection Source: Accounting for the Relative Importance of Objects in Image Retrieval 6

  7. Examples of image-to-sentence retrieval Source: Associating neural word embeddings with deep image representations using Fisher Vectors 7

  8. Datasets ● MSCOCO, Flickr 8K, Flickr 30K, Pascal 1K … ● Have a few captions for each image ● MSCOCO has object segment information ● Flickr30K has phrase localizations Example of Flikr30k Entities dataset Source: Flickr30k Entities: Collecting Region-to-Phrase 8 Correspondences for Richer Image-to-Sentence Models

  9. Paper ● Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016) Image feature Text feature 9

  10. Paper ● Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016) Image Sentence Word2vec Pretrained CNN FV-HGLMM Image feature Text feature 10

  11. Paper ● Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016) Loss Image Sentence B-norm B-norm Word2vec fc fc Pretrained CNN (VGG) FV-HGLMM fc fc PCA Image feature Text feature 11

  12. Paper ● Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016) Loss Image Sentence B-norm B-norm Word2vec fc fc Pretrained CNN (VGG) FV-HGLMM fc fc PCA Image feature Text feature 12

  13. Image feature extraction ● Using Pretrained VGG-VD-19 5 crops & flip = 10 crops Image feature (4096D) Averaging 4 corners + center ImageFeatures (4096D) x 10 Resized Image 13

  14. Paper ● Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016) Loss Image Sentence B-norm B-norm Word2vec fc fc Pretrained CNN (VGG) FV-HGLMM fc fc PCA Image feature Text feature 14

  15. Text feature extraction ● Word2Vec – word semantic embedding 15 Source: Distributed representations of words and phrases and their compositionality

  16. Text feature extraction ● Fisher Vector of (HGLMM + GMM) Hybrid Sentence Gaussian-Laplacian Gaussian Mixture model Mixture model Word2Vec EM(Training) Fisher Vector Final Vector (6000D) Fisher Vector PCA Concatenation Word Vector (18000D) Work of “Associating neural word embeddings with deep image 16 representations using Fisher Vectors” v

  17. Paper ● Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016) Loss Image Sentence B-norm B-norm Word2vec fc fc Pretrained CNN (VGG) FV-HGLMM fc fc PCA Image feature Text feature 17

  18. Loss Calculation ● Structure-preserving triplet loss ! : anchor instance $ : image " : matching instance % : sentence # : non-matching instance &((, *) : euclidean distance , : margin image - sentence sentence - image Image structure preserving Text structure preserving 18

  19. Loss Calculation ● Triplet loss? margin Source: “FaceNet: A unified embedding for face recognition and clustering” 19

  20. Loss Calculation ● Structure-preserving Square: image Circle: sentence 20

  21. Paper ● Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016) Loss Image Sentence B-norm B-norm Word2vec fc fc Pretrained CNN (VGG) FV-HGLMM fc fc PCA Image feature Text feature 21

  22. Evaluation ● Task: ● Image-to-sentence retrieval - Given an image, find nearest K sentences ● Sentence-to-image retrieval - Given a sentence, find nearest K images ● L2-distance ● Dataset ● MSCOCO ● Flickr30K ● Metric ● Recall @ 1, 5, 10 (GT: 5 captions per image) 22

  23. Evaluation setting index ● Net models ● Linear: just one linear projection (one fc) ● Non-linear: what we’ve covered image - sentence ● Training constraints ● One-directional : - . = 0 sentence - image ● Bi-directional : - . = 1 Image structure preserving ● Structure : - 2 = 0. 1 Text structure preserving ● - 1 = 0 for all cases ● No images have the same caption 23

  24. Result (Flickr30K) ● Mean vector: mean of word2vec vectors in a sentence ● Tf-idf: what we learned 24

  25. Result (MSCOCO 1K test) ● Mean vector: mean of word2vec vectors in a sentence ● Tf-idf: what we learned 25

  26. Additional application - Phrase localization on Flickr30K ● Region proposal + text-image Embedding 26

  27. Summary ● Image-to-text & text-to-image retrieval ● By embedding them to one space ● Image feature: pretrained CNN ● Text feature: word2vec + HLGMM+ FV ● Loss: structure-preserving triplet loss ● Test: ● Image-to-text & text-to-image retrieval ● Phrase localization 27

  28. Q&A 28

Recommend


More recommend