image captioning
play

Image Captioning Describe an image with meaningful and sensible - PowerPoint PPT Presentation

23rd International Conference on MultiMedia Modeling (MMM 2017) What Convnets Make for Image Captioning? Yu Liu*, Yanming Guo*, and Michael S. Lew Leiden Institute of Advanced Computer Science, Leiden University Presenter: Yanming Guo Discover


  1. 23rd International Conference on MultiMedia Modeling (MMM 2017) What Convnets Make for Image Captioning? Yu Liu*, Yanming Guo*, and Michael S. Lew Leiden Institute of Advanced Computer Science, Leiden University Presenter: Yanming Guo Discover the world at Leiden University

  2. Image Captioning Describe an image with meaningful and sensible sentence-level captions.  Objects  Actions  Descriptive words  Relations … A large bus sitting next to a very tall building Discover the world at Leiden University

  3. Image Captioning  Retrieval approaches ---- Map images to pre-defined sentences  Generative approaches ---- Estimate novel sentences A white dog and a brown dog run along side each other at the beach; A dog running on a wet suit on the beach Discover the world at Leiden University

  4. Image Captioning  Retrieval approaches ---- Map images to pre-defined sentences  Generative approaches ---- Estimate novel sentences Advantages:  Caption does not have to be previous seen  A good language model  More intelligent  Better performance Discover the world at Leiden University

  5. General Structure “White” “Cup” END ? … “White” “Cup” START CNN RNN Generate a sentence of words High-level image features Discover the world at Leiden University

  6. General Structure “White” “Cup” END ? … “White” “Cup” START CNN RNN What Convnets make for image captioning? Discover the world at Leiden University

  7. Three types of Convnets Single-label finetune Multi-label Multi-attribute  Single-label Convnet Generic representation ---- Convnet pre-trained on ImageNet dataset, e.g. AlexNet , VGG …  Multi-label Convnet Salient objects ---- Fine-tune Convnet on 80 object categories of MS COCO  Multi-attribute Convnet Salient objects, actions, relations… ---- Fine-tune Convnet on attributes of MS COCO (e.g. 300 attributes) Discover the world at Leiden University

  8. Three types of Convnets Input image Single-label Convnet Multi-label Convnet Multi-attribute Convnet The visualization of the most activated feature map in conv5_3 Discover the world at Leiden University

  9. Multi-Convnet Aggregation Single-label feature Aggregation feature Multi-label feature Multi-attribute feature 𝑦 0 𝑦 1 𝑦 i−2 𝑦 T−1 ag(x) ag(x) ag(x) ag(x) … … LSTM LSTM LSTM LSTM 𝑞 2 𝑞 1 𝑞 i−1 𝑞 T Discover the world at Leiden University

  10. Multi-Scale Testing … CNN 224 Caption generation transfer average … FCN LSTM 256 x t transfer … FCN 320 Discover the world at Leiden University

  11. Experiments  BLUE: measures the precision of n-grams between the generated and reference sentences (e.g. B-1, B-2, B-3, B-4).  METEOR: computed based on the alignment between the words in a generated and reference sentences.  ROUGE-L: focus on a set words that are appear in the same order in two sentences.  CIDEr: use a tf-idf weights for computing each n-grams. Discover the world at Leiden University

  12. Experiments  Multi-scale: considerable improvement  SL-Net: largest dimension & worst performance  ML-Net: smallest dimension & considerable improvement  MA-Net : medium dimension & significant improvement Discover the world at Leiden University

  13. Experiments  Multi-scale testing using FCN is always better;  The aggregation of different Convnets can enhance the performance Discover the world at Leiden University

  14. Experiments Single-label Convnet: A man is sitting on the water with a surfboard. Multi-label Convnet: A man sitting on a boat in front of a boat. Multi-attribute Convnet: A man and a dog on a boat. Multi-Convnet aggregation: A man and a dog on a small boat. Ground truth: A man and a dog on a small yellow boat. Discover the world at Leiden University

  15. Experiments Discover the world at Leiden University

  16. Experiments Ours: A man riding Ours: A living room Ours: A man riding a Ours: A close up a wave in the ocean. with a lot of furniture. horse at a horse. of an elephant with an elephant GT: A man riding a GT: Living room GT: A man getting a GT: A horse that wave on a surfboard with furniture with kiss on the neck threw a man off a in the ocean. garage door at one from an elephant's horse. end. trunk Discover the world at Leiden University

  17. Conclusion  Multi-attribute Convnet performs better for image captioning  The aggregation of different Convnets can deliver slightly better performance than each individual Convnet  Efficient multi-scale augmentation test using FCNs  Comparable results with the state-of-the-art Discover the world at Leiden University

  18. Thanks for your attention! Questions please?

Recommend


More recommend