23rd International Conference on MultiMedia Modeling (MMM 2017) What Convnets Make for Image Captioning? Yu Liu*, Yanming Guo*, and Michael S. Lew Leiden Institute of Advanced Computer Science, Leiden University Presenter: Yanming Guo Discover the world at Leiden University
Image Captioning Describe an image with meaningful and sensible sentence-level captions. Objects Actions Descriptive words Relations … A large bus sitting next to a very tall building Discover the world at Leiden University
Image Captioning Retrieval approaches ---- Map images to pre-defined sentences Generative approaches ---- Estimate novel sentences A white dog and a brown dog run along side each other at the beach; A dog running on a wet suit on the beach Discover the world at Leiden University
Image Captioning Retrieval approaches ---- Map images to pre-defined sentences Generative approaches ---- Estimate novel sentences Advantages: Caption does not have to be previous seen A good language model More intelligent Better performance Discover the world at Leiden University
General Structure “White” “Cup” END ? … “White” “Cup” START CNN RNN Generate a sentence of words High-level image features Discover the world at Leiden University
General Structure “White” “Cup” END ? … “White” “Cup” START CNN RNN What Convnets make for image captioning? Discover the world at Leiden University
Three types of Convnets Single-label finetune Multi-label Multi-attribute Single-label Convnet Generic representation ---- Convnet pre-trained on ImageNet dataset, e.g. AlexNet , VGG … Multi-label Convnet Salient objects ---- Fine-tune Convnet on 80 object categories of MS COCO Multi-attribute Convnet Salient objects, actions, relations… ---- Fine-tune Convnet on attributes of MS COCO (e.g. 300 attributes) Discover the world at Leiden University
Three types of Convnets Input image Single-label Convnet Multi-label Convnet Multi-attribute Convnet The visualization of the most activated feature map in conv5_3 Discover the world at Leiden University
Multi-Convnet Aggregation Single-label feature Aggregation feature Multi-label feature Multi-attribute feature 𝑦 0 𝑦 1 𝑦 i−2 𝑦 T−1 ag(x) ag(x) ag(x) ag(x) … … LSTM LSTM LSTM LSTM 𝑞 2 𝑞 1 𝑞 i−1 𝑞 T Discover the world at Leiden University
Multi-Scale Testing … CNN 224 Caption generation transfer average … FCN LSTM 256 x t transfer … FCN 320 Discover the world at Leiden University
Experiments BLUE: measures the precision of n-grams between the generated and reference sentences (e.g. B-1, B-2, B-3, B-4). METEOR: computed based on the alignment between the words in a generated and reference sentences. ROUGE-L: focus on a set words that are appear in the same order in two sentences. CIDEr: use a tf-idf weights for computing each n-grams. Discover the world at Leiden University
Experiments Multi-scale: considerable improvement SL-Net: largest dimension & worst performance ML-Net: smallest dimension & considerable improvement MA-Net : medium dimension & significant improvement Discover the world at Leiden University
Experiments Multi-scale testing using FCN is always better; The aggregation of different Convnets can enhance the performance Discover the world at Leiden University
Experiments Single-label Convnet: A man is sitting on the water with a surfboard. Multi-label Convnet: A man sitting on a boat in front of a boat. Multi-attribute Convnet: A man and a dog on a boat. Multi-Convnet aggregation: A man and a dog on a small boat. Ground truth: A man and a dog on a small yellow boat. Discover the world at Leiden University
Experiments Discover the world at Leiden University
Experiments Ours: A man riding Ours: A living room Ours: A man riding a Ours: A close up a wave in the ocean. with a lot of furniture. horse at a horse. of an elephant with an elephant GT: A man riding a GT: Living room GT: A man getting a GT: A horse that wave on a surfboard with furniture with kiss on the neck threw a man off a in the ocean. garage door at one from an elephant's horse. end. trunk Discover the world at Leiden University
Conclusion Multi-attribute Convnet performs better for image captioning The aggregation of different Convnets can deliver slightly better performance than each individual Convnet Efficient multi-scale augmentation test using FCNs Comparable results with the state-of-the-art Discover the world at Leiden University
Thanks for your attention! Questions please?
Recommend
More recommend