Model outline 1. How do we process images? 3. How do we compare an image and a sentence? “A dog jumping 2. How do we over a hurdle” process sentences? 37
End-to-end learning 1. How do we process images? 3. How do we single differentiable function compare an image and a sentence? “A dog jumping 2. How do we over a hurdle” process sentences? 38
Model outline Convolutional Network 3. How do we compare an image and a sentence? “A dog jumping 2. How do we over a hurdle” process sentences? 39
Model outline Convolutional Network 3. How do we compare an image and a sentence? “A dog jumping 2. How do we over a hurdle” process sentences? 40
Processing Sentences “A man wearing a helmet jumps on his bike near a beach.” Q: How can we encode the sentence into a fixed-sized vector? 41
Idea 1: Bag of Words “A man wearing a helmet jumps on his bike near a beach.” 42
Idea 2: Bag of n-grams e.g. n = 2: “A man wearing a helmet jumps on his bike near a beach.” 43
Our approach root 1. Extract a dependency tree [1] jumps prep 2. Compute representation prep nsubj recursively with a near on man Recursive Tensor pobj det partmod pobj Neural Network bike wearing A beach poss dobj apply recursive formula: det a helmet his det a [Marneffe and Manning 2008] 44
Model outline Convolutional Network 3. How do we compare an image and a sentence? Recursive Tensor Neural Network “A dog jumping over a hurdle” 45
Matching Image and Sentence Convolutional Network score x Recursive Tensor Neural Network “A dog jumping over a hurdle” 46
are there Convolutional Network fur textures? is a “dog” Recursive Tensor Neural Network “A dog jumping mentioned? over a hurdle” 47
dog jumping 0.5 0.1 -1.5 over a hurdle man in blue -1.5 2.0 0.9 wetsuit surfing baseball player 0.3 0.6 2.1 throwing the ball 48
Given image and sentence vectors dog jumping 0.5 0.1 -1.5 over a hurdle man in blue -1.5 2.0 0.9 the (structured, max-margin) loss wetsuit surfing becomes: baseball player 0.3 0.6 2.1 throwing the ball rank sentences (columns) rank images (rows) 49
Image Sentence Datasets [1] Pascal 1K : 1,000 images (5 sentences [2] Flickr8K : 8,000 images per image) [3] Flickr30K : 30,000 images [4] MSCOCO : 115,000 images [1] Rashtchian et al., 2010 [2] Hodosh et al., 2013 [3] Young et al., 2014 [4] Lin et al., 2015 50
Example results: sentence retrieval showing top 4 matching sentences (out of 5,000) 51
Image Sentence Fragment ranking Deep Visual-Semantic Alignments for Generating Image Descriptions. Karpathy, Fei-Fei, CVPR 2015.
Limitations of Ranking 1. Limiting. We cannot describe images with sentences that are not in the training data. 2. Inefficient. Have to loop over and test all sentences one by one when annotating an image. 3. Unsatisfying. Especially when compared to humans who can easily generate descriptions. 55
Outline 1. Matching images with sentences [1] matching “Dog jumping over a hurdle.” score 2. Generating captions for images [2] “Dog jumping over a hurdle.” 3. Generating multiple localized captions [3] “Dog jumping over a hurdle.” “A black and white dog in mid-air” “A hand of a person” “A blue and white hurdle” [1] Grounded Compositional Semantics for Finding and Describing Images with Sentences. Socher, Karpathy, Le, Manning, Ng, TACL 2013. [2] Deep Visual-Semantic Alignments for Generating Image Descriptions. Karpathy, Fei-Fei, CVPR 2015. [3] DenseCap: Fully Convolutional Localization Networks for Dense Captioning. Johnson*, Karpathy*, Fei-Fei, CVPR 2016. 56
Generating Captions for Images Problem Statement “A dog jumping over a hurdle” 57
Generating Descriptions: Prior work Baby Talk (Kulkarni et al. 2011) example template: (noun) (noun) (verb) (noun) _____ in _____ is ______ in _______. [Barnard ’03] [Gupta & Mannem ’12] [Yao ’10] [Duygulu ’02] [Elliott & Keller ’13] [Yang ’11] [Frome ’13] [Yatskar ’14] [Barbu ’12] [Kiros ’14] [Mitchell ’12] 58
Core Challenge how can we predict sentences? “A dog jumping ??? over a hurdle” differentiable function 59
Core Challenge how can we predict sentences? Convolutional Network “A dog jumping ??? over a hurdle” differentiable function 60
Core Challenge how can we predict sentences? image classification Convolutional Network dog differentiable function 61
Core Challenge Convolutional Network a ??? dog jumping over a sentences have variable number of words => output not fixed size! hurdle <end> 62
Language Model words P(word | previous words) 63
Recurrent Neural Network Language Model predict next word distribution RNN feed in words one at a time e.g. “one-hot” encoding 64
Image Classification image classification Convolutional Network dog 65
Image Captioning h1 Convolutional Network a h2 dog h3 jumping h4 over Q: how do we condition h5 a the generative process on the image information? h6 hurdle h7 <end> 66
Image Captioning h1 Convolutional Network a h2 dog h3 jumping h4 over h5 a h6 hurdle h7 <end> 67
Image Sentence Datasets [1] Pascal 1K : 1,000 images (5 sentences [2] Flickr8K : 8,000 images per image) [3] Flickr30K : 30,000 images [4] MSCOCO : 115,000 images [1] Rashtchian et al., 2010 [2] Hodosh et al., 2013 [3] Young et al., 2014 [4] Lin et al., 2015 68
“a woman in a bikini is jumping over a hurdle.” 70
Limitations “A group of people in an office.” 71
Outline 1. Matching images with sentences [1] matching “Dog jumping over a hurdle.” score 2. Generating captions for images [2] “Dog jumping over a hurdle.” 3. Generating multiple localized captions [3] “Dog jumping over a hurdle.” “A black and white dog in mid-air” “A hand of a person” “A blue and white hurdle” [1] Grounded Compositional Semantics for Finding and Describing Images with Sentences. Socher, Karpathy, Le, Manning, Ng, TACL 2013. [2] Deep Visual-Semantic Alignments for Generating Image Descriptions. Karpathy, Fei-Fei, CVPR 2015. [3] DenseCap: Fully Convolutional Localization Networks for Dense Captioning. Johnson*, Karpathy*, Fei-Fei, CVPR 2016. 72
Dense Captioning Whole Image Image Regions label density Detection Classification Single Cat Label Cat Skateboard Captioning Dense Captioning Orange spotted cat Sequence A cat Skateboard with red wheels riding a Cat riding a label skateboard skateboard complexity Brown hardwood flooring 73
Our approach End-to-end learning: Formulate a single differentiable function from inputs to outputs. differentiable function 74
Region-level descriptions data example annotations: “red frisbee” “frisbee is mid air” “frisbee is flying” … # Images: 108,077 # Region descriptions: 4,297,502 Visual Genome Dataset, Krishna et al. 2016 75
Dense Captioning Architecture Convolutional Network RNN 76
Dense Captioning Architecture Convolutional Network 77
Dense Captioning Architecture Convolutional Network Localization layer [Ren et al., 2015] [Girshick et al., 2015] [Szegedy et al., 2015] 78
Dense Captioning Architecture Convolutional Network Localization layer 1. Propose regions of interest Predict 300 scored boxes: [(x1,y1,x2,y2,score), …] (300*5 = 1500 numbers total) 79
Dense Captioning Architecture Convolutional Network Localization layer 2. Align predictions to true boxes True boxes 80
Dense Captioning Architecture Convolutional Network Localization layer reuse computation! 3. Crop out the aligned regions crop [Fast R-CNN, Girshick et al., 2015] 81
man wearing a blue shirt black computer monitor sitting on a chair wall is white people are in the background red and brown chair silver handle on the wall black bag on the floor computer monitor on a desk man with black hair 82
Quantitative Results Dense Captioning mAP (high = good) Image Captioning 4.26 with Region proposals [3] 5.39 DenseCap 0 1.5 3 4.5 6 Throughput in frames per second (high = good) Image Captioning 0.31 with Region proposals [3] 4.17 DenseCap 0 1.25 2.5 3.75 5 Better performance and 13x speedup! [3] Deep Visual-Semantic Alignments for Generating Image Descriptions. Karpathy, Fei-Fei, CVPR 2015. 83
84
Find DenseCap on Github! - Implemented in Torch - Code for training - Pretrained models - Webcam demo code
Finding regions given descriptions “head of a giraffe” search 86
Finding regions given descriptions “head of a giraffe” 87
Finding regions given descriptions “white tennis shoes” 88
Finding regions given descriptions “hands holding a phone” 89
Finding regions given descriptions “front wheel of a bus” 90
Outline 1. Matching images with sentences [1] matching “Dog jumping over a hurdle.” score 2. Generating captions for images [2] “Dog jumping over a hurdle.” 3. Generating multiple localized captions [3] “Dog jumping over a hurdle.” “A black and white dog in mid-air” “A hand of a person” “A blue and white hurdle” [1] Grounded Compositional Semantics for Finding and Describing Images with Sentences. Socher, Karpathy, Le, Manning, Ng, TACL 2013. [2] Deep Visual-Semantic Alignments for Generating Image Descriptions. Karpathy, Fei-Fei, CVPR 2015. [3] DenseCap: Fully Convolutional Localization Networks for Dense Captioning. Johnson*, Karpathy*, Fei-Fei, CVPR 2016. 91
Summary 1. Matching images with sentences Convolutional Network score x Recursive Tensor Neural Network “A dog jumping over a hurdle” “man in blue wetsuit surfing” “baseball player throwing the ball” … 92
Summary 2. Generating captions for images Convolutional Network a dog jumping RNN over a hurdle 93
Summary 3. Generating multiple localized captions for images Convolutional Network Localization layer 94
Going Forward… - Evaluation innovations - Dataset/Task innovations - Modeling innovations
Going Forward… - Evaluation innovations (or lack there of) - Dataset/Task innovations - Modeling innovations
Evaluation Test image and 5 reference sentences: compare Candidate generated caption: “A red car with two people next to it.” 97
Fixing evaluation … :s
Learning the evaluation metric? Supervision: - sentences for each image should be closer to each other than sentences from other images? - or: directly use human judgements as supervision?
Going Forward… - Evaluation innovations - Dataset/Task innovations - Modeling innovations
Recommend
More recommend