connecting images with natural language
play

Connecting Images with Natural Language Andrej Karpathy CVPR 2016. - PowerPoint PPT Presentation

Connecting Images with Natural Language Andrej Karpathy CVPR 2016. Deep Vision workshop. July 1, 2016 Visual domain Domain of Natural Language [Cho et al. 2014] Connecting Images and Natural Language A man sitting on A person sitting on a


  1. Model outline 1. How do we process images? 3. How do we compare an image and a sentence? “A dog jumping 2. How do we over a hurdle” process sentences? 37

  2. End-to-end learning 1. How do we process images? 3. How do we single differentiable function compare an image and a sentence? “A dog jumping 2. How do we over a hurdle” process sentences? 38

  3. Model outline Convolutional Network 3. How do we compare an image and a sentence? “A dog jumping 2. How do we over a hurdle” process sentences? 39

  4. Model outline Convolutional Network 3. How do we compare an image and a sentence? “A dog jumping 2. How do we over a hurdle” process sentences? 40

  5. Processing Sentences “A man wearing a helmet jumps on his bike near a beach.” Q: How can we encode the sentence into a fixed-sized vector? 41

  6. Idea 1: Bag of Words “A man wearing a helmet jumps on his bike near a beach.” 42

  7. Idea 2: Bag of n-grams e.g. n = 2: “A man wearing a helmet jumps on his bike near a beach.” 43

  8. Our approach root 1. Extract a dependency tree [1] jumps prep 2. Compute representation prep nsubj recursively with a near on man Recursive Tensor pobj det partmod pobj Neural Network bike wearing A beach poss dobj apply recursive formula: det a helmet his det a [Marneffe and Manning 2008] 44

  9. Model outline Convolutional Network 3. How do we compare an image and a sentence? Recursive Tensor Neural Network “A dog jumping over a hurdle” 45

  10. Matching Image and Sentence Convolutional Network score x Recursive Tensor Neural Network “A dog jumping over a hurdle” 46

  11. are there Convolutional Network fur textures? is a “dog” Recursive Tensor Neural Network “A dog jumping mentioned? over a hurdle” 47

  12. dog jumping 0.5 0.1 -1.5 over a hurdle man in blue -1.5 2.0 0.9 wetsuit surfing baseball player 0.3 0.6 2.1 throwing the ball 48

  13. Given image and sentence vectors dog jumping 0.5 0.1 -1.5 over a hurdle man in blue -1.5 2.0 0.9 the (structured, max-margin) loss wetsuit surfing becomes: baseball player 0.3 0.6 2.1 throwing the ball rank sentences (columns) rank images (rows) 49

  14. Image Sentence Datasets [1] Pascal 1K : 1,000 images (5 sentences [2] Flickr8K : 8,000 images per image) [3] Flickr30K : 30,000 images [4] MSCOCO : 115,000 images [1] Rashtchian et al., 2010 [2] Hodosh et al., 2013 [3] Young et al., 2014 [4] Lin et al., 2015 50

  15. Example results: sentence retrieval showing top 4 matching sentences (out of 5,000) 51

  16. Image Sentence Fragment ranking Deep Visual-Semantic Alignments for Generating Image Descriptions. Karpathy, Fei-Fei, CVPR 2015.

  17. Limitations of Ranking 1. Limiting. We cannot describe images with sentences that are not in the training data. 2. Inefficient. Have to loop over and test all sentences one by one when annotating an image. 3. Unsatisfying. Especially when compared to humans who can easily generate descriptions. 55

  18. Outline 1. Matching images with sentences [1] matching “Dog jumping over a hurdle.” score 2. Generating captions for images [2] “Dog jumping over a hurdle.” 3. Generating multiple localized captions [3] “Dog jumping over a hurdle.” “A black and white dog in mid-air” “A hand of a person” “A blue and white hurdle” [1] Grounded Compositional Semantics for Finding and Describing Images with Sentences. Socher, Karpathy, Le, Manning, Ng, TACL 2013. [2] Deep Visual-Semantic Alignments for Generating Image Descriptions. Karpathy, Fei-Fei, CVPR 2015. [3] DenseCap: Fully Convolutional Localization Networks for Dense Captioning. Johnson*, Karpathy*, Fei-Fei, CVPR 2016. 56

  19. Generating Captions for Images Problem Statement “A dog jumping over a hurdle” 57

  20. Generating Descriptions: Prior work Baby Talk (Kulkarni et al. 2011) example template: (noun) (noun) (verb) (noun) _____ in _____ is ______ in _______. [Barnard ’03] [Gupta & Mannem ’12] [Yao ’10] [Duygulu ’02] [Elliott & Keller ’13] [Yang ’11] [Frome ’13] [Yatskar ’14] [Barbu ’12] [Kiros ’14] [Mitchell ’12] 58

  21. Core Challenge how can we predict sentences? “A dog jumping ??? over a hurdle” differentiable function 59

  22. Core Challenge how can we predict sentences? Convolutional Network “A dog jumping ??? over a hurdle” differentiable function 60

  23. Core Challenge how can we predict sentences? image classification Convolutional Network dog differentiable function 61

  24. Core Challenge Convolutional Network a ??? dog jumping over a sentences have variable number of words => output not fixed size! hurdle <end> 62

  25. Language Model words P(word | previous words) 63

  26. Recurrent Neural Network Language Model predict next word distribution RNN feed in words one at a time e.g. “one-hot” encoding 64

  27. Image Classification image classification Convolutional Network dog 65

  28. Image Captioning h1 Convolutional Network a h2 dog h3 jumping h4 over Q: how do we condition h5 a the generative process on the image information? h6 hurdle h7 <end> 66

  29. Image Captioning h1 Convolutional Network a h2 dog h3 jumping h4 over h5 a h6 hurdle h7 <end> 67

  30. Image Sentence Datasets [1] Pascal 1K : 1,000 images (5 sentences [2] Flickr8K : 8,000 images per image) [3] Flickr30K : 30,000 images [4] MSCOCO : 115,000 images [1] Rashtchian et al., 2010 [2] Hodosh et al., 2013 [3] Young et al., 2014 [4] Lin et al., 2015 68

  31. “a woman in a bikini is jumping over a hurdle.” 70

  32. Limitations “A group of people in an office.” 71

  33. Outline 1. Matching images with sentences [1] matching “Dog jumping over a hurdle.” score 2. Generating captions for images [2] “Dog jumping over a hurdle.” 3. Generating multiple localized captions [3] “Dog jumping over a hurdle.” “A black and white dog in mid-air” “A hand of a person” “A blue and white hurdle” [1] Grounded Compositional Semantics for Finding and Describing Images with Sentences. Socher, Karpathy, Le, Manning, Ng, TACL 2013. [2] Deep Visual-Semantic Alignments for Generating Image Descriptions. Karpathy, Fei-Fei, CVPR 2015. [3] DenseCap: Fully Convolutional Localization Networks for Dense Captioning. Johnson*, Karpathy*, Fei-Fei, CVPR 2016. 72

  34. Dense Captioning Whole Image Image Regions label density Detection Classification Single Cat Label Cat Skateboard Captioning Dense Captioning Orange spotted cat Sequence A cat Skateboard with red wheels riding a Cat riding a label skateboard skateboard complexity Brown hardwood flooring 73

  35. Our approach End-to-end learning: Formulate a single differentiable function from inputs to outputs. differentiable function 74

  36. Region-level descriptions data example annotations: “red frisbee” “frisbee is mid air” “frisbee is flying” … # Images: 108,077 # Region descriptions: 4,297,502 Visual Genome Dataset, Krishna et al. 2016 75

  37. Dense Captioning Architecture Convolutional Network RNN 76

  38. Dense Captioning Architecture Convolutional Network 77

  39. Dense Captioning Architecture Convolutional Network Localization layer [Ren et al., 2015] [Girshick et al., 2015] [Szegedy et al., 2015] 78

  40. Dense Captioning Architecture Convolutional Network Localization layer 1. Propose regions of interest Predict 300 scored boxes: [(x1,y1,x2,y2,score), …] (300*5 = 1500 numbers total) 79

  41. Dense Captioning Architecture Convolutional Network Localization layer 2. Align predictions to true boxes True boxes 80

  42. Dense Captioning Architecture Convolutional Network Localization layer reuse computation! 3. Crop out the aligned regions crop [Fast R-CNN, Girshick et al., 2015] 81

  43. man wearing a blue shirt black computer monitor sitting on a chair wall is white people are in the background red and brown chair silver handle on the wall black bag on the floor computer monitor on a desk man with black hair 82

  44. Quantitative Results Dense Captioning mAP (high = good) Image Captioning 4.26 with Region proposals [3] 5.39 DenseCap 0 1.5 3 4.5 6 Throughput in frames per second (high = good) Image Captioning 0.31 with Region proposals [3] 4.17 DenseCap 0 1.25 2.5 3.75 5 Better performance and 13x speedup! [3] Deep Visual-Semantic Alignments for Generating Image Descriptions. Karpathy, Fei-Fei, CVPR 2015. 83

  45. 84

  46. Find DenseCap on Github! - Implemented in Torch - Code for training - Pretrained models - Webcam demo code

  47. Finding regions given descriptions “head of a giraffe” search 86

  48. Finding regions given descriptions “head of a giraffe” 87

  49. Finding regions given descriptions “white tennis shoes” 88

  50. Finding regions given descriptions “hands holding a phone” 89

  51. Finding regions given descriptions “front wheel of a bus” 90

  52. Outline 1. Matching images with sentences [1] matching “Dog jumping over a hurdle.” score 2. Generating captions for images [2] “Dog jumping over a hurdle.” 3. Generating multiple localized captions [3] “Dog jumping over a hurdle.” “A black and white dog in mid-air” “A hand of a person” “A blue and white hurdle” [1] Grounded Compositional Semantics for Finding and Describing Images with Sentences. Socher, Karpathy, Le, Manning, Ng, TACL 2013. [2] Deep Visual-Semantic Alignments for Generating Image Descriptions. Karpathy, Fei-Fei, CVPR 2015. [3] DenseCap: Fully Convolutional Localization Networks for Dense Captioning. Johnson*, Karpathy*, Fei-Fei, CVPR 2016. 91

  53. Summary 1. Matching images with sentences Convolutional Network score x Recursive Tensor Neural Network “A dog jumping over a hurdle” “man in blue wetsuit surfing” “baseball player throwing the ball” … 92

  54. Summary 2. Generating captions for images Convolutional Network a dog jumping RNN over a hurdle 93

  55. Summary 3. Generating multiple localized captions for images Convolutional Network Localization layer 94

  56. Going Forward… - Evaluation innovations - Dataset/Task innovations - Modeling innovations

  57. Going Forward… - Evaluation innovations (or lack there of) - Dataset/Task innovations - Modeling innovations

  58. Evaluation Test image and 5 reference sentences: compare Candidate generated caption: “A red car with two people next to it.” 97

  59. Fixing evaluation … :s

  60. Learning the evaluation metric? Supervision: - sentences for each image should be closer to each other than sentences from other images? - or: directly use human judgements as supervision?

  61. Going Forward… - Evaluation innovations - Dataset/Task innovations - Modeling innovations

Recommend


More recommend