Connecting Images with Natural Language Andrej Karpathy CVPR 2016. - PowerPoint PPT Presentation

Model outline 1. How do we process images? 3. How do we compare an image and a sentence? “A dog jumping 2. How do we over a hurdle” process sentences? 37

End-to-end learning 1. How do we process images? 3. How do we single differentiable function compare an image and a sentence? “A dog jumping 2. How do we over a hurdle” process sentences? 38

Model outline Convolutional Network 3. How do we compare an image and a sentence? “A dog jumping 2. How do we over a hurdle” process sentences? 39

Model outline Convolutional Network 3. How do we compare an image and a sentence? “A dog jumping 2. How do we over a hurdle” process sentences? 40

Processing Sentences “A man wearing a helmet jumps on his bike near a beach.” Q: How can we encode the sentence into a fixed-sized vector? 41

Idea 1: Bag of Words “A man wearing a helmet jumps on his bike near a beach.” 42

Idea 2: Bag of n-grams e.g. n = 2: “A man wearing a helmet jumps on his bike near a beach.” 43

Our approach root 1. Extract a dependency tree [1] jumps prep 2. Compute representation prep nsubj recursively with a near on man Recursive Tensor pobj det partmod pobj Neural Network bike wearing A beach poss dobj apply recursive formula: det a helmet his det a [Marneffe and Manning 2008] 44

Model outline Convolutional Network 3. How do we compare an image and a sentence? Recursive Tensor Neural Network “A dog jumping over a hurdle” 45

Matching Image and Sentence Convolutional Network score x Recursive Tensor Neural Network “A dog jumping over a hurdle” 46

are there Convolutional Network fur textures? is a “dog” Recursive Tensor Neural Network “A dog jumping mentioned? over a hurdle” 47

dog jumping 0.5 0.1 -1.5 over a hurdle man in blue -1.5 2.0 0.9 wetsuit surfing baseball player 0.3 0.6 2.1 throwing the ball 48

Given image and sentence vectors dog jumping 0.5 0.1 -1.5 over a hurdle man in blue -1.5 2.0 0.9 the (structured, max-margin) loss wetsuit surfing becomes: baseball player 0.3 0.6 2.1 throwing the ball rank sentences (columns) rank images (rows) 49

Image Sentence Datasets [1] Pascal 1K : 1,000 images (5 sentences [2] Flickr8K : 8,000 images per image) [3] Flickr30K : 30,000 images [4] MSCOCO : 115,000 images [1] Rashtchian et al., 2010 [2] Hodosh et al., 2013 [3] Young et al., 2014 [4] Lin et al., 2015 50

Example results: sentence retrieval showing top 4 matching sentences (out of 5,000) 51

Image Sentence Fragment ranking Deep Visual-Semantic Alignments for Generating Image Descriptions. Karpathy, Fei-Fei, CVPR 2015.

Limitations of Ranking 1. Limiting. We cannot describe images with sentences that are not in the training data. 2. Inefficient. Have to loop over and test all sentences one by one when annotating an image. 3. Unsatisfying. Especially when compared to humans who can easily generate descriptions. 55

Outline 1. Matching images with sentences [1] matching “Dog jumping over a hurdle.” score 2. Generating captions for images [2] “Dog jumping over a hurdle.” 3. Generating multiple localized captions [3] “Dog jumping over a hurdle.” “A black and white dog in mid-air” “A hand of a person” “A blue and white hurdle” [1] Grounded Compositional Semantics for Finding and Describing Images with Sentences. Socher, Karpathy, Le, Manning, Ng, TACL 2013. [2] Deep Visual-Semantic Alignments for Generating Image Descriptions. Karpathy, Fei-Fei, CVPR 2015. [3] DenseCap: Fully Convolutional Localization Networks for Dense Captioning. Johnson*, Karpathy*, Fei-Fei, CVPR 2016. 56

Generating Captions for Images Problem Statement “A dog jumping over a hurdle” 57

Generating Descriptions: Prior work Baby Talk (Kulkarni et al. 2011) example template: (noun) (noun) (verb) (noun) _____ in _____ is ______ in _______. [Barnard ’03] [Gupta & Mannem ’12] [Yao ’10] [Duygulu ’02] [Elliott & Keller ’13] [Yang ’11] [Frome ’13] [Yatskar ’14] [Barbu ’12] [Kiros ’14] [Mitchell ’12] 58

Core Challenge how can we predict sentences? “A dog jumping ??? over a hurdle” differentiable function 59

Core Challenge how can we predict sentences? Convolutional Network “A dog jumping ??? over a hurdle” differentiable function 60

Core Challenge how can we predict sentences? image classification Convolutional Network dog differentiable function 61

Core Challenge Convolutional Network a ??? dog jumping over a sentences have variable number of words => output not fixed size! hurdle <end> 62

Language Model words P(word | previous words) 63

Recurrent Neural Network Language Model predict next word distribution RNN feed in words one at a time e.g. “one-hot” encoding 64

Image Classification image classification Convolutional Network dog 65

Image Captioning h1 Convolutional Network a h2 dog h3 jumping h4 over Q: how do we condition h5 a the generative process on the image information? h6 hurdle h7 <end> 66

Image Captioning h1 Convolutional Network a h2 dog h3 jumping h4 over h5 a h6 hurdle h7 <end> 67

Image Sentence Datasets [1] Pascal 1K : 1,000 images (5 sentences [2] Flickr8K : 8,000 images per image) [3] Flickr30K : 30,000 images [4] MSCOCO : 115,000 images [1] Rashtchian et al., 2010 [2] Hodosh et al., 2013 [3] Young et al., 2014 [4] Lin et al., 2015 68

“a woman in a bikini is jumping over a hurdle.” 70

Limitations “A group of people in an office.” 71

Dense Captioning Whole Image Image Regions label density Detection Classification Single Cat Label Cat Skateboard Captioning Dense Captioning Orange spotted cat Sequence A cat Skateboard with red wheels riding a Cat riding a label skateboard skateboard complexity Brown hardwood flooring 73

Our approach End-to-end learning: Formulate a single differentiable function from inputs to outputs. differentiable function 74

Region-level descriptions data example annotations: “red frisbee” “frisbee is mid air” “frisbee is flying” … # Images: 108,077 # Region descriptions: 4,297,502 Visual Genome Dataset, Krishna et al. 2016 75

Dense Captioning Architecture Convolutional Network RNN 76

Dense Captioning Architecture Convolutional Network 77

Dense Captioning Architecture Convolutional Network Localization layer [Ren et al., 2015] [Girshick et al., 2015] [Szegedy et al., 2015] 78

Dense Captioning Architecture Convolutional Network Localization layer 1. Propose regions of interest Predict 300 scored boxes: [(x1,y1,x2,y2,score), …] (300*5 = 1500 numbers total) 79

Dense Captioning Architecture Convolutional Network Localization layer 2. Align predictions to true boxes True boxes 80

Dense Captioning Architecture Convolutional Network Localization layer reuse computation! 3. Crop out the aligned regions crop [Fast R-CNN, Girshick et al., 2015] 81

man wearing a blue shirt black computer monitor sitting on a chair wall is white people are in the background red and brown chair silver handle on the wall black bag on the floor computer monitor on a desk man with black hair 82

Quantitative Results Dense Captioning mAP (high = good) Image Captioning 4.26 with Region proposals [3] 5.39 DenseCap 0 1.5 3 4.5 6 Throughput in frames per second (high = good) Image Captioning 0.31 with Region proposals [3] 4.17 DenseCap 0 1.25 2.5 3.75 5 Better performance and 13x speedup! [3] Deep Visual-Semantic Alignments for Generating Image Descriptions. Karpathy, Fei-Fei, CVPR 2015. 83

Find DenseCap on Github! - Implemented in Torch - Code for training - Pretrained models - Webcam demo code

Finding regions given descriptions “head of a giraffe” search 86

Finding regions given descriptions “head of a giraffe” 87

Finding regions given descriptions “white tennis shoes” 88

Finding regions given descriptions “hands holding a phone” 89

Finding regions given descriptions “front wheel of a bus” 90

Summary 1. Matching images with sentences Convolutional Network score x Recursive Tensor Neural Network “A dog jumping over a hurdle” “man in blue wetsuit surfing” “baseball player throwing the ball” … 92

Summary 2. Generating captions for images Convolutional Network a dog jumping RNN over a hurdle 93

Summary 3. Generating multiple localized captions for images Convolutional Network Localization layer 94

Going Forward… - Evaluation innovations - Dataset/Task innovations - Modeling innovations

Going Forward… - Evaluation innovations (or lack there of) - Dataset/Task innovations - Modeling innovations

Evaluation Test image and 5 reference sentences: compare Candidate generated caption: “A red car with two people next to it.” 97

Fixing evaluation … :s

Learning the evaluation metric? Supervision: - sentences for each image should be closer to each other than sentences from other images? - or: directly use human judgements as supervision?

Going Forward… - Evaluation innovations - Dataset/Task innovations - Modeling innovations

Connecting Images with Natural Language Andrej Karpathy CVPR 2016. - PowerPoint PPT Presentation

Connecting Images with Natural Language Andrej Karpathy CVPR 2016. Deep Vision workshop. July 1, 2016 Visual domain Domain of Natural Language [Cho et al. 2014] Connecting Images and Natural Language A man sitting on A person sitting on a

CS4495/6495 Introduction to Computer Vision 2A-L1 Images as functions Images as functions Images

CONNECTING THE DOTS Current Events Forecast Revenue Estimate 1 11/17/2016 CONNECTING THE

Connecting Families in Bath & North East Somerset Paula Bromley Connecting Families Manager

Connecting the Dots: Connecting the Dots: Black Lives Matter, Connecting the Dots: COVID-19,

Autumn @ Connecting with God Connecting with God What are you ashamed of .. Family

Natural Language Understanding We want to communicate with computers using natural language

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Outline of todays lecture Overview of Natural Language Generation Components of Natural

Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language

Algorithms in Bioinformatics: A Practical Introduction Motif Finding Composition of our genome

Enhancing Privacy in Machine Learning Mathias Humbert INSA Toulouse/CNRS Toulouse, January 22,

Learning Where to Look and Listen: Egocentric and 360 Computer Vision Kristen Grauman Facebook

Knowledge-Based Reasoning in Computer Vision CSC 2539 Paul Vicol Outline Knowledge Bases

Unseen Patterns: Using Latent-Variable Models for Natural Language Shay Cohen Institute for

Construction of Goal Association Graphs from Search Query Logs Christian Krner MSc student

TO JOIN BY TELEPHONE: TO JOIN BY TELEPHONE: Phone: (5 Phone: (510) 2 ) 210-8882 0-8882 |

Preparing for Change: The DOLs Final Rule and Exempt Classifications Agenda A Quick Review

Sambuz

Useful Links

Newsletter

Mail Us

Connecting Images with Natural Language Andrej Karpathy CVPR 2016. - PowerPoint PPT Presentation

Connecting Images with Natural Language Andrej Karpathy CVPR 2016. Deep Vision workshop. July 1, 2016 Visual domain Domain of Natural Language [Cho et al. 2014] Connecting Images and Natural Language A man sitting on A person sitting on a

CS4495/6495 Introduction to Computer Vision 2A-L1 Images as functions Images as functions Images

CONNECTING THE DOTS Current Events Forecast Revenue Estimate 1 11/17/2016 CONNECTING THE

Connecting Families in Bath &amp; North East Somerset Paula Bromley Connecting Families Manager

Connecting the Dots: Connecting the Dots: Black Lives Matter, Connecting the Dots: COVID-19,

Autumn @ Connecting with God Connecting with God What are you ashamed of .. Family

Natural Language Understanding We want to communicate with computers using natural language

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Outline of todays lecture Overview of Natural Language Generation Components of Natural

Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language

Algorithms in Bioinformatics: A Practical Introduction Motif Finding Composition of our genome

Enhancing Privacy in Machine Learning Mathias Humbert INSA Toulouse/CNRS Toulouse, January 22,

Learning Where to Look and Listen: Egocentric and 360 Computer Vision Kristen Grauman Facebook

Knowledge-Based Reasoning in Computer Vision CSC 2539 Paul Vicol Outline Knowledge Bases

Unseen Patterns: Using Latent-Variable Models for Natural Language Shay Cohen Institute for

Construction of Goal Association Graphs from Search Query Logs Christian Krner MSc student

TO JOIN BY TELEPHONE: TO JOIN BY TELEPHONE: Phone: (5 Phone: (510) 2 ) 210-8882 0-8882 |

Preparing for Change: The DOLs Final Rule and Exempt Classifications Agenda A Quick Review

Sambuz

Useful Links

Newsletter

Mail Us

Connecting Families in Bath & North East Somerset Paula Bromley Connecting Families Manager