Natural Language Video Description using Deep Recurrent Neural Networks Thesis Proposal 23 Nov. 2015 Subhashini Venugopalan University of Texas at Austin 1
Problem Statement Generate descriptions for events depicted in video clips A monkey pulls a dog’s tail and is chased by the dog. 2
Applications Image and video retrieval by content. Video description service. Children are wearing green shirts. They are dancing as they sing the carol. Human Robot Interaction Video surveillance 3
Outline ● ● ○ ○ ● ○ ○ ○ ● 4
Related Work 5
Related Work - 1: Language & Vision Language: Increasingly focused on grounding meaning in perception. Vision: Exploit linguistic ontologies to “tell a story” from images. [Farhadi et. al. ECCV’10] [Kulkarni et. al. CVPR’11] Many early works on Image Description Farhadi et. al. ECCV’10, Kulkarni et. al. CVPR’11, Mitchell et. al. EACL’12, Kuznetsova et. al. ACL’12 & ACL’13 Identify objects and attributes, and combine with linguistic knowledge to “tell a story”. There are one cow and one sky. (animal, stand, ground) Dramatic increase in interest the past year. The golden cow is by the blue sky. (8 papers in CVPR’15) [Donahue et. al. CVPR’15] Relatively little on Video Description Need videos for semantics of wider range of actions. A group of young men playing a game of soccer. 6
Related Work - 2: Video Description ● Extract object and action descriptors. ● Learn object, action, scene classifiers. ● Use language to bias visual interpretation. ● Estimate most likely agents and actions. [Krishnamurthy, et al. AAAI’13] ● Template to generate sentence. Others: Guadarrama ICCV’13, Thomason COLING’14 Limitations: ● Narrow Domains ● Small Grammars [Yu and Siskind, ACL’13] ● Template based sentences ● Several features and classifiers Which objects/actions/scenes should [Rohrbach et. al. ICCV’13] we build classifiers for? 7
Can we learn directly from video sentence pairs? Without having to explicitly learn object/action/scene classifiers for our dataset. [Venugopalan et. al. NAACL’15] 8
Recurrent Neural Networks (RNNs) can map a vector to a sequence. RNN RNN English French [Sutskever et al. NIPS’14] encoder decoder Sentence Sentence RNN [Donahue et al. CVPR’15] Encode Sentence decoder [Vinyals et al. CVPR’15] RNN Encode Sentence [Venugopalan et. al. decoder NAACL’15] (this work) Key Insight: Generate feature representation of the video and “decode” it to a sentence 9
In this section ● Background - Recurrent Neural Networks ● 2 Deep methods for video description ■ First, learns from image description. (ignores temporal frame sequence in videos) ■ Second is temporally sensitive to input. 10
[Background] Recurrent Neural Networks Successful in translation, speech. Cell x t h t RNNs can map an input to an output h t-1 sequence. y t Output Pr(out y t | input, out y 0 ...y t-1 ) RNN Unit Insight: Each time step has a layer with the same weights. y t-1 RNN time x t-1 Problems: h t-1 1. Hard to capture long term dependencies 2. Vanishing gradients (shrink through many layers) y t RNN x t h t Solution: Long Short Term Memory (LSTM) unit 11
[Background] LSTM [Hochreiter and Schmidhuber ‘97] [Graves ‘13] x t h t-1 x t h t-1 Output Input Gate Gate Memory Cell x t h t (=z t ) h t-1 Input Modulation Gate Forget Gate LSTM Unit x t h t-1 12
[Background] LSTM Sequence decoders Functions are differentiable. Full gradient is computed by backpropagating through time. Weights updated using Stochastic Gradient Descent. input LSTM out t0 t=0 Matches state-of-the-art on: Speech Recognition [Graves & Jaitly ICML’14] input LSTM out t1 t=1 Machine Translation (Eng-Fr) [Sutskever et al. NIPS’14] Image-Description input LSTM out t2 t=2 [Donahue et al. CVPR’15] [Vinyals et al. CVPR’15] time input LSTM out t3 t=3 13
LSTM Sequence decoders Two LSTM layers - 2nd layer of depth in temporal processing. Softmax over the vocabulary to predict the output at each time step. SoftMax input LSTM out t0 LSTM t=0 input LSTM SoftMax LSTM out t1 t=1 input LSTM SoftMax LSTM out t2 t=2 time input LSTM SoftMax LSTM out t3 t=3 14
Translating Videos to Natural Language CNN [Venugopalan et. al. NAACL’15] 15
Test time: Step 1 CNN (a) Input Sample frames Video @1/10 227x227 Frame Scale (b) 16
[Background] Convolutional Neural Networks (CNNs) Successful in semantic visual recognition tasks. Layer - linear filters followed by non linear function. Stack layers. Learn a hierarchy of features of increasing semantic richness. Image Credit: Maurice Peeman >> Krizhevsky, Sutskever, Hinton 2012 ImageNet classification breakthrough 17
Test time: Step 2 Feature extraction CNN fc7: 4096 dimension CNN “feature vector” Forward propagate Output: “fc7” features (activations before classification layer) 18
Test time: Step 3 Mean pooling CNN CNN Mean across all frames CNN Arxiv: http://arxiv.org/abs/1505.00487 19
Test time: Step 4 Generation Input Video Convolutional Net Recurrent Net Output 20
Training Annotated video data is scarce. Key Insight: Use supervised pre-training on data-rich auxiliary tasks and transfer. 21
Step1: CNN pre-training fc7: 4096 dimension CNN “feature vector” ● Based on Alexnet [Krizhevsky et al. NIPS’12] ● 1.2M+ images from ImageNet ILSVRC-12 [Russakovsky et al.] ● Initialize weights of our network. 22
Step2: Image-Caption training CNN 23
Step3: Fine-tuning 1. Video Dataset 2. Mean pooled feature 3. Lower learning rate 24
Experiments: Dataset Microsoft Research Video Description dataset [Chen & Dolan, ACL’11] Link: http://www.cs.utexas.edu/users/ml/clamp/videoDescription/ ● 1970 YouTube video snippets ○ 10-30s each ○ typically single activity ○ no dialogues ○ 1200 training, 100 validation, 670 test ● Annotations ○ Descriptions in multiple languages ○ ~40 English descriptions per video ○ descriptions and videos collected on AMT 25
Augment Image datasets MSCOCO # Training videos - 1300 Flickr30k - 30,000 images, 150,000 descriptions MSCOCO - 120,000 images, 600,000 descriptions 26
Sample video and gold descriptions ● A man appears to be plowing a rice field with a ● A man is walking on a rope . plow being pulled by two oxen . ● A man is walking across a rope . ● A team of water buffalo pull a plow through a rice paddy. ● A man is balancing on a rope . ● Domesticated livestock are helping a man plow . ● A man is balancing on a rope at the beach. ● A man leads a team of oxen down a muddy path. ● A man walks on a tightrope at the beach. ● Two oxen walk through some mud. ● A man is balancing on a volleyball net . ● A man is tilling his land with an ox pulled plow. ● A man is walking on a rope held by poles ● Bulls are pulling an object. ● A man balanced on a wire . ● Two oxen are plowing a field. ● The man is balancing on the wire . ● The farmer is tilling the soil. ● A man is walking on a rope . ● A man in ploughing the field. ● A man is standing in the sea shore. 27
Evaluation ● Machine Translation Metrics ○ BLEU ○ METEOR ● Human evaluation 28
Results - Generation MT metrics (BLEU, METEOR) to compare the system generated sentences against (all) ground truth references. [Thomason et al. COLING’14] 29
Human Evaluation Relevance Grammar Rank sentences based on how accurately Rate the grammatical correctness of the they describe the event depicted in the video. following sentences . No two sentences can have the same rank. Multiple sentences can have same rating. 30
Results - Human Evaluation Model Relevance Grammar 2.26 3.99 [Thomason et al. COLING’14] 2.74 3.84 2.93 3.64 4.65 4.61 31
More Examples 32
Translating Videos to Natural Language CNN Does not consider temporal sequence of frames. [Venugopalan et. al. NAACL’15] 33
Can our model be sensitive to temporal structure? Allowing both input (sequence of frames) and output (sequence of words) of variable length. [Venugopalan et. al. ICCV’15] 34
Recurrent Neural Networks (RNNs) can map a vector to a sequence. RNN RNN English French [Sutskever et al. NIPS’14] encoder decoder Sentence Sentence RNN [Donahue et al. CVPR’15] Encode Sentence decoder [Vinyals et al. CVPR’15] RNN Encode Sentence [V. NAACL’15] decoder RNN RNN Encode Sentence [Venugopalan et. al.x encoder decoder ICCV’15] (this work) 35
S2VT: Sequence to Sequence Video to Text CNN CNN CNN CNN Now decode it to a sentence! LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM ... LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM A man is talking ... Encoding stage Decoding stage [Venugopalan et. al. ICCV’15] 36
1. Train on Imagenet 1000 categories [Krizhevsky et al. NIPS’15] CNN 2. Take activations from layer before classification fc7: 4096 dimension CNN “feature vector” Forward propagate Output: “fc7” features (activations before classification layer) Frames: RGB 37
Recommend
More recommend