natural language video description using deep recurrent
play

Natural Language Video Description using Deep Recurrent Neural - PowerPoint PPT Presentation

Natural Language Video Description using Deep Recurrent Neural Networks Thesis Proposal 23 Nov. 2015 Subhashini Venugopalan University of Texas at Austin 1 Problem Statement Generate descriptions for events depicted in video clips A monkey


  1. Natural Language Video Description using Deep Recurrent Neural Networks Thesis Proposal 23 Nov. 2015 Subhashini Venugopalan University of Texas at Austin 1

  2. Problem Statement Generate descriptions for events depicted in video clips A monkey pulls a dog’s tail and is chased by the dog. 2

  3. Applications Image and video retrieval by content. Video description service. Children are wearing green shirts. They are dancing as they sing the carol. Human Robot Interaction Video surveillance 3

  4. Outline ● ● ○ ○ ● ○ ○ ○ ● 4

  5. Related Work 5

  6. Related Work - 1: Language & Vision Language: Increasingly focused on grounding meaning in perception. Vision: Exploit linguistic ontologies to “tell a story” from images. [Farhadi et. al. ECCV’10] [Kulkarni et. al. CVPR’11] Many early works on Image Description Farhadi et. al. ECCV’10, Kulkarni et. al. CVPR’11, Mitchell et. al. EACL’12, Kuznetsova et. al. ACL’12 & ACL’13 Identify objects and attributes, and combine with linguistic knowledge to “tell a story”. There are one cow and one sky. (animal, stand, ground) Dramatic increase in interest the past year. The golden cow is by the blue sky. (8 papers in CVPR’15) [Donahue et. al. CVPR’15] Relatively little on Video Description Need videos for semantics of wider range of actions. A group of young men playing a game of soccer. 6

  7. Related Work - 2: Video Description ● Extract object and action descriptors. ● Learn object, action, scene classifiers. ● Use language to bias visual interpretation. ● Estimate most likely agents and actions. [Krishnamurthy, et al. AAAI’13] ● Template to generate sentence. Others: Guadarrama ICCV’13, Thomason COLING’14 Limitations: ● Narrow Domains ● Small Grammars [Yu and Siskind, ACL’13] ● Template based sentences ● Several features and classifiers Which objects/actions/scenes should [Rohrbach et. al. ICCV’13] we build classifiers for? 7

  8. Can we learn directly from video sentence pairs? Without having to explicitly learn object/action/scene classifiers for our dataset. [Venugopalan et. al. NAACL’15] 8

  9. Recurrent Neural Networks (RNNs) can map a vector to a sequence. RNN RNN English French [Sutskever et al. NIPS’14] encoder decoder Sentence Sentence RNN [Donahue et al. CVPR’15] Encode Sentence decoder [Vinyals et al. CVPR’15] RNN Encode Sentence [Venugopalan et. al. decoder NAACL’15] (this work) Key Insight: Generate feature representation of the video and “decode” it to a sentence 9

  10. In this section ● Background - Recurrent Neural Networks ● 2 Deep methods for video description ■ First, learns from image description. (ignores temporal frame sequence in videos) ■ Second is temporally sensitive to input. 10

  11. [Background] Recurrent Neural Networks Successful in translation, speech. Cell x t h t RNNs can map an input to an output h t-1 sequence. y t Output Pr(out y t | input, out y 0 ...y t-1 ) RNN Unit Insight: Each time step has a layer with the same weights. y t-1 RNN time x t-1 Problems: h t-1 1. Hard to capture long term dependencies 2. Vanishing gradients (shrink through many layers) y t RNN x t h t Solution: Long Short Term Memory (LSTM) unit 11

  12. [Background] LSTM [Hochreiter and Schmidhuber ‘97] [Graves ‘13] x t h t-1 x t h t-1 Output Input Gate Gate Memory Cell x t h t (=z t ) h t-1 Input Modulation Gate Forget Gate LSTM Unit x t h t-1 12

  13. [Background] LSTM Sequence decoders Functions are differentiable. Full gradient is computed by backpropagating through time. Weights updated using Stochastic Gradient Descent. input LSTM out t0 t=0 Matches state-of-the-art on: Speech Recognition [Graves & Jaitly ICML’14] input LSTM out t1 t=1 Machine Translation (Eng-Fr) [Sutskever et al. NIPS’14] Image-Description input LSTM out t2 t=2 [Donahue et al. CVPR’15] [Vinyals et al. CVPR’15] time input LSTM out t3 t=3 13

  14. LSTM Sequence decoders Two LSTM layers - 2nd layer of depth in temporal processing. Softmax over the vocabulary to predict the output at each time step. SoftMax input LSTM out t0 LSTM t=0 input LSTM SoftMax LSTM out t1 t=1 input LSTM SoftMax LSTM out t2 t=2 time input LSTM SoftMax LSTM out t3 t=3 14

  15. Translating Videos to Natural Language CNN [Venugopalan et. al. NAACL’15] 15

  16. Test time: Step 1 CNN (a) Input Sample frames Video @1/10 227x227 Frame Scale (b) 16

  17. [Background] Convolutional Neural Networks (CNNs) Successful in semantic visual recognition tasks. Layer - linear filters followed by non linear function. Stack layers. Learn a hierarchy of features of increasing semantic richness. Image Credit: Maurice Peeman >> Krizhevsky, Sutskever, Hinton 2012 ImageNet classification breakthrough 17

  18. Test time: Step 2 Feature extraction CNN fc7: 4096 dimension CNN “feature vector” Forward propagate Output: “fc7” features (activations before classification layer) 18

  19. Test time: Step 3 Mean pooling CNN CNN Mean across all frames CNN Arxiv: http://arxiv.org/abs/1505.00487 19

  20. Test time: Step 4 Generation Input Video Convolutional Net Recurrent Net Output 20

  21. Training Annotated video data is scarce. Key Insight: Use supervised pre-training on data-rich auxiliary tasks and transfer. 21

  22. Step1: CNN pre-training fc7: 4096 dimension CNN “feature vector” ● Based on Alexnet [Krizhevsky et al. NIPS’12] ● 1.2M+ images from ImageNet ILSVRC-12 [Russakovsky et al.] ● Initialize weights of our network. 22

  23. Step2: Image-Caption training CNN 23

  24. Step3: Fine-tuning 1. Video Dataset 2. Mean pooled feature 3. Lower learning rate 24

  25. Experiments: Dataset Microsoft Research Video Description dataset [Chen & Dolan, ACL’11] Link: http://www.cs.utexas.edu/users/ml/clamp/videoDescription/ ● 1970 YouTube video snippets ○ 10-30s each ○ typically single activity ○ no dialogues ○ 1200 training, 100 validation, 670 test ● Annotations ○ Descriptions in multiple languages ○ ~40 English descriptions per video ○ descriptions and videos collected on AMT 25

  26. Augment Image datasets MSCOCO # Training videos - 1300 Flickr30k - 30,000 images, 150,000 descriptions MSCOCO - 120,000 images, 600,000 descriptions 26

  27. Sample video and gold descriptions ● A man appears to be plowing a rice field with a ● A man is walking on a rope . plow being pulled by two oxen . ● A man is walking across a rope . ● A team of water buffalo pull a plow through a rice paddy. ● A man is balancing on a rope . ● Domesticated livestock are helping a man plow . ● A man is balancing on a rope at the beach. ● A man leads a team of oxen down a muddy path. ● A man walks on a tightrope at the beach. ● Two oxen walk through some mud. ● A man is balancing on a volleyball net . ● A man is tilling his land with an ox pulled plow. ● A man is walking on a rope held by poles ● Bulls are pulling an object. ● A man balanced on a wire . ● Two oxen are plowing a field. ● The man is balancing on the wire . ● The farmer is tilling the soil. ● A man is walking on a rope . ● A man in ploughing the field. ● A man is standing in the sea shore. 27

  28. Evaluation ● Machine Translation Metrics ○ BLEU ○ METEOR ● Human evaluation 28

  29. Results - Generation MT metrics (BLEU, METEOR) to compare the system generated sentences against (all) ground truth references. [Thomason et al. COLING’14] 29

  30. Human Evaluation Relevance Grammar Rank sentences based on how accurately Rate the grammatical correctness of the they describe the event depicted in the video. following sentences . No two sentences can have the same rank. Multiple sentences can have same rating. 30

  31. Results - Human Evaluation Model Relevance Grammar 2.26 3.99 [Thomason et al. COLING’14] 2.74 3.84 2.93 3.64 4.65 4.61 31

  32. More Examples 32

  33. Translating Videos to Natural Language CNN Does not consider temporal sequence of frames. [Venugopalan et. al. NAACL’15] 33

  34. Can our model be sensitive to temporal structure? Allowing both input (sequence of frames) and output (sequence of words) of variable length. [Venugopalan et. al. ICCV’15] 34

  35. Recurrent Neural Networks (RNNs) can map a vector to a sequence. RNN RNN English French [Sutskever et al. NIPS’14] encoder decoder Sentence Sentence RNN [Donahue et al. CVPR’15] Encode Sentence decoder [Vinyals et al. CVPR’15] RNN Encode Sentence [V. NAACL’15] decoder RNN RNN Encode Sentence [Venugopalan et. al.x encoder decoder ICCV’15] (this work) 35

  36. S2VT: Sequence to Sequence Video to Text CNN CNN CNN CNN Now decode it to a sentence! LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM ... LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM A man is talking ... Encoding stage Decoding stage [Venugopalan et. al. ICCV’15] 36

  37. 1. Train on Imagenet 1000 categories [Krizhevsky et al. NIPS’15] CNN 2. Take activations from layer before classification fc7: 4096 dimension CNN “feature vector” Forward propagate Output: “fc7” features (activations before classification layer) Frames: RGB 37

Recommend


More recommend