natural language video description with deep recurrent
play

Natural-Language Video Description with Deep Recurrent Neural - PowerPoint PPT Presentation

Natural-Language Video Description with Deep Recurrent Neural Networks June 2017 Subhashini Venugopalan University of Texas at Austin 1 Problem Statement Generate descriptions for events depicted in video clips A monkey pulls a dogs tail


  1. Natural-Language Video Description with Deep Recurrent Neural Networks June 2017 Subhashini Venugopalan University of Texas at Austin 1

  2. Problem Statement Generate descriptions for events depicted in video clips A monkey pulls a dog’s tail and is chased by the dog. 2

  3. Applications Image and video retrieval by content Video description service Children are wearing green shirts. They are dancing as they sing the carol. Human Robot Interaction Video surveillance 3

  4. Outline ● Review (proposal) ○ Background ○ Encoder-Decoder approaches to video description ● External knowledge to improve video description ● External knowledge for novel object captioning ● Temporal segmentation and description for long videos ● Future Directions 4

  5. Early Work in Video Description ● Extract features ● Classify objects, actions, scenes Subjects Verbs Objects Scenes egg 0.31 ● person 0.95 slice 0.19 kitchen 0.64 Visual confidences over entities : onion 0.21 monkey 0.01 chop 0.11 sky 0.17 Subject, Verb, Object, Scene potato 0.20 animal 0.01 play 0.09 house 0.07 . . . . piano 0 parrot 0 speak 0 snow 0 ● Bias with statistics from language ● Factor Graph to estimates most likely entities (S, V, O, P) ● Template based sentence A person is slicing an onion in the kitchen. generation. J. Thomason * , S. Venugopalan * , S. Guadarrama, K. Saenko, R. Mooney COLING’14 5

  6. Early Work in Video Description Limitations: ● Narrow Domains ● Small Grammars ● Template based sentences [Guadarrama, et al. ICCV’13] ● Several features and classifiers Which objects/actions/scenes should we build classifiers for? [Yu and Siskind, ACL’13] [Rohrbach et al. ICCV’13] [Thomason et al. COLING’14] 6

  7. Can we learn directly from video sentence pairs? Without having to explicitly identify objects/actions/scenes to build classifiers. S. Venugopalan, H. Xu, M. Rohrbach, J. Donahue, R. Mooney, K. Saenko. NAACL’15 7

  8. Outline ● Review (proposal) ○ Background ○ Encoder-Decoder approaches to video description ● External knowledge to improve video description ● External knowledge for novel object captioning ● Temporal segmentation and description for long videos ● Future Directions 8

  9. Deep Neural Networks Convolutional Neural Networks Recurrent Neural Networks ● RNNs can model sequences. ● Maps ● Features and classifiers are jointly ● Successful in translation, speech. learned. ● ● We use LSTMs. Directly from raw pixels and labels. 9

  10. Recurrent Neural Networks (RNNs) can map a vector to a sequence. RNN RNN English French [Sutskever et al. NIPS’14] encoder decoder Sentence Sentence RNN [Donahue et al. CVPR’15] Encode Sentence decoder [Vinyals et al. CVPR’15] RNN Encode Sentence [Venugopalan et al. decoder NAACL’15] Key Insight: Generate feature representation of the video and “decode” it to a sentence 10

  11. Inference: Feature extraction LSTM LSTM A LSTM LSTM boy LSTM LSTM is LSTM LSTM playing LSTM LSTM golf CNN LSTM LSTM <EOS> fc7: 4096 dimension CNN “feature vector” Forward propagate Output: “fc7” features (activations before classification layer) 11

  12. Inference: Mean Pool & Generation Input Video Convolutional Net Recurrent Net Output LSTM LSTM A LSTM LSTM boy LSTM . . . LSTM is LSTM LSTM playing LSTM LSTM golf LSTM LSTM <EOS> 12

  13. Translating Videos to Natural Language LSTM LSTM A LSTM LSTM boy LSTM LSTM is LSTM LSTM playing LSTM LSTM golf CNN LSTM LSTM <EOS> Does not consider temporal sequence of frames. 13

  14. Recurrent Neural Networks (RNNs) can map a vector to a sequence. RNN RNN English French [Sutskever et al. NIPS’14] encoder decoder Sentence Sentence RNN [Donahue et al. CVPR’15] Encode Sentence decoder [Vinyals et al. CVPR’15] RNN Encode Sentence [Venugopalan et al. decoder NAACL’15] RNN RNN Encode Sentence [Venugopalan et al. ICCV’15] encoder decoder 14

  15. S2VT: Sequence to Sequence Video to Text CNN CNN CNN CNN Now decode it to a sentence! LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM ... LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM A man is talking ... Encoding stage Decoding stage S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, K. Saenko. ICCV’15 15

  16. Frames: RGB, Flow 1000 1. RGB frames. categories [Simonyan and Zisserman CNN ICLR’15] 2. Use optical flow to extract flow images. [T. Brox et al. ECCV ‘04] UCF 101 101 Action 3. Train CNN on Classes Activity classes [Donahue et al. CVPR’15] CNN (modified AlexNet) 16

  17. Experiments: Dataset Microsoft Research Video Description dataset [Chen & Dolan, ACL’11] Link: http://www.cs.utexas.edu/users/ml/clamp/videoDescription/ • 1970 YouTube video snippets • 10-30s each • typically single activity • 1200 training, 100 validation, 670 test • Annotations • Descriptions in multiple languages • ~40 English descriptions per video • descriptions and videos collected on AMT 17

  18. Sample video and gold descriptions ● ● A man appears to be plowing a rice field with a plow A man is walking on a rope . ● being pulled by two oxen . A man is walking across a rope . ● ● A team of water buffalo pull a plow through a rice paddy. A man is balancing on a rope . ● Domesticated livestock are helping a man plow . ● A man is balancing on a rope at the beach. ● A man leads a team of oxen down a muddy path. ● A man walks on a tightrope at the beach. ● Two oxen walk through some mud. ● A man is balancing on a volleyball net . ● A man is tilling his land with an ox pulled plow. ● A man is walking on a rope held by poles ● Bulls are pulling an object. ● A man balanced on a wire . ● Two oxen are plowing a field. ● The man is balancing on the wire . ● The farmer is tilling the soil. ● A man is walking on a rope . ● A man in ploughing the field. ● A man is standing in the sea shore. 18

  19. Movie Corpus - DVS DVS - Separate audio track for the visually impaired Processed : Someone rushes Looking troubled, into the courtyard. someone descends She then puts a the stairs. head scarf on ... 19

  20. Evaluation: Movie Corpora M-VAD MPII-MD ● Univ. of Montreal • MPII, Germany ● DVS alignment: semi-automated and • DVS alignment: semi-automated crowdsourced and crowdsourced ● 92 movies • 94 movies ● 46,009 clips • 68,000 clips ● Avg. length: 6.2s per clip • Avg. length: 3.9s per clip ● 1-2 sentences per clip • ~1 sentence per clip ● 56,634 sentences • 68,375 sentences 20 [Torabi et al. arXiv‘15] [Rohrbach et al. CVPR ‘15]

  21. Evaluation Metrics • Machine Translation Metric • METEOR - word similarity and phrasing • Human evaluation • Relevance • Grammar 21

  22. Results (Youtube) 23.9 Prior Work (FGM) 27.7 Mean-Pool [1] 29.2 S2VT (RGB) [2] S2VT [2] 29.8 (RGB+Flow) [1] S. Venugopalan, H. Xu, M. Rohrbach, J. Donahue, R. Mooney, K. Saenko. NAACL’15 [2] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, K. Saenko. ICCV’15 22

  23. Proposed Work ● Short Term - Incorporate linguistic knowledge to improve descriptions. ● Long Term - Descriptions for longer videos. 23

  24. Outline ● Review (proposal) ○ Background ○ Encoder-Decoder approaches to video description ● External knowledge to improve video description ● External knowledge for novel object captioning ● Temporal segmentation and description for long videos ● Future Directions 24

  25. Can external linguistic knowledge improve descriptive quality? Unsupervised training on external text S. Venugopalan, L.A. Hendricks, R. Mooney, K. Saenko. EMNLP’16 25

  26. Integrating Statistical Linguistic Knowledge S. Venugopalan, L.A. Hendricks, R. Mooney, K. Saenko. EMNLP’16 26

  27. Unsupervised Training on External Text Fusing LSTM language model trained on text ● Early fusion ● Late fusion ● Deep fusion Distributional Embeddings ● Replace one-hot encoding with GloVe 27

  28. LSTM Language Model We learn a language model using LSTMs. ● Learns to predict the next word given previous words in the sequence. <BOS> A man is talking LSTM LSTM LSTM LSTM LSTM A man is talking <EOS> ● Data ○ Web Corpus: Wikipedia, UkWac, BNC, Gigaword ○ InDomain: MSCOCO image-caption sentences ○ Vocabulary: 72,700 (most frequent words) 28

  29. Distributional Embedding “You shall know a word by the company it keeps” (J. R. Firth, 1957) Dense vector representation of words. Paris ● semantically similar words are closer. Talking We use GloVe [Pennington et al. EMNLP’14] Seaworld ● Trained on Wikipedia and Gigaword. (6B tokens) Dolphin Porpoise ● Replace one-hot encoded input with GloVe. [10000] [01000] [00100] [00010] [00001] 29

  30. Early Fusion • Initialize weights of the caption model from the LSTM LM. Use LM to Initialize Weights 30

  31. Late Fusion Re-score video LSTM output based on language model. Set coefficient based on a validation set. 31

Recommend


More recommend