Garrett Bingham Sequence to Sequence – Video to Text Venugopalan et al.
Given a variable-length sequence of video frames, Prev: Title Slide generate a variable-length natural language Problem description of the video. Next: Motivation Approach Encoding & Decoding Input Data Datasets METEOR Results ... 2
Video description in general has applications in Prev: Problem Human-robot interaction ● Motivation Video indexing ● Describing movies for the blind ● Next: Sequence to sequence in particular : video descriptions should Approach Be sensitive to temporal structure ● Encoding & Decoding Allow input and output of variable length ● Input Data Previous work resolved variable length input with Datasets Holistic video representations ● METEOR Pooling over frames ● Results Sub-sampling on a fixed number of input frames ● Examples ... 3
Prev: Motivation Approach Next: Encoding & Decoding Input Data Datasets METEOR Results Examples Conclusion ... 4
Prev: Approach Encoding & Decoding Next: Input Data Encoding: Decoding: Datasets LSTMs encode frame <BOS> tag prompts LSTM to decode ● ● METEOR sequence hidden state into sequence of words Hidden representation Model maximizes log-likelihood of ● ● Results is concatenated with predicted output sentence given Examples null input words hidden representation of visual No loss while encoding frames and previous words ● Conclusion Critique Loss is propagated back in time, allowing the LSTM to learn Questions appropriate hidden state representation while encoding. 5
RGB frames: pre-trained CNNs process video Prev: Encoding & Decoding frames. Fully-connected classification layer is Input Data replaced with a linear embedding to a 500-dimensional space. Next: Optical flow: output of CNN pre-trained on UCF101 Datasets video dataset is mapped to 500-dimensional space. METEOR RGB + Flow: shallow fusion Results Examples Text: one-hot vector encoding Conclusion Critique Questions 6
Prev: Input Data Datasets Next: MSVD: Mechanical Turk workers collected short clips depicting a single activity and described the video with a single sentence. METEOR Multilingual corpus, but only English descriptions used. Results MPII-MD: Contains video clips extracted from Hollywood movies, Examples along with movie scripts/audio description data. Challenging due to Conclusion diverse visual/textual content Critique M-VAD: Similar to MPII-MD Questions “Together they form the largest parallel corpora with open domain video and natural language descriptions.” 7
“METEOR compares exact token matches, stemmed Prev: Datasets tokens, paraphrase matches, as well as semantically METEOR similar matches using WordNet synonyms.” Next: Results Examples Conclusion Critique Questions 8
Random frame order hurts performance, implying the full Prev: METEOR ● model learns temporal features Results Flow images alone do poorly, but outperform previous work ● when combined with RGB Movie datasets are hard ● Next: Examples Conclusion Critique Questions 9
A subject is verbing an object. Prev: Results Examples Next: Conclusion Critique Questions 10
M-VAD is much more difficult: the descriptions are Prev: Results complex and have a unique style. Examples This would be difficult for humans too! Next: Conclusion Critique Questions 11
First sequence to sequence approach to video Prev: Examples ● description Conclusion Learns temporal structure of data ● State-of-the-art performance on MSVD ● Next: Outperforms related work on MPII-MD, MVAD Critique ● Questions Simple approach outperforms more complicated ● ones (e.g. GoogleNet + 3D-CNN) 12
Only one metric: the authors justify using METEOR over Prev: Conclusion other metrics, but adding other metrics would have been Critique straightforward and potentially insightful (e.g. what fraction of descriptions are relevant-ish?) Next: Rudimentary RGB + Flow fusion: we can do more than just tune a single α parameter Questions Significance of results: the improvements on each dataset are small (29.6 → 29.8, 7.0 → 7.1, 6.3 → 6.7), raising the question: Are we really benefiting from temporal information? Statistical significance tests would be helpful. 13
Lack of creativity: “... 42.9% of the predictions are Prev: Conclusion identical to some training sentence, and another Critique 38.3% can be obtained by inserting, deleting or substituting one word from some sentence in the training corpus.” Next: Questions 14
Questions?
Recommend
More recommend