2018 018 中国科学院自动化研究所 Institute of Automation Institute of Automation Chinese Academy of Sciences Chinese Academy of Sciences Multimodal Memory Modelling for Video Captioning Liang Wang & Yan Huang Center for Research on Intelligent Perception and Computing (CRIPAC) National Laboratory of Pattern Recognition (NLPR) Institute of Automation, Chinese Academy of Sciences (CASIA) Mar Mar 28 28, , 2018 2018
NVAIL Artificial Intelligence Laboratory Researches on artificial intelligence and deep learning
Outline Introduction Model Description Experimental Results Conclusion
Outline Introduction Model Description Experimental Results Conclusion
Video Captioning Generate natural sentences to describe video content 1. A man and a woman performing a musical. 2. A teenage couple perform in an amateur musical. 3. Dancers are playing a routine. 4. People are dancing in a musical. Potential applications Challenges Learning an effective mapping from visual sequence space to language space The long-term visual-textual dependency modelling
Related Work Language template-based approach Krishnamoorthy et al. Generating Guadarrama et al.Youtube2text:Recognizing and Natural-Language Video Descriptions describing arbitrary activities using semantic Using Text-Mined Knowledge. AAAI13. hierarchies and zero-shot recognition.ICCV13.
Related Work Search-based approach Xu et al. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. AAAI15.
Related Work Sequence-to-Sequence learning-based approach Yao et al. Describing Videos by Exploiting Temporal Structure.ICCV15. Pan et al. Jointly Modeling Embedding and Translation to Bridge Video and Language.CVPR16.
Outline Introduction Model Description Experimental Results Conclusion
Motivation Recent work has pointed out that LSTM doesn’t work well when the sequence is long enough. Neural memory models have shown great potential to long- term dependency modelling, e.g., QA in NLP. Visual working memory is one of the key factors to guide eye movements. A. Graves, et al. Neural W. Wang, et al. Simulating Human Saccadic Scanpath Turing Machines. arXiv:1410.5401 on Natural Images. CVPR 2011.
Recent Related Work Fakoor et al. Memory-augmented Attention Modelling for Videos.arxiv16.
Recent Related Work
Recent Related Work Agrawal et al. Recurrent Memory Addressing for describing videos.arxiv16.
Captioning Framework … CNN-Based Video Encoder 2D/3D CNN 2D/3D CNN 2D/3D CNN 2D/3D CNN 𝑤 1 𝑤 2 𝑤 3 𝑤 𝑜 𝐵𝑢𝑢𝑓𝑜𝑒 𝑢+1 𝐵𝑢𝑢𝑓𝑜𝑒 𝑢+2 𝑜 𝑜 … 𝑢+1 , 𝑏 2 𝑢+1 , … 𝑏 𝑜 𝑢+1 𝑤 𝑗 𝑢+2 , 𝑏 2 𝑢+2 , … 𝑏 𝑜 𝑢+2 𝑤 𝑗 𝑢+1 𝑢+2 𝑏 1 𝑏 1 𝑏 𝑗 𝑏 𝑗 𝑗=1 𝑗=1 2 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 3 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 Multimodal Memory … 𝑁𝑓𝑛 𝑢 𝑁𝑓𝑛 𝑢+1 𝑁𝑓𝑛 𝑢+2 𝑥𝑠𝑗𝑢𝑓 𝑒𝑓𝑑 1 𝑠𝑓𝑏𝑒 𝑒𝑓𝑑 𝑥𝑠𝑗𝑢𝑓 𝑒𝑓𝑑 𝑠𝑓𝑏𝑒 𝑒𝑓𝑑 4 LSTM-Based … 𝑀𝑇𝑈𝑁 𝑢+1 𝑀𝑇𝑈𝑁 𝑢+2 𝑀𝑇𝑈𝑁 𝑢 Text Decoder … #𝑡𝑢𝑏𝑠𝑢 𝐵 𝑛𝑏𝑜 𝑗𝑡
CNN-Based Video Encoder C3D Residual GoogLenet VGG-19 Inception-3
Multimodal Memory Modelling Multimodal Memory 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 2 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 3 6 N × M matrix 𝑥𝑠𝑗𝑢𝑓 𝑒𝑝𝑑 𝑠𝑓𝑏𝑒 𝑒𝑝𝑑 𝑥𝑠𝑗𝑢𝑓 𝑒𝑝𝑑 1 4 5 ① Writing hidden representations to update memory ② Reading updated memory for temporal attention
Multimodal Memory Modelling 𝑤 1 𝑤 2 𝑤 3 𝑤 4 𝑤 5 𝑤 𝑜 𝐵𝑢𝑢𝑓𝑜𝑒 𝑢+1 𝐵𝑢𝑢𝑓𝑜𝑒 𝑢+2 𝐵𝑢𝑢𝑓𝑜𝑒 𝑢+𝑙 𝑜 𝑜 𝑜 𝑢+1 , 𝑏 2 𝑢+1 , … 𝑏 𝑜 𝑢+1 𝑢+2 , 𝑏 2 𝑢+2 , … 𝑏 𝑜 𝑢+2 𝑢+𝑙 , 𝑏 2 𝑢+𝑙 , … 𝑏 𝑜 𝑢+𝑙 𝑏 1 𝑏 1 𝑏 1 𝑢+1 𝑤 𝑗 𝑢+2 𝑤 𝑗 𝑢+𝑙 𝑤 𝑗 𝑏 𝑗 𝑏 𝑗 𝑏 𝑗 𝑗=1 𝑗=1 𝑗=1 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 Temporal attention selection for video representation
Multimodal Memory Modelling Multimodal Memory 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 2 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 3 6 N × M matrix 𝑥𝑠𝑗𝑢𝑓 𝑒𝑝𝑑 𝑠𝑓𝑏𝑒 𝑒𝑝𝑑 𝑥𝑠𝑗𝑢𝑓 𝑒𝑝𝑑 1 4 5 ③ Writing selected visual information to update memory ④ Reading the updated memory for LSTM-based language model
LSTM-Based Text Decoder 𝑀𝑇𝑈𝑁 𝑢+1 𝑀𝑇𝑈𝑁 𝑢+2 𝑀𝑇𝑈𝑁 𝑢 … 𝑀𝑇𝑈𝑁 𝑢+𝑙 … #𝑓𝑜𝑒 #𝑡𝑢𝑏𝑠𝑢 𝐵 𝑛𝑏𝑜 𝑗𝑡
Memory Addressing & Regularized Loss Content-Based Memory Addressing Regularized Loss
Outline Introduction Model Description Experimental Results Conclusion
Implementation Details Variable-length sentences A start tag and an end tag Beam search beam size: 5 LSTM-based decoder visual hidden units:1024,word embedding size: 468 Memory matrix memory size : (128,512), GlorotUniform read weight and write weight initial with OneHot Others Minibatch: 64, optimization algorithm: ADADELTA Dropout with rate of 0.5, gradient norm clipped (-10,10)
Experimental Results Microsoft Video Description Dataset 1970 Youtube videos, training set (1200), validation set (100), and test set (670) 10 seconds to 25 seconds for each clip each video has about 40 sentences
实验结果 CVPR2017 CVPR2017 ICLR2017 ICLR2017
Experimental Results Microsoft Research-Video to Text Dataset the largest dataset in terms of sentence and vocabulary, 10,000 video clips and 200,000 sentences each video is labelled with about 20 sentences training set (6513), validation set (497), and test set (2990)
Experimental Results Description Generation M 3 : Our model SA : Yao et al. ICCV 2015
Outline Introduction Model Description Experimental Results Conclusion
Conclusion & Future Work Textual/Visual/Attribute Memory 视觉空间板 中央执行系统 语音环路 Working Memory, Baddeley et al.
Acknowledgement NVAIL Artificial Intelligence Laboratory Sponsor excellent hardware resource s
Than ank you! u! ( Q/A) wangliang@nlpr.ia.ac.cn
Recommend
More recommend