multimodal memory modelling for video captioning
play

Multimodal Memory Modelling for Video Captioning Liang Wang & - PowerPoint PPT Presentation

2018 018 Institute of Automation Institute of Automation Chinese Academy of Sciences Chinese Academy of Sciences Multimodal Memory Modelling for Video Captioning Liang Wang & Yan Huang Center for


  1. 2018 018 中国科学院自动化研究所 Institute of Automation Institute of Automation Chinese Academy of Sciences Chinese Academy of Sciences Multimodal Memory Modelling for Video Captioning Liang Wang & Yan Huang Center for Research on Intelligent Perception and Computing (CRIPAC) National Laboratory of Pattern Recognition (NLPR) Institute of Automation, Chinese Academy of Sciences (CASIA) Mar Mar 28 28, , 2018 2018

  2. NVAIL Artificial Intelligence Laboratory Researches on artificial intelligence and deep learning

  3. Outline  Introduction  Model Description  Experimental Results  Conclusion

  4. Outline  Introduction  Model Description  Experimental Results  Conclusion

  5. Video Captioning  Generate natural sentences to describe video content 1. A man and a woman performing a musical. 2. A teenage couple perform in an amateur musical. 3. Dancers are playing a routine. 4. People are dancing in a musical.  Potential applications  Challenges  Learning an effective mapping from visual sequence space to language space  The long-term visual-textual dependency modelling

  6. Related Work  Language template-based approach Krishnamoorthy et al. Generating Guadarrama et al.Youtube2text:Recognizing and Natural-Language Video Descriptions describing arbitrary activities using semantic Using Text-Mined Knowledge. AAAI13. hierarchies and zero-shot recognition.ICCV13.

  7. Related Work  Search-based approach Xu et al. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. AAAI15.

  8. Related Work  Sequence-to-Sequence learning-based approach Yao et al. Describing Videos by Exploiting Temporal Structure.ICCV15. Pan et al. Jointly Modeling Embedding and Translation to Bridge Video and Language.CVPR16.

  9. Outline  Introduction  Model Description  Experimental Results  Conclusion

  10. Motivation  Recent work has pointed out that LSTM doesn’t work well when the sequence is long enough.  Neural memory models have shown great potential to long- term dependency modelling, e.g., QA in NLP.  Visual working memory is one of the key factors to guide eye movements. A. Graves, et al. Neural W. Wang, et al. Simulating Human Saccadic Scanpath Turing Machines. arXiv:1410.5401 on Natural Images. CVPR 2011.

  11. Recent Related Work Fakoor et al. Memory-augmented Attention Modelling for Videos.arxiv16.

  12. Recent Related Work

  13. Recent Related Work Agrawal et al. Recurrent Memory Addressing for describing videos.arxiv16.

  14. Captioning Framework … CNN-Based Video Encoder 2D/3D CNN 2D/3D CNN 2D/3D CNN 2D/3D CNN 𝑤 1 𝑤 2 𝑤 3 𝑤 𝑜 𝐵𝑢𝑢𝑓𝑜𝑒 𝑢+1 𝐵𝑢𝑢𝑓𝑜𝑒 𝑢+2 𝑜 𝑜 … 𝑢+1 , 𝑏 2 𝑢+1 , … 𝑏 𝑜 𝑢+1 𝑤 𝑗 𝑢+2 , 𝑏 2 𝑢+2 , … 𝑏 𝑜 𝑢+2 𝑤 𝑗 𝑢+1 𝑢+2 𝑏 1 𝑏 1 𝑏 𝑗 𝑏 𝑗 𝑗=1 𝑗=1 2 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 3 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 Multimodal Memory … 𝑁𝑓𝑛 𝑢 𝑁𝑓𝑛 𝑢+1 𝑁𝑓𝑛 𝑢+2 𝑥𝑠𝑗𝑢𝑓 𝑒𝑓𝑑 1 𝑠𝑓𝑏𝑒 𝑒𝑓𝑑 𝑥𝑠𝑗𝑢𝑓 𝑒𝑓𝑑 𝑠𝑓𝑏𝑒 𝑒𝑓𝑑 4 LSTM-Based … 𝑀𝑇𝑈𝑁 𝑢+1 𝑀𝑇𝑈𝑁 𝑢+2 𝑀𝑇𝑈𝑁 𝑢 Text Decoder … #𝑡𝑢𝑏𝑠𝑢 𝐵 𝑛𝑏𝑜 𝑗𝑡

  15. CNN-Based Video Encoder C3D Residual GoogLenet VGG-19 Inception-3

  16. Multimodal Memory Modelling  Multimodal Memory 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 2 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 3 6  N × M matrix 𝑥𝑠𝑗𝑢𝑓 𝑒𝑝𝑑 𝑠𝑓𝑏𝑒 𝑒𝑝𝑑 𝑥𝑠𝑗𝑢𝑓 𝑒𝑝𝑑 1 4 5 ① Writing hidden representations to update memory ② Reading updated memory for temporal attention

  17. Multimodal Memory Modelling 𝑤 1 𝑤 2 𝑤 3 𝑤 4 𝑤 5 𝑤 𝑜 𝐵𝑢𝑢𝑓𝑜𝑒 𝑢+1 𝐵𝑢𝑢𝑓𝑜𝑒 𝑢+2 𝐵𝑢𝑢𝑓𝑜𝑒 𝑢+𝑙 𝑜 𝑜 𝑜 𝑢+1 , 𝑏 2 𝑢+1 , … 𝑏 𝑜 𝑢+1 𝑢+2 , 𝑏 2 𝑢+2 , … 𝑏 𝑜 𝑢+2 𝑢+𝑙 , 𝑏 2 𝑢+𝑙 , … 𝑏 𝑜 𝑢+𝑙 𝑏 1 𝑏 1 𝑏 1 𝑢+1 𝑤 𝑗 𝑢+2 𝑤 𝑗 𝑢+𝑙 𝑤 𝑗 𝑏 𝑗 𝑏 𝑗 𝑏 𝑗 𝑗=1 𝑗=1 𝑗=1 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 Temporal attention selection for video representation

  18. Multimodal Memory Modelling  Multimodal Memory 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 2 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 3 6  N × M matrix 𝑥𝑠𝑗𝑢𝑓 𝑒𝑝𝑑 𝑠𝑓𝑏𝑒 𝑒𝑝𝑑 𝑥𝑠𝑗𝑢𝑓 𝑒𝑝𝑑 1 4 5 ③ Writing selected visual information to update memory ④ Reading the updated memory for LSTM-based language model

  19. LSTM-Based Text Decoder 𝑀𝑇𝑈𝑁 𝑢+1 𝑀𝑇𝑈𝑁 𝑢+2 𝑀𝑇𝑈𝑁 𝑢 … 𝑀𝑇𝑈𝑁 𝑢+𝑙 … #𝑓𝑜𝑒 #𝑡𝑢𝑏𝑠𝑢 𝐵 𝑛𝑏𝑜 𝑗𝑡

  20. Memory Addressing & Regularized Loss  Content-Based Memory Addressing  Regularized Loss

  21. Outline  Introduction  Model Description  Experimental Results  Conclusion

  22. Implementation Details  Variable-length sentences  A start tag and an end tag  Beam search  beam size: 5  LSTM-based decoder  visual hidden units:1024,word embedding size: 468  Memory matrix  memory size : (128,512), GlorotUniform  read weight and write weight initial with OneHot  Others  Minibatch: 64, optimization algorithm: ADADELTA  Dropout with rate of 0.5, gradient norm clipped (-10,10)

  23. Experimental Results  Microsoft Video Description Dataset  1970 Youtube videos, training set (1200), validation set (100), and test set (670)  10 seconds to 25 seconds for each clip  each video has about 40 sentences

  24. 实验结果 CVPR2017 CVPR2017 ICLR2017 ICLR2017

  25. Experimental Results  Microsoft Research-Video to Text Dataset  the largest dataset in terms of sentence and vocabulary, 10,000 video clips and 200,000 sentences  each video is labelled with about 20 sentences  training set (6513), validation set (497), and test set (2990)

  26. Experimental Results  Description Generation M 3 : Our model SA : Yao et al. ICCV 2015

  27. Outline  Introduction  Model Description  Experimental Results  Conclusion

  28. Conclusion & Future Work  Textual/Visual/Attribute Memory 视觉空间板 中央执行系统 语音环路 Working Memory, Baddeley et al.

  29. Acknowledgement NVAIL Artificial Intelligence Laboratory Sponsor excellent hardware resource s

  30. Than ank you! u! ( Q/A) wangliang@nlpr.ia.ac.cn

Recommend


More recommend