Multimodal Memory Modelling for Video Captioning Liang Wang & - PowerPoint PPT Presentation

2018 018 中国科学院自动化研究所 Institute of Automation Institute of Automation Chinese Academy of Sciences Chinese Academy of Sciences Multimodal Memory Modelling for Video Captioning Liang Wang & Yan Huang Center for Research on Intelligent Perception and Computing (CRIPAC) National Laboratory of Pattern Recognition (NLPR) Institute of Automation, Chinese Academy of Sciences (CASIA) Mar Mar 28 28, , 2018 2018

NVAIL Artificial Intelligence Laboratory Researches on artificial intelligence and deep learning

Outline  Introduction  Model Description  Experimental Results  Conclusion

Video Captioning  Generate natural sentences to describe video content 1. A man and a woman performing a musical. 2. A teenage couple perform in an amateur musical. 3. Dancers are playing a routine. 4. People are dancing in a musical.  Potential applications  Challenges  Learning an effective mapping from visual sequence space to language space  The long-term visual-textual dependency modelling

Related Work  Language template-based approach Krishnamoorthy et al. Generating Guadarrama et al.Youtube2text:Recognizing and Natural-Language Video Descriptions describing arbitrary activities using semantic Using Text-Mined Knowledge. AAAI13. hierarchies and zero-shot recognition.ICCV13.

Related Work  Search-based approach Xu et al. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. AAAI15.

Related Work  Sequence-to-Sequence learning-based approach Yao et al. Describing Videos by Exploiting Temporal Structure.ICCV15. Pan et al. Jointly Modeling Embedding and Translation to Bridge Video and Language.CVPR16.

Motivation  Recent work has pointed out that LSTM doesn’t work well when the sequence is long enough.  Neural memory models have shown great potential to long- term dependency modelling, e.g., QA in NLP.  Visual working memory is one of the key factors to guide eye movements. A. Graves, et al. Neural W. Wang, et al. Simulating Human Saccadic Scanpath Turing Machines. arXiv:1410.5401 on Natural Images. CVPR 2011.

Recent Related Work Fakoor et al. Memory-augmented Attention Modelling for Videos.arxiv16.

Recent Related Work

Recent Related Work Agrawal et al. Recurrent Memory Addressing for describing videos.arxiv16.

Captioning Framework … CNN-Based Video Encoder 2D/3D CNN 2D/3D CNN 2D/3D CNN 2D/3D CNN 𝑤 1 𝑤 2 𝑤 3 𝑤 𝑜 𝐵𝑢𝑢𝑓𝑜𝑒 𝑢+1 𝐵𝑢𝑢𝑓𝑜𝑒 𝑢+2 𝑜 𝑜 … 𝑢+1 , 𝑏 2 𝑢+1 , … 𝑏 𝑜 𝑢+1 𝑤 𝑗 𝑢+2 , 𝑏 2 𝑢+2 , … 𝑏 𝑜 𝑢+2 𝑤 𝑗 𝑢+1 𝑢+2 𝑏 1 𝑏 1 𝑏 𝑗 𝑏 𝑗 𝑗=1 𝑗=1 2 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 3 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 Multimodal Memory … 𝑁𝑓𝑛 𝑢 𝑁𝑓𝑛 𝑢+1 𝑁𝑓𝑛 𝑢+2 𝑥𝑠𝑗𝑢𝑓 𝑒𝑓𝑑 1 𝑠𝑓𝑏𝑒 𝑒𝑓𝑑 𝑥𝑠𝑗𝑢𝑓 𝑒𝑓𝑑 𝑠𝑓𝑏𝑒 𝑒𝑓𝑑 4 LSTM-Based … 𝑀𝑇𝑈𝑁 𝑢+1 𝑀𝑇𝑈𝑁 𝑢+2 𝑀𝑇𝑈𝑁 𝑢 Text Decoder … #𝑡𝑢𝑏𝑠𝑢 𝐵 𝑛𝑏𝑜 𝑗𝑡

CNN-Based Video Encoder C3D Residual GoogLenet VGG-19 Inception-3

Multimodal Memory Modelling  Multimodal Memory 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 2 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 3 6  N × M matrix 𝑥𝑠𝑗𝑢𝑓 𝑒𝑝𝑑 𝑠𝑓𝑏𝑒 𝑒𝑝𝑑 𝑥𝑠𝑗𝑢𝑓 𝑒𝑝𝑑 1 4 5 ① Writing hidden representations to update memory ② Reading updated memory for temporal attention

Multimodal Memory Modelling 𝑤 1 𝑤 2 𝑤 3 𝑤 4 𝑤 5 𝑤 𝑜 𝐵𝑢𝑢𝑓𝑜𝑒 𝑢+1 𝐵𝑢𝑢𝑓𝑜𝑒 𝑢+2 𝐵𝑢𝑢𝑓𝑜𝑒 𝑢+𝑙 𝑜 𝑜 𝑜 𝑢+1 , 𝑏 2 𝑢+1 , … 𝑏 𝑜 𝑢+1 𝑢+2 , 𝑏 2 𝑢+2 , … 𝑏 𝑜 𝑢+2 𝑢+𝑙 , 𝑏 2 𝑢+𝑙 , … 𝑏 𝑜 𝑢+𝑙 𝑏 1 𝑏 1 𝑏 1 𝑢+1 𝑤 𝑗 𝑢+2 𝑤 𝑗 𝑢+𝑙 𝑤 𝑗 𝑏 𝑗 𝑏 𝑗 𝑏 𝑗 𝑗=1 𝑗=1 𝑗=1 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 Temporal attention selection for video representation

Multimodal Memory Modelling  Multimodal Memory 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 2 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 3 6  N × M matrix 𝑥𝑠𝑗𝑢𝑓 𝑒𝑝𝑑 𝑠𝑓𝑏𝑒 𝑒𝑝𝑑 𝑥𝑠𝑗𝑢𝑓 𝑒𝑝𝑑 1 4 5 ③ Writing selected visual information to update memory ④ Reading the updated memory for LSTM-based language model

LSTM-Based Text Decoder 𝑀𝑇𝑈𝑁 𝑢+1 𝑀𝑇𝑈𝑁 𝑢+2 𝑀𝑇𝑈𝑁 𝑢 … 𝑀𝑇𝑈𝑁 𝑢+𝑙 … #𝑓𝑜𝑒 #𝑡𝑢𝑏𝑠𝑢 𝐵 𝑛𝑏𝑜 𝑗𝑡

Memory Addressing & Regularized Loss  Content-Based Memory Addressing  Regularized Loss

Implementation Details  Variable-length sentences  A start tag and an end tag  Beam search  beam size: 5  LSTM-based decoder  visual hidden units:1024,word embedding size: 468  Memory matrix  memory size ： (128,512), GlorotUniform  read weight and write weight initial with OneHot  Others  Minibatch: 64, optimization algorithm: ADADELTA  Dropout with rate of 0.5, gradient norm clipped (-10,10)

Experimental Results  Microsoft Video Description Dataset  1970 Youtube videos, training set (1200), validation set (100), and test set (670)  10 seconds to 25 seconds for each clip  each video has about 40 sentences

实验结果 CVPR2017 CVPR2017 ICLR2017 ICLR2017

Experimental Results  Microsoft Research-Video to Text Dataset  the largest dataset in terms of sentence and vocabulary, 10,000 video clips and 200,000 sentences  each video is labelled with about 20 sentences  training set (6513), validation set (497), and test set (2990)

Experimental Results  Description Generation M 3 ： Our model SA ： Yao et al. ICCV 2015

Conclusion & Future Work  Textual/Visual/Attribute Memory 视觉空间板中央执行系统语音环路 Working Memory, Baddeley et al.

Acknowledgement NVAIL Artificial Intelligence Laboratory Sponsor excellent hardware resource s

Than ank you! u! ( Q/A) wangliang@nlpr.ia.ac.cn

Multimodal Memory Modelling for Video Captioning Liang Wang & - PowerPoint PPT Presentation

2018 018 Institute of Automation Institute of Automation Chinese Academy of Sciences Chinese Academy of Sciences Multimodal Memory Modelling for Video Captioning Liang Wang & Yan Huang Center for

Video Captioning Erin Grant March 1 st , 2016 Last Class: Image Captioning From Kiros et al.

Image Captioning Image Captioning Image Captioning A survey of recent deep-learning approaches

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Tutorial on Recent Advances in Visual Captioning Luowei Zhou 06/15/2020 1 Outline Problem

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Real Time American Sign Language Video Captioning using Deep Neural Networks Syed Tousif Ahmed

Implementing Closed Captioning Implementing Closed Captioning for DTV for DTV Graham Jones

Session Transcript: 6/26/2020 Closed Captioning/ Transcript Disclaimer Closed captioning and/or

Video De-Captioning using U-Net with Stacked Dilated Convolutional Layers. ChaLearn Video

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

MULTIMODAL OPTIMIZATION MIKE PREUSS. Multimodal Optimization 1 2014-09-14 Mike Preuss

Multimodal Learning for Image Captioning and Visual Question Answering Xiaodong He Deep Learning

Video Paragraph Captioning using Hierarchical Recurrent Neural Networks Haonan Yu, Jiang Wang,

Technology Committee St. Joseph School, 2017 1. Welcome, Prayer 2. Introductions Tonights

PROPOSED ACTIVITIES, AND QUESTIONS GREATER MEKONG SUBREGION By Vong Sok, EOC CORE ENVIRONMENT

Onslow Bay Financial LLC February 2019 Safe Harbor Notice This presentation, other written or

Regional Spatial and Economic Strategy Issues Paper Special Council meeting 4 th December 2017

Advisory Board Wednesday, July 22, 2020 10:00 AM to 12:00 PM ET Agenda Welcome and

MAX Station Optimization May 31, 2019 1 Long a critique of MAX Has there been any discussion

Minority Business Development Agency Business Center (MBC) Program Pre- Application Conference

Division of Licensing Responsibilities: Monitor the program Collaborate with others Convene

Multimodal Memory Modelling for Video Captioning Liang Wang & - PowerPoint PPT Presentation

2018 018 Institute of Automation Institute of Automation Chinese Academy of Sciences Chinese Academy of Sciences Multimodal Memory Modelling for Video Captioning Liang Wang & Yan Huang Center for

Video Captioning Erin Grant March 1 st , 2016 Last Class: Image Captioning From Kiros et al.

Image Captioning Image Captioning Image Captioning A survey of recent deep-learning approaches

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Tutorial on Recent Advances in Visual Captioning Luowei Zhou 06/15/2020 1 Outline Problem

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Real Time American Sign Language Video Captioning using Deep Neural Networks Syed Tousif Ahmed

Implementing Closed Captioning Implementing Closed Captioning for DTV for DTV Graham Jones

Session Transcript: 6/26/2020 Closed Captioning/ Transcript Disclaimer Closed captioning and/or

Video De-Captioning using U-Net with Stacked Dilated Convolutional Layers. ChaLearn Video

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Multimodal Corridor Planning &amp; Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

MULTIMODAL OPTIMIZATION MIKE PREUSS. Multimodal Optimization 1 2014-09-14 Mike Preuss

Multimodal Learning for Image Captioning and Visual Question Answering Xiaodong He Deep Learning

Video Paragraph Captioning using Hierarchical Recurrent Neural Networks Haonan Yu, Jiang Wang,

Technology Committee St. Joseph School, 2017 1. Welcome, Prayer 2. Introductions Tonights

PROPOSED ACTIVITIES, AND QUESTIONS GREATER MEKONG SUBREGION By Vong Sok, EOC CORE ENVIRONMENT

Onslow Bay Financial LLC February 2019 Safe Harbor Notice This presentation, other written or

Regional Spatial and Economic Strategy Issues Paper Special Council meeting 4 th December 2017

Advisory Board Wednesday, July 22, 2020 10:00 AM to 12:00 PM ET Agenda Welcome and

MAX Station Optimization May 31, 2019 1 Long a critique of MAX Has there been any discussion

Minority Business Development Agency Business Center (MBC) Program Pre- Application Conference

Division of Licensing Responsibilities: Monitor the program Collaborate with others Convene

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING