Video Paragraph Captioning using Hierarchical Recurrent Neural - PowerPoint PPT Presentation

Video Paragraph Captioning using Hierarchical Recurrent Neural Networks Haonan Yu, Jiang Wang, ZhihengHuang, Yi Yang, Wei Xu

Problem Given a video, generate a paragraph (multiple sentences). 01/13

Problem Given a video, generate a paragraph (multiple sentences). The person entered the kitchen. The person opened the drawer. The person took out a knife and a sharpener. The person sharpened the knife. The person cleaned the knife. 01/13

Problem Given a video, generate a paragraph (multiple sentences). The person entered the kitchen. The person opened the drawer. The person took out a knife and a sharpener. The person sharpened the knife. The person cleaned the knife. VS. The person sharpened the knife in the kitchen. 01/13

Motivation Inter-sentence dependency (semantics context) 02/13

Motivation Inter-sentence dependency (semantics context) The person took out some potatoes. 02/13

Motivation Inter-sentence dependency (semantics context) The person took out some potatoes. The person peeled the potatoes. The person turned on the stove. 02/13

Motivation Inter-sentence dependency (semantics context) The person took out some potatoes. The person peeled the potatoes. The person turned on the stove. We want to model this dependency. 02/13

Hierarchy A paragraph is inherently hierarchical. 03/13

Hierarchy A paragraph is inherently hierarchical. The person took out some potatoes. 03/13

Hierarchy A paragraph is inherently hierarchical. … … The person took out some potatoes. The person peeled the potatoes. 03/13

Hierarchy A paragraph is inherently hierarchical. … … The person took out some potatoes. The person peeled the potatoes. RNN RNN 03/13

Hierarchy A paragraph is inherently hierarchical. RNN … … The person took out some potatoes. The person peeled the potatoes. RNN RNN 03/13

Framework (a (a) ) Sentence Generator RNN RNN (b (b) ) Paragraph Generator 04/13

Framework – language model (a (a) ) Sentence Generator Softmax MaxID Embedding Recurrent I Hidden Predicted Words Input Words 512 512 512 1024 Multimodal RNN (b (b) ) Paragraph Generator 04/13

Framework – attention model for video feature (a) (a ) Sentence Generator Video Feature Pool Sequential Softmax Weighted Average Attention II Attention I Softmax MaxID Embedding Recurrent I Hidden Predicted Words Input Words 512 512 512 1024 Multimodal RNN (b) (b ) Paragraph Generator 04/13

Framework – paragraph model (a) (a ) Sentence Generator Video Feature Pool Sequential Softmax Weighted Average Attention II Attention I Softmax MaxID Embedding Recurrent I Hidden Predicted Words Input Words 512 512 512 1024 Multimodal Embedding Last Instance Average Recurrent II Sentence Paragraph State 512 512 512 Embedding (b) (b ) Paragraph Generator 04/13

Visual Features Appearance Feature Pool Video Feature Pool Action Feature Pool Object appearance: VGG-16 (fc7) [Simonyan et al. , 2015], pre-trained on ImageNet dataset Action: C3D (fc6) [Tran et al., 2015], pre-trained on Sports-1M dataset Dense Trajectories+Fisher Vector [Wang et al. , 2011] 05/13

Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 Learning spatial & temporal attention simultaneously 06/13

Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 06/13

Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 … … i-1 feature pool i i+1 06/13

Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 … … i-1 feature pool i i+1 previous recurrent state t-1 06/13

Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 … … i-1 feature pool i i+1 … … attention weights previous recurrent state t-1 06/13

Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 … … i-1 feature pool i i+1 average feature dot product … … attention weights (input to multimodal layer) previous recurrent state t-1 06/13

Paragraph Generator Unrolled visual features sentence n-1 embedding hidden softmax maxid current word next word 7192 7192 1024 512 512 512 sentence generator multi-model 512 paragraph generator input to next visual features sentence n sentence embedding hidden softmax maxid current word next word 7192 7192 1024 512 512 512 sentence generator multi-model 512 paragraph generator 07/13

Sentence Embedding Embedding Recurrent I Input Words 512 512 Embedding Last Instance Average Sentence 512 Embedding 08/13

Experiments - Setup Two datasets: YouTube2Text > open-domain > 1,970 videos, ~80k video-sentence pairs, 12k unique words > only one sentence for a video ( special case ) TACoS-MultiLevel > closed-domain: cooking > 173 videos, 16,145 intervals, ~40k interval-sentence pairs, 2k unique words > several dependent sentences for a video Three evaluation metrics: BLEU [Papineni et al., 2002] METEOR [Banerjee and Lavie, 2005] CIDEr [Vedantam et al., 2015] The higher, the better. 09/13

Experiments - YouTube2Text 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 BLEU@4 METEOR CIDEr 10/13

Experiments - TACoS-MultiLevel 0.31 0.3 0.29 0.28 0.27 0.26 0.25 0.24 BLEU@4 METEOR 1.65 1.6 1.55 1.5 1.45 1.4 1.35 1.3 1.25 1.2 10/13 CIDEr

Experiments - TACoS-MultiLevel 0.31 0.3 0.29 0.28 0.27 0.26 0.25 Evaluation metric scores are not always 0.24 reliable, we need further comparison. BLEU@4 METEOR 1.65 1.6 1.55 1.5 1.45 1.4 1.35 1.3 1.25 1.2 10/13 CIDEr

RNN-cat vs. h-RNN 11/13

RNN-cat vs. h-RNN RNN-cat flat structure, concatenating sentences directly with one RNN RNN … … The person took out some potatoes. The person peeled the potatoes. 11/13

RNN-cat vs. h-RNN RNN-cat flat structure, concatenating sentences directly with one RNN RNN … … The person took out some potatoes. The person peeled the potatoes. Amazon Mechanical Turk (AMT): side-by-side comparison Which of the two sentences better describes the video? 1. the first 2. the second. 3. Equally good or bad 11/13

RNN-sent vs. h-RNN examples 12/13

Conclusions & Discussions Hierarchical RNN improves paragraph generation 13/13

Conclusions & Discussions Hierarchical RNN improves paragraph generation Issues: 1. Most errors occur when generating nouns; small objects hard to recognize (on TACoS-MultiLevel) 2. One-way information flow 3. Language model helps, but sometimes overrides computer vision result in a wrong way 13/13

Thanks! Poster #4

Video Paragraph Captioning using Hierarchical Recurrent Neural - PowerPoint PPT Presentation

Video Paragraph Captioning using Hierarchical Recurrent Neural Networks Haonan Yu, Jiang Wang, ZhihengHuang, Yi Yang, Wei Xu Problem Given a video, generate a paragraph (multiple sentences). 01/13 Problem Given a video, generate a paragraph

Video Captioning via Hierarchical Reinforcement Learning Xin Wang, Wenhu Chen, Jiawei Wi,

Video Captioning Erin Grant March 1 st , 2016 Last Class: Image Captioning From Kiros et al.

Real Time American Sign Language Video Captioning using Deep Neural Networks Syed Tousif Ahmed

Video De-Captioning using U-Net with Stacked Dilated Convolutional Layers. ChaLearn Video

A Hierarchical Encoder-Decoder for Paragraph Summarization Farzaneh Mahdisoltani Department of

Tutorial on Recent Advances in Visual Captioning Luowei Zhou 06/15/2020 1 Outline Problem

Experiments from paper on Hierarchical Video Segmentation February 17, 2016 Original paper:

Extract 2 1984 , G. Orwell, 1. What progression can you find from paragraph 1 to paragraph 5?

Image Captioning Image Captioning Image Captioning A survey of recent deep-learning approaches

Less is More: Picking Informative Frames for Video Captioning ECCV 2018 Yangyu Chen 1 , Shuhui

Learning Several cars are driving straight on the freeway Semantic Video Captioning using

Natural Language Video Description using Deep Recurrent Neural Networks Thesis Proposal 23 Nov.

Learning Hierarchical Information Flow with Recurrent Neural Modules Danijar Hafner 1 , Alex Irpan

Natural-Language Video Description with Deep Recurrent Neural Networks June 2017 Subhashini

Bidirectional Recurrent Convolutional Networks for Video Super-Resolution Qi Zhang & Yan

Mikolovs Language Models: Distributed Representations of Sentences and Documents Recurrent

Take out Warm- Paragraph Versailles Agenda up Response DBQ Put Paragraph Response and

Session Transcript: 6/26/2020 Closed Captioning/ Transcript Disclaimer Closed captioning and/or

Presentation Class Lesson 3 Lesson Preview: I. Writing a Paragraph a. What is a paragraph? b.

1 2 Further information: IFRS 17 paragraph 29 3 4 Further information: IFRS 17 paragraph 32

Captioning for Contextual Suggestion (position paper) Charles L. A. Clarke William Song

Multimodal Memory Modelling for Video Captioning Liang Wang & Yan Huang Center for Research

Strategies for Asking To access captioning, click on captions show subtitles . REALD

PARAGRAPH & ESSAY WRITING ESSAY WRITING Teacher : Prof. Indu Bora Subject : English

Video Paragraph Captioning using Hierarchical Recurrent Neural - PowerPoint PPT Presentation

Video Paragraph Captioning using Hierarchical Recurrent Neural Networks Haonan Yu, Jiang Wang, ZhihengHuang, Yi Yang, Wei Xu Problem Given a video, generate a paragraph (multiple sentences). 01/13 Problem Given a video, generate a paragraph

Video Captioning via Hierarchical Reinforcement Learning Xin Wang, Wenhu Chen, Jiawei Wi,

Video Captioning Erin Grant March 1 st , 2016 Last Class: Image Captioning From Kiros et al.

Real Time American Sign Language Video Captioning using Deep Neural Networks Syed Tousif Ahmed

Video De-Captioning using U-Net with Stacked Dilated Convolutional Layers. ChaLearn Video

A Hierarchical Encoder-Decoder for Paragraph Summarization Farzaneh Mahdisoltani Department of

Tutorial on Recent Advances in Visual Captioning Luowei Zhou 06/15/2020 1 Outline Problem

Experiments from paper on Hierarchical Video Segmentation February 17, 2016 Original paper:

Extract 2 1984 , G. Orwell, 1. What progression can you find from paragraph 1 to paragraph 5?

Image Captioning Image Captioning Image Captioning A survey of recent deep-learning approaches

Less is More: Picking Informative Frames for Video Captioning ECCV 2018 Yangyu Chen 1 , Shuhui

Learning Several cars are driving straight on the freeway Semantic Video Captioning using

Natural Language Video Description using Deep Recurrent Neural Networks Thesis Proposal 23 Nov.

Learning Hierarchical Information Flow with Recurrent Neural Modules Danijar Hafner 1 , Alex Irpan

Natural-Language Video Description with Deep Recurrent Neural Networks June 2017 Subhashini

Bidirectional Recurrent Convolutional Networks for Video Super-Resolution Qi Zhang &amp; Yan

Mikolovs Language Models: Distributed Representations of Sentences and Documents Recurrent

Take out Warm- Paragraph Versailles Agenda up Response DBQ Put Paragraph Response and

Session Transcript: 6/26/2020 Closed Captioning/ Transcript Disclaimer Closed captioning and/or

Presentation Class Lesson 3 Lesson Preview: I. Writing a Paragraph a. What is a paragraph? b.

1 2 Further information: IFRS 17 paragraph 29 3 4 Further information: IFRS 17 paragraph 32

Captioning for Contextual Suggestion (position paper) Charles L. A. Clarke William Song

Multimodal Memory Modelling for Video Captioning Liang Wang &amp; Yan Huang Center for Research

Strategies for Asking To access captioning, click on captions show subtitles . REALD

PARAGRAPH &amp; ESSAY WRITING ESSAY WRITING Teacher : Prof. Indu Bora Subject : English

Bidirectional Recurrent Convolutional Networks for Video Super-Resolution Qi Zhang & Yan

Multimodal Memory Modelling for Video Captioning Liang Wang & Yan Huang Center for Research

PARAGRAPH & ESSAY WRITING ESSAY WRITING Teacher : Prof. Indu Bora Subject : English