Video Paragraph Captioning using Hierarchical Recurrent Neural Networks Haonan Yu, Jiang Wang, ZhihengHuang, Yi Yang, Wei Xu
Problem Given a video, generate a paragraph (multiple sentences). 01/13
Problem Given a video, generate a paragraph (multiple sentences). The person entered the kitchen. The person opened the drawer. The person took out a knife and a sharpener. The person sharpened the knife. The person cleaned the knife. 01/13
Problem Given a video, generate a paragraph (multiple sentences). The person entered the kitchen. The person opened the drawer. The person took out a knife and a sharpener. The person sharpened the knife. The person cleaned the knife. VS. The person sharpened the knife in the kitchen. 01/13
Motivation Inter-sentence dependency (semantics context) 02/13
Motivation Inter-sentence dependency (semantics context) The person took out some potatoes. 02/13
Motivation Inter-sentence dependency (semantics context) The person took out some potatoes. The person peeled the potatoes. The person turned on the stove. 02/13
Motivation Inter-sentence dependency (semantics context) The person took out some potatoes. The person peeled the potatoes. The person turned on the stove. We want to model this dependency. 02/13
Hierarchy A paragraph is inherently hierarchical. 03/13
Hierarchy A paragraph is inherently hierarchical. The person took out some potatoes. 03/13
Hierarchy A paragraph is inherently hierarchical. … … The person took out some potatoes. The person peeled the potatoes. 03/13
Hierarchy A paragraph is inherently hierarchical. … … The person took out some potatoes. The person peeled the potatoes. RNN RNN 03/13
Hierarchy A paragraph is inherently hierarchical. RNN … … The person took out some potatoes. The person peeled the potatoes. RNN RNN 03/13
Framework (a (a) ) Sentence Generator RNN RNN (b (b) ) Paragraph Generator 04/13
Framework – language model (a (a) ) Sentence Generator Softmax MaxID Embedding Recurrent I Hidden Predicted Words Input Words 512 512 512 1024 Multimodal RNN (b (b) ) Paragraph Generator 04/13
Framework – attention model for video feature (a) (a ) Sentence Generator Video Feature Pool Sequential Softmax Weighted Average Attention II Attention I Softmax MaxID Embedding Recurrent I Hidden Predicted Words Input Words 512 512 512 1024 Multimodal RNN (b) (b ) Paragraph Generator 04/13
Framework – paragraph model (a) (a ) Sentence Generator Video Feature Pool Sequential Softmax Weighted Average Attention II Attention I Softmax MaxID Embedding Recurrent I Hidden Predicted Words Input Words 512 512 512 1024 Multimodal Embedding Last Instance Average Recurrent II Sentence Paragraph State 512 512 512 Embedding (b) (b ) Paragraph Generator 04/13
Visual Features Appearance Feature Pool Video Feature Pool Action Feature Pool Object appearance: VGG-16 (fc7) [Simonyan et al. , 2015], pre-trained on ImageNet dataset Action: C3D (fc6) [Tran et al., 2015], pre-trained on Sports-1M dataset Dense Trajectories+Fisher Vector [Wang et al. , 2011] 05/13
Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 Learning spatial & temporal attention simultaneously 06/13
Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 06/13
Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 06/13
Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 … … i-1 feature pool i i+1 06/13
Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 … … i-1 feature pool i i+1 previous recurrent state t-1 06/13
Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 … … i-1 feature pool i i+1 … … attention weights previous recurrent state t-1 06/13
Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 … … i-1 feature pool i i+1 average feature dot product … … attention weights (input to multimodal layer) previous recurrent state t-1 06/13
Paragraph Generator Unrolled visual features sentence n-1 embedding hidden softmax maxid current word next word 7192 7192 1024 512 512 512 sentence generator multi-model 512 paragraph generator input to next visual features sentence n sentence embedding hidden softmax maxid current word next word 7192 7192 1024 512 512 512 sentence generator multi-model 512 paragraph generator 07/13
Sentence Embedding Embedding Recurrent I Input Words 512 512 Embedding Last Instance Average Sentence 512 Embedding 08/13
Experiments - Setup Two datasets: YouTube2Text > open-domain > 1,970 videos, ~80k video-sentence pairs, 12k unique words > only one sentence for a video ( special case ) TACoS-MultiLevel > closed-domain: cooking > 173 videos, 16,145 intervals, ~40k interval-sentence pairs, 2k unique words > several dependent sentences for a video Three evaluation metrics: BLEU [Papineni et al., 2002] METEOR [Banerjee and Lavie, 2005] CIDEr [Vedantam et al., 2015] The higher, the better. 09/13
Experiments - YouTube2Text 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 BLEU@4 METEOR CIDEr 10/13
Experiments - TACoS-MultiLevel 0.31 0.3 0.29 0.28 0.27 0.26 0.25 0.24 BLEU@4 METEOR 1.65 1.6 1.55 1.5 1.45 1.4 1.35 1.3 1.25 1.2 10/13 CIDEr
Experiments - TACoS-MultiLevel 0.31 0.3 0.29 0.28 0.27 0.26 0.25 0.24 BLEU@4 METEOR 1.65 1.6 1.55 1.5 1.45 1.4 1.35 1.3 1.25 1.2 10/13 CIDEr
Experiments - TACoS-MultiLevel 0.31 0.3 0.29 0.28 0.27 0.26 0.25 Evaluation metric scores are not always 0.24 reliable, we need further comparison. BLEU@4 METEOR 1.65 1.6 1.55 1.5 1.45 1.4 1.35 1.3 1.25 1.2 10/13 CIDEr
RNN-cat vs. h-RNN 11/13
RNN-cat vs. h-RNN RNN-cat flat structure, concatenating sentences directly with one RNN RNN … … The person took out some potatoes. The person peeled the potatoes. 11/13
RNN-cat vs. h-RNN RNN-cat flat structure, concatenating sentences directly with one RNN RNN … … The person took out some potatoes. The person peeled the potatoes. Amazon Mechanical Turk (AMT): side-by-side comparison Which of the two sentences better describes the video? 1. the first 2. the second. 3. Equally good or bad 11/13
RNN-cat vs. h-RNN RNN-cat flat structure, concatenating sentences directly with one RNN RNN … … The person took out some potatoes. The person peeled the potatoes. Amazon Mechanical Turk (AMT): side-by-side comparison Which of the two sentences better describes the video? 1. the first 2. the second. 3. Equally good or bad 11/13
RNN-sent vs. h-RNN examples 12/13
Conclusions & Discussions Hierarchical RNN improves paragraph generation 13/13
Conclusions & Discussions Hierarchical RNN improves paragraph generation Issues: 1. Most errors occur when generating nouns; small objects hard to recognize (on TACoS-MultiLevel) 2. One-way information flow 3. Language model helps, but sometimes overrides computer vision result in a wrong way 13/13
Thanks! Poster #4
Recommend
More recommend