video paragraph captioning using hierarchical recurrent
play

Video Paragraph Captioning using Hierarchical Recurrent Neural - PowerPoint PPT Presentation

Video Paragraph Captioning using Hierarchical Recurrent Neural Networks Haonan Yu, Jiang Wang, ZhihengHuang, Yi Yang, Wei Xu Problem Given a video, generate a paragraph (multiple sentences). 01/13 Problem Given a video, generate a paragraph


  1. Video Paragraph Captioning using Hierarchical Recurrent Neural Networks Haonan Yu, Jiang Wang, ZhihengHuang, Yi Yang, Wei Xu

  2. Problem Given a video, generate a paragraph (multiple sentences). 01/13

  3. Problem Given a video, generate a paragraph (multiple sentences). The person entered the kitchen. The person opened the drawer. The person took out a knife and a sharpener. The person sharpened the knife. The person cleaned the knife. 01/13

  4. Problem Given a video, generate a paragraph (multiple sentences). The person entered the kitchen. The person opened the drawer. The person took out a knife and a sharpener. The person sharpened the knife. The person cleaned the knife. VS. The person sharpened the knife in the kitchen. 01/13

  5. Motivation Inter-sentence dependency (semantics context) 02/13

  6. Motivation Inter-sentence dependency (semantics context) The person took out some potatoes. 02/13

  7. Motivation Inter-sentence dependency (semantics context) The person took out some potatoes. The person peeled the potatoes. The person turned on the stove. 02/13

  8. Motivation Inter-sentence dependency (semantics context) The person took out some potatoes. The person peeled the potatoes. The person turned on the stove. We want to model this dependency. 02/13

  9. Hierarchy A paragraph is inherently hierarchical. 03/13

  10. Hierarchy A paragraph is inherently hierarchical. The person took out some potatoes. 03/13

  11. Hierarchy A paragraph is inherently hierarchical. … … The person took out some potatoes. The person peeled the potatoes. 03/13

  12. Hierarchy A paragraph is inherently hierarchical. … … The person took out some potatoes. The person peeled the potatoes. RNN RNN 03/13

  13. Hierarchy A paragraph is inherently hierarchical. RNN … … The person took out some potatoes. The person peeled the potatoes. RNN RNN 03/13

  14. Framework (a (a) ) Sentence Generator RNN RNN (b (b) ) Paragraph Generator 04/13

  15. Framework – language model (a (a) ) Sentence Generator Softmax MaxID Embedding Recurrent I Hidden Predicted Words Input Words 512 512 512 1024 Multimodal RNN (b (b) ) Paragraph Generator 04/13

  16. Framework – attention model for video feature (a) (a ) Sentence Generator Video Feature Pool Sequential Softmax Weighted Average Attention II Attention I Softmax MaxID Embedding Recurrent I Hidden Predicted Words Input Words 512 512 512 1024 Multimodal RNN (b) (b ) Paragraph Generator 04/13

  17. Framework – paragraph model (a) (a ) Sentence Generator Video Feature Pool Sequential Softmax Weighted Average Attention II Attention I Softmax MaxID Embedding Recurrent I Hidden Predicted Words Input Words 512 512 512 1024 Multimodal Embedding Last Instance Average Recurrent II Sentence Paragraph State 512 512 512 Embedding (b) (b ) Paragraph Generator 04/13

  18. Visual Features Appearance Feature Pool Video Feature Pool Action Feature Pool Object appearance: VGG-16 (fc7) [Simonyan et al. , 2015], pre-trained on ImageNet dataset Action: C3D (fc6) [Tran et al., 2015], pre-trained on Sports-1M dataset Dense Trajectories+Fisher Vector [Wang et al. , 2011] 05/13

  19. Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 Learning spatial & temporal attention simultaneously 06/13

  20. Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 06/13

  21. Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 06/13

  22. Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 … … i-1 feature pool i i+1 06/13

  23. Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 … … i-1 feature pool i i+1 previous recurrent state t-1 06/13

  24. Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 … … i-1 feature pool i i+1 … … attention weights previous recurrent state t-1 06/13

  25. Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 … … i-1 feature pool i i+1 average feature dot product … … attention weights (input to multimodal layer) previous recurrent state t-1 06/13

  26. Paragraph Generator Unrolled visual features sentence n-1 embedding hidden softmax maxid current word next word 7192 7192 1024 512 512 512 sentence generator multi-model 512 paragraph generator input to next visual features sentence n sentence embedding hidden softmax maxid current word next word 7192 7192 1024 512 512 512 sentence generator multi-model 512 paragraph generator 07/13

  27. Sentence Embedding Embedding Recurrent I Input Words 512 512 Embedding Last Instance Average Sentence 512 Embedding 08/13

  28. Experiments - Setup Two datasets: YouTube2Text > open-domain > 1,970 videos, ~80k video-sentence pairs, 12k unique words > only one sentence for a video ( special case ) TACoS-MultiLevel > closed-domain: cooking > 173 videos, 16,145 intervals, ~40k interval-sentence pairs, 2k unique words > several dependent sentences for a video Three evaluation metrics: BLEU [Papineni et al., 2002] METEOR [Banerjee and Lavie, 2005] CIDEr [Vedantam et al., 2015] The higher, the better. 09/13

  29. Experiments - YouTube2Text 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 BLEU@4 METEOR CIDEr 10/13

  30. Experiments - TACoS-MultiLevel 0.31 0.3 0.29 0.28 0.27 0.26 0.25 0.24 BLEU@4 METEOR 1.65 1.6 1.55 1.5 1.45 1.4 1.35 1.3 1.25 1.2 10/13 CIDEr

  31. Experiments - TACoS-MultiLevel 0.31 0.3 0.29 0.28 0.27 0.26 0.25 0.24 BLEU@4 METEOR 1.65 1.6 1.55 1.5 1.45 1.4 1.35 1.3 1.25 1.2 10/13 CIDEr

  32. Experiments - TACoS-MultiLevel 0.31 0.3 0.29 0.28 0.27 0.26 0.25 Evaluation metric scores are not always 0.24 reliable, we need further comparison. BLEU@4 METEOR 1.65 1.6 1.55 1.5 1.45 1.4 1.35 1.3 1.25 1.2 10/13 CIDEr

  33. RNN-cat vs. h-RNN 11/13

  34. RNN-cat vs. h-RNN RNN-cat flat structure, concatenating sentences directly with one RNN RNN … … The person took out some potatoes. The person peeled the potatoes. 11/13

  35. RNN-cat vs. h-RNN RNN-cat flat structure, concatenating sentences directly with one RNN RNN … … The person took out some potatoes. The person peeled the potatoes. Amazon Mechanical Turk (AMT): side-by-side comparison Which of the two sentences better describes the video? 1. the first 2. the second. 3. Equally good or bad 11/13

  36. RNN-cat vs. h-RNN RNN-cat flat structure, concatenating sentences directly with one RNN RNN … … The person took out some potatoes. The person peeled the potatoes. Amazon Mechanical Turk (AMT): side-by-side comparison Which of the two sentences better describes the video? 1. the first 2. the second. 3. Equally good or bad 11/13

  37. RNN-sent vs. h-RNN examples 12/13

  38. Conclusions & Discussions Hierarchical RNN improves paragraph generation 13/13

  39. Conclusions & Discussions Hierarchical RNN improves paragraph generation Issues: 1. Most errors occur when generating nouns; small objects hard to recognize (on TACoS-MultiLevel) 2. One-way information flow 3. Language model helps, but sometimes overrides computer vision result in a wrong way 13/13

  40. Thanks! Poster #4

Recommend


More recommend