Fi Fine ne-gr grained ained Vid Video eo-Te Text Re Retrieval wi with th Hier Hierar arch chic ical al Gr Graph Re Reasoning Shizhe Chen 1 , Yida Zhao 1 , Qin Jin 1 , Qi Wu 2 1 Renmin University of China , 2 University of Adelaide 1
Vi Video-Te Text Cr Cros oss-mod modal Re Retrieval • Task: using sentences to retrieve videos • Sentences contain richer and more structured details than keywords 2
文本视频跨模态检索:动机 Mot Motivation on • Understanding fine-grained semantics in the query sentence • Hierarchical sentence structures • Event Action-action relationships • Actions Action-entity relationships • Entities • Fine-grained local components & how they compose to the event 3
文本视频跨模态检索:动机 Motivation Mot on • Understanding fine-grained semantics in the query sentence • Hierarchical sentence structures • Event Action-action relationships • Actions Action-entity relationships • Entities • Fine-grained local components & how they compose to the event • Limitations of previous works • Global matching: one vector • hard to capture fine-grained details • Local matching: word level • Cannot express complex relationships among words 4
文本视频跨模态检索:模型 Th The Pr Proposed Me Method od • Hierarchical Graph Reasoning Model (HGR) • Hierarchical Textual Encoding • Hierarchical Video Encoding • Multi-level Video-Text Matching 5
文本视频跨模态检索:模型 Hier Hierar archic hical al Te Textual Enc Encodi ding ng Semantic Role Graph Node Initialization Attention-based Graph Reasoning contextual word embedding o Capture interactive context via Event attentive relational GCN node: o Factorize relational matrix max pooling Action Entity 6
Hier Hierar archic hical al Vi Video Enc Encodi ding ng • Video contain multiple aspects • Objects, actions, events • Challenging to parse directly as texts, which require object detection, tracking, action segmentation etc. • Different weights for each level • Use different level of text as guidance to learn diverse video representation 7
Mu Multi-le level el Cr Cros oss-mod modal Ma Matching • Multi-level fusion • Event Level • Global Matching • Cosine similarity • Action & Entity Levels • Local Matching • Weakly supervised attentive alignment • Training objective • contrastive ranking loss 8
文本视频跨模态检索 : 实验 Expe Experimental Se Settings • Datasets dataset Train Validation Test # sent/video MSR-VTT 6573 497 2990 20 TGIF 79451 10651 11310 1 VATEX 25991 1500 1500 10 Youtube2Text - - 670 41.5 • Evaluation metric • R@K: K={1, 5, 10} • MedR (median rank) & MnR (mean rank) 9
文本视频跨模态检索 : 实验 Expe Experimental Re Results • In-domain cross-modal retrieval • HGR model achieves consistent improvements on three datasets MSR-VTT dataset 10
文本视频跨模态检索 : 实验 Expe Experimental Re Results • In-domain cross-modal retrieval: ablation study • Textual encoding • Graph attention & Semantic role awareness • Video encoding • Different video weights at each level MSR-VTT dataset 11
文本视频跨模态检索 : 实验 Expe Experimental Re Results • Cross-dataset video-text retrieval • Train on MSRVTT and test on Youtube2Text • HGR model also has better generalization performances In-domain Cross-dataset 12
文本视频跨模态检索 : 实验 Expe Experimental Re Results • Fine-grained binary selection • Evaluation models’ fine-grained textual discrimination abilities • Better performances, especial incomplete events a man is positive a dog hits a man’s hands cutting pizza . positive with its paws while standing. pizza is negative negative a dog hits a man’s hands . cutting a man . 13
Con Conclusion on • Contributions • Decompose video and text at event, action and entity levels for multi-level cross-modal matching • Utilize attention-based graph reasoning on textual semantic role graph to generate hierarchical embeddings • Results on in-domain, cross-dataset and fine-grained binary selection demonstrate the advantages of our model • Future work • Improve video encoding with multi-modalities and fine-grained spatial- temporal information Codes are released at: https://github.com/cshizhe/hgr_v2t 14
Recommend
More recommend