AI·M 3 at Team eam RU RUC AI at Vid Video eo Pe Pentathlon Cha Challeng nge 2020 2020 Shizhe Chen , Yida Zhao, Qin Jin Renmin University of China 1
Vi Video Pe Pentathlon Ch Challenge • Task • Text-to-Video Cross-modal Retrieval • Using provided multimodal features • Evaluation • a pentathlon of five video-text benchmarks • MSRVTT, MSVD, DiDeMo, ActivityNet (ANet), YouCook2 (YC2) • Metric • geometric mean of Recall@K (K={1, 5, 10}) 2
Ou Our Con Contri ribution ons • Hierarchical Video-Text Matching • Hierarchical graph reasoning model • Enhanced Inference Methods • Query expansion • Hubness mitigation • Knowledge Transfer from Additional Datasets • Multi-task training 3
Ou Our Con Contri ribution ons • Hierarchical Video-Text Matching • Hierarchical graph reasoning model • Enhanced Inference Methods • Query expansion • Hubness mitigation • Knowledge Transfer from Additional Datasets • Multi-task training 4
Hier Hierar archic hical al Vi Video-Te Text Ma Matching • Simple embeddings are insufficient to represent complicated video and text details • Hierarchical Graph Reasoning Model • multi-level cross-modal matching Global • Event • Actions • Entities Local • Hierarchical textual encoding • Hierarchical video encoding Chen, Shizhe, et al. "Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning." CVPR, 2020. 5
Hier Hierar archic hical al Vi Video-Te Text Ma Matching • Experimental results • HGR model achieves the best performance on all datasets • Especially on DiDeMo and Anet whose description lengths are long Absolute Gains + 1.25 + 0.77 + 4.18 + 2.98 + 1.84 Average 9 7 33 54 9 Sentence Length 6
Ou Our Con Contri ribution ons • Hierarchical Video-Text Matching • Hierarchical graph reasoning model • Enhanced Inference Methods • Query expansion • Hubness mitigation • Knowledge Transfer from Additional Datasets • Multi-task training 7
Enha Enhanc nced In Infer erenc ence Me Method ods • Query Expansion • Reformulate a given query and ensemble results from all expanded queries • Use multiple query texts for a video in MSRVTT and MSVD datasets • Experimental results • improves retrieval performance with groundtruth expanded queries • Future work: other techniques such as automatic paraphrasing 8
Enha Enhanc nced In Infer erenc ence Me Method ods • Hubness Mitigation • some points have high probabilities to be nearest neighbors of many other points • Inverted Softmax: • Experimental results • improves retrieval performance with groundtruth expanded queries • Future work: mitigate hubness problem during training Smith, Samuel L., et al. “Offline bilingual word vectors, orthogonal transformations and the inverted softmax.” ICLR, 2017. 9
Ou Our Con Contri ribution ons • Hierarchical Video-Text Matching • Hierarchical graph reasoning model • Enhanced Inference Methods • Query expansion • Hubness mitigation • Knowledge Transfer from Additional Datasets • Multi-task balanced training 10
Kn Knowledge Tr Transfer • Training with all datasets does not perform well • Different dataset scales and cross-domain discrepancies MSRVTT MSVD DiDeMo Anet YC2 # trn pairs 117,220 43,892 7,552 8,007 7,745 • Cross-dataset performance 11
Kn Knowledge Tr Transfer • Multi-task balanced training • Combine target dataset and MSRVTT in training • Balance the training examples from different datasets • Experimental results • beneficial to employ additional datasets • Future work: more effective transfer learning approaches 12
Testing Su Te Submi mission ons • Pipeline HGR model Average Query Hubness with multi- Ensembling Expansion mitigation task balanced (3-5 models) (optional) inference training • Experimental results • Second place in the challenge 13
Ta Take Ho Home Me Message • Multi-level matching model (HGR) is effective than global/local matching models for text-video retrieval • Hubness problem needs to be addressed in training and inference • Knowledge transferring is promising Contact email: cszhe1@ruc.edu.cn 14
Recommend
More recommend