Supervised Hierarchical Cross-Modal Hashing Changchang Sun † , Xuemeng Song † , Fuli Feng ‡ , Wayne Xin Zhao $ , Hao Zhang * , Liqiang Nie † † School of Computer Science and Technology, Shandong University ‡ School of Computing, National University of Singapore $ School of Information, Renmin University of China * Mercari, Inc, Japan 1
Background Ø Unprecedented growth of multimedia data on the Internet. Ø Application: cross-modal retrieval. Ø Solution: supervised cross-modal hashing. Mini-skirt UNIQLO Women Cotton Mini Skirt. Hamming Long Skirt Chicwish Endless Blooming Rose Max Skirt. 0 Space Wide-leg Jeans Chloé Frayed High-rise Wide-leg Jeans. Labels Image Text 2
Related Work Ø Define cross-modal similarity matrix Qingyuan Jiang and Wujun Li. Deep Cross-Modal Hashing. In CVPR, 2017 3
Related Work Ø Learn semantic information from multiple labels Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, and Dacheng Tao. Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval. In CVPR, 2018 4
Motivation Ø Explore the rich semantic information conveyed by the label hierarchy. Ø Finest-grained layer I I Dissimilar 3 1 Ø Less finer-grained layer I I Similar 3 1 Figure 1: Illustration of the label hierarchy. 5
Challenges Ø How to employ the label hierarchy to guide the cross-modal hashing and preserve the underlying correlations from original space to hamming space. Mapping A B C Original Space 6 Hamming Space
Challenges Ø How to enhance the hierarchical discriminative power of hash codes. Skirt Mini-Skirt Hash Code Jeans Wide-leg Jeans Hash Code 7
Challenges Ø The lack of benchmark dataset, whose data points should involve multiple modalities and are hierarchically labeled. Super-class Class Flowers Rose, Sunflower, Lily... Fish Goldfish, Shark, Dolphin... Unimodal Insect Bee, Butterfly, Caterpillar... Data Points Fruit Apple, Peach, Pear... ... ... Table 1: Hierarchical labels of benchmark dataset CIFAR-100. 8
Framework Concatenation VGG-F Figure 2: Illustration of the proposed scheme, HiCHNet. 9
Framework Ø Regularized Cross-modal Hashing p Layer-wise Hash Representation K Fully Connected Networks K Layers ~ v ~ i k k k h s ( W v g ), k 1 ,..., K v i v i v ~ k j k h s ( W t g ), k 1 ,..., K t j t j t k k b sign ( h ), k 1 ,..., K v v i i k k b sign ( h ), k 1 ,..., K t t j j k k h i h ( ) : layer-wise hash representation v t j ~ k k b i b ( ) : layer-wise binary hash codes t v t j j 10
Framework Ø Regularized Cross-modal Hashing p Layer-wise Semantic Similarity Preserving • Objective function (negative log likelihood): k S 1 Same label at the k-th layer Ground ij k S 0 Truth Different label at the k-th layer ij K N k k k ( S log( 1 e )) ij 1 k ij ij k 1 i , j 1 1 Layer Semantic k k T k ( h ) h ij v t Confidence Similarity 2 i j 11
Framework Ø Regularized Cross-modal Hashing p Binarization Difference Penalizing To derive the optimal continuous surrogates of the hash codes k k B sgn( H ) v v T a 1 , 1 , , 1 k k B sgn( H ) t t k 2 2 2 2 k k k k k k ( B H B H ) ( H a H a ) 2 v v t t v t F F 2 2 K 1 Binarization Difference Information Regularization Maximization 12
Framework Ø Hierarchical Discriminative Learning • Objective function (negative log likelihood): k k h p k k k k p soft max( U h q ), k 1 ,..., K v i v i v v v v i i k j k k p soft max( U h g ), k 1 ,..., K t t t t j j K N k T k k T k ( y ) log( p ) ( y ) log( p ) k k h k i v i t h p i i k 1 i 1 t j t j Layer Ground-truth Confidence 13
Framework Ø Final Objective Function Non-negative Tradeoff Parameter min ( 1 ) r h k B , , v t Regularized Hierarchical Discriminative Cross-modal Hashing Learning 14
Experiment Ø Dataset • Two datasets: FashionVC (public) and Ssense (created by ourselves). • Ssense: Collected from the online fashion platform Ssense. (2018.12.14--2018.12.16). • Raw data: 25,974 image-text instances with hierarchical labels. • Preprocessing: Removed the noisy instances that involve multiple items. Filtered out the categories with less than 70 instances. Noisy Instances 15
Experiment Ø Dataset • Two datasets: FashionVC (public) and Ssense (created by ourselves). Table 1: Statistics of our datasets. 16
Experiment Ø Dataset • FashionVC Label Hierarchy: 35 categories with two layers 17
Experiment Ø Dataset • Ssense Label Hierarchy: 32 categories with two layers 18
Experiment Ø Experiment Setting Image to Text Task Text to Image Protocol: Mean Average Precision Shallow Learning: CCA, SCM-Or, SCM-Se, DCH Baselines Deep Learning: CDQ, SSAH, DCMH 500-D SIFT Features and 4096-D Deep Features 19
Experiment Ø On Model Comparison Ta b l e 2 : T h e M A P s c o r e s o f d i ff e r e n t methods on two datasets. The shallow learning baselines use the SIFT features. Table 3: The MAP scores of different methods on two datasets. The shallow learning baselines use the VGG-F features. 20
Experiment Ø On Label Hierarchy Figure 3: HiCHNet-flat : One derivative of our HiCHNet model. 21
Experiment Ø On Label Hierarchy Figure 4: Performance of HiCHNet and HiCHNet-flat on FashionVC. 22
Experiment Ø On Case Study 1 • Retrieve from the whole retrieval set Figure 5: Illustration of ranking results from the whole retrieval set. The irrelevant images are highlighted in red boxes. 23
Experiment Ø On Case Study 2 • Retrieve from the constrained subset of 10 images of different categories. Figure 6: Illustration of ranking results from the constrained retrieval set. 24
Conclusion l We first validate the benefits of utilizing the category hierarchy in cross-modal. l We propose a novel supervised hierarchical cross-modal hashing framework. l We build a large-scale benchmark dataset from the global fashion platform Ssense. Extensive experiments demonstrate the superiority of HiCHNet over the state-of-the-art methods. 25
Thanks Q&A Thanks for the travel grant from SIGIR. Email: sunchangchang123@gmail.com 26
Back Up 27
Experiment Ø On Category Analysis Figure 7: Performance of HiCHNet and DCMH on different categories of FashionVC and Ssense in the task of “Text→Image”. 28
Experiment Ø On Component Analysis Figure 8: Sensitivity analysis of the hyper-parameters. 29
Recommend
More recommend