Composite Correlation Qantization for Efficient Multimodal Retrieval Mingsheng Long 1 , Yue Cao 1 , Jianmin Wang 1 , and Philip S. Yu 12 1 School of Sofware Tsinghua University 2 Department of Computer Science University of Illinois, Chicago ACM Conference on Research and Development in Information Retrieval, SIGIR 2016 M. Long et al. (Tsinghua University) Composite Correlation Qantization ACM SIGIR 2016 1 / 28
Outline Introduction 1 Problem Effectiveness and Efficiency Previous Work Composite Correlation Qantization 2 Multimodal Correlation Composite Qantization Optimization Framework Evaluation 3 Results Discussion Summary 4 M. Long et al. (Tsinghua University) Composite Correlation Qantization ACM SIGIR 2016 2 / 28
Introduction Problem Multimodal Understanding How to utilize multimodal data to understand our real world? Isomorphic space: integration, fusion, correlation, transfer, ... M. Long et al. (Tsinghua University) Composite Correlation Qantization ACM SIGIR 2016 3 / 28
Introduction Problem Multimodal Retrieval Nearest Neighbor (NN) similarity retrieval across modalities Database: X img = { x img , . . . , x img N } and Qery: q txt 1 Cross-modal NN: NN ( q txt ) = min x img ∈X img d x img , q txt � � Top 16 Returned Images Top 16 Returned Tags Image Query Tags Query [‘sky sun’] [‘lake’] Precision: 0.625 Precision: 0.625 (a) I → T (Image Qery on Text DB) (b) T → I (Text Qery on Image DB) Figure: Cross-modal retrieval: similarity retrieval across media modalities. M. Long et al. (Tsinghua University) Composite Correlation Qantization ACM SIGIR 2016 4 / 28
Introduction Effectiveness and Efficiency Multimodal Embedding Multimodal embedding reduces cross-modal heterogeneity gap N d ( z img , z txt � i ) → more flexible Coupling: min i i = 1 Fusion: z i = f ( z img , z txt i ) → tighter relationship i Multimodal Coupling Embedding — Image Mapping 011 001 “A Tabby cat is leaning Multimodal on a wooden table, with Embedding one paw on a laser + Text mouse and the other on 001 Mapping a black laptop” Fusion 011 M. Long et al. (Tsinghua University) Composite Correlation Qantization ACM SIGIR 2016 5 / 28
Introduction Effectiveness and Efficiency Indexing and Hashing Approximate Nearest Neighbor (ANN) Search Exact Nearest Neighbor Search: linear scan O ( NP ) Efficient, acceptable accuracy, practical solutions Reduce the number of distance computations: O ( N ′ P ) , N ′ ≪ N Indexing: tree, neighborhood graph, inverted index, ... Reduce the cost of each distance computation: O ( NP ′ ) , P ′ ≪ P Hashing: Locality-Sensitive Hashing, Spectral Hashing, ... Produce a few distinct distances (curse of dimensionality) Limited ability and flexibility of distance approximation Qantization: Vector Qantization (VQ), Iterative Qantization (ITQ), Product Qantization (PQ), Composite Qantization (CQ) K-means: Impossible for medium and long codes (large K ) M. Long et al. (Tsinghua University) Composite Correlation Qantization ACM SIGIR 2016 6 / 28
Introduction Previous Work Multimodal Hashing 512-dim 128-bits floats two- 20GB 1M images 160M stage Previous work: separate pipeline for Multimodal Embedding and Binary Encoding → large information loss, unbalanced encoding M. Long et al. (Tsinghua University) Composite Correlation Qantization ACM SIGIR 2016 7 / 28
Composite Correlation Qantization Problem Definition Definition (Composite Correlation Qantization, CCQ) n = 1 ∈ R P 1 and a text set { x 2 n } N 1 n } N 2 Given an image set { x 1 n = 1 ∈ R P 2 , learn two correlation mappings f 1 : R P 1 �→ R D and f 2 : R P 2 �→ R D that transform images and texts into a D -dimensional isomorphic latent space, and jointly learn two composite quantizers q 1 : R D �→ { 0 , 1 } H and q 2 : R D �→ { 0 , 1 } H that quantize latent embeddings into compact H -bits binary codes. M. Long et al. (Tsinghua University) Composite Correlation Qantization ACM SIGIR 2016 8 / 28
Composite Correlation Qantization Overview A Latent Semantic Analysis (LSA) optimization framework n , where R v is correlation-maximal mapping, C v is x v n ≈ R v C v b v similarity-preserving codebook, b v n is compact binary code Multimodal Embedding: Correlation Mapping & Code Fusion Composite Qantization: Isomorphic Space (shared codebook) A “simple and reliable” approach to efficient multimodal retrieval Composite Image 4 Quantization 111 Mapping 000 001 010 011 100 101 110 111 4 Hash + Multimodal 011 100 Code 110 2 2 101 Embedding “A Tabby cat is leaning 000 001 010 011 100 101 110 111 001 on a wooden table, with 2 one paw on a laser 010 8 2 Text mouse and the other on Isomorphic 8 a black laptop” Mapping Codebook 000 001 010 011 100 101 110 111 M. Long et al. (Tsinghua University) Composite Correlation Qantization ACM SIGIR 2016 9 / 28
Composite Correlation Qantization Multimodal Correlation Multimodal Correlation Paired data matrices: X 1 = [ x 1 N ] , X 2 = [ x 2 1 , . . . , x 1 1 , . . . , x 2 N ] Fusion representation matrix: Z = [ z 1 , . . . , z N ] Transformation matrices: R 1 , R 2 , which transform X into Z 2 2 � R 1T X 1 − Z � � � R 2T X 2 − Z � � F + λ 2 R 1 , R 2 , Z λ 1 min (1) � � � � � � F Composite Image 4 Quantization X 1 111 Mapping 000 001 010 011 100 101 110 111 R 1 4 Hash + Multimodal 011 100 2 2 Code 110 101 Embedding “A Tabby cat is leaning Z R 2 000 001 010 011 100 101 110 111 001 on a wooden table, with 2 X 2 one paw on a laser 010 8 2 Text mouse and the other on Isomorphic 8 a black laptop” Mapping Codebook 000 001 010 011 100 101 110 111 M. Long et al. (Tsinghua University) Composite Correlation Qantization ACM SIGIR 2016 10 / 28
Composite Correlation Qantization Multimodal Correlation Multimodal Correlation This problem is ill-posed, which cannot be solved successfully 2 2 � � � � � R 1T X 1 − Z � R 2T X 2 − Z R 1 , R 2 , Z λ 1 min F + λ 2 (2) � � � � � � F Z = λ 1 R 1T X 1 + λ 2 R 2T X 2 (3) λ 1 + λ 2 X 1 X 1T � − 1 X 2 X 2T � − 1 R 1 = � R 2 = � X 1 Z T X 2 Z T (4) Composite Image 4 Quantization X 1 111 Mapping 000 001 010 011 100 101 110 111 R 1 4 Hash + Multimodal 011 100 2 2 Code 110 101 Embedding “A Tabby cat is leaning Z R 2 000 001 010 011 100 101 110 111 001 on a wooden table, with 2 X 2 one paw on a laser 010 8 2 Text mouse and the other on Isomorphic 8 a black laptop” Mapping Codebook 000 001 010 011 100 101 110 111 M. Long et al. (Tsinghua University) Composite Correlation Qantization ACM SIGIR 2016 11 / 28
Composite Correlation Qantization Multimodal Correlation Multimodal Correlation Add the covariance maximization with orthogonal constraints �� � 2 2 � � T X 1 � � R 1T X 1 − Z � R 1 F + R 1 , R 2 , Z λ 1 min � � � � ⊥ � � F (5) �� � 2 2 � R 2T X 2 − Z � � T X 2 � � R 2 + λ 2 F + � � � � ⊥ � � F � X 1 − R 1 Z � X 2 − R 2 Z � 2 � 2 � � � � R 1 , R 2 , Z λ 1 min F + λ 2 (6) F Composite Image 4 Quantization X 1 111 Mapping 000 001 010 011 100 101 110 111 R 1 4 Hash + Multimodal 011 100 Code 110 2 2 101 Embedding “A Tabby cat is leaning R 2 Z 000 001 010 011 100 101 110 111 001 on a wooden table, with 2 X 2 one paw on a laser 010 8 2 Text mouse and the other on Isomorphic 8 a black laptop” Mapping Codebook 000 001 010 011 100 101 110 111 M. Long et al. (Tsinghua University) Composite Correlation Qantization ACM SIGIR 2016 12 / 28
Composite Correlation Qantization Composite Qantization Composite Qantization Learn M codebooks: C = [ C 1 , . . . , C M ] , each codebook has K codewords C m = [ c m 1 , . . . , c mK ] (cluster centroids of K-means) Each z i is approximated by the addtion of M codewords One per codebook, each selected by the binary assignment b mi Code representation: i 1 i 2 . . . i M , where i m = nz ( b mi ) Code length: M log 2 K (1-of- K encoding) z ≈ ˆ z = C 1 b 1 + C 2 b 2 + . . . + C M b M (7) = c 1 i 1 + c 2 i 2 + . . . + c Mi M C 1 = [ c 11 , . . . , c 1 K ] C 2 = [ c 21 , . . . , c 2 K ] . . . C M = [ c M 1 , . . . , c MK ] M. Long et al. (Tsinghua University) Composite Correlation Qantization ACM SIGIR 2016 13 / 28
Composite Correlation Qantization Composite Qantization Composite Qantization Learn M codebooks: C = [ C 1 , . . . , C M ] , each codebook has K codewords C m = [ c m 1 , . . . , c mK ] (cluster centroids of K-means) Binary code matrices: B = [ B 1 ; . . . ; B M ] , B m = [ b m 1 ; . . . ; b mN ] Control binary codes quality by quantization error minimization N 2 2 � M � M � � � � � min � Z − m = 1 C m B m F = � z i − m = 1 C m b mi (8) � � � � � � Z , C , B 2 i = 1 Isomorphic Codebook Composite Image 4 Quantization 111 Mapping 000 001 010 011 100 101 110 111 4 Hash + Multimodal B 011 100 Code 110 2 2 101 Embedding “A Tabby cat is leaning Z 000 001 010 011 100 101 110 111 001 on a wooden table, with 2 one paw on a laser 010 8 2 Text mouse and the other on [ ] 8 a black laptop” Mapping C = C 1 , … , C M 000 001 010 011 100 101 110 111 M. Long et al. (Tsinghua University) Composite Correlation Qantization ACM SIGIR 2016 14 / 28
Recommend
More recommend