BEYOND “PROJECT AND SIGN” FOR COSINE ESTIMATION WITH BINARY CODES Raghavendran Balu, Teddy Furon and Hervé Jégou INRIA, Rennes
Problem statement: Nearest Neighbors search – Finding the closest vector(s) from a database for a given query Search query Nearest engine x neighbors database y 1 , . . . , y n – In this paper: 1 i n x > y i NN ( x ) = arg min 1 i n k x � y i k = arg max Problem: Exhaustive search has complexity - 2
2 approaches to Nearest Neighbor Search – Space partitioning • The search no longer exhaustive • Example: indexing technique involving several hash functions – Approximate distance • Faster to compute but exhaustive • In this paper: we use an Hamming Embedding - 3
Hamming embedding 110 111 101 100 010 011 000 001 • Design a mapping function • Objective – neighborhood in Hamming space reflects true neighborhood • Advantages – compact descriptor – fast distance computation - 4
Locality Sensitive Hashing (LSH) • Initialization: Randomly draw L directions • For a given vector , compute a bit for each direction, as 1. Project 2. And sign • Properties For two vectors and – – The Hamming distance is related in expectation to the angle as [Charikar 02] - 5
Our approach • Synthesis point of view – Reconstructed vector – If ‘close’ to on the sphere, then • Minimizing the quantization error – If L < D and , ‘project and sign’ is optimal – If L > D , it is a combinatorial problem • Not tractable for large D - 6
Reconstruction point of view • ‘Project and sign’ with a frame W • ‘Project and sign’ with a tight frame W • Our algorithm qoLSH optimality simplicity – quantization optimized LSH • ‘AntiSparse’ [Jégou 11] – Too slow for large D • Optimal – Untractable for large D - 7
qoLSH algorithm • Parameter: randomly draw a tight frame • Initialization: input – ‘project and sign’: • Iteration k + 1 – For any j • Flip j -th bit: • Measure cosine: – Keep best flip • - 8
Estimated angle vs True angle 3.14 2.36 Estimated � 1.57 0.79 LSH qoLSH 0.00 0.00 0.79 1.57 2.36 3.14 Synthetic data D = 8, L =64 � - 9
Angle estimation error analysis Synthetic data D = 128, L =256 STANDARD DEVIATION BIAS qoLSH reduces estimation bias and variance compared to LSH - 10
Application the Nearest Neighbor Search database Max-heap Symmetric Candidates query qoLHS similarity reconstruction Re-ranking Asymmetric Candidates similarity - 11
Experimental details • Dataset • Synthetic ( n = 1 million, D = 8) • SIFT ( n = 1 million, D = 128) • http://corpus-texmex.irisa.fr • Algorithms • LSH with or without tight frame • qoLSH • anti-sparse • quantization optimal (if tractable) • Performance measurement • 1-Recall@R: probability that the true nearest neighbor belongs to a short list of R candidates - 12
Recall on synthetic data ( n = 1M, D = 8) - 13
Recall on real SIFT data ( n = 1M, D = 128) - 14
Conclusion • Hamming embedding dedicated for cosine similarity estimation • L<D – ‘Project and sign’ is optimal with orthogonal random projection • L>D – Tight frame is a good choice – ‘Project and sign’ is suboptimal – Our reconstruction based approach • decreases quantization error • improves cosine similarity estimation • improves quality of approximate NN search • strikes a good trade-off between quality and complexity Package Online! http://people.rennes.inria.fr/Raghavendran.Balu/code/qolsh.zip - 15
Thank You! QUESTIONS? - 16
110 111 101 100 010 011 000 001 - 17
LSH suboptimality when L > D • When L > D , is not orthogonal – Entropy H ( B ) < L bits • Example • LSH (sub optimal): • Optimal - 18
Recommend
More recommend