Permutation Search Methods are Efficient, Yet Faster Search is Possible Bileg (Bilegsaikhan) Naidan 1 Leo (Leonid) Boytsov 2 Eric Nyberg 2 1 Norwegian University of Science and Technology (NTNU) 2 Carnegie Mellon University (CMU) https://github.com/searchivarius/NonMetricSpaceLib
Nearest-neighbor search (NN-search) 1/ 17 4/9/15
Nearest-neighbor search (NN-search) • Input: A set of n objects and a distance function d ( x , y ) 1/ 17 4/9/15
Nearest-neighbor search (NN-search) • Input: A set of n objects and a distance function d ( x , y ) • Query: New object q and k 1/ 17 4/9/15
Nearest-neighbor search (NN-search) • Input: A set of n objects and a distance function d ( x , y ) • Query: New object q and k • Task: Quickly fi nd k most similar objects in the database to q 2 1 3 Query q k = 3 q 1/ 17 4/9/15
Distance function d ( x , y ) Name Symmetry Triangle ineq. �� ( x i − y i ) 2 Euclidean ( L 2 ) 1 − x · y Cosine distance | x || y | � x i log x i KL-diverg. y i JS-diverg. symmetrized & smoothed KL-diverg. Distance functions can be metric or non-metric 2/ 17 4/9/15
How to fi nd similar objects? 3/ 17 4/9/15
How to fi nd similar objects? • Brute-force • Exact search • Slow: n distance computations 3/ 17 4/9/15
How to fi nd similar objects? • Brute-force • Exact search • Slow: n distance computations • I ndexing • Exact search is mostly slow in high-dimensions and/or non-metric spaces: O ( n ) distance computations • Approximate search can be fast 3/ 17 4/9/15
State-of-the-art approximate search methods • Locality Sensitivity Hashing (LSH) • VP-tree/ball-tree (data-dependent tuning) • Proximity graphs (kNN-graphs) • Permutation methods 4/ 17 4/9/15
Why should we care about permutation methods? 5/ 17 4/9/15
Why should we care about permutation methods? • Promising universal methods for non-metric spaces 5/ 17 4/9/15
Why should we care about permutation methods? • Promising universal methods for non-metric spaces • Mapping data from “hard ” spaces to “ easy ” spaces (the Euclidean space) 5/ 17 4/9/15
Why should we care about permutation methods? • Promising universal methods for non-metric spaces • Mapping data from “hard ” spaces to “ easy ” spaces (the Euclidean space) • Database-friendly methods that are easy to implement on top of a database system or Lucene 5/ 17 4/9/15
Research questions 6/ 17 4/9/15
Research questions • How good are permutation-based projections? 6/ 17 4/9/15
Research questions • How good are permutation-based projections? • How well do permutation methods fare against state of the art? 6/ 17 4/9/15
Permutation Methods • Filter-and-re fi ne methods using pivot-based projection to the permutation space ( L 1 or L 2 ) 7/ 17 4/9/15
Permutation Methods • Filter-and-re fi ne methods using pivot-based projection to the permutation space ( L 1 or L 2 ) • Select randomly a set of reference points called pivots 7/ 17 4/9/15
Permutation Methods • Filter-and-re fi ne methods using pivot-based projection to the permutation space ( L 1 or L 2 ) • Select randomly a set of reference points called pivots • Order pivots by their distances to data points to obtain pivot rankings, which we call permutations 7/ 17 4/9/15
Permutation Methods • Filter-and-re fi ne methods using pivot-based projection to the permutation space ( L 1 or L 2 ) • Select randomly a set of reference points called pivots • Order pivots by their distances to data points to obtain pivot rankings, which we call permutations • Filter by comparing permutations to obtain candidate points 7/ 17 4/9/15
Permutation Methods • Filter-and-re fi ne methods using pivot-based projection to the permutation space ( L 1 or L 2 ) • Select randomly a set of reference points called pivots • Order pivots by their distances to data points to obtain pivot rankings, which we call permutations • Filter by comparing permutations to obtain candidate points • Re fi ne by comparing candidate points to the query 7/ 17 4/9/15
Permutation Methods How do we carry out the fi ltering step? 8/ 17 4/9/15
Permutation Methods How do we carry out the fi ltering step? • Brute force searching 8/ 17 4/9/15
Permutation Methods How do we carry out the fi ltering step? • Brute force searching • Indexing of permutations 8/ 17 4/9/15
Permutation Methods How do we carry out the fi ltering step? • Brute force searching • Indexing of permutations • Neighborhood APProximation Index (NAPP) is the best approach 8/ 17 4/9/15
Experiments: Datasets Name Distance Number Brute-force Dimens. function of points (sec.) Metric Data 5 · 10 6 CoPhIR L 2 0.6 282 5 · 10 6 SIFT L 2 0.3 128 1 · 10 6 ImageNet SQFD 4.1 N/A Non-Metric Data 4 · 10 6 10 5 Wiki-sparse Cosine sim. 1.9 2 · 10 6 Wiki-8 KL-div/JS-div 0.045/0.28 8 2 · 10 6 Wiki-128 KL-div/JS-div 0.22/ 4 128 1 · 10 6 DNA Norm. Leven. 3.5 N/A 9/ 17 4/9/15
Experiments: Projection Quality Distance in the original space vs. distance in the projected space. The closer to a monotonic mapping, the better : 300 200 100 0 0 200 400 600 Good projection (original distance: L 2 ) 10/ 17 4/9/15
Experiments: Projection Quality Distance in the original space vs. distance in the projected space. The closer to a monotonic mapping, the better : 250 200 150 100 50 0 0.0 0.2 0.4 0.6 Bad projection (original distance: JS-div.) 11/ 17 4/9/15
Experiments: E ffi ciency vs Accuracy Improvement in e ffi ciency over brute-force search vs. accuracy. Higher and to the right is better : SIFT ( L 2 ) Improv. in efficiency (log. scale) 10 2 VP-tree MPLSH 10 1 kNN-graph (SW) NAPP 0 . 6 0 . 7 0 . 8 0 . 9 1 Recall 12/ 17 4/9/15
Experiments: E ffi ciency vs Accuracy Improvement in e ffi ciency over brute-force search vs. accuracy. Higher and to the right is better : Norm. Levenshtein VP-tree Improv. in efficiency (log. scale) 10 2 kNN-graph (NN-desc) brute-force filt. bin. NAPP 10 1 0 . 6 0 . 7 0 . 8 0 . 9 1 Recall 13/ 17 4/9/15
Conclusions • Permutation methods beat state-of-the-art methods (VP-trees, kNN-graphs and Multiprobe LSH) for some data sets , in particular, when the distance function is expensive 14/ 17 4/9/15
Conclusions • Permutation methods beat state-of-the-art methods (VP-trees, kNN-graphs and Multiprobe LSH) for some data sets , in particular, when the distance function is expensive • The quality of permutation-based projection can be both good and poor : it appears to be better when the space is metric and/or dimensionality is low 14/ 17 4/9/15
Poster Session Discussion Points What makes a good, amenable, non-metric space? 15/ 17 4/9/15
Thank you for your attention! 16/ 17 4/9/15
Some technical details
Permutation Methods The data points are a , b , c , d in 2-dim. Euclidean space ( L 2 ). The Voronoi diagram produced by 4 pivots π i . Point Pivot Order Permutations � 2 Similar ( π 1 , π 2 , π 3 , π 4 ) ( 1 , 2 , 3 , 4 ) � 1 a a b d ( π 1 , π 2 , π 4 , π 3 ) ( 1 , 2 , 4 , 3 ) b c � 4 ( π 3 , π 1 , π 2 , π 4 ) ( 2 , 3 , 1 , 4 ) c � 3 ( π 4 , π 2 , π 1 , π 3 ) ( 3 , 2 , 4 , 1 ) d Position of π 4 is 1
Permutation Methods Permutation is a fancy word for a pivot ranking! The data points are a , b , c , d in 2-dim. Euclidean space ( L 2 ). The Voronoi diagram produced by 4 pivots π i . Point Pivot Order Permutations � 2 Similar ( π 1 , π 2 , π 3 , π 4 ) ( 1 , 2 , 3 , 4 ) � 1 a a b d ( π 1 , π 2 , π 4 , π 3 ) ( 1 , 2 , 4 , 3 ) b c � 4 ( π 3 , π 1 , π 2 , π 4 ) ( 2 , 3 , 1 , 4 ) c � 3 ( π 4 , π 2 , π 1 , π 3 ) ( 3 , 2 , 4 , 1 ) d Position of π 4 is 1
Permutation Methods • Filtering step - compare permutations instead of original data points to obtain γ candidate points • Footrule distance ( x , y ) = � i | x i − y i | (same as L 1 ) • Spearman’s rho distance (same as L 2 ) Footrule( a , • ) candidate points Point � 2 | 1 − 1 | + | 2 − 2 | + | 3 − 4 | + | 4 − 3 | = 2 � 1 b a b d | 1 − 2 | + | 2 − 3 | + | 3 − 1 | + | 4 − 4 | = 4 c c � 4 | 1 − 3 | + | 2 − 2 | + | 3 − 4 | + | 4 − 1 | = 6 d � 3 • Re fi nement step - apply d ( q , • ) for the candidate points (in our example, γ = 2, q = a , d ( q , b ) and d ( q , c ) )
Permutation Methods Filtering step: • Naive approach - Brute force searching • using a priority queue • incremental sorting [ Gonzales 2008 ] ( × 2 faster than the priority queue approach) • binarized permutations (select a threshold b and use the Hamming distance) • Brute force in the permutation space is e ffi cient if the distance is expensive.
Permutation Methods To reduce the cost of the fi ltering stage , three types of indices were proposed: • use the existing methods for metric spaces [ Figueroa 2009 ] • the Permutation Pre fi x Index (PP-Index) [ Esuli 2009 ] • the Metric Inverted File (MI- fi le) [ Amato et al. 2008 ]
Permutation Methods Permutation Pre fi x I ndex (PP-index) [ Esuli 2009 ] 1 4 Point Pivot Order 3 ( π 1 , π 2 , π 3 , π 4 ) a 2 1 2 ( π 1 , π 2 , π 4 , π 3 ) b ( π 3 , π 1 , π 2 , π 4 ) 3 4 2 1 c ( π 4 , π 2 , π 1 , π 3 ) d a c b d
Recommend
More recommend