Algorithm Engineering for High- Dimensional Similarity Search Problems Martin Aumรผller IT University of Copenhagen
Roadmap 01 02 03 Similarity Search in Survey of state-of- Similarity Search on High-Dimensions: the-art Nearest the GPU, in external Setup/Experimental Neighbor Search memory, and in Approach algorithms distributed settings 2
1. Similarity Search in High- Dimensions: Setup/Experimental Approach 3
๐ -Nearest Neighbor Problem โข Preprocessing : Build DS for set ๐ of ๐ data points โข Task : Given query point ๐ , return ๐ closest points to ๐ in ๐ โ โ โ โ โ โ 4
Nearest neighbor search on words โข GloVe: learning algorithm to find vector representations for words โข GloVe.twitter dataset: 1.2M words , vectors trained from 2B tweets , 100 dimensions โข Semantically similar words: nearest neighbor search on vectors https://nlp.stanford.edu/projects/glove/ Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. 5
โ sicily โ โข sardinia โข tuscany โข dubrovnik โข liguria โข naples โalgorithmโ โข algorithms GloVe Examples โข optimization โข approximation โข iterative โข computation โengineeringโ โข engineer โข accounting โข research โข science โข development $ grep -n "sicily" glove.twitter.27B.100d.txt 118340:sicily -0.43731 - 1.1003 0.93183 0.13311 0.17207 โฆ 6
Basic Setup โข Data is described by high-dimensional feature vectors โข Exact similarity search is difficult in high dimensions โข data structures and algorithms suffer โข exponential dependence on dimensionality โข in time, space, or both 7
Why is Exact NN difficult? ๐ , for large ๐ โข Choose ๐ random points from ๐ 0, 1/๐ โข Choose a random query point 2 ยฑ 1/โ๐ โข nearest and furthest neighbor basically at same distance 8
Performance on GloVe 9
Difficulty measure for queries โข Given query ๐ and distances ๐ 1 , โฆ , ๐ ๐ to ๐ nearest neighbors, define โ1 ๐ธ ๐ = โ 1 ๐ เท ln ๐ ๐ /๐ ๐ ๐ Based on the concept of local intrinsic dimensionality [Houle, 2013] and its MLE estimator [Amsaleg et al., 2015] 10
LID Distribution 11
Results (GloVe, 10-NN, 1.2M points) Easy Middle Difficult http://ann-benchmarks.com/sisap19/faiss-ivf.html 12
2. STATE-OF-THE-ART NEAREST NEIGHBOR SEARCH 13
General Pipeline Index generates candidates Brute-force search on candidates 14
Brute-force search GloVe: 1.2 M points, inner product as distance measure ๐ ๐ ๐ 1 400 byte 400 byte โข 100ms per scan โข 4.2 GB/s throughput โข CPU-bound Automatically SIMD vectorized with clang โ O3: https://godbolt.org/z/TJX68s 15
https://gist.github.com/maumueller/720d0f71664bef694bd56b2aeff80b17 Manual vectorization (256 bit registers) โฆ ๐ฆ Parallel multiply โข 25 ms per query โข 16 GB/s โฆ โข 16.5 GB/s single- ๐ง thread max on my laptop โข Memory-bound Parallel add to result register 0 0 0 0 0 0 0 0 Horizontal sum and cast to float 16
Brute-force on bit vectors โข Another popular distance measure is Hamming distance โข Number of positions in which two bit strings differ โข Can be nicely packed into 64-bit words โข Hamming distance of two words is just bitcount of the XOR โข 1.3 ms per query (128 bits) โข 6 GB/s throughput 17
[Christiani, 2019] Sketching to avoid distance computations SimHash [Charikar, 2002] 1-BitMinHash [Kรถnig-Li, 2010] โข Distance computations on bit Sketch representation vectors faster than Euclidean ๐ 1011100101 distance/inner product 0101110101 ๐ฆ โข Their number can be reduced by Easy to analyze: Sum of Bernoulli trials of storing compact sketch Pr(๐ = 1) = ๐(dist(๐, ๐ฆ)) representations At least ๐ collisions? Yes No Can distance computation be compute ๐ avoided? skip dist( ๐, ๐ฆ) ๐ฆ Set ๐ such that with probability at least 1 โ ๐ we donโt disregard point that could be among NN. 18
General Pipeline Index generates candidates Brute-force search on candidates 19
PUFFINN P ARAMETERLESS AND U NIVERSALLY F AST FI NDINGOF N EAREST N EIGHBORS [A., Christiani, Pagh, Vesterli, 2019] https://github.com/puffinn/puffinn Credit: Richard Bartz 20
How does it work? Locality-Sensitive Hashing (LSH) [Indyk-Motwani, 1998] โ ๐ = โ 1 ๐ โ โ 2 ๐ โ โ 3 ๐ โ 0,1 3 = A family โ of hash functions is locality- sensitive , if the collision probability of two points is decreasing with their distance to each other. 21
Solving ๐ -NN using LSH (with failure prob. ๐ ) Dataset ๐ โ 4 โ 5 โ 2 โ 3 โ ๐ โฆ Termination : If 1 โ ๐ ๐ โค ๐ , report current top- ๐ . Not terminated? Decrease ๐ฟ ! probability of the current ๐ -th nearest neighbor to collide. 22
The Data Structure Theoretical Practical โข LSH Forest: Each repetition is a โข Store indices of data set points Trie build from LSH hash values sorted by hash code [Bawa et al., 2005] โข โTraversing the Trie โ by binary search โข use lookup table for first levels 1 0 0 1 0 1 0 1 0 1 0 1 0 1 โฆ 23
Overall System Design 24
Running time (Glove 100d, 1.2M, 10-NN) 25
A difficult (?) data set in โ 3๐ ๐ data points ๐ query points ๐ฆ 1 = 0 ๐ , ๐ง 1 , ๐จ 1 ๐ 1 = ๐ค, 0 ๐ , ๐ 1 โฎ โฎ ๐ ๐ = (๐ค, 0 ๐ , ๐ ๐ ) ๐ฆ ๐โ1 = 0 ๐ , ๐ง ๐โ1 , ๐จ ๐โ1 ๐ฆ ๐ = (๐ค, ๐ฅ, 0 ๐ ) 0, 1 ๐ง ๐ , ๐จ ๐ , ๐ค, ๐ฅ, ๐ ๐ โผ ๐ช ๐ 2๐ 26
Running time (โDifficultโ, 1M, 10 -NN) 27
Graph-based Similarity Search 28
Building a Small World Graph 29
Refining a Small World Graph Goal : Keep out- degree as small as possible (while maintaining โlarge - enoughโ in -degree)! HNSW/ONNG: [Malkov et al., 2020], [Iwasaki et al., 2018] 30
Running time (Glove 100d, 1.2M, 10-NN) 31
Open Problems Nearest Neighbor Search โข Data-dependent LSH with guarantees? โข Theoretical sound Small-World Graphs? โข Multi-core implementations โข Good? [Malkov et al., 2020] โข Alternative ways of sketching data? 32
3. Similarity Search on the GPU, in External Memory, and in Distributed Settings 33
Nearest Neighbors on the GPU: FAISS [Johnson et al., 2017] https://github.com/facebookresearch/faiss โข GPU setting โข Data structure is held in GPU memory โข Queries come in batches of say 10,000 queries per time โข Results: โข http://ann-benchmarks.com/sift-128-euclidean_10_euclidean-batch.html 34
FAISS/2 โข Data structure โข Run k-means with large number of centroids โข Each data point is associated with closest centroid โข Query โข Find ๐ closest centroids โข Return ๐ closest points found in points associated with these centroids https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html 35
Nearest Neighbors on the GPU: GGNN [Groh et al., 2019] 36
Nearest Neighbors in External Memory [Subramanya et al., 2019] RAM SSD เท ๐ฆ 1 โฆ ๐ฆ ๐ เท โฆ compressed vectors ๐ฆ 1 ๐ฆ ๐ (32 byte per vector) Original vectors Product Quantization (~400 byte per vector) 37
Distributed Setting: Similarity Join โข Problem โข given sets ๐ and ๐ of size ๐ , ๐ โข and similarity threshold ๐, compute ๐ โ ๐ ๐ = { ๐ฆ, ๐ง โ ๐ ร ๐ โฃ ๐ก๐๐ ๐ฆ, ๐ง โฅ ๐} โข Similarity measures โข Jaccard similarity โข Cosine similarity ๐ โข Naive: ๐ ๐ 2 distance computations 38
Scalability! But at what COST? [ McSherry et al., 2015] Map-Reduce-based Similarity Join Single Core on Xeon E5-2630v2 (2.60 GHz) Hadoop cluster (12 nodes, 24 HT per node) [Fier et al., 2018] [Mann et al., 2016] 39
Solved almost-optimally in the MPC model [Hu et al., 2019] Hash Join on using hash LSH ๐ Similarity (๐ฆ, โ ๐ (x)) at least Emit ๐ ? (๐ฆ, ๐ง ) (๐ฆ, ๐ง, โ ๐ (x)) (๐ง, โ ๐ (y)) ๐ ๐(๐ 2 ) local work for distance computations! 40
Another approach: DANNY [A., Ceccarello, Pagh, 2020] In preparation, https://github.com/cecca/danny LSH + Sketching, Cartesian Product candidate verification ๐ locally Emit/Collect ๐ Implementation in Rust using timely dataflow https://github.com/TimelyDataflow/timely-dataflow 41
Results 42
Roadmap 01 02 03 Similarity Search in Survey of state-of- Similarity Search on High-Dimensions: the-art Nearest the GPU, in external Setup/Experimental Neighbor Search memory, and in Approach algorithms distributed settings 43
Recommend
More recommend