problems
play

Problems Martin Aumller IT University of Copenhagen Roadmap 01 - PowerPoint PPT Presentation

Algorithm Engineering for High- Dimensional Similarity Search Problems Martin Aumller IT University of Copenhagen Roadmap 01 02 03 Similarity Search in Survey of state-of- Similarity Search on High-Dimensions: the-art Nearest the


  1. Algorithm Engineering for High- Dimensional Similarity Search Problems Martin Aumรผller IT University of Copenhagen

  2. Roadmap 01 02 03 Similarity Search in Survey of state-of- Similarity Search on High-Dimensions: the-art Nearest the GPU, in external Setup/Experimental Neighbor Search memory, and in Approach algorithms distributed settings 2

  3. 1. Similarity Search in High- Dimensions: Setup/Experimental Approach 3

  4. ๐‘™ -Nearest Neighbor Problem โ€ข Preprocessing : Build DS for set ๐‘‡ of ๐‘œ data points โ€ข Task : Given query point ๐‘Ÿ , return ๐‘™ closest points to ๐‘Ÿ in ๐‘‡ โœ“ โœ“ โœ“ โœ“ โœ“ โœ“ 4

  5. Nearest neighbor search on words โ€ข GloVe: learning algorithm to find vector representations for words โ€ข GloVe.twitter dataset: 1.2M words , vectors trained from 2B tweets , 100 dimensions โ€ข Semantically similar words: nearest neighbor search on vectors https://nlp.stanford.edu/projects/glove/ Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. 5

  6. โ€œ sicily โ€ โ€ข sardinia โ€ข tuscany โ€ข dubrovnik โ€ข liguria โ€ข naples โ€œalgorithmโ€ โ€ข algorithms GloVe Examples โ€ข optimization โ€ข approximation โ€ข iterative โ€ข computation โ€œengineeringโ€ โ€ข engineer โ€ข accounting โ€ข research โ€ข science โ€ข development $ grep -n "sicily" glove.twitter.27B.100d.txt 118340:sicily -0.43731 - 1.1003 0.93183 0.13311 0.17207 โ€ฆ 6

  7. Basic Setup โ€ข Data is described by high-dimensional feature vectors โ€ข Exact similarity search is difficult in high dimensions โ€ข data structures and algorithms suffer โ€ข exponential dependence on dimensionality โ€ข in time, space, or both 7

  8. Why is Exact NN difficult? ๐‘’ , for large ๐‘’ โ€ข Choose ๐‘œ random points from ๐‘‚ 0, 1/๐‘’ โ€ข Choose a random query point 2 ยฑ 1/โˆš๐‘’ โ€ข nearest and furthest neighbor basically at same distance 8

  9. Performance on GloVe 9

  10. Difficulty measure for queries โ€ข Given query ๐‘Ÿ and distances ๐‘  1 , โ€ฆ , ๐‘  ๐‘™ to ๐‘™ nearest neighbors, define โˆ’1 ๐ธ ๐‘Ÿ = โˆ’ 1 ๐‘™ เท ln ๐‘  ๐‘— /๐‘  ๐‘Ÿ ๐‘™ Based on the concept of local intrinsic dimensionality [Houle, 2013] and its MLE estimator [Amsaleg et al., 2015] 10

  11. LID Distribution 11

  12. Results (GloVe, 10-NN, 1.2M points) Easy Middle Difficult http://ann-benchmarks.com/sisap19/faiss-ivf.html 12

  13. 2. STATE-OF-THE-ART NEAREST NEIGHBOR SEARCH 13

  14. General Pipeline Index generates candidates Brute-force search on candidates 14

  15. Brute-force search GloVe: 1.2 M points, inner product as distance measure ๐‘ž ๐‘œ ๐‘ž 1 400 byte 400 byte โ€ข 100ms per scan โ€ข 4.2 GB/s throughput โ€ข CPU-bound Automatically SIMD vectorized with clang โ€“ O3: https://godbolt.org/z/TJX68s 15

  16. https://gist.github.com/maumueller/720d0f71664bef694bd56b2aeff80b17 Manual vectorization (256 bit registers) โ€ฆ ๐‘ฆ Parallel multiply โ€ข 25 ms per query โ€ข 16 GB/s โ€ฆ โ€ข 16.5 GB/s single- ๐‘ง thread max on my laptop โ€ข Memory-bound Parallel add to result register 0 0 0 0 0 0 0 0 Horizontal sum and cast to float 16

  17. Brute-force on bit vectors โ€ข Another popular distance measure is Hamming distance โ€ข Number of positions in which two bit strings differ โ€ข Can be nicely packed into 64-bit words โ€ข Hamming distance of two words is just bitcount of the XOR โ€ข 1.3 ms per query (128 bits) โ€ข 6 GB/s throughput 17

  18. [Christiani, 2019] Sketching to avoid distance computations SimHash [Charikar, 2002] 1-BitMinHash [Kรถnig-Li, 2010] โ€ข Distance computations on bit Sketch representation vectors faster than Euclidean ๐‘Ÿ 1011100101 distance/inner product 0101110101 ๐‘ฆ โ€ข Their number can be reduced by Easy to analyze: Sum of Bernoulli trials of storing compact sketch Pr(๐‘Œ = 1) = ๐‘”(dist(๐‘Ÿ, ๐‘ฆ)) representations At least ๐œ collisions? Yes No Can distance computation be compute ๐‘Ÿ avoided? skip dist( ๐‘Ÿ, ๐‘ฆ) ๐‘ฆ Set ๐œ such that with probability at least 1 โˆ’ ๐œ we donโ€™t disregard point that could be among NN. 18

  19. General Pipeline Index generates candidates Brute-force search on candidates 19

  20. PUFFINN P ARAMETERLESS AND U NIVERSALLY F AST FI NDINGOF N EAREST N EIGHBORS [A., Christiani, Pagh, Vesterli, 2019] https://github.com/puffinn/puffinn Credit: Richard Bartz 20

  21. How does it work? Locality-Sensitive Hashing (LSH) [Indyk-Motwani, 1998] โ„Ž ๐‘ž = โ„Ž 1 ๐‘ž โˆ˜ โ„Ž 2 ๐‘ž โˆ˜ โ„Ž 3 ๐‘ž โˆˆ 0,1 3 = A family โ„‹ of hash functions is locality- sensitive , if the collision probability of two points is decreasing with their distance to each other. 21

  22. Solving ๐‘™ -NN using LSH (with failure prob. ๐œ€ ) Dataset ๐‘‡ โ„Ž 4 โ„Ž 5 โ„Ž 2 โ„Ž 3 โ„Ž ๐‘€ โ€ฆ Termination : If 1 โˆ’ ๐‘ž ๐‘˜ โ‰ค ๐œ€ , report current top- ๐‘™ . Not terminated? Decrease ๐ฟ ! probability of the current ๐‘™ -th nearest neighbor to collide. 22

  23. The Data Structure Theoretical Practical โ€ข LSH Forest: Each repetition is a โ€ข Store indices of data set points Trie build from LSH hash values sorted by hash code [Bawa et al., 2005] โ€ข โ€Traversing the Trie โ€ by binary search โ€ข use lookup table for first levels 1 0 0 1 0 1 0 1 0 1 0 1 0 1 โ€ฆ 23

  24. Overall System Design 24

  25. Running time (Glove 100d, 1.2M, 10-NN) 25

  26. A difficult (?) data set in โ„ 3๐‘’ ๐‘œ data points ๐‘› query points ๐‘ฆ 1 = 0 ๐‘’ , ๐‘ง 1 , ๐‘จ 1 ๐‘Ÿ 1 = ๐‘ค, 0 ๐‘’ , ๐‘  1 โ‹ฎ โ‹ฎ ๐‘Ÿ ๐‘› = (๐‘ค, 0 ๐‘’ , ๐‘  ๐‘› ) ๐‘ฆ ๐‘œโˆ’1 = 0 ๐‘’ , ๐‘ง ๐‘œโˆ’1 , ๐‘จ ๐‘œโˆ’1 ๐‘ฆ ๐‘œ = (๐‘ค, ๐‘ฅ, 0 ๐‘’ ) 0, 1 ๐‘ง ๐‘— , ๐‘จ ๐‘— , ๐‘ค, ๐‘ฅ, ๐‘  ๐‘— โˆผ ๐’ช ๐‘’ 2๐‘’ 26

  27. Running time (โ€œDifficultโ€, 1M, 10 -NN) 27

  28. Graph-based Similarity Search 28

  29. Building a Small World Graph 29

  30. Refining a Small World Graph Goal : Keep out- degree as small as possible (while maintaining โ€œlarge - enoughโ€ in -degree)! HNSW/ONNG: [Malkov et al., 2020], [Iwasaki et al., 2018] 30

  31. Running time (Glove 100d, 1.2M, 10-NN) 31

  32. Open Problems Nearest Neighbor Search โ€ข Data-dependent LSH with guarantees? โ€ข Theoretical sound Small-World Graphs? โ€ข Multi-core implementations โ€ข Good? [Malkov et al., 2020] โ€ข Alternative ways of sketching data? 32

  33. 3. Similarity Search on the GPU, in External Memory, and in Distributed Settings 33

  34. Nearest Neighbors on the GPU: FAISS [Johnson et al., 2017] https://github.com/facebookresearch/faiss โ€ข GPU setting โ€ข Data structure is held in GPU memory โ€ข Queries come in batches of say 10,000 queries per time โ€ข Results: โ€ข http://ann-benchmarks.com/sift-128-euclidean_10_euclidean-batch.html 34

  35. FAISS/2 โ€ข Data structure โ€ข Run k-means with large number of centroids โ€ข Each data point is associated with closest centroid โ€ข Query โ€ข Find ๐‘€ closest centroids โ€ข Return ๐‘™ closest points found in points associated with these centroids https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html 35

  36. Nearest Neighbors on the GPU: GGNN [Groh et al., 2019] 36

  37. Nearest Neighbors in External Memory [Subramanya et al., 2019] RAM SSD เทž ๐‘ฆ 1 โ€ฆ ๐‘ฆ ๐‘œ เทž โ€ฆ compressed vectors ๐‘ฆ 1 ๐‘ฆ ๐‘œ (32 byte per vector) Original vectors Product Quantization (~400 byte per vector) 37

  38. Distributed Setting: Similarity Join โ€ข Problem โ€ข given sets ๐‘† and ๐‘‡ of size ๐‘œ , ๐‘† โ€ข and similarity threshold ๐œ‡, compute ๐‘† โ‹ˆ ๐œ‡ ๐‘‡ = { ๐‘ฆ, ๐‘ง โˆˆ ๐‘† ร— ๐‘‡ โˆฃ ๐‘ก๐‘—๐‘› ๐‘ฆ, ๐‘ง โ‰ฅ ๐œ‡} โ€ข Similarity measures โ€ข Jaccard similarity โ€ข Cosine similarity ๐‘‡ โ€ข Naive: ๐‘ƒ ๐‘œ 2 distance computations 38

  39. Scalability! But at what COST? [ McSherry et al., 2015] Map-Reduce-based Similarity Join Single Core on Xeon E5-2630v2 (2.60 GHz) Hadoop cluster (12 nodes, 24 HT per node) [Fier et al., 2018] [Mann et al., 2016] 39

  40. Solved almost-optimally in the MPC model [Hu et al., 2019] Hash Join on using hash LSH ๐‘† Similarity (๐‘ฆ, โ„Ž ๐‘— (x)) at least Emit ๐œ‡ ? (๐‘ฆ, ๐‘ง ) (๐‘ฆ, ๐‘ง, โ„Ž ๐‘— (x)) (๐‘ง, โ„Ž ๐‘— (y)) ๐‘‡ ๐‘ƒ(๐‘œ 2 ) local work for distance computations! 40

  41. Another approach: DANNY [A., Ceccarello, Pagh, 2020] In preparation, https://github.com/cecca/danny LSH + Sketching, Cartesian Product candidate verification ๐‘† locally Emit/Collect ๐‘‡ Implementation in Rust using timely dataflow https://github.com/TimelyDataflow/timely-dataflow 41

  42. Results 42

  43. Roadmap 01 02 03 Similarity Search in Survey of state-of- Similarity Search on High-Dimensions: the-art Nearest the GPU, in external Setup/Experimental Neighbor Search memory, and in Approach algorithms distributed settings 43

Recommend


More recommend