locality sensitive hashing
play

Locality-Sensitive Hashing CS 395T: Visual Recognition and Search - PowerPoint PPT Presentation

Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban Feb 22, 2008 1 Nearest Neighbor Given a query any point , return the point q closest to . q Useful for finding similar objects in a database. Brute


  1. Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban Feb 22, 2008 1

  2. Nearest Neighbor  Given a query any point , return the point q closest to . q  Useful for finding similar objects in a database.  Brute force linear search is not practical for massive databases. ? Feb 22, 2008 2

  3. The “Curse of Dimensionality”  For , data structures exist that d < 10 to 20 require sublinear time and near linear space to perform a NN search.  Time or space requirements grow exponentially in the dimension.  The dimensionality of images or documents is usually in the order of several hundred or more.  Brute force linear search is the best we can do. Feb 22, 2008 3

  4. (r, )-Nearest Neighbor ²  An approximate nearest neighbor should suffice in most cases.  Definition: If for any query point , there exists q p 0 a point such that , w.h.p return jj q ¡ p jj · r p jj q ¡ p 0 jj · (1 + ² ) r such that . ? Feb 22, 2008 4

  5. Locality-sensative Hash Families Definition: A LSH family , , has the H ( c; r; P 1 ; P 2 ) following properties for any : q; p 2 S 1. If then jj p ¡ q jj · r Pr H [ h ( p ) = h ( q )] ¸ P 1 2. If then jj p ¡ q jj ¸ cr Pr H [ h ( p ) = h ( p )] · P 2 Feb 22, 2008 5

  6. Hamming Space  Definition: Hamming space is the set of all 2 N binary strings of length . N  Definition: The Hamming distance between two equal length binary strings is the number of positions for which the bits are different. k 1011101 ; 1001001 k H = 2 k 1110101 ; 1111101 k H = 1 Feb 22, 2008 6

  7. Hamming Space  Let a hashing family be defined as h i ( p ) = p i where is the bit of . i th p i p Pr H [ h ( p ) 6 = h ( q )] = k p; q k H d Pr H [ h ( p ) = h ( q )] = 1 ¡ k p; q k H d Clearly, this family is locality sensative. Feb 22, 2008 7

  8. k-bit LSH Functions  A k-bit locality-sensitive hash function (LSHF) is defined as: g ( p ) = [ h 1 ( p ) ; h 2 ( p ) ; : : : ; h k ( p )] T  Each is chosen randomly from . H h i  Each results in a single bit. h i µ ¶ k 1 ¡ 1  Pr(similar points collide) ¸ 1 ¡ P 1  Pr(dissimilar points collide) · P k 2 Feb 22, 2008 8

  9. LSH Preprocessing  Each training example is entered into hash l tables indexed by independantly constructed . g 1 ; : : : ; g l  Preprocessing Space: O ( lN ) 1 2 l ... Feb 22, 2008 9

  10. LSH Querying  For each hash table i , 1 · i · l  Return the bin indexed by g i ( q )  Perform a linear search on the union of the bins. q ... Feb 22, 2008 10

  11. Parameter Selection  Suppose we want to search at most B examples. Then setting ¶ log (1 =P 1 ) µ N ¶ µ N log (1 =P 2 ) k = log 1 =P 2 ; l = B B ensures that it will succeed with high probability. Feb 22, 2008 11

  12. Experiment 1  Compare LSH accuracy and performance to exact NN search. Examine the influence of:  k, the number of hash bits.  l, the number of hash tables.  B, the maximum search length.  Dataset  59500 20x20 patches taken from motorcycle images.  Represented as 400-dimensional column vectors Feb 22, 2008 12

  13. Hash Function  Convert the feature vectors into binary strings and use the Hamming hash functions.  Given a vector we can create a unary x 2 N d representation for each element . x i  = 1's followed by 0's, ( C ¡ x i ) x i Unary C ( x i ) where is the max coordinate for all points. C  u ( x ) = Unary C ( x 1 ) ; : : : ; Unary C ( x d )  Note that for any two points : p; q k p; q k = k u ( p ) ; u ( q ) k H Feb 22, 2008 13

  14. Example Query l = 20, k = 24, B = 1   Query =  Examples searched: 7,722 of 59,500  Result =  Actual NNs = Feb 22, 2008 14

  15. Average Search Length  Let B = 1 24 22 20 30 18 16 14 25 12 10 20 8 6 4 l 15 2 x1000 10 5 10 15 20 25 30 5 k Feb 22, 2008 15

  16. Average Search Length  Let B = 1 24 22 20 30 More hash bits, 18  (k), result in 16 shorter 14 25 searches. 12 10 More hash  20 8 tables (l), result 6 in longer 4 l 15 searches. 2 x1000 10 5 10 15 20 25 30 5 k Feb 22, 2008 16

  17. Average Approximation Error  Let B = 1 1.11 1.1 1.09 30 1.08 1.07 1.06 25 1.05 1.04 20 l 15 10 5 5 10 15 20 25 30 k Feb 22, 2008 17

  18. Average Approximation Error  Let B = 1 1.11 1.1 1.09 30 Over hashing 1.08  can result in too 1.07 few candidates 1.06 25 to return a good 1.05 approximation. 1.04 20 Over hashing  can cause l 15 algorithm to fail. 10 5 5 10 15 20 25 30 k Feb 22, 2008 18

  19. Average Approximation Error  Let B = 1 1.11 1.1 1.09 30 Over hashing 1.08  can result in too 1.07 few candidates 1.06 25 to return a good 1.05 approximation. 1.04 20 Over hashing  can cause l Average search 15 algorithm to fail. length = 8000 10 5 10 15 20 25 30 5 k Feb 22, 2008 19

  20. Average Approximation Error N  Let B = 5500 ¼ 1.15 ln N 1.14 1.13 30 1.12 1.11 1.1 25 1.09 1.08 20 l 15 10 5 5 10 15 20 25 30 k Feb 22, 2008 20

  21. Average Approximation Error p  Let B = 250 ¼ N 1.6 1.55 1.5 30 1.45 1.4 1.35 25 1.3 1.25 20 l 15 10 5 5 10 15 20 25 30 k Feb 22, 2008 21

  22. Experiment 2  Examine the effect of the approximation on the subjective quality of the results.  Dataset  D. Nistér and H. Stewénius. Scalable recognition with a vocabulary tree  2550 sets of 4 images represented as document-term matrix of the visual words. Feb 22, 2008 22

  23. Experiment 2: Issues  LSH requires a vector representation.  Not clear how to easily convert a bag of words representation into a vector one.  A binary vector where the presence of each word is a bit does not provide a good distance measure.  Each image has roughly the same number of different words from any other image.  Boostmap? Feb 22, 2008 23

  24. Conclusions  Approximate Nearest Neighbors is neccessary for very large high dimensional datasets.  LSH is a simple approach to aNN.  LSH requires a vector representation.  Clear relationship between search length and approximation error. Feb 22, 2008 24

  25. Tools  Octave (MATLAB)  LSH Matlab Toolbox - http://www.cs.brown.edu/~gregory/code/lsh/  Python  Gnuplot Feb 22, 2008 25

  26. References 'Fast Pose Estimation with Parameter Senative Hashing' –  Shakhnarovich et al. 'Similarity Search in High Dimensions via Hashing' – Gionis et al.  'Object Recognition Using Locality-Sensitive Hashing of Shape  Contexts' - Andrea Frome and Jitendra Malik 'Nearest neighbors in high-dimensional spaces', Handbook of  Discrete and Computational Geometry – Piotr Indyk Algorithms for Nearest Neighbor Search -  http://simsearch.yury.name/tutorial.html LSH Matlab Toolbox - http://www.cs.brown.edu/~gregory/code/lsh/  Feb 22, 2008 26

Recommend


More recommend