locality sensitive hashing scheme based on p stable
play

Locality Sensitive Hashing Scheme Based on p -Stable Distributions - PowerPoint PPT Presentation

Locality Sensitive Hashing Scheme Based on p -Stable Distributions Mayur Datar (Stanford) Nicole Immorlica (MIT) Piotr Indyk (MIT) Vahab Mirrokni (MIT) (Streaming) Massive Data Sets High Dimensional Vectors Massive data sets visualized


  1. Locality Sensitive Hashing Scheme Based on p -Stable Distributions Mayur Datar (Stanford) Nicole Immorlica (MIT) Piotr Indyk (MIT) Vahab Mirrokni (MIT)

  2. (Streaming) Massive Data Sets ⇒ High Dimensional Vectors • Massive data sets visualized as high dimensional vectors • E.g. Number of IP-packets sent to address i from IP address j v j = { v j 1 , v j 2 , . . . , v j i , . . . , v j N } Dimensionality = 2 32 • E.g. Number of phone calls made from telephone number j to telephone number k v j = { v j 1 , v j 2 , . . . , v j k , . . . , v j N ′ } Dimensionality = 10 9 Mayur Datar. LSH Scheme based on p -Stable distributions 1

  3. Update Model • Vectors constantly updated as per cash register model • Update element ( i, a ) for vector v changes it as follows: v = { v 1 , v 2 , . . . , ( v i + a ) , . . . , v N } • Numerous high dimensional vectors E.g. one vector per (millions) telephone customers, one vector per (millions) IP-address etc. Rows of a huge matrix Mayur Datar. LSH Scheme based on p -Stable distributions 2

  4. l p Norms • l p ( v ) = ( � N i =1 | v i | p ) 1 /p E.g. l 1 norm (Manhattan), l 2 norm (Euclidean) • l p norms usually computed over vector differences E.g. l 1 ( v j − v k ) , l 2 ( v j − v k ) , l 0 . 005 ( v j − v k ) etc. • What do l p norms capture? – l 1 norm applied to telephone vectors: symmetric (multi) set difference between two customers – l p norms for small values of p (0.005): capture Hamming norms, distinct values [CDIM’02] Mayur Datar. LSH Scheme based on p -Stable distributions 3

  5. Proximity Queries • Nearest Neighbor : Given a query q find the closest (smallest l p norm) point p • Near Neighbor : Given a query q and distance R find all (or most) points p s.t. l p ( p − q ) ≤ R • Applications: Classification, fraud detection etc. E.g. find cell phone customers whose calling pattern is similar to that of XYZ (UBL) Mayur Datar. LSH Scheme based on p -Stable distributions 4

  6. Approximate Nearest Neighbor • Curse of dimensionality • Error parameter ǫ : Find any point that is within (1+ ǫ ) times the distance from true nearest neighbor p* q r (1+e)r Mayur Datar. LSH Scheme based on p -Stable distributions 5

  7. Approximate Near Neighbor (( R, ǫ )–PLEB) • B ( c , R ) denotes a ball of radius R centered at c • Given: radius R , error parameter ǫ and query point q : – if there exists data point p s.t. q ∈ B ( p , R ) , return Yes and a point (or all points) p ′ s.t. q ∈ B ( p ′ , (1 + ǫ ) R ) , ∈ B ( p , R ) for all data points p , return No , – if q / – if closest data point to q is at distance between R and R (1 + ǫ ) then return Yes or No Mayur Datar. LSH Scheme based on p -Stable distributions 6

  8. Approximate Near Neighbor • Useful problem formulation in itself • Approximate near est neighbor can be reduced to approximate near neighbor (binary search on R ) • Henceforth, we will concentrate on solving approximate near neighbor Mayur Datar. LSH Scheme based on p -Stable distributions 7

  9. Our contribution • Data structure for the approximate near neighbor problem (( R, ǫ )–PLEB) • Small query time, update time and easy to implement • works for l p norms, for 0 < p ≤ 2 . In particular 0 < p < 1 • Earlier result ([IM’98]) worked for l 1 , l 2 and Hamming norm. • Our technique improves the query time for l 2 norm Mayur Datar. LSH Scheme based on p -Stable distributions 8

  10. Locality Sensitive Hashing (LSH)([IM’98]) • Intuition: if two points are close (less than dist r 1 ) they hash to same bucket with prob at least p 1 . Else, if they are far (more than dist r 2 > r 1 ) they hash to same bucket with prob no more than p 2 < p 1 • Formally: A family H = { h : S → U } is called ( r 1 , r 2 , p 1 , p 2 ) -sensitive for distance function D if for any v , q ∈ S – if v ∈ B ( q , r 1 ) then Pr H [ h ( q ) = h ( v )] ≥ p 1 , ∈ B ( q , r 2 ) then Pr H [ h ( q ) = h ( v )] ≤ p 2 . – if v / – r 1 < r 2 , p 1 > p 2 Mayur Datar. LSH Scheme based on p -Stable distributions 9

  11. Using LSH to solve ( R, ǫ )–PLEB ([IM’98]) • Let c = 1 + ǫ Theorem. Suppose there is a ( R, cR, p 1 , p 2 ) -sensitive family H for a distance measure D . Then there exists an algorithm for ( R, c ) - PLEB under measure D which uses O ( dn + n 1+ ρ ) space, with query time dominated by O ( n ρ ) distance computations, and O ( n ρ log 1 /p 2 n ) evaluations of hash functions from H , where ρ = ln 1 /p 1 ln 1 /p 2 • Bottom-line: Design LSH scheme with small ρ for l p norms Mayur Datar. LSH Scheme based on p -Stable distributions 10

  12. Recap • Proximity problems reduced to designing LSH schemes • Design LSH schemes for l p norms with small ρ , update time etc. • A family H = { h : S → U } is called ( r 1 , r 2 , p 1 , p 2 ) -sensitive for distance function D if for any v , q ∈ S – if v ∈ B ( q , r 1 ) then Pr H [ h ( q ) = h ( v )] ≥ p 1 , ∈ B ( q , r 2 ) then Pr H [ h ( q ) = h ( v )] ≤ p 2 – if v / • r 1 = R = 1 , r 2 = R (1 + ǫ ) = 1 + ǫ = c Mayur Datar. LSH Scheme based on p -Stable distributions 11

  13. p –Stable distributions • p –stable distribution ( p ≥ 0 ) : A distribution D over ℜ s.t – n real numbers v 1 . . . v n , – i.i.d. variables X 1 . . . X n with distribution D , i | v i | p ) 1 /p X = – r.v. � i v i X i has the same distribution as the variable ( � l p ( v ) X , where X is a r.v. with distribution D • E.g. p –Stable distr for p = 1 is Cauchy distr, for p = 2 is Gaussian distr • for 0 < p < 2 there is a way to sample from a p –stable distribution given two uniform r.v.’s over [0 , 1] [Nol] Mayur Datar. LSH Scheme based on p -Stable distributions 12

  14. How are p –Stable distributions useful? • Consider a vector X = { X 1 , X 2 , . . . , X N } , where each X i is drawn from a p –Stable distr • For any pair of vectors a, b a · X − b · X = ( a − b ) · X (by linearity) • Thus a · X − b · X is distributed as ( l p ( a − b )) X ′ where X ′ is a p –Stable distr r.v. • Using multiple independent X ’s we can use a · X − b · X to estimate l p ( a − b ) [Ind’01] Mayur Datar. LSH Scheme based on p -Stable distributions 13

  15. How are p –Stable distributions useful? • For a vector a , the dot product a · X projects it onto the real line • For any pair of vectors a, b these projections are “close” (w.h.p.) if l p ( a − b ) is “small” and “far” otherwise • Divide the real line into segments of width w • Each segment defines a hash bucket, i.e. vectors that project onto the same segment belong to the same bucket Mayur Datar. LSH Scheme based on p -Stable distributions 14

  16. Hashing (formal) definition B 0 W W W W • Consider h a ,b ∈ H w , h a ,b ( v ) : R d → N • a is a d dimensional random vector whose each entry is drawn from a p -stable distr • b is a random real number chosen uniformly from [0 , w ] (random shift) • h a ,b ( v ) = ⌊ a · v + b ⌋ w Mayur Datar. LSH Scheme based on p -Stable distributions 15

  17. Collision probabilities B 0 W W W W • Consider two vectors v 1 , v 2 and let ℓ = l p ( v 1 , v 2 ) • Let Y denote the distance between their projections onto the random vector a ( Y is distributed as ℓX where X is a p -stable distr r.v.) • if Y > w , v 1 , v 2 will not collide • if Y ≤ w , v 1 , v 2 will collide with probability equal to (1 − ( Y/w )) (random shift b ) Mayur Datar. LSH Scheme based on p -Stable distributions 16

  18. Collision probabilities • f p ( t ) : p.d.f. of the absolute value of a p -stable distribution • ℓ = l p ( v 1 , v 2 ) � w 0 f p ( t )(1 − t • ℓ ≤ 1 , p 1 = Pr [ h a ,b ( v 1 ) = h a ,b ( v 2 )] ≥ w ) dt � w 1 c f p ( t c )(1 − t • ℓ > 1 + ǫ = c , p 2 = Pr [ h a ,b ( v 1 ) = h a ,b ( v 2 )] ≤ w ) dt 0 • H w hash family is ( r 1 , r 2 , p 1 , p 2 ) -sensitive for r 1 = 1 , r 2 = c and p 1 , p 2 given as above Mayur Datar. LSH Scheme based on p -Stable distributions 17

  19. Special cases • p = 1 (Cauchy distr): f p ( t ) = 2 1 1+ t 2 π • p 2 = 2 tan − 1 ( w/c ) π ( w/c ) ln(1 + ( w/c ) 2 ) 1 − π • p 1 obtained by substituting c = 1 0.9 c=1.5 p1 p2 0.8 0.7 0.6 0.5 borp/pxe 0.4 0.3 0.2 0.1 0 0 5 10 15 20 r Mayur Datar. LSH Scheme based on p -Stable distributions 18

  20. Special cases 2 π e − t 2 / 2 2 • p = 2 (Gaussian distr): f p ( t ) = √ 2 πw/c (1 − e − ( w 2 / 2 c 2 ) ) 2 • p 2 = 1 − 2 norm ( − w/c ) − √ • p 1 obtained by substituting c = 1 1 c=1.5 p1 p2 0.9 0.8 0.7 0.6 borp/pxe 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 r Mayur Datar. LSH Scheme based on p -Stable distributions 19

  21. Comparison with previous scheme • Previous hashing scheme for p = 1 , 2 achieved ρ = 1 /c • Based on reduction to hamming distance • New scheme achieves smaller ρ (than 1 /c ) for p = 2 • Large constants and log factors for p = 2 in query time besides n ρ • Achieves ρ = 1 /c for p = 1 Mayur Datar. LSH Scheme based on p -Stable distributions 20

  22. ρ for p = 2 1 rho 1/c 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 Approximation factor c Mayur Datar. LSH Scheme based on p -Stable distributions 21

  23. ρ for p = 1 1 rho 1/c 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 Approximation factor c Mayur Datar. LSH Scheme based on p -Stable distributions 22

Recommend


More recommend