Locality Sensitive Hashing Scheme Based on p -Stable Distributions Mayur Datar (Stanford) Nicole Immorlica (MIT) Piotr Indyk (MIT) Vahab Mirrokni (MIT)
(Streaming) Massive Data Sets ⇒ High Dimensional Vectors • Massive data sets visualized as high dimensional vectors • E.g. Number of IP-packets sent to address i from IP address j v j = { v j 1 , v j 2 , . . . , v j i , . . . , v j N } Dimensionality = 2 32 • E.g. Number of phone calls made from telephone number j to telephone number k v j = { v j 1 , v j 2 , . . . , v j k , . . . , v j N ′ } Dimensionality = 10 9 Mayur Datar. LSH Scheme based on p -Stable distributions 1
Update Model • Vectors constantly updated as per cash register model • Update element ( i, a ) for vector v changes it as follows: v = { v 1 , v 2 , . . . , ( v i + a ) , . . . , v N } • Numerous high dimensional vectors E.g. one vector per (millions) telephone customers, one vector per (millions) IP-address etc. Rows of a huge matrix Mayur Datar. LSH Scheme based on p -Stable distributions 2
l p Norms • l p ( v ) = ( � N i =1 | v i | p ) 1 /p E.g. l 1 norm (Manhattan), l 2 norm (Euclidean) • l p norms usually computed over vector differences E.g. l 1 ( v j − v k ) , l 2 ( v j − v k ) , l 0 . 005 ( v j − v k ) etc. • What do l p norms capture? – l 1 norm applied to telephone vectors: symmetric (multi) set difference between two customers – l p norms for small values of p (0.005): capture Hamming norms, distinct values [CDIM’02] Mayur Datar. LSH Scheme based on p -Stable distributions 3
Proximity Queries • Nearest Neighbor : Given a query q find the closest (smallest l p norm) point p • Near Neighbor : Given a query q and distance R find all (or most) points p s.t. l p ( p − q ) ≤ R • Applications: Classification, fraud detection etc. E.g. find cell phone customers whose calling pattern is similar to that of XYZ (UBL) Mayur Datar. LSH Scheme based on p -Stable distributions 4
Approximate Nearest Neighbor • Curse of dimensionality • Error parameter ǫ : Find any point that is within (1+ ǫ ) times the distance from true nearest neighbor p* q r (1+e)r Mayur Datar. LSH Scheme based on p -Stable distributions 5
Approximate Near Neighbor (( R, ǫ )–PLEB) • B ( c , R ) denotes a ball of radius R centered at c • Given: radius R , error parameter ǫ and query point q : – if there exists data point p s.t. q ∈ B ( p , R ) , return Yes and a point (or all points) p ′ s.t. q ∈ B ( p ′ , (1 + ǫ ) R ) , ∈ B ( p , R ) for all data points p , return No , – if q / – if closest data point to q is at distance between R and R (1 + ǫ ) then return Yes or No Mayur Datar. LSH Scheme based on p -Stable distributions 6
Approximate Near Neighbor • Useful problem formulation in itself • Approximate near est neighbor can be reduced to approximate near neighbor (binary search on R ) • Henceforth, we will concentrate on solving approximate near neighbor Mayur Datar. LSH Scheme based on p -Stable distributions 7
Our contribution • Data structure for the approximate near neighbor problem (( R, ǫ )–PLEB) • Small query time, update time and easy to implement • works for l p norms, for 0 < p ≤ 2 . In particular 0 < p < 1 • Earlier result ([IM’98]) worked for l 1 , l 2 and Hamming norm. • Our technique improves the query time for l 2 norm Mayur Datar. LSH Scheme based on p -Stable distributions 8
Locality Sensitive Hashing (LSH)([IM’98]) • Intuition: if two points are close (less than dist r 1 ) they hash to same bucket with prob at least p 1 . Else, if they are far (more than dist r 2 > r 1 ) they hash to same bucket with prob no more than p 2 < p 1 • Formally: A family H = { h : S → U } is called ( r 1 , r 2 , p 1 , p 2 ) -sensitive for distance function D if for any v , q ∈ S – if v ∈ B ( q , r 1 ) then Pr H [ h ( q ) = h ( v )] ≥ p 1 , ∈ B ( q , r 2 ) then Pr H [ h ( q ) = h ( v )] ≤ p 2 . – if v / – r 1 < r 2 , p 1 > p 2 Mayur Datar. LSH Scheme based on p -Stable distributions 9
Using LSH to solve ( R, ǫ )–PLEB ([IM’98]) • Let c = 1 + ǫ Theorem. Suppose there is a ( R, cR, p 1 , p 2 ) -sensitive family H for a distance measure D . Then there exists an algorithm for ( R, c ) - PLEB under measure D which uses O ( dn + n 1+ ρ ) space, with query time dominated by O ( n ρ ) distance computations, and O ( n ρ log 1 /p 2 n ) evaluations of hash functions from H , where ρ = ln 1 /p 1 ln 1 /p 2 • Bottom-line: Design LSH scheme with small ρ for l p norms Mayur Datar. LSH Scheme based on p -Stable distributions 10
Recap • Proximity problems reduced to designing LSH schemes • Design LSH schemes for l p norms with small ρ , update time etc. • A family H = { h : S → U } is called ( r 1 , r 2 , p 1 , p 2 ) -sensitive for distance function D if for any v , q ∈ S – if v ∈ B ( q , r 1 ) then Pr H [ h ( q ) = h ( v )] ≥ p 1 , ∈ B ( q , r 2 ) then Pr H [ h ( q ) = h ( v )] ≤ p 2 – if v / • r 1 = R = 1 , r 2 = R (1 + ǫ ) = 1 + ǫ = c Mayur Datar. LSH Scheme based on p -Stable distributions 11
p –Stable distributions • p –stable distribution ( p ≥ 0 ) : A distribution D over ℜ s.t – n real numbers v 1 . . . v n , – i.i.d. variables X 1 . . . X n with distribution D , i | v i | p ) 1 /p X = – r.v. � i v i X i has the same distribution as the variable ( � l p ( v ) X , where X is a r.v. with distribution D • E.g. p –Stable distr for p = 1 is Cauchy distr, for p = 2 is Gaussian distr • for 0 < p < 2 there is a way to sample from a p –stable distribution given two uniform r.v.’s over [0 , 1] [Nol] Mayur Datar. LSH Scheme based on p -Stable distributions 12
How are p –Stable distributions useful? • Consider a vector X = { X 1 , X 2 , . . . , X N } , where each X i is drawn from a p –Stable distr • For any pair of vectors a, b a · X − b · X = ( a − b ) · X (by linearity) • Thus a · X − b · X is distributed as ( l p ( a − b )) X ′ where X ′ is a p –Stable distr r.v. • Using multiple independent X ’s we can use a · X − b · X to estimate l p ( a − b ) [Ind’01] Mayur Datar. LSH Scheme based on p -Stable distributions 13
How are p –Stable distributions useful? • For a vector a , the dot product a · X projects it onto the real line • For any pair of vectors a, b these projections are “close” (w.h.p.) if l p ( a − b ) is “small” and “far” otherwise • Divide the real line into segments of width w • Each segment defines a hash bucket, i.e. vectors that project onto the same segment belong to the same bucket Mayur Datar. LSH Scheme based on p -Stable distributions 14
Hashing (formal) definition B 0 W W W W • Consider h a ,b ∈ H w , h a ,b ( v ) : R d → N • a is a d dimensional random vector whose each entry is drawn from a p -stable distr • b is a random real number chosen uniformly from [0 , w ] (random shift) • h a ,b ( v ) = ⌊ a · v + b ⌋ w Mayur Datar. LSH Scheme based on p -Stable distributions 15
Collision probabilities B 0 W W W W • Consider two vectors v 1 , v 2 and let ℓ = l p ( v 1 , v 2 ) • Let Y denote the distance between their projections onto the random vector a ( Y is distributed as ℓX where X is a p -stable distr r.v.) • if Y > w , v 1 , v 2 will not collide • if Y ≤ w , v 1 , v 2 will collide with probability equal to (1 − ( Y/w )) (random shift b ) Mayur Datar. LSH Scheme based on p -Stable distributions 16
Collision probabilities • f p ( t ) : p.d.f. of the absolute value of a p -stable distribution • ℓ = l p ( v 1 , v 2 ) � w 0 f p ( t )(1 − t • ℓ ≤ 1 , p 1 = Pr [ h a ,b ( v 1 ) = h a ,b ( v 2 )] ≥ w ) dt � w 1 c f p ( t c )(1 − t • ℓ > 1 + ǫ = c , p 2 = Pr [ h a ,b ( v 1 ) = h a ,b ( v 2 )] ≤ w ) dt 0 • H w hash family is ( r 1 , r 2 , p 1 , p 2 ) -sensitive for r 1 = 1 , r 2 = c and p 1 , p 2 given as above Mayur Datar. LSH Scheme based on p -Stable distributions 17
Special cases • p = 1 (Cauchy distr): f p ( t ) = 2 1 1+ t 2 π • p 2 = 2 tan − 1 ( w/c ) π ( w/c ) ln(1 + ( w/c ) 2 ) 1 − π • p 1 obtained by substituting c = 1 0.9 c=1.5 p1 p2 0.8 0.7 0.6 0.5 borp/pxe 0.4 0.3 0.2 0.1 0 0 5 10 15 20 r Mayur Datar. LSH Scheme based on p -Stable distributions 18
Special cases 2 π e − t 2 / 2 2 • p = 2 (Gaussian distr): f p ( t ) = √ 2 πw/c (1 − e − ( w 2 / 2 c 2 ) ) 2 • p 2 = 1 − 2 norm ( − w/c ) − √ • p 1 obtained by substituting c = 1 1 c=1.5 p1 p2 0.9 0.8 0.7 0.6 borp/pxe 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 r Mayur Datar. LSH Scheme based on p -Stable distributions 19
Comparison with previous scheme • Previous hashing scheme for p = 1 , 2 achieved ρ = 1 /c • Based on reduction to hamming distance • New scheme achieves smaller ρ (than 1 /c ) for p = 2 • Large constants and log factors for p = 2 in query time besides n ρ • Achieves ρ = 1 /c for p = 1 Mayur Datar. LSH Scheme based on p -Stable distributions 20
ρ for p = 2 1 rho 1/c 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 Approximation factor c Mayur Datar. LSH Scheme based on p -Stable distributions 21
ρ for p = 1 1 rho 1/c 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 Approximation factor c Mayur Datar. LSH Scheme based on p -Stable distributions 22
Recommend
More recommend