Locality Sensitive Hashing Scheme Based on p -Stable Distributions - PowerPoint PPT Presentation

Locality Sensitive Hashing Scheme Based on p -Stable Distributions Mayur Datar (Stanford) Nicole Immorlica (MIT) Piotr Indyk (MIT) Vahab Mirrokni (MIT)

(Streaming) Massive Data Sets ⇒ High Dimensional Vectors • Massive data sets visualized as high dimensional vectors • E.g. Number of IP-packets sent to address i from IP address j v j = { v j 1 , v j 2 , . . . , v j i , . . . , v j N } Dimensionality = 2 32 • E.g. Number of phone calls made from telephone number j to telephone number k v j = { v j 1 , v j 2 , . . . , v j k , . . . , v j N ′ } Dimensionality = 10 9 Mayur Datar. LSH Scheme based on p -Stable distributions 1

Update Model • Vectors constantly updated as per cash register model • Update element ( i, a ) for vector v changes it as follows: v = { v 1 , v 2 , . . . , ( v i + a ) , . . . , v N } • Numerous high dimensional vectors E.g. one vector per (millions) telephone customers, one vector per (millions) IP-address etc. Rows of a huge matrix Mayur Datar. LSH Scheme based on p -Stable distributions 2

l p Norms • l p ( v ) = ( � N i =1 | v i | p ) 1 /p E.g. l 1 norm (Manhattan), l 2 norm (Euclidean) • l p norms usually computed over vector differences E.g. l 1 ( v j − v k ) , l 2 ( v j − v k ) , l 0 . 005 ( v j − v k ) etc. • What do l p norms capture? – l 1 norm applied to telephone vectors: symmetric (multi) set difference between two customers – l p norms for small values of p (0.005): capture Hamming norms, distinct values [CDIM’02] Mayur Datar. LSH Scheme based on p -Stable distributions 3

Proximity Queries • Nearest Neighbor : Given a query q find the closest (smallest l p norm) point p • Near Neighbor : Given a query q and distance R find all (or most) points p s.t. l p ( p − q ) ≤ R • Applications: Classification, fraud detection etc. E.g. find cell phone customers whose calling pattern is similar to that of XYZ (UBL) Mayur Datar. LSH Scheme based on p -Stable distributions 4

Approximate Nearest Neighbor • Curse of dimensionality • Error parameter ǫ : Find any point that is within (1+ ǫ ) times the distance from true nearest neighbor p* q r (1+e)r Mayur Datar. LSH Scheme based on p -Stable distributions 5

Approximate Near Neighbor (( R, ǫ )–PLEB) • B ( c , R ) denotes a ball of radius R centered at c • Given: radius R , error parameter ǫ and query point q : – if there exists data point p s.t. q ∈ B ( p , R ) , return Yes and a point (or all points) p ′ s.t. q ∈ B ( p ′ , (1 + ǫ ) R ) , ∈ B ( p , R ) for all data points p , return No , – if q / – if closest data point to q is at distance between R and R (1 + ǫ ) then return Yes or No Mayur Datar. LSH Scheme based on p -Stable distributions 6

Approximate Near Neighbor • Useful problem formulation in itself • Approximate near est neighbor can be reduced to approximate near neighbor (binary search on R ) • Henceforth, we will concentrate on solving approximate near neighbor Mayur Datar. LSH Scheme based on p -Stable distributions 7

Our contribution • Data structure for the approximate near neighbor problem (( R, ǫ )–PLEB) • Small query time, update time and easy to implement • works for l p norms, for 0 < p ≤ 2 . In particular 0 < p < 1 • Earlier result ([IM’98]) worked for l 1 , l 2 and Hamming norm. • Our technique improves the query time for l 2 norm Mayur Datar. LSH Scheme based on p -Stable distributions 8

Locality Sensitive Hashing (LSH)([IM’98]) • Intuition: if two points are close (less than dist r 1 ) they hash to same bucket with prob at least p 1 . Else, if they are far (more than dist r 2 > r 1 ) they hash to same bucket with prob no more than p 2 < p 1 • Formally: A family H = { h : S → U } is called ( r 1 , r 2 , p 1 , p 2 ) -sensitive for distance function D if for any v , q ∈ S – if v ∈ B ( q , r 1 ) then Pr H [ h ( q ) = h ( v )] ≥ p 1 , ∈ B ( q , r 2 ) then Pr H [ h ( q ) = h ( v )] ≤ p 2 . – if v / – r 1 < r 2 , p 1 > p 2 Mayur Datar. LSH Scheme based on p -Stable distributions 9

Using LSH to solve ( R, ǫ )–PLEB ([IM’98]) • Let c = 1 + ǫ Theorem. Suppose there is a ( R, cR, p 1 , p 2 ) -sensitive family H for a distance measure D . Then there exists an algorithm for ( R, c ) - PLEB under measure D which uses O ( dn + n 1+ ρ ) space, with query time dominated by O ( n ρ ) distance computations, and O ( n ρ log 1 /p 2 n ) evaluations of hash functions from H , where ρ = ln 1 /p 1 ln 1 /p 2 • Bottom-line: Design LSH scheme with small ρ for l p norms Mayur Datar. LSH Scheme based on p -Stable distributions 10

Recap • Proximity problems reduced to designing LSH schemes • Design LSH schemes for l p norms with small ρ , update time etc. • A family H = { h : S → U } is called ( r 1 , r 2 , p 1 , p 2 ) -sensitive for distance function D if for any v , q ∈ S – if v ∈ B ( q , r 1 ) then Pr H [ h ( q ) = h ( v )] ≥ p 1 , ∈ B ( q , r 2 ) then Pr H [ h ( q ) = h ( v )] ≤ p 2 – if v / • r 1 = R = 1 , r 2 = R (1 + ǫ ) = 1 + ǫ = c Mayur Datar. LSH Scheme based on p -Stable distributions 11

p –Stable distributions • p –stable distribution ( p ≥ 0 ) : A distribution D over ℜ s.t – n real numbers v 1 . . . v n , – i.i.d. variables X 1 . . . X n with distribution D , i | v i | p ) 1 /p X = – r.v. � i v i X i has the same distribution as the variable ( � l p ( v ) X , where X is a r.v. with distribution D • E.g. p –Stable distr for p = 1 is Cauchy distr, for p = 2 is Gaussian distr • for 0 < p < 2 there is a way to sample from a p –stable distribution given two uniform r.v.’s over [0 , 1] [Nol] Mayur Datar. LSH Scheme based on p -Stable distributions 12

How are p –Stable distributions useful? • Consider a vector X = { X 1 , X 2 , . . . , X N } , where each X i is drawn from a p –Stable distr • For any pair of vectors a, b a · X − b · X = ( a − b ) · X (by linearity) • Thus a · X − b · X is distributed as ( l p ( a − b )) X ′ where X ′ is a p –Stable distr r.v. • Using multiple independent X ’s we can use a · X − b · X to estimate l p ( a − b ) [Ind’01] Mayur Datar. LSH Scheme based on p -Stable distributions 13

How are p –Stable distributions useful? • For a vector a , the dot product a · X projects it onto the real line • For any pair of vectors a, b these projections are “close” (w.h.p.) if l p ( a − b ) is “small” and “far” otherwise • Divide the real line into segments of width w • Each segment defines a hash bucket, i.e. vectors that project onto the same segment belong to the same bucket Mayur Datar. LSH Scheme based on p -Stable distributions 14

Hashing (formal) definition B 0 W W W W • Consider h a ,b ∈ H w , h a ,b ( v ) : R d → N • a is a d dimensional random vector whose each entry is drawn from a p -stable distr • b is a random real number chosen uniformly from [0 , w ] (random shift) • h a ,b ( v ) = ⌊ a · v + b ⌋ w Mayur Datar. LSH Scheme based on p -Stable distributions 15

Collision probabilities B 0 W W W W • Consider two vectors v 1 , v 2 and let ℓ = l p ( v 1 , v 2 ) • Let Y denote the distance between their projections onto the random vector a ( Y is distributed as ℓX where X is a p -stable distr r.v.) • if Y > w , v 1 , v 2 will not collide • if Y ≤ w , v 1 , v 2 will collide with probability equal to (1 − ( Y/w )) (random shift b ) Mayur Datar. LSH Scheme based on p -Stable distributions 16

Collision probabilities • f p ( t ) : p.d.f. of the absolute value of a p -stable distribution • ℓ = l p ( v 1 , v 2 ) � w 0 f p ( t )(1 − t • ℓ ≤ 1 , p 1 = Pr [ h a ,b ( v 1 ) = h a ,b ( v 2 )] ≥ w ) dt � w 1 c f p ( t c )(1 − t • ℓ > 1 + ǫ = c , p 2 = Pr [ h a ,b ( v 1 ) = h a ,b ( v 2 )] ≤ w ) dt 0 • H w hash family is ( r 1 , r 2 , p 1 , p 2 ) -sensitive for r 1 = 1 , r 2 = c and p 1 , p 2 given as above Mayur Datar. LSH Scheme based on p -Stable distributions 17

Special cases • p = 1 (Cauchy distr): f p ( t ) = 2 1 1+ t 2 π • p 2 = 2 tan − 1 ( w/c ) π ( w/c ) ln(1 + ( w/c ) 2 ) 1 − π • p 1 obtained by substituting c = 1 0.9 c=1.5 p1 p2 0.8 0.7 0.6 0.5 borp/pxe 0.4 0.3 0.2 0.1 0 0 5 10 15 20 r Mayur Datar. LSH Scheme based on p -Stable distributions 18

Special cases 2 π e − t 2 / 2 2 • p = 2 (Gaussian distr): f p ( t ) = √ 2 πw/c (1 − e − ( w 2 / 2 c 2 ) ) 2 • p 2 = 1 − 2 norm ( − w/c ) − √ • p 1 obtained by substituting c = 1 1 c=1.5 p1 p2 0.9 0.8 0.7 0.6 borp/pxe 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 r Mayur Datar. LSH Scheme based on p -Stable distributions 19

Comparison with previous scheme • Previous hashing scheme for p = 1 , 2 achieved ρ = 1 /c • Based on reduction to hamming distance • New scheme achieves smaller ρ (than 1 /c ) for p = 2 • Large constants and log factors for p = 2 in query time besides n ρ • Achieves ρ = 1 /c for p = 1 Mayur Datar. LSH Scheme based on p -Stable distributions 20

ρ for p = 2 1 rho 1/c 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 Approximation factor c Mayur Datar. LSH Scheme based on p -Stable distributions 21

ρ for p = 1 1 rho 1/c 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 Approximation factor c Mayur Datar. LSH Scheme based on p -Stable distributions 22

Locality Sensitive Hashing Scheme Based on p -Stable Distributions - PowerPoint PPT Presentation

Locality Sensitive Hashing Scheme Based on p -Stable Distributions Mayur Datar (Stanford) Nicole Immorlica (MIT) Piotr Indyk (MIT) Vahab Mirrokni (MIT) (Streaming) Massive Data Sets High Dimensional Vectors Massive data sets visualized

Today. Cuckoo hashing. Today. Cuckoo hashing. Johnson-Lindenstrass. Cuckoo hashing. Hashing

Nearest Neighbor and Locality-Sensitive Hashing Nearest Neighbor Set Similarity

Locality-Sensitive Hashing LSH Fingerprints References Anil Maheshwari School of Computer

14. Hashing Hash Tables, Pre-Hashing, Hashing, Resolving Collisions using Chaining, Simple

Overview Intro to Hashing Intro to Hashing Hashing with Chaining Whats hashing?

14. Hashing Hash Tables, Pre-Hashing, Hashing, Resolving Collisions using Chaining, Simple

Locality-Sensitive Hashing Documents LSH Metric Spaces Sensitive Function Anil Maheshwari

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Information near-duplicates Minimum hashing; Locality Sensitive Hashing Web Search Information

MIN-HASHING AND LOCALITY SENSITIVE HASHING Thanks to: Rajaraman and Ullman, Mining Massive

Highway Locality Budget Scheme Steve Dibben Highway Locality Manager Mid Herts Group

Database Systems Index: Hashing Based on slides by Feifei Li, University of Utah Hashing n

Near Neighbor Search in High Dimensional Data (2) Locality-Sensitive Hashing (continued) LS

Hashing (Application of Probability) Ashwinee Panda Final CS 70 Lecture! 9 Aug 2018 Overview

Hashing Connections 2-Universal Hash Function Perfect Hashing Anil Maheshwari Proofs

Union-Find [10] In the last class Hashing Collision Handling for Hashing Closed

Linux Disaster Recovery best practices with rear Gratien D'haese Gratien D'haese IT3

Creating accessible fixed layout EPUB3 for schools and colleges Dr Gerald Schm idt Pearson

Networking 2 Por$ons courtesy Ellen Liu CSE/ISE 311: Systems

Experimental Opportunity at J-PARC H.M.Shimizu Department of Physics, Nagoya University

opening the clouds qualitative overview of the state-of-the-art open source cloud management

A new boot process for Plan 9 Iruat Souza 4th IWP9 October 22, 2009 http://iru.oitobits.net

xCAT and Masterless Puppet: Aiming For Ideal Configuration Management Jason St. John Research

A Tool for Environment Deployment in Clusters and Light Grids presented by Guillaume Huard