Locality-Sensitive Hashing Anil Maheshwari Introduction Similarity of Documents Locality-Sensitive Hashing LSH Fingerprints References Anil Maheshwari School of Computer Science Carleton University Canada
Outline Locality-Sensitive Hashing Anil Maheshwari Introduction Similarity of Documents Introduction 1 LSH Fingerprints References Similarity of Documents 2 LSH 3 Fingerprints 4 References 5
Objectives Locality-Sensitive Hashing Anil Maheshwari Introduction Similarity of Documents LSH How to find efficiently Fingerprints References Similar documents among a collection of documents 1 Similar web-pages among web-pages 2 Similar fingerprints among a database of fingerprints 3 Similar sets among a collection of sets 4 Similar images from a database of images 5
Similarity of Documents Locality-Sensitive Hashing Anil Maheshwari Introduction Problem Definition Similarity of Documents Input: A collection of web-pages. LSH Output: Report near duplicate web-pages. Fingerprints References k-shingles Any substring of k words that appears in the document. Text Document = “What is the likely date that the regular classes may resume in Ontario” 2 − shingles: What is, is the, the likely, . . . , in Ontario 3 − shingles: What is the, is the likely, . . . , resume in Ontario In practice: 9 − shingles for English Text and 5 − shingles for e-mails
Similarity between sets Locality-Sensitive Hashing Anil Maheshwari Introduction Text Document D → Set S Similarity of Documents Form all the k -shingles of D 1 LSH Fingerprints S is the collection of all k -shingles of D 2 References Jaccard Similarity For a pair of sets S and T , the Jaccard Similarity is defined as SIM ( S, T ) = | S ∩ T | | S ∪ T | New Problem Given a constant 0 ≤ s ≤ 1 and a collection of sets S , find the pairs of sets in S with Jaccard similarity ≥ s
Characteristic Matrix Representation of Sets Locality-Sensitive Hashing Anil Maheshwari Introduction Similarity of Documents U = { Cruise, Ski, Resorts, Safari, Stay@Home } LSH S = { S 1 , S 2 , S 3 , S 4 } , where each S i ⊆ U Fingerprints e.g. S 1 = { Cruise, Safari } and S 2 = { Resorts } References Characteristic matrix for S : S 1 S 2 S 3 S 4 Cruise 1 0 0 1 Ski 0 0 1 0 Resorts 0 1 0 1 Safari 1 0 1 1 Stay@Home 0 0 1 0
MinHash Signatures Locality-Sensitive Hashing Anil Maheshwari S 1 S 2 S 3 S 4 Introduction Similarity of 0 Cruise 1 0 0 1 Documents 1 Ski 0 0 1 0 LSH 2 Resorts 0 1 0 1 Fingerprints 3 Safari 1 0 1 1 References 4 Stay@Home 0 0 1 0 Permute Rows π : 01234 → 40312 S 1 S 2 S 3 S 4 0 Ski 0 0 1 0 1 Safari 1 0 1 1 2 Stay@Home 0 0 1 0 3 Resorts 0 1 0 1 4 Cruise 1 0 0 1 Minhash Signatures for π : h ( S 1 ) = 1 , h ( S 2 ) = 3 , h ( S 3 ) = 0 , and h ( S 4 ) = 1
Key Observation Locality-Sensitive Hashing Anil Maheshwari Lemma Introduction Similarity of For any two sets S i and S j in a collection of sets S where Documents the elements are drawn from the universe U , the LSH probability that the minhash value h ( S i ) equals h ( S j ) is Fingerprints equal to the Jaccard similarity of S i and S j , i.e., References Pr [ h ( S i ) = h ( S j )] = SIM ( S i , S j ) = | S i ∩ S j | | S i ∪ S j | . S 1 S 2 S 3 S 4 0 Ski 0 0 1 0 1 Safari 1 0 1 1 2 Stay@Home 0 0 1 0 3 Resorts 0 1 0 1 4 Cruise 1 0 0 1 Pr [ h ( S 1 ) = h ( S 4 )] = SIM ( S 1 , S 4 ) = | S 1 ∩ S 4 | | S 1 ∪ S 4 | = 2 3
MinHashSignature Matrix Locality-Sensitive Hashing Anil Maheshwari MinHash Signature matrix for |S| = 11 sets with 12 hash Introduction functions Similarity of Documents S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 10 S 11 LSH 2 2 1 0 0 1 3 2 5 0 3 Fingerprints 1 3 2 0 2 2 1 4 2 1 2 References 3 0 3 0 4 3 2 0 0 4 2 0 4 3 1 5 3 3 2 3 5 4 2 1 1 0 4 1 2 1 4 2 5 4 2 1 0 5 2 3 2 3 5 4 2 4 3 0 5 3 3 4 4 5 3 0 2 4 1 3 4 3 2 2 2 4 0 2 1 0 5 1 1 1 1 5 1 0 5 1 0 2 1 3 2 1 5 4 1 3 1 0 5 2 3 3 6 3 2 0 5 2 1 5 1 2 2 6 5 4
LSH for MinHash Locality-Sensitive Hashing Anil Maheshwari Partitioning of a signature matrix into b = 4 bands of r = 3 Introduction rows each. Similarity of Documents Band S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 10 S 11 2 2 1 0 0 1 3 2 5 0 3 LSH I 1 3 2 0 2 2 1 4 2 1 2 3 0 3 0 4 3 2 0 0 4 2 Fingerprints 0 4 3 1 5 3 3 2 3 5 4 References II 2 1 1 0 4 1 2 1 4 2 5 4 2 1 0 5 2 3 2 3 5 4 2 4 3 0 5 3 3 4 4 5 3 III 0 2 4 1 3 4 3 2 2 2 4 0 2 1 0 5 1 1 1 1 5 1 0 5 1 0 2 1 3 2 1 5 4 IV 1 3 1 0 5 2 3 3 6 3 2 0 5 2 1 5 1 2 2 6 5 4 Band 1: { S 3 , S 6 } are hashed into same bucket Band 3: { S 3 , S 6 , S 11 } are hashed into the same bucket, and so are { S 8 , S 9 } Band 4: { S 2 , S 10 } are hashed into the same bucket
Probability of finding similar sets Locality-Sensitive Hashing Anil Maheshwari Lemma Introduction Similarity of Let s > 0 be the Jaccard similarity of two sets. The Documents probability that the minHash signature matrix agrees in all LSH the rows of at least one of the bands for these two sets is Fingerprints f ( s ) = 1 − (1 − s r ) b . References Band S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 10 S 11 2 2 1 0 0 1 3 2 5 0 3 I 1 3 2 0 2 2 1 4 2 1 2 3 0 3 0 4 3 2 0 0 4 2 0 4 3 1 5 3 3 2 3 5 4 II 2 1 1 0 4 1 2 1 4 2 5 4 2 1 0 5 2 3 2 3 5 4 2 4 3 0 5 3 3 4 4 5 3 III 0 2 4 1 3 4 3 2 2 2 4 0 2 1 0 5 1 1 1 1 5 1 0 5 1 0 2 1 3 2 1 5 4 IV 1 3 1 0 5 2 3 3 6 3 2 0 5 2 1 5 1 2 2 6 5 4
Proof Locality-Sensitive Hashing Anil Maheshwari Claim: Pr(signatures agree in all the rows of ≥ 1 bands Introduction for these two sets) = f ( s ) = 1 − (1 − s r ) b Similarity of Documents Proof: LSH Pr(minhash signatures for these two sets are the 1 Fingerprints same in any particular row) = s (key observation) References Pr(signatures agree in all the rows in one particular 2 band) = s r Pr(signatures do not agree in ≥ 1 rows in this band) 3 = 1 − s r Pr(signatures do not agree in any of the b bands) 4 = (1 − s r ) b Pr(signatures agree in ≥ 1 bands) 5 = f ( s ) = 1 − (1 − s r ) b
Understanding f ( s ) Locality-Sensitive Hashing Anil Maheshwari f ( s ) = 1 − (1 − s r ) b for different values of s, b, and r : Introduction Similarity of ( b, r ) (4 , 3) (16 , 4) (20 , 5) (25 , 5) (100 , 10) Documents f ( s ) = 1 − (1 − s r ) b ց LSH s = 0 . 2 0.0316 0.0252 0.0063 0.0079 0.0000 s = 0 . 4 0.2324 0.3396 0.1860 0.2268 0.0104 Fingerprints s = 0 . 5 0.4138 0.6439 0.4700 0.5478 0.0930 s = 0 . 6 0.6221 0.8914 0.8019 0.8678 0.4547 References s = 0 . 8 0.9432 0.9997 0.9996 0.9999 0.9999 s = 1 . 0 1.0 1.0 1.0 1.0 1.0 b ) ( 1 r ) Threshold t = ( 1 0 . 6299 0 . 5 0 . 5492 0 . 5253 0 . 6309
S -curve Locality-Sensitive Hashing Anil Maheshwari Introduction 1 Similarity of r = 3 , b = 4 Documents r = 4 , b = 16 LSH 0 . 8 f ( s ) = 1 − (1 − s r ) b r = 5 , b = 20 Fingerprints r = 5 , b = 25 References 0 . 6 r = 10 , b = 100 0 . 4 0 . 2 0 0 0 . 2 0 . 4 0 . 6 0 . 8 1 s
Comments on S -Curve Locality-Sensitive Hashing Anil Maheshwari For what values of s , f ′′ ( s ) = 0 ? 1 Introduction 1 s = ( r − 1 br − 1 ) Similarity of r Documents 1 For values of br >> 1 , s ≈ ( 1 b ) 2 LSH r Fingerprints Steepest slope occurs at s ≈ (1 /b ) (1 /r ) 3 References If the Jaccard similarity s of the two sets is above the 4 1 threshold t = ( 1 r , the probability that they will be b ) found potentially similar is very high. Consider the entries in the row corresponding to 5 s = 0 . 8 in the table and observe that most of the values for f ( s = 0 . 8) → 1 as s > t .
Computational Summary Locality-Sensitive Hashing Anil Maheshwari Input: Collection of m text documents of size D Introduction k -shingles: Size = k D Similarity of Documents Characteristic matrix of size | U | × m , where U is the LSH universe of all possible k -shingles Fingerprints References Signature matrix of size n × m using n -permutations ⌈ n r ⌉ bands each consisting of r rows Hash maps from bands to buckets Output: All pairs of documents that are in the same bucket corresponding to a band Check whether the pairs correspond to similar documents! With the right choice of threshold Pr(the pair is similar) → 1
Matching Fingerprints Locality-Sensitive Hashing Anil Maheshwari Introduction Fingerprints consists of minutia points and patterns that Similarity of form ridges and bifurcations Documents LSH Fingerprints Bifurcations References Ridge Ending Ridge Dot
Fingerprint with an overlay grid Locality-Sensitive Hashing Anil Maheshwari Introduction Fingerprint mapped to a normalized grid cell Similarity of Documents LSH Fingerprints References
Recommend
More recommend