Locality-Sensitive Hashing Anil Maheshwari Introduction Similarity of Locality-Sensitive Hashing Documents LSH Metric Spaces Sensitive Function Anil Maheshwari Family AND-OR Family anil@scs.carleton.ca Fingerprints School of Computer Science References Carleton University Canada
Outline Locality-Sensitive Hashing Anil Maheshwari Introduction Introduction 1 Similarity of Documents Similarity of Documents 2 LSH Metric Spaces LSH 3 Sensitive Function Family AND-OR Family Metric Spaces 4 Fingerprints References Sensitive Function Family 5 AND-OR Family 6 Fingerprints 7 References 8
Objectives Locality-Sensitive Hashing Anil Maheshwari How to find efficiently Introduction Similarity of Similar documents among a collection of documents 1 Documents LSH Similar web-pages among web-pages 2 Metric Spaces Similar fingerprints among a database of fingerprints 3 Sensitive Function Family Similar sets among a collection of sets 4 AND-OR Family Similar images from a database of images 5 Fingerprints Similar vectors in higher dimensions. References 6
Similarity of Documents Locality-Sensitive Hashing Anil Maheshwari Problem Definition Introduction Similarity of Input: A collection of web-pages. Documents Output: Report near duplicate web-pages. LSH Metric Spaces k-shingles Sensitive Function Family Any substring of k words that appears in the document. AND-OR Family Fingerprints References Text Document = “What is the likely date that the regular classes may resume in Ontario” 2 − shingles: What is, is the, the likely, . . . , in Ontario 3 − shingles: What is the, is the likely, . . . , resume in Ontario In practice: 9 − shingles for English Text and 5 − shingles for e-mails
Similarity between sets Locality-Sensitive Hashing Anil Maheshwari Text Document D → Set S Introduction Similarity of Form all the k -shingles of D 1 Documents LSH S is the collection of all k -shingles of D 2 Metric Spaces Sensitive Function Family Jaccard Similarity AND-OR Family For a pair of sets S and T , the Jaccard Similarity is Fingerprints defined as SIM ( S, T ) = | S ∩ T | References | S ∪ T |
Problem: Find Similar Sets Locality-Sensitive Hashing Anil Maheshwari New Problem Introduction Similarity of Given a constant 0 ≤ s ≤ 1 and a collection of sets S , find Documents the pairs of sets in S with Jaccard similarity ≥ s LSH Metric Spaces Sensitive Function Family AND-OR Family Fingerprints References
Characteristic Matrix Representation of Sets Locality-Sensitive Hashing Anil Maheshwari U = { Cruise, Ski, Resorts, Safari, Stay@Home } Introduction S = { S 1 , S 2 , S 3 , S 4 } , where each S i ⊆ U Similarity of Documents e.g. S 1 = { Cruise, Safari } and S 2 = { Resorts } LSH Metric Spaces Characteristic matrix for S : Sensitive Function S 1 S 2 S 3 S 4 Family Cruise 1 0 0 1 AND-OR Family Fingerprints Ski 0 0 1 0 References Resorts 0 1 0 1 Safari 1 0 1 1 Stay@Home 0 0 1 0
MinHash Signatures Locality-Sensitive Hashing Anil Maheshwari S 1 S 2 S 3 S 4 Introduction 0 Cruise 1 0 0 1 Similarity of 1 Ski 0 0 1 0 Documents 2 Resorts 0 1 0 1 LSH 3 Safari 1 0 1 1 Metric Spaces 4 Stay@Home 0 0 1 0 Sensitive Function Family Permute Rows π : 01234 → 40312 AND-OR Family S 1 S 2 S 3 S 4 Fingerprints 0 Ski 0 0 1 0 References 1 Safari 1 0 1 1 2 Stay@Home 0 0 1 0 3 Resorts 0 1 0 1 4 Cruise 1 0 0 1 Minhash Signatures for π : h ( S 1 ) = 1 , h ( S 2 ) = 3 , h ( S 3 ) = 0 , and h ( S 4 ) = 1
Key Observation Locality-Sensitive Hashing Anil Maheshwari Lemma Introduction Similarity of For any two sets S i and S j in a collection of sets S where Documents the elements are drawn from the universe U , the LSH probability that the minhash value h ( S i ) equals h ( S j ) is Metric Spaces equal to the Jaccard similarity of S i and S j , i.e., Sensitive Function Family Pr [ h ( S i ) = h ( S j )] = SIM ( S i , S j ) = | S i ∩ S j | | S i ∪ S j | . AND-OR Family Fingerprints S 1 S 2 S 3 S 4 References 0 Ski 0 0 1 0 1 Safari 1 0 1 1 2 Stay@Home 0 0 1 0 3 Resorts 0 1 0 1 4 Cruise 1 0 0 1 Pr [ h ( S 1 ) = h ( S 4 )] = SIM ( S 1 , S 4 ) = | S 1 ∩ S 4 | | S 1 ∪ S 4 | = 2 3
MinHashSignature Matrix Locality-Sensitive Hashing Anil Maheshwari MinHash Signature matrix for |S| = 11 sets with 12 hash Introduction functions Similarity of Documents S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 10 S 11 LSH 2 2 1 0 0 1 3 2 5 0 3 Metric Spaces 1 3 2 0 2 2 1 4 2 1 2 Sensitive Function Family 3 0 3 0 4 3 2 0 0 4 2 AND-OR Family 0 4 3 1 5 3 3 2 3 5 4 Fingerprints 2 1 1 0 4 1 2 1 4 2 5 References 4 2 1 0 5 2 3 2 3 5 4 2 4 3 0 5 3 3 4 4 5 3 0 2 4 1 3 4 3 2 2 2 4 0 2 1 0 5 1 1 1 1 5 1 0 5 1 0 2 1 3 2 1 5 4 1 3 1 0 5 2 3 3 6 3 2 0 5 2 1 5 1 2 2 6 5 4
LSH for MinHash Locality-Sensitive Hashing Anil Maheshwari Partitioning of a signature matrix into b = 4 bands of r = 3 Introduction rows each. Similarity of Documents Band S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 10 S 11 LSH 2 2 1 0 0 1 3 2 5 0 3 Metric Spaces I 1 3 2 0 2 2 1 4 2 1 2 3 0 3 0 4 3 2 0 0 4 2 Sensitive Function Family 0 4 3 1 5 3 3 2 3 5 4 AND-OR Family II 2 1 1 0 4 1 2 1 4 2 5 Fingerprints 4 2 1 0 5 2 3 2 3 5 4 2 4 3 0 5 3 3 4 4 5 3 References III 0 2 4 1 3 4 3 2 2 2 4 0 2 1 0 5 1 1 1 1 5 1 0 5 1 0 2 1 3 2 1 5 4 IV 1 3 1 0 5 2 3 3 6 3 2 0 5 2 1 5 1 2 2 6 5 4 Band 3: { S 3 , S 6 , S 11 } are hashed into the same bucket, and so are { S 8 , S 9 }
Probability of finding similar sets Locality-Sensitive Hashing Anil Maheshwari Lemma Introduction Similarity of Let s > 0 be the Jaccard similarity of two sets. The Documents probability that the minHash signature matrix agrees in all LSH the rows of at least one of the bands for these two sets is Metric Spaces f ( s ) = 1 − (1 − s r ) b . Sensitive Function Family AND-OR Family Band S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 10 S 11 Fingerprints 2 2 1 0 0 1 3 2 5 0 3 I 1 3 2 0 2 2 1 4 2 1 2 References 3 0 3 0 4 3 2 0 0 4 2 0 4 3 1 5 3 3 2 3 5 4 II 2 1 1 0 4 1 2 1 4 2 5 4 2 1 0 5 2 3 2 3 5 4 2 4 3 0 5 3 3 4 4 5 3 III 0 2 4 1 3 4 3 2 2 2 4 0 2 1 0 5 1 1 1 1 5 1 0 5 1 0 2 1 3 2 1 5 4 IV 1 3 1 0 5 2 3 3 6 3 2 0 5 2 1 5 1 2 2 6 5 4
Proof Locality-Sensitive Hashing Anil Maheshwari Claim: Pr(signatures agree in all rows of ≥ 1 bands for S i Introduction and S j with Jaccard Similarity s )= f ( s ) = 1 − (1 − s r ) b . Similarity of Documents LSH Metric Spaces Sensitive Function Family AND-OR Family Fingerprints References
Understanding f ( s ) Locality-Sensitive Hashing Anil Maheshwari f ( s ) = 1 − (1 − s r ) b for different values of s, b, and r : Introduction Similarity of Documents ( b, r ) (4 , 3) (16 , 4) (20 , 5) (25 , 5) (100 , 10) f ( s ) = 1 − (1 − s r ) b ց LSH Metric Spaces s = 0 . 2 0.0316 0.0252 0.0063 0.0079 0.0000 Sensitive Function Family s = 0 . 4 0.2324 0.3396 0.1860 0.2268 0.0104 AND-OR Family s = 0 . 5 0.4138 0.6439 0.4700 0.5478 0.0930 Fingerprints References s = 0 . 6 0.6221 0.8914 0.8019 0.8678 0.4547 s = 0 . 8 0.9432 0.9997 0.9996 0.9999 0.9999 s = 1 . 0 1.0 1.0 1.0 1.0 1.0 b ) ( 1 Threshold t = ( 1 r ) 0 . 6299 0 . 5 0 . 5492 0 . 5253 0 . 6309
S -curve Locality-Sensitive Hashing Anil Maheshwari 1 Introduction Similarity of r = 3 , b = 4 Documents r = 4 , b = 16 0 . 8 LSH f ( s ) = 1 − (1 − s r ) b r = 5 , b = 20 Metric Spaces r = 5 , b = 25 Sensitive Function 0 . 6 Family r = 10 , b = 100 AND-OR Family Fingerprints 0 . 4 References 0 . 2 0 0 0 . 2 0 . 4 0 . 6 0 . 8 1 s
Comments on S -Curve Locality-Sensitive Hashing Anil Maheshwari For what values of s , f ′′ ( s ) = 0 ? 1 Introduction 1 s = ( r − 1 br − 1 ) Similarity of r Documents 1 For values of br >> 1 , s ≈ ( 1 b ) 2 LSH r Metric Spaces Steepest slope occurs at s ≈ (1 /b ) (1 /r ) 3 Sensitive Function Family If the Jaccard similarity s of the two sets is above the 4 1 AND-OR Family threshold t = ( 1 r , the probability that they will be b ) Fingerprints found potentially similar is very high. References Consider the entries in the row corresponding to 5 s = 0 . 8 in the table and observe that most of the values for f ( s = 0 . 8) → 1 as s > t .
Computational Summary Locality-Sensitive Hashing Anil Maheshwari Input: Collection of m text documents of size D Introduction k -shingles: Size = k D Similarity of Documents Characteristic matrix of size | U | × m , where U is the LSH universe of all possible k -shingles Metric Spaces Sensitive Function Signature matrix of size n × m using n -permutations Family ⌈ n r ⌉ bands each consisting of r rows AND-OR Family Fingerprints Hash maps from bands to buckets References Output: All pairs of documents that are in the same bucket corresponding to a band Check whether the pairs correspond to similar documents! With the right choice of threshold Pr(the pair is similar) → 1
Recommend
More recommend