Table of Content • Motivation • Shingling for duplicate comparison Topic: Duplicate Detection and • Minhashing Similarity Computing • LSH UCSB 290N, 2013 Tao Yang Some of slides are from text book [CMS] and Rajaraman/Ullman Applications of Duplicate Detection and Duplicate Detection Similarity Computing • Duplicate and near-duplicate documents occur in • Exact duplicate detection is relatively easy many situations Content fingerprints Copies, versions, plagiarism, spam, mirror sites MD5, cyclic redundancy check (CRC) Over 30% of the web pages in a large crawl are • Checksum techniques exact or near duplicates of pages in the other 70% A checksum is a value that is computed based on the • Duplicates consume significant resources during content of the document crawling, indexing, and search – e.g., sum of the bytes in the document file Little value to most users • Similar query suggestions • Advertisement: coalition and spam detection Possible for files with different text to have same checksum 1
Near-Duplicate News Articles Near-Duplicate Detection • More challenging task Are web pages with same text context but different advertising or format near-duplicates? • Near-Duplication : Approximate match Compute syntactic similarity with an edit- distance measure Use similarity threshold to detect near- duplicates – E.g., Similarity > 80% => Documents are “near duplicates” – Not transitive though sometimes used transitively SpotSigs: Robust & Efficient Near Duplicate Detection in 5 Large Web Collections Near-Duplicate Detection Two Techniques for Computing Similarity 1. Shingling : convert documents, emails, etc., to • Search : fingerprint sets. find near-duplicates of a document D 2. Minhashing : convert large sets to short signatures, while preserving similarity. O(N) comparisons required • Discovery : find all pairs of near-duplicate documents in the All-pair Docu- collection ment comparison O(N 2 ) comparisons • IR techniques are effective for search scenario The set Signatures : of strings short integer • For discovery, other techniques used to generate of length k vectors that compact representations that appear represent the in the doc- sets, and ument reflect their 8 similarity 2
Fingerprint Generation Process for Web Computing Similarity with Shingles Documents • Shingles (Word k -Grams) [Brin95, Brod98] “a rose is a rose is a rose” => a_rose_is_a rose_is_a_rose is_a_rose_is • Similarity Measure between two docs (= sets of shingles) Size_of_Intersection / Size_of_Union Jaccard measure Example: Jaccard Similarity Fingerprint Example for Web Documents • The Jaccard similarity of two sets is the size of their intersection divided by the size of their union. Sim (C 1 , C 2 ) = |C 1 C 2 |/|C 1 C 2 |. 3 in intersection. 8 in union. Jaccard similarity = 3/8 11 3
Approximated Representation with Computing Sketch[i] for Doc1 Sketching • Computing exact set intersection of shingles between Document 1 all pairs of documents is expensive Approximate using a subset of shingles (called sketch vectors) 2 64 Start with 64 bit shingles Create a sketch vector using minhashing. 2 64 – For doc d , sketch d [i] is computed as follows: Permute on the number line – Let f map all shingles in the universe to 0..2 m with p i 2 64 – Let p i be a specific random permutation on 0..2 m – Pick MIN p i ( f(s) ) over all shingles s in this document d 2 64 Pick the min value Documents which share more than t (say 80%) in sketch vector’s elements are similar Test if Doc1.Sketch[i] = Shingling with minhashing Doc2.Sketch[i] • Given two documents d1, d2. Document 2 Document 1 • Let S1 and S2 be their shingle sets • Resemblance = |Intersection of S1 and S2| / | Union of S1 and S2|. 2 64 2 64 • Let Alpha = min ( p (S1)) 2 64 2 64 • Let Beta = min ( p (S2)) 2 64 2 64 Probability (Alpha = Beta) = Resemblance A B Computing this by sampling (e.g. 200 times). 2 64 2 64 Are these equal? Test for 200 random permutations: p 1 , p 2 ,… p 200 4
Proof with Boolean Matrices Key Observation • Rows = elements of the universal set. • For columns C i , C j , four types of rows • Columns = sets. C i C j • 1 in row e and column S if and only if e is a A 1 1 member of S . • Column similarity is the Jaccard similarity of the B 1 0 sets of their rows with 1. C 0 1 • Typical matrix is sparse. D 0 0 C C i j sim (C , C ) • Overload notation: A = # of rows of type A C 1 C 2 J i j C C 0 1 * i j • Claim 1 0 * * * 1 1 Sim (C 1 , C 2 ) = A sim (C , C ) 0 0 2/5 = 0.4 J i j A B C * * 1 1 * 0 1 17 Minhashing Property • Imagine the rows permuted randomly. • The probability (over all permutations of the • “hash” function h ( C ) = the index of the first (in rows) that h (C 1 ) = h (C 2 ) is the same as Sim (C 1 , the permuted order) row with 1 in column C . C 2 ). P h(C ) h(C ) sim C , C • Use several (e.g., 100) independent hash J i j i j functions to create a signature. • Both are A /( A + B + C )! • The similarity of signatures is the fraction of • Why? the hash functions in which they agree. Look down the permuted columns C 1 and C 2 until we see a 1. If it’s a type - a row, then h (C 1 ) = h (C 2 ). If a type- b or type- c row, then not. 19 20 5
All-pair comparison is expensive • We want to compare objects, finding those pairs that are sufficiently similar. • comparing the signatures of all pairs of objects is quadratic in the number of objects • Example: 10 6 objects implies 5*10 11 comparisons. Locality-Sensitive Hashing At 1 microsecond/comparison: 6 days. 21 22 The Big Picture Locality-Sensitive Hashing • General idea: Use a function f(x,y) that tells whether or not x and y is a candidate pair : a pair Candidate of elements whose similarity must be evaluated. pairs : • Map a document to many buckets Locality- Docu- those pairs sensitive of signatures ment Hashing that we need d1 d2 to test for similarity. The set Signatures : of strings short integer of length k vectors that that appear represent the in the doc- sets, and • Make elements of the same bucket candidate pairs. ument reflect their similarity 23 24 6
Another view of LSH: Produce signature Signature agreement of each pair at each with bands band Agreement? Mapped into the same bucket? r rows r rows per band per band b bands b bands One short signature Signature 25 26 Docs 2 and 6 Buckets Signature generation and bucket comparison are probably identical. Docs 6 and 7 are surely different. Matrix M • Create b bands for each document Signature of doc X and Y in the same band agrees a candidate pair Use r minhash values ( r rows) for each band b bands r rows • Tune b and r to catch most similar pairs, but few nonsimilar pairs. 27 28 7
Example Analysis of LSH • Suppose C 1 , C 2 are 80% Similar • Probability the minhash signatures of C 1 , C 2 agree in • Choose 20 bands of 5 integers/band. one row: s Threshold of two similar documents • Probability C 1 , C 2 identical in one particular band: (0.8) 5 = 0.328. • Probability C 1 , C 2 identical in one band: s r • Probability C 1 , C 2 are not similar in any of the 20 • Probability C 1 , C 2 do not agree at least one row of a bands: (1-0.328) 20 = .00035 . band: 1-s r i.e., about 1/3000th of the 80%-similar column pairs • Probability C 1 , C 2 do not agree in all bands: (1-s r ) b are false negatives. False negative probability • Probability C 1 , C 2 agree one of these bands: 1- (1-s r ) b C1 C2 Probability that we find such a pair. 29 30 What One Band Gives You Analysis of LSH – What We Want Probability = 1 if s > t Remember: Probability No chance Probability probability of of sharing if s < t of sharing equal hash-values a bucket a bucket = similarity t t Similarity s of two docs Similarity s of two docs 31 32 8
Example: b = 20; r = 5 What b Bands of r Rows Gives You Probability of a similar pair to share a bucket At least s 1-(1-s r ) b No bands one band identical identical .2 .006 .3 .047 s r 1 - ( 1 - ) b t ~ (1/b) 1/r .4 .186 Probability of sharing .5 .470 a bucket .6 .802 All rows Some row .7 .975 of a band of a band are equal unequal t .8 .9996 Similarity s of two docs 33 34 LSH Summary • Get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures. Check that candidate pairs really do have similar signatures. • LSH involves tradeoff Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives. Example: if we had only 15 bands of 5 rows, the number of false positives would go down, but the number of false negatives would go up. 35 9
Recommend
More recommend