INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 18/25: Web Search Basics Paul Ginsparg Cornell University, Ithaca, NY 3 Nov 2011 1 / 63
Administrativa Assignment 3 due Sun 6 Nov 2 / 63
Overview Recap 1 Duplicate detection 2 Web IR 3 Queries Links Context Users Documents 3 / 63
Outline Recap 1 Duplicate detection 2 Web IR 3 Queries Links Context Users Documents 4 / 63
Duplicate detection The web is full of duplicated content. More so than many other collections Exact duplicates Easy to eliminate E.g., use hash/fingerprint Near-duplicates Abundant on the web Difficult to eliminate For the user, it’s annoying to get a search result with near-identical documents. Recall marginal relevance We need to eliminate near-duplicates. 5 / 63
Detecting near-duplicates Compute similarity with an edit-distance measure We want syntactic (as opposed to semantic) similarity. We do not consider documents near-duplicates if they have the same content, but express it with different words. Use similarity threshold θ to make the call “is/isn’t a near-duplicate”. E.g., two documents are near-duplicates if similarity > θ = 80%. 6 / 63
Shingles A shingle is simply a word n-gram. Shingles are used as features to measure syntactic similarity of documents. For example, for n = 3, “a rose is a rose is a rose” would be represented as this set of shingles: { a-rose-is, rose-is-a, is-a-rose } We can map shingles to 1 .. 2 m (e.g., m = 64) by fingerprinting. From now on: s k refers to the shingle’s fingerprint in [1 , 2 m ]. The similarity of two documents can then be defined as the Jaccard coefficient of their shingle sets. 7 / 63
Recall (from lecture 4): Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient: jaccard ( A , B ) = | A ∩ B | | A ∪ B | ( A � = ∅ or B � = ∅ ) jaccard ( A , A ) = 1 jaccard ( A , B ) = 0 if A ∩ B = 0 A and B don’t have to be the same size. Always assigns a number between 0 and 1. 8 / 63
Jaccard coefficient: Example Three documents: d 1 : “Jack London traveled to Oakland” d 2 : “Jack London traveled to the city of Oakland” d 3 : “Jack traveled from Oakland to London” Based on shingles of size 2, what are the Jaccard coefficients J ( d 1 , d 2 ) and J ( d 1 , d 3 )? J ( d 1 , d 2 ) = 3 / 8 = 0 . 375 J ( d 1 , d 3 ) = 0 Note: very sensitive to dissimilarity 9 / 63
Outline Recap 1 Duplicate detection 2 Web IR 3 Queries Links Context Users Documents 10 / 63
Intuition Let S 1 be the set of shingles from document d 1 , and S 2 the set of shingles from document d 2 . Start with the full set shingles S 1 ∪ S 2 from two documents, total number = S 1 ∪ S 2 . From this full set, pick a shingle at random, and test if it’s contained in both documents or just one. Intuition: if picking at random more often than not retrieves shingles shared by both documents, then more likely than not the two document have many shingles in common. In particular the probability of picking one in common is proportional to the Jaccard coefficient: J ( S 1 , S 2 ) = | S 1 ∩ S 2 | / | S 1 ∪ S 2 | 11 / 63
Intuition, cont’d So if we choose at random, say, 200 times, and find 190 in common, then we’ve estimated J ( S 1 , S 2 ) ≈ . 95; and if we find only 10 in common, then we estimate J ( S 1 , S 2 ) ≈ . 05. The number chosen for testing determines the accuracy of the approximation. [Each test is equivalent to an independent Bernoulli trial with p = J , so the expected number k of “successes” in n trials is E[ k ] = pn , with � standard deviation σ = np (1 − p ). The estimate for J is thus k / n with � < 1 / 2 √ n .] error σ/ n = p (1 − p ) / n � For n = 200 trials, then our estimate of J is good to ± . 1 accuracy with greater than 99% confidence (i.e., less than 1% chance that we have m successes in N trials for J varying from our estimate m / N by more than .1), and that’s good enough for a .8 threshold. [More precisely, the estimate is accurate to ± . 035 with 68.3% confidence (1 standard deviation), to ± . 07 with 95.4% confidence (2 s.d.), and to ± . 105 with 99.7% confidence (3 s.d.).] 12 / 63
Implementation How do we choose shingles at random? Idea 1: order them, do a random permutation, then the pick the one now in the first position (or any fixed position). Since the permutation is random, that picks a random shingle from the set. But how do we explicitly implement a random permutation? Idea 2: map the shingles onto integers in some range, then use a random (hash) function on them. The numbers representing the shingles are jumbled around at random, and hence the one that is smallest (lands in first position) can be taken as the desired random selection. Then repeat with 200 different hash functions, giving the desired number of trials: each time check whether the chosen shingle comes from a single document or both. 13 / 63
One additional Bonus It’s not necessary to choose the n = 200 shingles repeatedly for each document pair. Instead the 200 shingles can be chosen and once and for all for each document, and are called a sketch for the document: a vector of 200 integers K ( i ) α , corresponding to the smallest value for each of the 200 random permutations of the shingles in the document. Key point: Since we use the same 200 random permutations (hash functions) for each of the documents, the test of whether the smallest value of the permuted set S 1 ∩ S 2 is a shingle shared by the two documents is simply whether the corresponding value of the sketch vectors K ( i ) and K ( j ) for the two documents coincide. α α So we pre-calculate a set of 200 numbers for each of N documents, and estimate the Jaccard coefficient for the overlap between any pair of documents d i , d j as the number of coincident values of their sketch vectors (i.e., those satisfying K ( i ) α = K ( j ) for α α = 1 , . . . , 200), divided by 200. 14 / 63
Redux Simple intuition: consider the ensemble of shingles S 1 ∪ S 2 in two documents. | S 1 ∪ S 2 | may be in the many thousands if either document contains thousands of words. Pick 200 of these at random, and the percentage of those 200 that are shared in common between the documents give a good (enough) estimate of the total number in common, because it’s a random sample (just as for election polls, where a representative sample can be used to estimate the preference of a much larger population to great accuracy). The technical implementation is to map the shingles to numbers, use a hash function to randomly permute the numbers, then test the smallest one (could be any fixed position, but smallest is easiest to retain while progressing linearly through the full set S 1 ∪ S 2 ). 15 / 63
Sketches The number of shingles per document is large (of order the number of words in the document) To increase efficiency, we will use a sketch, a cleverly chosen subset of the shingles of a document. The size of a sketch is, say, 200 . . . . . . and is defined by a set of permutations π 1 . . . π 200 . Each π i is a random permutation on 1 .. 2 m The sketch of d is defined as: < min s ∈ d π 1 ( s ) , min s ∈ d π 2 ( s ) , . . . , min s ∈ d π 200 ( s ) > (a vector of 200 numbers). 16 / 63
Permutation and minimum: Example document 1: { s k } document 2: { s k } 2 m 2 m 1 s s s s 1 s s s s ✲ ✲ s 1 s 2 s 3 s 4 s 1 s 5 s 3 s 4 x k = π ( s k ) x k = π ( s k ) 2 m 2 m 1 1 ❝ s s ❝ ❝ s s ❝ ❝ s ❝ s ❝ s s ❝ ✲ ✲ x 3 x 1 x 4 x 2 x 3 x 1 x 4 x 5 x k x k 2 m 2 m 1 ❝ ❝ ❝ ❝ 1 ❝ ❝ ❝ ❝ ✲ ✲ x 3 x 1 x 4 x 2 x 3 x 1 x 5 x 2 min s k π ( s k ) min s k π ( s k ) 2 m 2 m 1 ❝ 1 ❝ ✲ ✲ x 3 x 3 Roughly: We use min s ∈ d 1 π ( s ) = min s ∈ d 2 π ( s ) as a test for: Are d 1 and d 2 near duplicates? 17 / 63
Computing Jaccard for sketches Sketches: Each document is now a vector of 200 numbers. Much easier to deal with than the very high-dimensional space of shingles But how do we compute Jaccard? 18 / 63
Computing Jaccard for sketches (2) How do we compute Jaccard? Let U be the union of the set of shingles of d 1 and d 2 and I the intersection. There are | U | ! permutations on U . For s ′ ∈ I , for how many permutations π do we have arg min s ∈ d 1 π ( s ) = s ′ = arg min s ∈ d 2 π ( s )? Answer: ( | U | − 1)! There is a set of ( | U | − 1)! different permutations for each s in I . Thus, the proportion of permutations that make min s ∈ d 1 π ( s ) = min s ∈ d 2 π ( s ) true is: | I | ( | U | − 1)! = | I | | U | = J ( d 1 , d 2 ) | U | ! 19 / 63
Estimating Jaccard Thus, the proportion of permutations that make min s ∈ d 1 π ( s ) = min s ∈ d 2 π ( s ) true is the Jaccard coefficient. Picking a permutation at random and outputting 0/1 depending on min s ∈ d 1 π ( s ) = min s ∈ d 2 π ( s ) is a Bernoulli trial. Estimator of probability of success: proportion of successes in n Bernoulli trials. Our sketch is based on a random selection of permutations. Thus, to compute Jaccard, count the number k of “successful” permutations (minima are the same) for < d 1 , d 2 > and divide by n = 200. k / 200 estimates J ( d 1 , d 2 ). 20 / 63
Recommend
More recommend