Recitation sessions : ¡ Review of proof techniques and probability § Friday January 17, 3:00-4:10 PM in Skilling Auditorium ¡ Review of linear algebra § Friday January 17, 4:20-5:20 PM in Skilling Auditorium Deadlines tonight, 11:59 PM : ¡ Colab 0 (Spark Tutorial), Colab 1 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 1
Note to other teachers and users of these slides: We would be delighted if you found our material useful for giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: CS246: Mining Massive Datasets Jure Leskovec, Stanford University
¡ Task: Given a large number ( N in the millions or billions) of documents, find “near duplicates” ¡ Problem: § Too many documents to compare all pairs ¡ Solution: Hash documents so that similar documents hash into the same bucket § Documents in the same bucket are then candidate pairs whose similarity is then evaluated 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 3
Candidate pairs: Locality- those pairs M i n - H a s h - Docu- sensitive S h i n g l i n g of signatures ment i n g Hashing that we need to test for similarity The set Signatures: of strings short integer of length k vectors that that appear represent the in the doc- sets, and ument reflect their similarity 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 4
¡ A k -shingle (or k -gram) is a sequence of k tokens that appears in the document § Example: k=2 ; D 1 = abcab Set of 2-shingles: C 1 = S(D 1 ) = { ab , bc , ca } ¡ Represent a doc by a set of hash values of its k -shingles ¡ A natural similarity measure is then the Jaccard similarity: sim (D 1 , D 2 ) = |C 1 Ç C 2 |/|C 1 È C 2 | § Similarity of two documents is the Jaccard similarity of their shingles 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 6
¡ Min-Hashing : Convert large sets into short signatures, while preserving similarity: Pr[ h (C 1 ) = h (C 2 )] = sim (D 1 , D 2 ) Permutation p Input matrix (Shingles x Documents) Signature matrix M 2 4 3 1 0 1 0 2 1 2 1 3 2 1 0 0 1 4 2 1 4 1 0 1 0 1 7 1 7 1 2 1 2 0 1 0 1 6 3 2 Similarities of columns and 0 1 0 1 1 6 6 signatures (approx.) match! 1-3 2-4 1-2 3-4 5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67 1.00 0 0 4 5 5 1 0 1 0 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 7
¡ Hash columns of the signature matrix M: Similar columns likely hash to same bucket § Divide matrix M into b bands of r rows (M=b·r) § Candidate column pairs are those that hash to the same bucket for ≥ 1 band Buckets Prob. of sharing Threshold t ≥ 1 bucket b bands r rows Similarity Matrix M 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 8
Signatures: short Candidate pairs: integer signatures that those pairs of Locality- reflect point similarity H a s h signatures that Points sensitive f u n c . we need to test Hashing for similarity Design a locality sensitive Apply the hash function (for a given “Bands” technique distance metric) 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 9
¡ The S-curve is where the “magic” happens Remember: Probability of sharing Threshold t Probability of Probability=1 equal hash-values ≥ 1 bucket = similarity if s>t No chance if s<t Similarity s of two sets Similarity s of two sets This is what 1 hash-code gives you This is what we want! Pr[ h p (C 1 ) = h p (C 2 )] = s im (D 1 , D 2 ) How to get a step-function? By choosing r and b ! 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 10
¡ Remember: b bands, r rows/band ¡ Let sim( C 1 , C 2 ) = s What’s the prob. that at least 1 band is equal? ¡ Pick some band ( r rows) § Prob. that elements in a single row of columns C 1 and C 2 are equal = s § Prob. that all rows in a band are equal = s r § Prob. that some row in a band is not equal = 1 - s r ¡ Prob. that all bands are not equal = (1 - s r ) b ¡ Prob. that at least 1 band is equal = 1 - (1 - s r ) b P(C 1 , C 2 is a candidate pair) = 1 - (1 - s r ) b 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 11
¡ Picking r and b to get the best S-curve § 50 hash-functions (r=5, b=10) 1 0.9 Prob. sharing a bucket 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity, s 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 12
1 1 r = 5, b = 1..50 Prob(Candidate pair) r = 1..10, b = 1 0.9 0.9 0.8 0.8 Given a fixed 0.7 0.7 0.6 0.6 threshold t . 0.5 0.5 0.4 0.4 0.3 0.3 We want choose 0.2 0.2 0.1 0.1 r and b such 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 that the 1 1 r = 10, b = 1..50 Prob(Candidate pair) 0.9 0.9 P(Candidate 0.8 0.8 0.7 0.7 pair) has a 0.6 0.6 “step” right 0.5 0.5 0.4 0.4 around t . 0.3 0.3 0.2 0.2 r = 1, b = 1..10 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity Similarity prob = 1 - (1 - s r ) b 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 13
Visualization of the effect of threshold, band size, and # of rows in LSH by Trenton Chang (Thank you!!) 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 14
Candidate pairs: Locality- those pairs M i n - H a s h - sensitive of signatures i n g Hashing that we need to test for similarity Signatures: short vectors that represent the sets, and reflect their similarity
¡ We have used LSH to find similar documents § More generally, we found similar columns in large sparse matrices with high Jaccard similarity ¡ Can we use LSH for other distance measures? § e.g., Euclidean distances, Cosine distance § Let’s generalize what we’ve learned! 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 16
¡ 𝒆(⋅) is a distance measure if it is a function from pairs of points x,y to real numbers such that: § 𝑒 𝑦, 𝑧 ≥ 0 § 𝑒(𝑦, 𝑧) = 0 𝑗𝑔𝑔 𝑦 = 𝑧 § 𝑒(𝑦, 𝑧) = 𝑒(𝑧, 𝑦) § 𝑒 𝑦, 𝑧 ≤ 𝑒(𝑦, 𝑨) + 𝑒(𝑨, 𝑧) (triangle inequality) ¡ Jaccard distance for sets = 1 - Jaccard similarity ¡ Cosine distance for vectors = angle between the vectors ¡ Euclidean distances: § L 2 norm : d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension § The most common notion of “distance” § L 1 norm : sum of absolute value of the differences in each dimension § Manhattan distance = distance if you travel along axes only 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 17
¡ For Min-Hashing signatures, we got a Min-Hash function for each permutation of rows ¡ A “hash function” is any function that allows us to say whether two elements are “equal” § Shorthand: h(x) = h(y) means “ h says x and y are equal ” ¡ A family of hash functions is any set of hash functions from which we can efficiently pick one at random § Example: The set of Min-Hash functions generated from permutations of rows 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 21
Suppose we have a space S of points with ¡ a distance measure d(x,y) Critical assumption A family H of hash functions is said to be ¡ ( d 1 , d 2 , p 1 , p 2 )- sensitive if for any x and y in S : 1. If d(x, y) < d 1 , then the probability over all h Î H , that h(x) = h(y) is at least p 1 2. If d(x, y) > d 2 , then the probability over all h Î H , that h(x) = h(y) is at most p 2 With a LS Family we can do LSH! 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 22
Distance Small distance, Notice it’s distance, not similarity, threshold t high probability hence the S-curve is flipped! p 1 Pr [ h (x) = h (y)] p 2 Large distance, low probability of hashing to the same value d 1 d 2 Distance d(x,y) 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 23
¡ Let: § S = space of all sets, § d = Jaccard distance, § H is family of Min-Hash functions for all permutations of rows ¡ Then for any hash function h Î H : Pr[h(x) = h(y)] = 1 - d(x, y) § Simply restates theorem about Min-Hashing in terms of distances rather than similarities 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 24
More recommend