http mmds org
play

http://www.mmds.org Many problems can be expressed as finding - PowerPoint PPT Presentation

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a


  1. Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org

  2.  Many problems can be expressed as finding “similar” objects:  Find near(est)-neighbors  Examples:  Pages with similar words  For duplicate detection, classification by topic  Customers who purchased similar products  Products with similar customer sets  Images with similar features  Users who visited similar websites  Record linkage (deduplication) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

  3. [Hays and Efros, SIGGRAPH 2007] 10 nearest neighbors from a collection of 2 million images J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

  4.  Given: (High dimensional) data points 𝒚 𝟐 , 𝒚 𝟑 , …  For example: Image is a vector of pixel colors 1 2 1 → [1 2 1 0 2 1 0 1 0] 0 2 1 0 1 0  And some distance function 𝒆(𝒚 𝟐 , 𝒚 𝟑 )  Which quantifies the “distance” between 𝒚 𝟐 and 𝒚 𝟑  Goal: Find all pairs of data points (𝒚 𝒋 , 𝒚 𝒌 ) that are within some distance threshold 𝒆 𝒚 𝒋 , 𝒚 𝒌 ≤ 𝒕  Note: Naïve solution would take 𝑷 𝑶 𝟑  where 𝑶 is the number of data points J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

  5.  Hash objects to buckets such that objects that are similar hash to the same bucket  Only compare candidates in each bucket  Benefits: Instead of O(N 2 ) comparisons, we need O(N) to find similar documents  Hash functions depend on similarity functions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

  6.  Goal: Given a large number ( 𝑂 in the millions or billions) of documents, find “near duplicate” pairs  Applications:  Mirror websites, or approximate mirrors  Similar news articles at many news sites  Problems:  Documents are so large or so many that they cannot fit in main memory  Too many documents to compare all pairs J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

  7.  Shingling: Convert documents to sets  Simple approaches:  Document = set of words appearing in document  Document = set of “important” words  Need to account for ordering of words!  Document = set of Shingles J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14

  8.  A set of k -shingles (or k -grams) for a document is a set of k-sequence tokens that appears in the doc  Tokens can be characters, words  Example:  k=2 ;  D 1 = abcab  Set of 2-shingles: S(D 1 ) = { ab , bc , ca } J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15

  9.  Document D i is a set of its k-shingles C i =S(D i )  A natural similarity measure is the Jaccard similarity: sim (D 1 , D 2 ) = |C 1  C 2 |/|C 1  C 2 | J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17

  10.  Rows = elements (shingles)  Columns = sets (documents) Documents  1 in row e and column s if and only 1 1 1 0 if e is a member of s 1 1 0 1  Column similarity is the Jaccard 0 1 0 1 similarity of the corresponding Shingles 0 0 0 1 sets  Typical matrix is sparse! 1 0 0 1  Example: sim(C 1 ,C 2 ) = ? 1 1 1 0 1 0 1 0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19

  11.  Rows = elements (shingles)  Columns = sets (documents) Documents  1 in row e and column s if and only 1 1 1 0 if e is a member of s 1 1 0 1  Column similarity is the Jaccard 0 1 0 1 similarity of the corresponding Shingles 0 0 0 1 sets  Typical matrix is sparse! 1 0 0 1  Example: sim(C 1 ,C 2 ) = ? 1 1 1 0  Size of intersection = 3; size of union = 6, 1 0 1 0 Jaccard similarity = 3/6  d(C 1 ,C 2 ) = 1 – (Jaccard similarity) = 3/6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20

  12.  Suppose we need to find near-duplicate documents among 𝑶 = 𝟐 million documents  Naïvely, we would have to compute pairwise Jaccard similarities for every pair of docs  𝑶(𝑶 − 𝟐)/𝟑 ≈ 5*10 11 comparisons  At 10 5 secs/day and 10 6 comparisons/sec, it would take 5 days  For 𝑶 = 𝟐𝟏 million, it takes more than a year… J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22

  13.  Key Idea: “hash” each column C to a small signature h(C) :  (1) h(C) is small enough that the signature fits in RAM  (2) sim(C 1 , C 2 ) is the same as the “similarity” of signatures h(C 1 ) and h(C 2 )  Locality sensitive hashing:  If sim(C 1 ,C 2 ) is high, then with high prob. h(C 1 ) = h(C 2 )  If sim(C 1 ,C 2 ) is low, then with high prob. h(C 1 ) ≠ h(C 2 )  Expect that “most” pairs of near duplicate docs hash into the same bucket! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26

  14. 2 nd element of the permutation is the first to map to a 1 Permutation  Input matrix (Shingles x Documents) Signature matrix M 2 4 3 1 0 1 0 2 1 2 1 1 0 0 1 3 2 4 2 1 4 1 0 1 0 1 7 1 7 1 2 1 2 0 1 0 1 6 3 2 0 1 0 1 1 6 6 4 th element of the permutation is the first to map to a 1 5 7 1 1 0 1 0 4 5 5 1 0 1 0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27

  15.  Imagine the rows of the boolean matrix permuted under random permutation   Define a “hash” function h  (C) = the index of the first (in the permuted order  ) row in which column C has value 1 : h  (C) = min   (C)  Use several (e.g., 100) independent hash functions (that is, permutations) to create a signature of a column J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

  16.  Permuting rows even once is prohibitive  Row hashing!  Pick K hash functions k i  Ordering under k i gives a random row permutation! How to pick a random hash function h(x)? Universal hashing: h a,b (x)=((a·x+b) mod p) mod N where: a,b … random integers p … prime number (p > N) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29

  17.  One-pass implementation  For each column C and hash-func. k i keep a “slot” for the min-hash value  Initialize all sig(C)[i] =   For each row  If there is a 1 in column C  Update sig(C)[i] if k i (j) is smaller than current value  If k i (j) < sig(C)[i] , then sig(C)[i]  k i (j) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30

  18. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  19. 0 0 0 0 1 1  Given a random permutation  0 0  What is Pr[ h  (C 1 ) = h  (C 2 )] 0 1 1 0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33

  20. 0 0 0 0 1 1  Given a random permutation  0 0  Claim: Pr[ h  (C 1 ) = h  (C 2 )] = sim (C 1 , C 2 )  Why? 0 1  The first non 0 row 1 0  If both are 1, then h  (C 1 ) = h  (C 2 )  Pr[ h  (C 1 ) = h  (C 2 )]=|C 1  C 2 |/|C 1  C 2 |= sim (C 1 , C 2 ) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34

  21.  The similarity of two signatures is the fraction of the values that agree  We know: Pr[ h  (C 1 ) = h  (C 2 )] = sim (C 1 , C 2 ) Signature matrix M  Because of the Min-Hash property, the expected similarity of two signatures = 2 1 2 1 similarity of the columns 2 1 4 1 1 2 1 2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36

  22. Permutation  Input matrix (Shingles x Documents) Signature matrix M 2 4 1 0 1 0 3 2 1 2 1 1 0 0 1 3 2 4 2 1 4 1 0 1 0 1 7 1 7 1 2 1 2 0 1 0 1 6 3 2 0 1 0 1 1 6 6 Similarities: 1-3 2-4 1-2 3-4 5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67 1.00 0 0 4 5 5 1 0 1 0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37

  23. Candidate pairs: Locality- those pairs Docu- Sensitive of signatures ment Hashing that we need to test for similarity The set Signatures: of strings short integer of length k vectors that that appear represent the in the doc- sets, and ument reflect their similarity Step 3: Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents

  24.  The probability of two columns hash into the same bit given one permutation is their similarity  Pr[ h  (C 1 ) = h  (C 2 )] = sim (C 1 , C 2 )  The similarity of two signatures is the fraction of the hash functions in which they agree  Sim[ h (C 1 ), h (C 2 )] = sim (C 1 , C 2 ) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40

Recommend


More recommend