apps data data data learning
play

Apps data data data learning Locality Filtering PageRank, - PowerPoint PPT Presentation

High dim. Graph Infinite Machine Apps data data data learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems hashing streams Network Web Decision Association Clustering Analysis advertising


  1.  We know: Pr[ h  (C 1 ) = h  (C 2 )] = sim (C 1 , C 2 )  Now generalize to multiple hash functions  The similarity of two signatures is the fraction of the hash functions in which they agree  Thus, the expected similarity of two signatures equals the Jaccard similarity of the columns or sets that the signatures represent.  And the longer the signatures, the smaller will be the expected error. 3/2/2020 28

  2. Permutation  Input matrix (Shingles x Documents) Signature matrix M 2 4 3 1 0 1 0 2 3 7 2 1 0 0 1 3 2 4 2 4 1 4 0 1 0 1 7 1 7 7 3 7 3 0 1 0 1 6 3 2 0 1 0 1 1 6 6 Similarities: 1-3 2-4 1-2 3-4 5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0 Sig/Sig 0.34 0.67 0 0 4 5 5 1 0 1 0 3/2/2020 29

  3.  Permuting rows even once is prohibitive  Row hashing!  Pick K = 100 hash functions h i  Ordering under h i gives a random permutation of rows!  One-pass implementation  For each column c and hash-func. h i keep a “slot” M ( i, c ) for the min-hash value  Initialize all M ( i, c ) =  How to pick a random  Scan rows looking for 1s hash function h(x)? Universal hashing:  Suppose row j has 1 in column c h a,b (x)=((a·x+b) mod p) mod N where:  Then for each h i : a,b … random integers  If h i (j) < M ( i, c ), then M ( i, c )  h i (j) p … prime number (p > N) 3/2/2020 30

  4. for each row r do begin for each hash function h i do compute h i ( r ); Important: so you hash r only for each column c once per hash function, not once per 1 in row r. if c has 1 in row r for each hash function h i do if h i ( r ) < M ( i, c ) then M ( i, c ) := h i ( r ); end; 3/2/2020 31

  5. M(i, C 1 ) M(i, C 2 ) ∞ h (1) = 1 1 ∞ g (1) = 3 3 Row C 1 C 2 h (2) = 2 1 2 1 1 0 g (2) = 0 3 0 2 0 1 3 1 1 h (3) = 3 1 2 4 1 0 g (3) = 2 2 0 5 0 1 h (4) = 4 1 2 g (4) = 4 2 0 h ( x ) = x mod 5 h (5) = 0 1 0 g ( x ) = (2 x +1) mod 5 g (5) = 1 2 0 Signature matrix M 3/2/2020 32

  6. Candidate pairs: Locality- those pairs Docu- Sensitive of signatures ment Hashing that we need to test for similarity The set Signatures: of strings short integer of length k vectors that that appear represent the in the doc- sets, and ument reflect their similarity Step 3: Locality Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents

  7. 2 3 7 2 2 4 1 4 7 3 7 3  Goal: Find documents with Jaccard similarity at least s (for some similarity threshold, e.g., s =0.8)  LSH – General idea: Use a hash function that tells whether x and y is a candidate pair : a pair of elements whose similarity must be evaluated  For Min-Hash matrices:  Hash columns of signature matrix M to many buckets  Each pair of documents that hashes into the same bucket is a candidate pair 3/2/2020 34

  8. 2 3 7 2 2 4 1 4 7 3 7 3  Pick a similarity threshold s (0 < s < 1)  Columns x and y of M are a candidate pair if their signatures agree on at least fraction s of their rows: M ( i, x ) = M ( i, y ) for at least frac. s values of i  We expect documents x and y to have the same (Jaccard) similarity as their signatures 3/2/2020 35

  9. 2 3 7 2 2 4 1 4 7 3 7 3  Big idea: Hash columns of signature matrix M several times  Arrange that (only) similar columns are likely to hash to the same bucket , with high probability  Candidate pairs are those that hash to the same bucket 3/2/2020 36

  10. 2 3 7 2 2 4 1 4 7 3 7 3 r rows per band b bands One signature Signature matrix M 3/2/2020 37

  11.  Divide matrix M into b bands of r rows  For each band, hash its portion of each column to a hash table with k buckets  Make k as large as possible  Candidate column pairs are those that hash to the same bucket for ≥ 1 band  Tune b and r to catch most similar pairs, but few non-similar pairs 3/2/2020 38

  12. Columns 2 and 6 Buckets are probably identical ( candidate pair ) Columns 6 and 7 are surely different. Matrix M b bands r rows 3/2/2020 39

  13.  There are enough buckets that columns are unlikely to hash to the same bucket unless they are identical in a particular band  Hereafter, we assume that “ same bucket ” means “ identical in that band ”  Assumption needed only to simplify analysis, not for correctness of algorithm 3/2/2020 41

  14. 2 3 7 2 2 4 1 4 7 3 7 3 Assume the following case:  Suppose 100,000 columns of M (100k docs)  Signatures of 100 integers (rows)  Therefore, signatures take 40MB  Goal: Find pairs of documents that are at least s = 0.8 similar  Choose b = 20 bands of r = 5 integers/band 3/2/2020 42

  15. 2 3 7 2 2 4 1 4 7 3 7 3  Find pairs of  s = 0.8 similarity, set b =20, r =5  Assume: sim(C 1 , C 2 ) = 0.8  Since sim(C 1 , C 2 )  s , we want C 1 , C 2 to be a candidate pair : We want them to hash to at least 1 common bucket (at least one band is identical)  Probability C 1 , C 2 identical in one particular band: (0.8) 5 = 0.328  Probability C 1 , C 2 are not similar in all of the 20 bands: (1-0.328) 20 = 0.00035  i.e., about 1/3000th of the 80%-similar column pairs are false negatives (we miss them)  We would find 99.965% pairs of truly similar documents 3/2/2020 43

  16. 2 3 7 2 2 4 1 4 7 3 7 3  Find pairs of  s = 0.8 similarity, set b =20, r =5  Assume: sim(C 1 , C 2 ) = 0.3  Since sim(C 1 , C 2 ) < s we want C 1 , C 2 to hash to NO common buckets (all bands should be different)  Probability C 1 , C 2 identical in one particular band: (0.3) 5 = 0.00243  Probability C 1 , C 2 identical in at least 1 of 20 bands: 1 - (1 - 0.00243) 20 = 0.0474  In other words, approximately 4.74% pairs of docs with similarity 0.3 end up becoming candidate pairs  They are false positives since we will have to examine them (they are candidate pairs) but then it will turn out their similarity is below threshold s 3/2/2020 44

  17. 2 3 7 2 2 4 1 4 7 3 7 3  Pick:  The number of Min-Hashes (rows of M )  The number of bands b , and  The number of rows r per band to balance false positives/negatives  Example: If we had only 10 bands of 10 rows, the number of false positives would go down, but the number of false negatives would go up 3/2/2020 45

  18. Probability = 1 Similarity threshold s if t > s Probability No chance Say “yes” if you of sharing if t < s are below the line. a bucket Similarity t =sim(C 1 , C 2 ) of two sets 3/2/2020 46

  19. Probability Remember: of sharing Probability of a bucket equal hash-values = similarity Similarity t =sim(C 1 , C 2 ) of two sets 3/2/2020 47

  20. False negatives Probability Say “yes” if you of sharing are below the line. a bucket False positives s Similarity t =sim(C 1 , C 2 ) of two sets 3/2/2020 48

  21.  Say columns C 1 and C 2 have similarity t  Pick any band ( r rows)  Prob. that all rows in band equal = t r  Prob. that some row in band unequal = 1 - t r  Prob. that no band identical = (1 - t r ) b  Prob. that at least 1 band identical = 1 - (1 - t r ) b 3/2/2020 49

  22. At least No bands one band identical identical 1 - ( ) b 1 - t r Probability of sharing a bucket All rows Some row of a band of a band are equal unequal Similarity t=sim(C 1 , C 2 ) of two sets 3/2/2020 50

  23.  Similarity threshold s  Prob. that at least 1 band is identical: 1-(1-s r ) b s 0.2 0.006 0.3 0.047 0.4 0.186 0.5 0.470 0.6 0.802 0.7 0.975 0.8 0.9996 3/2/2020 51

  24.  Picking r and b to get the best S-curve  50 hash-functions (r=5, b=10) 1 0.9 Prob. sharing a bucket 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Blue area: False Negative rate 0.1 Green area: False Positive rate 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity 3/2/2020 52

  25.  Tune M, b, r to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures  Check in main memory that candidate pairs really do have similar signatures  Optional: In another pass through data, check that the remaining candidate pairs really represent similar documents 3/2/2020 53

  26.  Shingling: Convert documents to set representation  We used hashing to assign each shingle an ID  Min-Hashing: Convert large sets to short signatures, while preserving similarity  We used similarity preserving hashing to generate signatures with property Pr[ h  (C 1 ) = h  (C 2 )] = sim (C 1 , C 2 )  We used hashing to get around generating random permutations  Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents  We used hashing to find candidate pairs of similarity  s 3/2/2020 54

  27.  Task: Given a large number ( N in the millions or billions) of documents, find “near duplicates”  Problem:  Too many documents to compare all pairs  Solution: Hash documents so that similar documents hash into the same bucket  Documents in the same bucket are then candidate pairs whose similarity is then evaluated 3/2/2020 56

  28. Candidate pairs: Locality- those pairs Docu- sensitive of signatures ment Hashing that we need to test for similarity The set Signatures: of strings short integer of length k vectors that that appear represent the in the doc- sets, and ument reflect their similarity 3/2/2020 57

  29.  A k -shingle (or k -gram) is a sequence of k tokens that appears in the document  Example: k=2 ; D 1 = abcab Set of 2-shingles: C 1 = S(D 1 ) = { ab , bc , ca }  Represent a doc by a set of hash values of its k -shingles  A natural similarity measure is then the Jaccard similarity: sim (D 1 , D 2 ) = |C 1  C 2 |/|C 1  C 2 |  Similarity of two documents is the Jaccard similarity of their shingles 3/2/2020 59

  30.  Min-Hashing : Convert large sets into short signatures, while preserving similarity: Pr[ h (C 1 ) = h (C 2 )] = sim (D 1 , D 2 ) Permutation  Input matrix (Shingles x Documents) Signature matrix M 2 4 3 1 0 1 0 2 3 7 2 1 0 0 1 3 2 4 2 4 1 4 0 1 0 1 7 1 7 7 3 7 3 0 1 0 1 6 3 2 Similarities of columns and 0 1 0 1 1 6 6 signatures (approx.) match! 1-3 2-4 1-2 3-4 5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0 Sig/Sig 0.34 0.67 0 0 4 5 5 1 0 1 0 3/2/2020 60

  31.  Hash columns of the signature matrix M: Similar columns likely hash to same bucket  Divide matrix M into b bands of r rows (M=b·r)  Candidate column pairs are those that hash to the same bucket for ≥ 1 band Buckets Prob. of sharing Threshold s ≥ 1 bucket b bands r rows Similarity Matrix M 3/2/2020 61

  32. Signatures: short Candidate pairs: integer signatures that those pairs of Locality- reflect point similarity signatures that Points sensitive we need to test Hashing for similarity Design a locality sensitive Apply the hash function (for a given “Bands” technique distance metric) 3/2/2020 62

  33.  The S- curve is where the “magic” happens Probability of sharing Remember: Threshold s Probability of Probability=1 equal hash-values ≥ 1 bucket if t>s = similarity No chance if t<s Similarity t of two sets Similarity t of two sets This is what 1 hash-code gives you This is what we want! Pr[ h  (C 1 ) = h  (C 2 )] = s im (D 1 , D 2 ) How to get a step-function? By choosing r and b ! 3/2/2020 63

  34.  Remember: b bands, r rows/band  Let sim( C 1 , C 2 ) = s What’s the prob. that at least 1 band is equal?  Pick some band ( r rows)  Prob. that elements in a single row of columns C 1 and C 2 are equal = s  Prob. that all rows in a band are equal = s r  Prob. that some row in a band is not equal = 1 - s r  Prob. that all bands are not equal = (1 - s r ) b  Prob. that at least 1 band is equal = 1 - (1 - s r ) b P(C 1 , C 2 is a candidate pair) = 1 - (1 - s r ) b 3/2/2020 64

  35.  Picking r and b to get the best S-curve  50 hash-functions (r=5, b=10) 1 0.9 Prob. sharing a bucket 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity, s 3/2/2020 65

  36. 1 1 r = 5, b = 1..50 Prob(Candidate pair) r = 1..10, b = 1 0.9 0.9 0.8 0.8 Given a fixed 0.7 0.7 0.6 0.6 threshold s . 0.5 0.5 0.4 0.4 0.3 0.3 We want choose 0.2 0.2 0.1 0.1 r and b such 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 r = 10, b = 1..50 1 that the Prob(Candidate pair) 0.9 0.9 P(Candidate 0.8 0.8 0.7 0.7 pair) has a 0.6 0.6 “step” right 0.5 0.5 0.4 0.4 around s . 0.3 0.3 0.2 0.2 r = 1, b = 1..10 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity Similarity prob = 1 - (1 - t r ) b 3/2/2020 66

  37. Candidate pairs: Locality- those pairs sensitive of signatures Hashing that we need to test for similarity Signatures: short vectors that represent the sets, and reflect their similarity

  38.  We have used LSH to find similar documents  More generally, we found similar columns in large sparse matrices with high Jaccard similarity  Can we use LSH for other distance measures?  e.g., Euclidean distances, Cosine distance  Let’s generalize what we’ve learned! 3/2/2020 68

  39.  d() is a distance measure if it is a function from pairs of points x,y to real numbers such that:  𝑒 𝑦, 𝑧 ≥ 0  𝑒(𝑦, 𝑧) = 0 𝑗𝑔𝑔 𝑦 = 𝑧  𝑒(𝑦, 𝑧) = 𝑒(𝑧, 𝑦)  𝑒 𝑦, 𝑧 ≤ 𝑒(𝑦, 𝑨) + 𝑒(𝑨, 𝑧) (triangle inequality)  Jaccard distance for sets = 1 minus Jaccard similarity  Cosine distance for vectors = angle between the vectors  Euclidean distances:  L 2 norm : d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension  The most common notion of “distance”  L 1 norm : sum of the differences in each dimension  Manhattan distance = distance if you travel along coordinates only 3/2/2020 69

  40.  d(x,y) > 0 because |x  y| < |x  y|  Thus, similarity < 1 and distance = 1 – similarity > 0  d(x,x) = 0 because x  x = x  x.  And if x  y, then |x  y| is strictly less than |x  y|, so sim(x,y) < 1; thus d(x,y) > 0  d(x,y) = d(y,x) because union and intersection are symmetric  d(x,y) < d(x,z) + d(z,y) trickier: 1 - |x  z| + 1 - |y  z| > 1 -|x  y| |x  z| |y  z| |x  y| 3/2/2020 70

  41. d(x,z) d(z,y) d(x,y) 1 - |x  z| + 1 - |y  z| > 1 -|x  y| |x  z| |y  z| |x  y|  Remember: |a  b|/|a  b| = probability that minhash(a) = minhash(b).  Thus, 1 - |a  b|/|a  b| = probability that minhash(a)  minhash(b).  Need to show: prob[minhash(x)  minhash(y)] < prob[minhash(x)  minhash(z)] + prob[minhash(z)  minhash(y)] 71

  42.  Whenever minhash(x)  minhash(y), at least one of minhash(x)  minhash(z) and minhash(z)  minhash(y) must be true: minhash(x)  minhash(y minhash(x)  minhash(z) minhash(z)  minhash(y) 72

  43.  For Min-Hashing signatures, we got a Min-Hash function for each permutation of rows  A “hash function” is any function that allows us to say whether two elements are “equal”  Shorthand: h(x) = h(y) means “ h says x and y are equal ”  A family of hash functions is any set of hash functions from which we can pick one at random efficiently  Example: The set of Min-Hash functions generated from permutations of rows 3/2/2020 73

  44. Suppose we have a space S of points with  a distance measure d(x,y) Critical assumption A family H of hash functions is said to be  ( d 1 , d 2 , p 1 , p 2 )- sensitive if for any x and y in S : 1. If d(x, y) < d 1 , then the probability over all h  H , that h(x) = h(y) is at least p 1 2. If d(x, y) > d 2 , then the probability over all h  H , that h(x) = h(y) is at most p 2 With a LS Family we can do LSH! 3/2/2020 74

  45. Distance Small distance, threshold t high probability p 1 Pr [ h (x) = h (y)] p 2 Large distance, low probability of hashing to the same value d 1 d 2 Distance d(x,y) 3/2/2020 75

  46.  Let:  S = space of all sets,  d = Jaccard distance,  H is family of Min-Hash functions for all permutations of rows  Then for any hash function h  H : Pr[h(x) = h(y)] = 1 - d(x, y)  Simply restates theorem about Min-Hashing in terms of distances rather than similarities 3/2/2020 76

  47.  Claim: Min-hash H is a (1/3, 2/3, 2/3, 1/3)- sensitive family for S and d . Then probability If distance < 1/3 that Min-Hash values (so similarity ≥ 2/3) agree is > 2/3  For Jaccard similarity, Min-Hashing gives a (d 1 ,d 2 ,(1-d 1 ),(1-d 2 ))- sensitive family for any d 1 <d 2 3/2/2020 77

  48. Prob. of sharing  Can we reproduce the a bucket “S - curve” effect we saw before for any LS family? Similarity t  The “ bands ” technique we learned for signature matrices carries over to this more general setting  Can do LSH with any ( d 1 , d 2 , p 1 , p 2 )- sensitive family!  Two constructions:  AND construction like “rows in a band”  OR construction like “many bands” 3/2/2020 78

  49.  Given family H , construct family H’ consisting of r functions from H  For h = [ h 1 ,…, h r ] in H’ , we say h(x) = h(y) if and only if h i (x) = h i (y) for all i 1  i  r  Note this corresponds to creating a band of size r  Theorem: If H is ( d 1 , d 2 , p 1 , p 2 ) -sensitive, then H’ is ( d 1 ,d 2 , (p 1 ) r , (p 2 ) r ) -sensitive  Proof: Use the fact that h i ’s are independent Lowers probability for Also lowers probability large distances (Good) for small distances (Bad) 3/2/2020 80

  50.  Independence of hash functions (HFs) really means that the prob. of two HFs saying “yes” is the product of each saying “yes”  But two particular hash functions could be highly correlated  For example, in Min-Hash if their permutations agree in the first one million entries  However , the probabilities in definition of a LSH-family are over all possible members of H , H’ (i.e., average case and not the worst case) 3/2/2020 81

  51.  Given family H , construct family H’ consisting of b functions from H  For h = [ h 1 ,…, h b ] in H’ , h(x) = h(y) if and only if h i (x) = h i (y) for at least 1 i  Theorem: If H is ( d 1 , d 2 , p 1 , p 2 )- sensitive, then H’ is ( d 1 , d 2 , 1-(1- p 1 ) b , 1-(1- p 2 ) b ) -sensitive  Proof: Use the fact that h i ’s are independent Raises probability for Raises probability for large distances (Bad) small distances (Good) 3/2/2020 82

  52.  AND makes all probs. shrink , but by choosing r correctly, we can make the lower prob. approach 0 while the higher does not  OR makes all probs. grow , but by choosing b correctly, we can make the upper prob. approach 1 while the lower does not Prob. sharing a bucket 1 1 Prob. sharing a bucket AND 0.9 0.9 r=1..10, b=1 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 OR 0.3 0.3 0.2 0.2 r=1, b=1..10 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity of a pair of items Similarity of a pair of items 3/2/2020 83

  53.  By choosing b and r correctly, we can make the lower probability approach 0 while the higher approaches 1  As for the signature matrix, we can use the AND construction followed by the OR construction  Or vice-versa  Or any sequence of AND’s and OR’s alternating 3/2/2020 84

  54.  r -way AND followed by b -way OR construction  Exactly what we did with Min-Hashing  AND: If bands match in all r values hash to same bucket  OR: Cols that have  1 common bucket  Candidate  Take points x and y s.t. Pr[h(x) = h(y)] = s  H will make (x,y) a candidate pair with prob. s  Construction makes (x,y) a candidate pair with probability 1-(1-s r ) b The S-Curve!  Example: Take H and construct H’ by the AND construction with r = 4 . Then, from H’ , construct H’’ by the OR construction with b = 4 3/2/2020 85

  55. p=1-(1-s 4 ) 4 s .2 .0064 .3 .0320 .4 .0985 .5 .2275 .6 .4260 .7 .6666 .8 .8785 r = 4, b = 4 transforms a (.2,.8,.8,.2)-sensitive family into a .9 .9860 (.2,.8,.8785,.0064)-sensitive family. 3/2/2020 86

  56.  Picking r and b to get desired performance  50 hash-functions ( r = 5, b = 10 ) 1 Blue area X : False Negative rate Threshold s 0.9 These are pairs with sim > s but the X Prob(Candidate pair) 0.8 fraction won’t share a band and then 0.7 will never become candidates. This 0.6 means we will never consider these 0.5 pairs for (slow/exact) similarity 0.4 calculation! 0.3 Green area Y: False Positive rate 0.2 These are pairs with sim < s but 0.1 we will consider them as candidates. 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 This is not too bad, we will consider Similarity s them for (slow/exact) similarity computation and discard them. 3/2/2020 88

  57.  Picking r and b to get desired performance  50 hash-functions ( r * b = 50 ) 1 Threshold s r=2, b=25 0.9 Prob(Candidate pair) r=5, b=10 0.8 r=10, b=5 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity s 3/2/2020 89

  58.  Apply a b -way OR construction followed by an r -way AND construction  Transforms similarity s (probability p) into (1-(1-s) b ) r  The same S-curve, mirrored horizontally and vertically  Example: Take H and construct H’ by the OR construction with b = 4. Then, from H’ , construct H’’ by the AND construction with r = 4 3/2/2020 90

  59. 1 p=(1-(1-s) 4 ) 4 s 0.9 Prob(Candidate pair) 0.8 .1 .0140 0.7 0.6 .2 .1215 0.5 .3 .3334 0.4 0.3 .4 .5740 0.2 0.1 .5 .7725 0 0 0.2 0.4 0.6 0.8 1 Similarity s .6 .9015 The example transforms a .7 .9680 (.2,.8,.8,.2)-sensitive family into a .8 .9936 (.2,.8,.9936,.1215)-sensitive family 3/2/2020 91

  60.  Example: Apply the (4,4) OR-AND construction followed by the (4,4) AND-OR construction  Transforms a (.2, .8, .8, .2)-sensitive family into a (.2, .8, .9999996, .0008715)-sensitive family  Note this family uses 256 (=4*4*4*4) of the original hash functions 3/2/2020 92

  61.  For each AND-OR S-curve 1-(1-s r ) b , there is a threshold t , for which 1-(1-t r ) b = t  Above t , high probabilities are increased; below t , low probabilities are decreased  You improve the sensitivity as long as the low probability is less than t , and the high probability is greater than t  Iterate as you like.  Similar observation for the OR-AND type of S- curve: (1-(1-s) b ) r 3/2/2020 93

  62. Probability Is raised Prob(Candidate pair) Threshold t Probability Is lowered s t 3/2/2020 94

  63.  Pick any two distances d 1 < d 2  Start with a ( d 1 , d 2 , (1- d 1 ), (1- d 2 ) ) - sensitive family  Apply constructions to amplify (d 1 , d 2 , p 1 , p 2 ) -sensitive family, where p 1 is almost 1 and p 2 is almost 0  The closer to 0 and 1 we get, the more hash functions must be used! 3/2/2020 95

  64.  LSH methods for other distance metrics:  Cosine distance: Random hyperplanes  Euclidean distance: Project on lines Signatures: short Candidate pairs: integer signatures that those pairs of Locality- reflect their similarity signatures that Points sensitive we need to test Hashing for similarity Design a (d 1 , d 2 , p 1 , p 2 )-sensitive Amplify the family family of hash functions (for that using AND and OR particular distance metric) constructions Depends on the distance function used 3/2/2020 97

  65. Signatures: short Candidate pairs: integer signatures that those pairs of Locality- reflect their similarity signatures that Data sensitive we need to test Hashing for similarity 0 1 0 0 Documents 1 1 1 0 1 5 1 5 MinHash “Bands” technique 0 0 0 1 Candidate pairs 2 3 1 3 0 1 0 1 6 4 6 4 0 0 1 0 1 0 0 1 0 1 0 0 Data points Random Hyperplanes -1 +1 -1 -1 1 1 1 0 “Bands” technique Candidate pairs +1 +1 +1 -1 0 0 0 1 -1 -1 -1 -1 0 1 0 1 0 0 1 0 3/2/2020 98 1 0 0 1

  66. A  Cosine distance = angle between vectors from the origin to the points in question d(A, B) =  = arccos(A  B / ǁ A ǁ · ǁ B ǁ ) B A  B  Has range 𝟏 … 𝝆 (equivalently 0...180 ° ) ‖B‖  Can divide  by 𝝆 to have distance in range 0…1  Cosine similarity = 1-d(A,B) 𝐵⋅𝐶  But often defined as cosine sim: cos(𝜄) = 𝐵 𝐶 - Has range - 1…1 for general vectors - Range 0..1 for non-negative vectors (angles up to 90 ° ) 3/2/2020 99

  67.  For cosine distance , there is a technique called Random Hyperplanes  Technique similar to Min-Hashing  Random Hyperplanes method is a ( d 1 , d 2 , (1-d 1 / 𝝆 ), (1-d 2 / 𝝆 ) ) - sensitive family for any d 1 and d 2  Reminder: ( d 1 , d 2 , p 1 , p 2 ) - sensitive 1. If d(x,y) < d 1 , then prob. that h(x) = h(y) is at least p 1 2. If d(x,y) > d 2 , then prob. that h(x) = h(y) is at most p 2 3/2/2020 100

  68.  Each vector v determines a hash function h v with two buckets  h v (x) = +1 if v  x  0 ; = -1 if v  x < 0  LS-family H = set of all functions derived from any vector  Claim: For points x and y , Pr[h(x) = h(y)] = 1 – d(x,y) / 𝝆 3/2/2020 101

  69. v’ Look in the plane of x x v and y . Hyperplane θ normal to v’. Here h(x) ≠ h(y) Hyperplane y normal to v . Here h(x) = h(y) Note: what is important is that hyperplane is outside the angle, not that the vector is inside. 3/2/2020 102

  70. So: Prob[Red case ] = θ / 𝝆 So: P [ h(x)=h(y) ] = 1- θ/ 𝜌 = 1-d(x,y)/ 𝜌 3/2/2020 103

Recommend


More recommend