We know: Pr[ h (C 1 ) = h (C 2 )] = sim (C 1 , C 2 ) Now generalize to multiple hash functions The similarity of two signatures is the fraction of the hash functions in which they agree Thus, the expected similarity of two signatures equals the Jaccard similarity of the columns or sets that the signatures represent. And the longer the signatures, the smaller will be the expected error. 3/2/2020 28
Permutation Input matrix (Shingles x Documents) Signature matrix M 2 4 3 1 0 1 0 2 3 7 2 1 0 0 1 3 2 4 2 4 1 4 0 1 0 1 7 1 7 7 3 7 3 0 1 0 1 6 3 2 0 1 0 1 1 6 6 Similarities: 1-3 2-4 1-2 3-4 5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0 Sig/Sig 0.34 0.67 0 0 4 5 5 1 0 1 0 3/2/2020 29
Permuting rows even once is prohibitive Row hashing! Pick K = 100 hash functions h i Ordering under h i gives a random permutation of rows! One-pass implementation For each column c and hash-func. h i keep a “slot” M ( i, c ) for the min-hash value Initialize all M ( i, c ) = How to pick a random Scan rows looking for 1s hash function h(x)? Universal hashing: Suppose row j has 1 in column c h a,b (x)=((a·x+b) mod p) mod N where: Then for each h i : a,b … random integers If h i (j) < M ( i, c ), then M ( i, c ) h i (j) p … prime number (p > N) 3/2/2020 30
for each row r do begin for each hash function h i do compute h i ( r ); Important: so you hash r only for each column c once per hash function, not once per 1 in row r. if c has 1 in row r for each hash function h i do if h i ( r ) < M ( i, c ) then M ( i, c ) := h i ( r ); end; 3/2/2020 31
M(i, C 1 ) M(i, C 2 ) ∞ h (1) = 1 1 ∞ g (1) = 3 3 Row C 1 C 2 h (2) = 2 1 2 1 1 0 g (2) = 0 3 0 2 0 1 3 1 1 h (3) = 3 1 2 4 1 0 g (3) = 2 2 0 5 0 1 h (4) = 4 1 2 g (4) = 4 2 0 h ( x ) = x mod 5 h (5) = 0 1 0 g ( x ) = (2 x +1) mod 5 g (5) = 1 2 0 Signature matrix M 3/2/2020 32
Candidate pairs: Locality- those pairs Docu- Sensitive of signatures ment Hashing that we need to test for similarity The set Signatures: of strings short integer of length k vectors that that appear represent the in the doc- sets, and ument reflect their similarity Step 3: Locality Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents
2 3 7 2 2 4 1 4 7 3 7 3 Goal: Find documents with Jaccard similarity at least s (for some similarity threshold, e.g., s =0.8) LSH – General idea: Use a hash function that tells whether x and y is a candidate pair : a pair of elements whose similarity must be evaluated For Min-Hash matrices: Hash columns of signature matrix M to many buckets Each pair of documents that hashes into the same bucket is a candidate pair 3/2/2020 34
2 3 7 2 2 4 1 4 7 3 7 3 Pick a similarity threshold s (0 < s < 1) Columns x and y of M are a candidate pair if their signatures agree on at least fraction s of their rows: M ( i, x ) = M ( i, y ) for at least frac. s values of i We expect documents x and y to have the same (Jaccard) similarity as their signatures 3/2/2020 35
2 3 7 2 2 4 1 4 7 3 7 3 Big idea: Hash columns of signature matrix M several times Arrange that (only) similar columns are likely to hash to the same bucket , with high probability Candidate pairs are those that hash to the same bucket 3/2/2020 36
2 3 7 2 2 4 1 4 7 3 7 3 r rows per band b bands One signature Signature matrix M 3/2/2020 37
Divide matrix M into b bands of r rows For each band, hash its portion of each column to a hash table with k buckets Make k as large as possible Candidate column pairs are those that hash to the same bucket for ≥ 1 band Tune b and r to catch most similar pairs, but few non-similar pairs 3/2/2020 38
Columns 2 and 6 Buckets are probably identical ( candidate pair ) Columns 6 and 7 are surely different. Matrix M b bands r rows 3/2/2020 39
There are enough buckets that columns are unlikely to hash to the same bucket unless they are identical in a particular band Hereafter, we assume that “ same bucket ” means “ identical in that band ” Assumption needed only to simplify analysis, not for correctness of algorithm 3/2/2020 41
2 3 7 2 2 4 1 4 7 3 7 3 Assume the following case: Suppose 100,000 columns of M (100k docs) Signatures of 100 integers (rows) Therefore, signatures take 40MB Goal: Find pairs of documents that are at least s = 0.8 similar Choose b = 20 bands of r = 5 integers/band 3/2/2020 42
2 3 7 2 2 4 1 4 7 3 7 3 Find pairs of s = 0.8 similarity, set b =20, r =5 Assume: sim(C 1 , C 2 ) = 0.8 Since sim(C 1 , C 2 ) s , we want C 1 , C 2 to be a candidate pair : We want them to hash to at least 1 common bucket (at least one band is identical) Probability C 1 , C 2 identical in one particular band: (0.8) 5 = 0.328 Probability C 1 , C 2 are not similar in all of the 20 bands: (1-0.328) 20 = 0.00035 i.e., about 1/3000th of the 80%-similar column pairs are false negatives (we miss them) We would find 99.965% pairs of truly similar documents 3/2/2020 43
2 3 7 2 2 4 1 4 7 3 7 3 Find pairs of s = 0.8 similarity, set b =20, r =5 Assume: sim(C 1 , C 2 ) = 0.3 Since sim(C 1 , C 2 ) < s we want C 1 , C 2 to hash to NO common buckets (all bands should be different) Probability C 1 , C 2 identical in one particular band: (0.3) 5 = 0.00243 Probability C 1 , C 2 identical in at least 1 of 20 bands: 1 - (1 - 0.00243) 20 = 0.0474 In other words, approximately 4.74% pairs of docs with similarity 0.3 end up becoming candidate pairs They are false positives since we will have to examine them (they are candidate pairs) but then it will turn out their similarity is below threshold s 3/2/2020 44
2 3 7 2 2 4 1 4 7 3 7 3 Pick: The number of Min-Hashes (rows of M ) The number of bands b , and The number of rows r per band to balance false positives/negatives Example: If we had only 10 bands of 10 rows, the number of false positives would go down, but the number of false negatives would go up 3/2/2020 45
Probability = 1 Similarity threshold s if t > s Probability No chance Say “yes” if you of sharing if t < s are below the line. a bucket Similarity t =sim(C 1 , C 2 ) of two sets 3/2/2020 46
Probability Remember: of sharing Probability of a bucket equal hash-values = similarity Similarity t =sim(C 1 , C 2 ) of two sets 3/2/2020 47
False negatives Probability Say “yes” if you of sharing are below the line. a bucket False positives s Similarity t =sim(C 1 , C 2 ) of two sets 3/2/2020 48
Say columns C 1 and C 2 have similarity t Pick any band ( r rows) Prob. that all rows in band equal = t r Prob. that some row in band unequal = 1 - t r Prob. that no band identical = (1 - t r ) b Prob. that at least 1 band identical = 1 - (1 - t r ) b 3/2/2020 49
At least No bands one band identical identical 1 - ( ) b 1 - t r Probability of sharing a bucket All rows Some row of a band of a band are equal unequal Similarity t=sim(C 1 , C 2 ) of two sets 3/2/2020 50
Similarity threshold s Prob. that at least 1 band is identical: 1-(1-s r ) b s 0.2 0.006 0.3 0.047 0.4 0.186 0.5 0.470 0.6 0.802 0.7 0.975 0.8 0.9996 3/2/2020 51
Picking r and b to get the best S-curve 50 hash-functions (r=5, b=10) 1 0.9 Prob. sharing a bucket 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Blue area: False Negative rate 0.1 Green area: False Positive rate 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity 3/2/2020 52
Tune M, b, r to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures Check in main memory that candidate pairs really do have similar signatures Optional: In another pass through data, check that the remaining candidate pairs really represent similar documents 3/2/2020 53
Shingling: Convert documents to set representation We used hashing to assign each shingle an ID Min-Hashing: Convert large sets to short signatures, while preserving similarity We used similarity preserving hashing to generate signatures with property Pr[ h (C 1 ) = h (C 2 )] = sim (C 1 , C 2 ) We used hashing to get around generating random permutations Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents We used hashing to find candidate pairs of similarity s 3/2/2020 54
Task: Given a large number ( N in the millions or billions) of documents, find “near duplicates” Problem: Too many documents to compare all pairs Solution: Hash documents so that similar documents hash into the same bucket Documents in the same bucket are then candidate pairs whose similarity is then evaluated 3/2/2020 56
Candidate pairs: Locality- those pairs Docu- sensitive of signatures ment Hashing that we need to test for similarity The set Signatures: of strings short integer of length k vectors that that appear represent the in the doc- sets, and ument reflect their similarity 3/2/2020 57
A k -shingle (or k -gram) is a sequence of k tokens that appears in the document Example: k=2 ; D 1 = abcab Set of 2-shingles: C 1 = S(D 1 ) = { ab , bc , ca } Represent a doc by a set of hash values of its k -shingles A natural similarity measure is then the Jaccard similarity: sim (D 1 , D 2 ) = |C 1 C 2 |/|C 1 C 2 | Similarity of two documents is the Jaccard similarity of their shingles 3/2/2020 59
Min-Hashing : Convert large sets into short signatures, while preserving similarity: Pr[ h (C 1 ) = h (C 2 )] = sim (D 1 , D 2 ) Permutation Input matrix (Shingles x Documents) Signature matrix M 2 4 3 1 0 1 0 2 3 7 2 1 0 0 1 3 2 4 2 4 1 4 0 1 0 1 7 1 7 7 3 7 3 0 1 0 1 6 3 2 Similarities of columns and 0 1 0 1 1 6 6 signatures (approx.) match! 1-3 2-4 1-2 3-4 5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0 Sig/Sig 0.34 0.67 0 0 4 5 5 1 0 1 0 3/2/2020 60
Hash columns of the signature matrix M: Similar columns likely hash to same bucket Divide matrix M into b bands of r rows (M=b·r) Candidate column pairs are those that hash to the same bucket for ≥ 1 band Buckets Prob. of sharing Threshold s ≥ 1 bucket b bands r rows Similarity Matrix M 3/2/2020 61
Signatures: short Candidate pairs: integer signatures that those pairs of Locality- reflect point similarity signatures that Points sensitive we need to test Hashing for similarity Design a locality sensitive Apply the hash function (for a given “Bands” technique distance metric) 3/2/2020 62
The S- curve is where the “magic” happens Probability of sharing Remember: Threshold s Probability of Probability=1 equal hash-values ≥ 1 bucket if t>s = similarity No chance if t<s Similarity t of two sets Similarity t of two sets This is what 1 hash-code gives you This is what we want! Pr[ h (C 1 ) = h (C 2 )] = s im (D 1 , D 2 ) How to get a step-function? By choosing r and b ! 3/2/2020 63
Remember: b bands, r rows/band Let sim( C 1 , C 2 ) = s What’s the prob. that at least 1 band is equal? Pick some band ( r rows) Prob. that elements in a single row of columns C 1 and C 2 are equal = s Prob. that all rows in a band are equal = s r Prob. that some row in a band is not equal = 1 - s r Prob. that all bands are not equal = (1 - s r ) b Prob. that at least 1 band is equal = 1 - (1 - s r ) b P(C 1 , C 2 is a candidate pair) = 1 - (1 - s r ) b 3/2/2020 64
Picking r and b to get the best S-curve 50 hash-functions (r=5, b=10) 1 0.9 Prob. sharing a bucket 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity, s 3/2/2020 65
1 1 r = 5, b = 1..50 Prob(Candidate pair) r = 1..10, b = 1 0.9 0.9 0.8 0.8 Given a fixed 0.7 0.7 0.6 0.6 threshold s . 0.5 0.5 0.4 0.4 0.3 0.3 We want choose 0.2 0.2 0.1 0.1 r and b such 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 r = 10, b = 1..50 1 that the Prob(Candidate pair) 0.9 0.9 P(Candidate 0.8 0.8 0.7 0.7 pair) has a 0.6 0.6 “step” right 0.5 0.5 0.4 0.4 around s . 0.3 0.3 0.2 0.2 r = 1, b = 1..10 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity Similarity prob = 1 - (1 - t r ) b 3/2/2020 66
Candidate pairs: Locality- those pairs sensitive of signatures Hashing that we need to test for similarity Signatures: short vectors that represent the sets, and reflect their similarity
We have used LSH to find similar documents More generally, we found similar columns in large sparse matrices with high Jaccard similarity Can we use LSH for other distance measures? e.g., Euclidean distances, Cosine distance Let’s generalize what we’ve learned! 3/2/2020 68
d() is a distance measure if it is a function from pairs of points x,y to real numbers such that: 𝑒 𝑦, 𝑧 ≥ 0 𝑒(𝑦, 𝑧) = 0 𝑗𝑔𝑔 𝑦 = 𝑧 𝑒(𝑦, 𝑧) = 𝑒(𝑧, 𝑦) 𝑒 𝑦, 𝑧 ≤ 𝑒(𝑦, 𝑨) + 𝑒(𝑨, 𝑧) (triangle inequality) Jaccard distance for sets = 1 minus Jaccard similarity Cosine distance for vectors = angle between the vectors Euclidean distances: L 2 norm : d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension The most common notion of “distance” L 1 norm : sum of the differences in each dimension Manhattan distance = distance if you travel along coordinates only 3/2/2020 69
d(x,y) > 0 because |x y| < |x y| Thus, similarity < 1 and distance = 1 – similarity > 0 d(x,x) = 0 because x x = x x. And if x y, then |x y| is strictly less than |x y|, so sim(x,y) < 1; thus d(x,y) > 0 d(x,y) = d(y,x) because union and intersection are symmetric d(x,y) < d(x,z) + d(z,y) trickier: 1 - |x z| + 1 - |y z| > 1 -|x y| |x z| |y z| |x y| 3/2/2020 70
d(x,z) d(z,y) d(x,y) 1 - |x z| + 1 - |y z| > 1 -|x y| |x z| |y z| |x y| Remember: |a b|/|a b| = probability that minhash(a) = minhash(b). Thus, 1 - |a b|/|a b| = probability that minhash(a) minhash(b). Need to show: prob[minhash(x) minhash(y)] < prob[minhash(x) minhash(z)] + prob[minhash(z) minhash(y)] 71
Whenever minhash(x) minhash(y), at least one of minhash(x) minhash(z) and minhash(z) minhash(y) must be true: minhash(x) minhash(y minhash(x) minhash(z) minhash(z) minhash(y) 72
For Min-Hashing signatures, we got a Min-Hash function for each permutation of rows A “hash function” is any function that allows us to say whether two elements are “equal” Shorthand: h(x) = h(y) means “ h says x and y are equal ” A family of hash functions is any set of hash functions from which we can pick one at random efficiently Example: The set of Min-Hash functions generated from permutations of rows 3/2/2020 73
Suppose we have a space S of points with a distance measure d(x,y) Critical assumption A family H of hash functions is said to be ( d 1 , d 2 , p 1 , p 2 )- sensitive if for any x and y in S : 1. If d(x, y) < d 1 , then the probability over all h H , that h(x) = h(y) is at least p 1 2. If d(x, y) > d 2 , then the probability over all h H , that h(x) = h(y) is at most p 2 With a LS Family we can do LSH! 3/2/2020 74
Distance Small distance, threshold t high probability p 1 Pr [ h (x) = h (y)] p 2 Large distance, low probability of hashing to the same value d 1 d 2 Distance d(x,y) 3/2/2020 75
Let: S = space of all sets, d = Jaccard distance, H is family of Min-Hash functions for all permutations of rows Then for any hash function h H : Pr[h(x) = h(y)] = 1 - d(x, y) Simply restates theorem about Min-Hashing in terms of distances rather than similarities 3/2/2020 76
Claim: Min-hash H is a (1/3, 2/3, 2/3, 1/3)- sensitive family for S and d . Then probability If distance < 1/3 that Min-Hash values (so similarity ≥ 2/3) agree is > 2/3 For Jaccard similarity, Min-Hashing gives a (d 1 ,d 2 ,(1-d 1 ),(1-d 2 ))- sensitive family for any d 1 <d 2 3/2/2020 77
Prob. of sharing Can we reproduce the a bucket “S - curve” effect we saw before for any LS family? Similarity t The “ bands ” technique we learned for signature matrices carries over to this more general setting Can do LSH with any ( d 1 , d 2 , p 1 , p 2 )- sensitive family! Two constructions: AND construction like “rows in a band” OR construction like “many bands” 3/2/2020 78
Given family H , construct family H’ consisting of r functions from H For h = [ h 1 ,…, h r ] in H’ , we say h(x) = h(y) if and only if h i (x) = h i (y) for all i 1 i r Note this corresponds to creating a band of size r Theorem: If H is ( d 1 , d 2 , p 1 , p 2 ) -sensitive, then H’ is ( d 1 ,d 2 , (p 1 ) r , (p 2 ) r ) -sensitive Proof: Use the fact that h i ’s are independent Lowers probability for Also lowers probability large distances (Good) for small distances (Bad) 3/2/2020 80
Independence of hash functions (HFs) really means that the prob. of two HFs saying “yes” is the product of each saying “yes” But two particular hash functions could be highly correlated For example, in Min-Hash if their permutations agree in the first one million entries However , the probabilities in definition of a LSH-family are over all possible members of H , H’ (i.e., average case and not the worst case) 3/2/2020 81
Given family H , construct family H’ consisting of b functions from H For h = [ h 1 ,…, h b ] in H’ , h(x) = h(y) if and only if h i (x) = h i (y) for at least 1 i Theorem: If H is ( d 1 , d 2 , p 1 , p 2 )- sensitive, then H’ is ( d 1 , d 2 , 1-(1- p 1 ) b , 1-(1- p 2 ) b ) -sensitive Proof: Use the fact that h i ’s are independent Raises probability for Raises probability for large distances (Bad) small distances (Good) 3/2/2020 82
AND makes all probs. shrink , but by choosing r correctly, we can make the lower prob. approach 0 while the higher does not OR makes all probs. grow , but by choosing b correctly, we can make the upper prob. approach 1 while the lower does not Prob. sharing a bucket 1 1 Prob. sharing a bucket AND 0.9 0.9 r=1..10, b=1 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 OR 0.3 0.3 0.2 0.2 r=1, b=1..10 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity of a pair of items Similarity of a pair of items 3/2/2020 83
By choosing b and r correctly, we can make the lower probability approach 0 while the higher approaches 1 As for the signature matrix, we can use the AND construction followed by the OR construction Or vice-versa Or any sequence of AND’s and OR’s alternating 3/2/2020 84
r -way AND followed by b -way OR construction Exactly what we did with Min-Hashing AND: If bands match in all r values hash to same bucket OR: Cols that have 1 common bucket Candidate Take points x and y s.t. Pr[h(x) = h(y)] = s H will make (x,y) a candidate pair with prob. s Construction makes (x,y) a candidate pair with probability 1-(1-s r ) b The S-Curve! Example: Take H and construct H’ by the AND construction with r = 4 . Then, from H’ , construct H’’ by the OR construction with b = 4 3/2/2020 85
p=1-(1-s 4 ) 4 s .2 .0064 .3 .0320 .4 .0985 .5 .2275 .6 .4260 .7 .6666 .8 .8785 r = 4, b = 4 transforms a (.2,.8,.8,.2)-sensitive family into a .9 .9860 (.2,.8,.8785,.0064)-sensitive family. 3/2/2020 86
Picking r and b to get desired performance 50 hash-functions ( r = 5, b = 10 ) 1 Blue area X : False Negative rate Threshold s 0.9 These are pairs with sim > s but the X Prob(Candidate pair) 0.8 fraction won’t share a band and then 0.7 will never become candidates. This 0.6 means we will never consider these 0.5 pairs for (slow/exact) similarity 0.4 calculation! 0.3 Green area Y: False Positive rate 0.2 These are pairs with sim < s but 0.1 we will consider them as candidates. 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 This is not too bad, we will consider Similarity s them for (slow/exact) similarity computation and discard them. 3/2/2020 88
Picking r and b to get desired performance 50 hash-functions ( r * b = 50 ) 1 Threshold s r=2, b=25 0.9 Prob(Candidate pair) r=5, b=10 0.8 r=10, b=5 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity s 3/2/2020 89
Apply a b -way OR construction followed by an r -way AND construction Transforms similarity s (probability p) into (1-(1-s) b ) r The same S-curve, mirrored horizontally and vertically Example: Take H and construct H’ by the OR construction with b = 4. Then, from H’ , construct H’’ by the AND construction with r = 4 3/2/2020 90
1 p=(1-(1-s) 4 ) 4 s 0.9 Prob(Candidate pair) 0.8 .1 .0140 0.7 0.6 .2 .1215 0.5 .3 .3334 0.4 0.3 .4 .5740 0.2 0.1 .5 .7725 0 0 0.2 0.4 0.6 0.8 1 Similarity s .6 .9015 The example transforms a .7 .9680 (.2,.8,.8,.2)-sensitive family into a .8 .9936 (.2,.8,.9936,.1215)-sensitive family 3/2/2020 91
Example: Apply the (4,4) OR-AND construction followed by the (4,4) AND-OR construction Transforms a (.2, .8, .8, .2)-sensitive family into a (.2, .8, .9999996, .0008715)-sensitive family Note this family uses 256 (=4*4*4*4) of the original hash functions 3/2/2020 92
For each AND-OR S-curve 1-(1-s r ) b , there is a threshold t , for which 1-(1-t r ) b = t Above t , high probabilities are increased; below t , low probabilities are decreased You improve the sensitivity as long as the low probability is less than t , and the high probability is greater than t Iterate as you like. Similar observation for the OR-AND type of S- curve: (1-(1-s) b ) r 3/2/2020 93
Probability Is raised Prob(Candidate pair) Threshold t Probability Is lowered s t 3/2/2020 94
Pick any two distances d 1 < d 2 Start with a ( d 1 , d 2 , (1- d 1 ), (1- d 2 ) ) - sensitive family Apply constructions to amplify (d 1 , d 2 , p 1 , p 2 ) -sensitive family, where p 1 is almost 1 and p 2 is almost 0 The closer to 0 and 1 we get, the more hash functions must be used! 3/2/2020 95
LSH methods for other distance metrics: Cosine distance: Random hyperplanes Euclidean distance: Project on lines Signatures: short Candidate pairs: integer signatures that those pairs of Locality- reflect their similarity signatures that Points sensitive we need to test Hashing for similarity Design a (d 1 , d 2 , p 1 , p 2 )-sensitive Amplify the family family of hash functions (for that using AND and OR particular distance metric) constructions Depends on the distance function used 3/2/2020 97
Signatures: short Candidate pairs: integer signatures that those pairs of Locality- reflect their similarity signatures that Data sensitive we need to test Hashing for similarity 0 1 0 0 Documents 1 1 1 0 1 5 1 5 MinHash “Bands” technique 0 0 0 1 Candidate pairs 2 3 1 3 0 1 0 1 6 4 6 4 0 0 1 0 1 0 0 1 0 1 0 0 Data points Random Hyperplanes -1 +1 -1 -1 1 1 1 0 “Bands” technique Candidate pairs +1 +1 +1 -1 0 0 0 1 -1 -1 -1 -1 0 1 0 1 0 0 1 0 3/2/2020 98 1 0 0 1
A Cosine distance = angle between vectors from the origin to the points in question d(A, B) = = arccos(A B / ǁ A ǁ · ǁ B ǁ ) B A B Has range 𝟏 … 𝝆 (equivalently 0...180 ° ) ‖B‖ Can divide by 𝝆 to have distance in range 0…1 Cosine similarity = 1-d(A,B) 𝐵⋅𝐶 But often defined as cosine sim: cos(𝜄) = 𝐵 𝐶 - Has range - 1…1 for general vectors - Range 0..1 for non-negative vectors (angles up to 90 ° ) 3/2/2020 99
For cosine distance , there is a technique called Random Hyperplanes Technique similar to Min-Hashing Random Hyperplanes method is a ( d 1 , d 2 , (1-d 1 / 𝝆 ), (1-d 2 / 𝝆 ) ) - sensitive family for any d 1 and d 2 Reminder: ( d 1 , d 2 , p 1 , p 2 ) - sensitive 1. If d(x,y) < d 1 , then prob. that h(x) = h(y) is at least p 1 2. If d(x,y) > d 2 , then prob. that h(x) = h(y) is at most p 2 3/2/2020 100
Each vector v determines a hash function h v with two buckets h v (x) = +1 if v x 0 ; = -1 if v x < 0 LS-family H = set of all functions derived from any vector Claim: For points x and y , Pr[h(x) = h(y)] = 1 – d(x,y) / 𝝆 3/2/2020 101
v’ Look in the plane of x x v and y . Hyperplane θ normal to v’. Here h(x) ≠ h(y) Hyperplane y normal to v . Here h(x) = h(y) Note: what is important is that hyperplane is outside the angle, not that the vector is inside. 3/2/2020 102
So: Prob[Red case ] = θ / 𝝆 So: P [ h(x)=h(y) ] = 1- θ/ 𝜌 = 1-d(x,y)/ 𝜌 3/2/2020 103
Recommend
More recommend