CS 498ABD: Algorithms for Big Data, Spring 2019 Similarity Estimation Lecture 13 March 5, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 30
Similar Items Modern data: often unstructured and high-dimensional Examples: documents, web pages, reviews, images, audio, video, Chandra (UIUC) CS498ABD 2 Spring 2019 2 / 30
Similar Items Modern data: often unstructured and high-dimensional Examples: documents, web pages, reviews, images, audio, video, Given a collection of objects from a data collection: find all “similar” items (application: duplicate detection in documents) for an item x find all items in the collection similar to x (near-neighbor search, many applications) Chandra (UIUC) CS498ABD 2 Spring 2019 2 / 30
Similar Items Modern data: often unstructured and high-dimensional Examples: documents, web pages, reviews, images, audio, video, Given a collection of objects from a data collection: find all “similar” items (application: duplicate detection in documents) for an item x find all items in the collection similar to x (near-neighbor search, many applications) Comparing two items expensive. Comparing all pairs, infeasible. Chandra (UIUC) CS498ABD 2 Spring 2019 2 / 30
High-level Ideas How to measure similarity/dissimilarity? Proxy functions for estimating/capturing similarity Focus only on highly similar items rather than try to find similarity for all pairs Compression/sketching/hashing to create compact representations of objects Fast/approximate near-neighbor search via ideas such as locality-sensitive-hashing, clustering etc Chandra (UIUC) CS498ABD 3 Spring 2019 3 / 30
Topics Jaccard similarity for sets and minhash Angular distance and simhash Locality-sensitive hashing Chandra (UIUC) CS498ABD 4 Spring 2019 4 / 30
Part I Jaccard Similarity and Min-wise independent Hashing Chandra (UIUC) CS498ABD 5 Spring 2019 5 / 30
Set Similarity Motivation: How do we detect near-duplicate text documents? Web pages, papers, homeworks, . . . ? Chandra (UIUC) CS498ABD 6 Spring 2019 6 / 30
Set Similarity Motivation: How do we detect near-duplicate text documents? Web pages, papers, homeworks, . . . ? Model documents as (multi)sets of “words” or more generally “shingles” A very large set of words/singles Each document is a set of words/shingles Large number of documents and each document is sparse in space of words/shingles Chandra (UIUC) CS498ABD 6 Spring 2019 6 / 30
Jaccard similarity of sets Definition: given two sets S , T the Jaccard similarity between S and T is defined as | S ∩ T | | S ∪ T | and denoted by SIM ( S , T ) . Chandra (UIUC) CS498ABD 7 Spring 2019 7 / 30
Jaccard similarity of sets Definition: given two sets S , T the Jaccard similarity between S and T is defined as | S ∩ T | | S ∪ T | and denoted by SIM ( S , T ) . Assumption: S , T very similar if SIM ( S , T ) ≥ α for some fixed threshold α . Say α = 0 . 7 Chandra (UIUC) CS498ABD 7 Spring 2019 7 / 30
Jaccard similarity of sets Definition: given two sets S , T the Jaccard similarity between S and T is defined as | S ∩ T | | S ∪ T | and denoted by SIM ( S , T ) . Assumption: S , T very similar if SIM ( S , T ) ≥ α for some fixed threshold α . Say α = 0 . 7 Question: Given many documents how do we find similar documents? Chandra (UIUC) CS498ABD 7 Spring 2019 7 / 30
Min Hashing Let n be the size of vocabulary For a permutation σ of [ n ] and set S let σ min ( S ) = min { σ ( i ) | i ∈ S } Chandra (UIUC) CS498ABD 8 Spring 2019 8 / 30
Min Hashing Let n be the size of vocabulary For a permutation σ of [ n ] and set S let σ min ( S ) = min { σ ( i ) | i ∈ S } Example: Chandra (UIUC) CS498ABD 8 Spring 2019 8 / 30
Min Hashing Lemma Let S , T be two subsets of [ n ] . Suppose σ is a random permutation of [ n ] . Then Pr[ σ min ( S ) = σ min ( T )] = | S ∩ T | | S ∪ T | . Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 30
Min Hashing Pick ℓ random permutations σ 1 , σ 2 , . . . , σ ℓ For each set S store a ℓ -tuple ( σ 1 min ( S ) , . . . , σ ℓ min ( S )) To check similarity between S and T let s = |{ i | σ i min ( S ) = σ i min ( T ) }| . Output estimator Z = SIM ( S , T ) = s /ℓ Chandra (UIUC) CS498ABD 10 Spring 2019 10 / 30
Min Hashing Pick ℓ random permutations σ 1 , σ 2 , . . . , σ ℓ For each set S store a ℓ -tuple ( σ 1 min ( S ) , . . . , σ ℓ min ( S )) To check similarity between S and T let s = |{ i | σ i min ( S ) = σ i min ( T ) }| . Output estimator Z = SIM ( S , T ) = s /ℓ Z is an exact estimator for SIM ( S , T ) . Exercise: Suppose SIM ( S , T ) ≥ α . How large should ℓ be such that Pr[ Z < (1 − ǫ ) α ] < δ ? Chandra (UIUC) CS498ABD 10 Spring 2019 10 / 30
Min Hashing In practice: Pick some sufficiently large ℓ Use “shingles” instead of “words”: depends on application Store for each S the compact “sketch/signature” ( σ 1 min ( S ) , . . . , σ ℓ min ( S )) Do further optimizations for performance/space See Chapter 3 in Mining Massive Data Sets book by Leskovic, Rajaraman, Ullman. Chandra (UIUC) CS498ABD 11 Spring 2019 11 / 30
Random permutation? Random permutation like a random hash function is complex Cannot store compactly Computing σ min ( S ) expensive Need pseudorandom permutations that suffice. Chandra (UIUC) CS498ABD 12 Spring 2019 12 / 30
Minwise Independent Permutations [Broder-Charikar-Frieze-Mitzemacher] Given n , S n is the set of n ! permutations Want a family F ⊆ S n of permutations such that picking a random σ from F behaves like a random permutation (uniformly chosen from S n ) Chandra (UIUC) CS498ABD 13 Spring 2019 13 / 30
Minwise Independent Permutations [Broder-Charikar-Frieze-Mitzemacher] Given n , S n is the set of n ! permutations Want a family F ⊆ S n of permutations such that picking a random σ from F behaves like a random permutation (uniformly chosen from S n ) Definition A family F ⊆ S n is a minwise independent family of permutations if for every X ⊆ [ n ] and a ∈ X , for a σ chosen uniformly from F , 1 Pr[ σ min ( X ) = a ] = | X | . Chandra (UIUC) CS498ABD 13 Spring 2019 13 / 30
Minwise Independent Permutations Definition A family F ⊆ S n is a minwise independent family of permutations if for every X ⊆ [ n ] and a ∈ X , for a σ chosen uniformly from F , 1 Pr[ σ min ( X ) = a ] = | X | . Exercise: Minwise independent permutations suffice for Jaccard similarity estimation. Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 30
Minwise Independent Permutations Definition A family F ⊆ S n is a minwise independent family of permutations if for every X ⊆ [ n ] and a ∈ X , for a σ chosen uniformly from F , 1 Pr[ σ min ( X ) = a ] = | X | . Exercise: Minwise independent permutations suffice for Jaccard similarity estimation. Question: is there a small F ? Not obvious there is a non-trivial family. Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 30
Minwise Independent Permutations Definition A family F ⊆ S n is a minwise independent family of permutations if for every X ⊆ [ n ] and a ∈ X , for a σ chosen uniformly from F , 1 Pr[ σ min ( X ) = a ] = | X | . Exercise: Minwise independent permutations suffice for Jaccard similarity estimation. Question: is there a small F ? Not obvious there is a non-trivial family. There exist minwise independent families of size 4 n Any minwise independent family must have size e (1 − o (1)) n Hence we need to relax the requirement further. Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 30
Minwise Independent Permutations Definition A family F ⊆ S n is a minwise independent family of permutations if for every X ⊆ [ n ] and a ∈ X , for a σ chosen uniformly from F , 1 Pr[ σ min ( X ) = a ] = | X | . Two relaxations: ǫ -approximate minwise independence. 1 − ǫ ≤ Pr[ σ min ( X ) = a ] ≤ 1 + ǫ | X | . | X | Need condition to hold only for sets X where | X | ≤ k for some k < n . Sufficient for applications where sets are much smaller than n Chandra (UIUC) CS498ABD 15 Spring 2019 15 / 30
Relaxation of Minwise Independence Definition A family F ⊆ S n is ( ǫ, k ) min-wise independent family if for all X ⊂ [ n ] such that | X | ≤ k , if σ is chosen uniformly from F , 1 − ǫ ≤ Pr[ σ min ( X ) = a ] ≤ 1 + ǫ | X | . | X | Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 30
Minwise Independence and Hashing Question: Is there a connection between minwise independent permutations and hashing? Suppose H is a family of t -wise independent hash functions from [ n ] to [ n ] . Let h ∈ H . Why is h not a permutation? Chandra (UIUC) CS498ABD 17 Spring 2019 17 / 30
Minwise Independence and Hashing Question: Is there a connection between minwise independent permutations and hashing? Suppose H is a family of t -wise independent hash functions from [ n ] to [ n ] . Let h ∈ H . Why is h not a permutation? Because of collisions Suppose h : [ n ] → [ m ] where m ≫ n then h has very low probability of collisions. Then would h behave like a minwise independent permutation? Chandra (UIUC) CS498ABD 17 Spring 2019 17 / 30
Recommend
More recommend