Similarity Estimation Lecture 13 March 5, 2019 Chandra (UIUC) - PowerPoint PPT Presentation

CS 498ABD: Algorithms for Big Data, Spring 2019 Similarity Estimation Lecture 13 March 5, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 30

Similar Items Modern data: often unstructured and high-dimensional Examples: documents, web pages, reviews, images, audio, video, Chandra (UIUC) CS498ABD 2 Spring 2019 2 / 30

Similar Items Modern data: often unstructured and high-dimensional Examples: documents, web pages, reviews, images, audio, video, Given a collection of objects from a data collection: find all “similar” items (application: duplicate detection in documents) for an item x find all items in the collection similar to x (near-neighbor search, many applications) Chandra (UIUC) CS498ABD 2 Spring 2019 2 / 30

Similar Items Modern data: often unstructured and high-dimensional Examples: documents, web pages, reviews, images, audio, video, Given a collection of objects from a data collection: find all “similar” items (application: duplicate detection in documents) for an item x find all items in the collection similar to x (near-neighbor search, many applications) Comparing two items expensive. Comparing all pairs, infeasible. Chandra (UIUC) CS498ABD 2 Spring 2019 2 / 30

High-level Ideas How to measure similarity/dissimilarity? Proxy functions for estimating/capturing similarity Focus only on highly similar items rather than try to find similarity for all pairs Compression/sketching/hashing to create compact representations of objects Fast/approximate near-neighbor search via ideas such as locality-sensitive-hashing, clustering etc Chandra (UIUC) CS498ABD 3 Spring 2019 3 / 30

Topics Jaccard similarity for sets and minhash Angular distance and simhash Locality-sensitive hashing Chandra (UIUC) CS498ABD 4 Spring 2019 4 / 30

Part I Jaccard Similarity and Min-wise independent Hashing Chandra (UIUC) CS498ABD 5 Spring 2019 5 / 30

Set Similarity Motivation: How do we detect near-duplicate text documents? Web pages, papers, homeworks, . . . ? Chandra (UIUC) CS498ABD 6 Spring 2019 6 / 30

Set Similarity Motivation: How do we detect near-duplicate text documents? Web pages, papers, homeworks, . . . ? Model documents as (multi)sets of “words” or more generally “shingles” A very large set of words/singles Each document is a set of words/shingles Large number of documents and each document is sparse in space of words/shingles Chandra (UIUC) CS498ABD 6 Spring 2019 6 / 30

Jaccard similarity of sets Definition: given two sets S , T the Jaccard similarity between S and T is defined as | S ∩ T | | S ∪ T | and denoted by SIM ( S , T ) . Chandra (UIUC) CS498ABD 7 Spring 2019 7 / 30

Jaccard similarity of sets Definition: given two sets S , T the Jaccard similarity between S and T is defined as | S ∩ T | | S ∪ T | and denoted by SIM ( S , T ) . Assumption: S , T very similar if SIM ( S , T ) ≥ α for some fixed threshold α . Say α = 0 . 7 Chandra (UIUC) CS498ABD 7 Spring 2019 7 / 30

Jaccard similarity of sets Definition: given two sets S , T the Jaccard similarity between S and T is defined as | S ∩ T | | S ∪ T | and denoted by SIM ( S , T ) . Assumption: S , T very similar if SIM ( S , T ) ≥ α for some fixed threshold α . Say α = 0 . 7 Question: Given many documents how do we find similar documents? Chandra (UIUC) CS498ABD 7 Spring 2019 7 / 30

Min Hashing Let n be the size of vocabulary For a permutation σ of [ n ] and set S let σ min ( S ) = min { σ ( i ) | i ∈ S } Chandra (UIUC) CS498ABD 8 Spring 2019 8 / 30

Min Hashing Let n be the size of vocabulary For a permutation σ of [ n ] and set S let σ min ( S ) = min { σ ( i ) | i ∈ S } Example: Chandra (UIUC) CS498ABD 8 Spring 2019 8 / 30

Min Hashing Lemma Let S , T be two subsets of [ n ] . Suppose σ is a random permutation of [ n ] . Then Pr[ σ min ( S ) = σ min ( T )] = | S ∩ T | | S ∪ T | . Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 30

Min Hashing Pick ℓ random permutations σ 1 , σ 2 , . . . , σ ℓ For each set S store a ℓ -tuple ( σ 1 min ( S ) , . . . , σ ℓ min ( S )) To check similarity between S and T let s = |{ i | σ i min ( S ) = σ i min ( T ) }| . Output estimator Z = SIM ( S , T ) = s /ℓ Chandra (UIUC) CS498ABD 10 Spring 2019 10 / 30

Min Hashing Pick ℓ random permutations σ 1 , σ 2 , . . . , σ ℓ For each set S store a ℓ -tuple ( σ 1 min ( S ) , . . . , σ ℓ min ( S )) To check similarity between S and T let s = |{ i | σ i min ( S ) = σ i min ( T ) }| . Output estimator Z = SIM ( S , T ) = s /ℓ Z is an exact estimator for SIM ( S , T ) . Exercise: Suppose SIM ( S , T ) ≥ α . How large should ℓ be such that Pr[ Z < (1 − ǫ ) α ] < δ ? Chandra (UIUC) CS498ABD 10 Spring 2019 10 / 30

Min Hashing In practice: Pick some sufficiently large ℓ Use “shingles” instead of “words”: depends on application Store for each S the compact “sketch/signature” ( σ 1 min ( S ) , . . . , σ ℓ min ( S )) Do further optimizations for performance/space See Chapter 3 in Mining Massive Data Sets book by Leskovic, Rajaraman, Ullman. Chandra (UIUC) CS498ABD 11 Spring 2019 11 / 30

Random permutation? Random permutation like a random hash function is complex Cannot store compactly Computing σ min ( S ) expensive Need pseudorandom permutations that suffice. Chandra (UIUC) CS498ABD 12 Spring 2019 12 / 30

Minwise Independent Permutations [Broder-Charikar-Frieze-Mitzemacher] Given n , S n is the set of n ! permutations Want a family F ⊆ S n of permutations such that picking a random σ from F behaves like a random permutation (uniformly chosen from S n ) Chandra (UIUC) CS498ABD 13 Spring 2019 13 / 30

Minwise Independent Permutations [Broder-Charikar-Frieze-Mitzemacher] Given n , S n is the set of n ! permutations Want a family F ⊆ S n of permutations such that picking a random σ from F behaves like a random permutation (uniformly chosen from S n ) Definition A family F ⊆ S n is a minwise independent family of permutations if for every X ⊆ [ n ] and a ∈ X , for a σ chosen uniformly from F , 1 Pr[ σ min ( X ) = a ] = | X | . Chandra (UIUC) CS498ABD 13 Spring 2019 13 / 30

Minwise Independent Permutations Definition A family F ⊆ S n is a minwise independent family of permutations if for every X ⊆ [ n ] and a ∈ X , for a σ chosen uniformly from F , 1 Pr[ σ min ( X ) = a ] = | X | . Exercise: Minwise independent permutations suffice for Jaccard similarity estimation. Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 30

Minwise Independent Permutations Definition A family F ⊆ S n is a minwise independent family of permutations if for every X ⊆ [ n ] and a ∈ X , for a σ chosen uniformly from F , 1 Pr[ σ min ( X ) = a ] = | X | . Exercise: Minwise independent permutations suffice for Jaccard similarity estimation. Question: is there a small F ? Not obvious there is a non-trivial family. Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 30

Minwise Independent Permutations Definition A family F ⊆ S n is a minwise independent family of permutations if for every X ⊆ [ n ] and a ∈ X , for a σ chosen uniformly from F , 1 Pr[ σ min ( X ) = a ] = | X | . Exercise: Minwise independent permutations suffice for Jaccard similarity estimation. Question: is there a small F ? Not obvious there is a non-trivial family. There exist minwise independent families of size 4 n Any minwise independent family must have size e (1 − o (1)) n Hence we need to relax the requirement further. Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 30

Minwise Independent Permutations Definition A family F ⊆ S n is a minwise independent family of permutations if for every X ⊆ [ n ] and a ∈ X , for a σ chosen uniformly from F , 1 Pr[ σ min ( X ) = a ] = | X | . Two relaxations: ǫ -approximate minwise independence. 1 − ǫ ≤ Pr[ σ min ( X ) = a ] ≤ 1 + ǫ | X | . | X | Need condition to hold only for sets X where | X | ≤ k for some k < n . Sufficient for applications where sets are much smaller than n Chandra (UIUC) CS498ABD 15 Spring 2019 15 / 30

Relaxation of Minwise Independence Definition A family F ⊆ S n is ( ǫ, k ) min-wise independent family if for all X ⊂ [ n ] such that | X | ≤ k , if σ is chosen uniformly from F , 1 − ǫ ≤ Pr[ σ min ( X ) = a ] ≤ 1 + ǫ | X | . | X | Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 30

Minwise Independence and Hashing Question: Is there a connection between minwise independent permutations and hashing? Suppose H is a family of t -wise independent hash functions from [ n ] to [ n ] . Let h ∈ H . Why is h not a permutation? Chandra (UIUC) CS498ABD 17 Spring 2019 17 / 30

Minwise Independence and Hashing Question: Is there a connection between minwise independent permutations and hashing? Suppose H is a family of t -wise independent hash functions from [ n ] to [ n ] . Let h ∈ H . Why is h not a permutation? Because of collisions Suppose h : [ n ] → [ m ] where m ≫ n then h has very low probability of collisions. Then would h behave like a minwise independent permutation? Chandra (UIUC) CS498ABD 17 Spring 2019 17 / 30

Similarity Estimation Lecture 13 March 5, 2019 Chandra (UIUC) - PowerPoint PPT Presentation

CS 498ABD: Algorithms for Big Data, Spring 2019 Similarity Estimation Lecture 13 March 5, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 30 Similar Items Modern data: often unstructured and high-dimensional Examples: documents, web pages,

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Similarity Estimation Similarity Estimation Techniques from Rounding Techniques from Rounding

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE Thanks to: Tan,

I/O-EFFICIENT SIMILARITY JOIN R. Pagh, N. Pham, F. Silvestri, M. Stckel Similarity Join R = Q

COMP9313: Big Data Management High Dimensional Similarity Search Similarity Search Problem

DATA MINING LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE

DATA MINING LECTURE 5 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

MLSE Channel Estimation MLSE Channel Estimation MLSE Channel Estimation Parametric or Non-

M-Estimation under High-Dimensional Asymptotics DLD, Andrea Montanari 2014-05-01 DLD, Andrea

Part 3. Spectrum Estimation Part 3. Spectrum Estimation 3.2 Parametric Methods for Spectral

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Similarity searching using multiple starting points Peter Willett, University of Sheffield, UK

Feat u re Generation FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data

Locality-Sensitive Hashing LSH Fingerprints References Anil Maheshwari School of Computer

Professor Flavia Berys 619.665.3528 www.BerysLaw.com/cwsl Class 1 www.BerysLaw.com/cwsl

Beach Guide for Dogs and Their Owners 2 3 www.thecornishcoast.co.uk 4 7 9 5 8 6 10 Dogs

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets

Ar Archite chitectur ctural al Styles tyles Stud tudent ent Slides lides Du Dutch ch

Creating Successful Grant Projects Grants in general ! Relationship ! Seldom has anything to do

Architecture of the Triposo travel guide Douwe Osinga (@dosinga) Thursday, 17 October 13 The