Efficient Algorithms for Streaming Datasets with Near-Duplicates Qin Zhang Indiana University Bloomington Based on work with: Djamal Belazzougui (CERIST) Di Chen (HKUST) Jiecao Chen (IUB) Haoyu Zhang (IUB) Theory and Applications of Hashing May 4, 2017 1-1
Disclaimer Not really a survey talk; results are all very recent, and solutions may be quite premature. 2-1
Disclaimer Not really a survey talk; results are all very recent, and solutions may be quite premature. Agenda 1. Background and motivation 2. Distinct elements on data with near-duplicates 3. Similarity join under edit distance 2-2
Model of computation The Streaming Model – high-speed online data – want space/time efficient algorithms RAM 1 7 9 1 7 3 2 E.g., what is the number of distinct elements? CPU 3-1
Linear sketches Problem : given a data vector x ∈ R d , compute f ( x ) Can do this using linear sketches recover = M Mx g ( Mx ) ≈ f ( x ) x linear mapping sketching vector (sometimes embeds a hash function) 4-1
Linear sketches Problem : given a data vector x ∈ R d , compute f ( x ) Can do this using linear sketches recover = M Mx g ( Mx ) ≈ f ( x ) x linear mapping sketching vector (sometimes embeds a hash function) Simple and useful : used extensively in streaming/distributed algorithms, compressive sensing, . . . 4-2
Linear sketches in the streaming model RAM 1 7 9 1 7 3 2 View each incoming element i as updating x ← x + e i Can update the sketching vector incrementally space = = + M ( x + e i ) Mx M e i size of sketch Mx time ≤ space = + M i (usually) Mx 5-1
Real-world data is often noisy music, images, videos... after compressions, resize, photoshop, etc. 6-1
Real-world data is often noisy music, images, videos... after compressions, resize, photoshop, etc. “theory and applications of hashing” “theory application of hash” “dagstuhl hashing” “dagstuhl seminar hash” Queries of the same meaning sent to Google 6-2
Robust streaming algorithms RAM We have to consider near-duplicates as one element. Then how to compute f ( x )? CPU 7-1
Linear sketches do not work Linear sketches do not work. Why? Items representing the same entity may be hashed into different coordinates of the sketching vector 8-1
Magic hash functions? Does there exist a magic hash function that can (1) map only items rep. the same element into the same bucket, and (2) can be described succinctly? Answer: (In general) No. 9-1
Magic hash functions? Does there exist a magic hash function that can (1) map only items rep. the same element into the same bucket, and (2) can be described succinctly? Answer: (In general) No. Some hashing functions may help (will discuss later) 9-2
History and the New Question Related to Entity Resolution: Identify and group different manifestations of the same real world object. Key problem in data cleaning / integration. Have been studied for 40+ years in DB, also in AI, NT. Previous solutions use at least linear space, detect items representing the same entity, output all distinct entities. 10-1
History and the New Question Related to Entity Resolution: Identify and group different manifestations of the same real world object. Key problem in data cleaning / integration. Have been studied for 40+ years in DB, also in AI, NT. Previous solutions use at least linear space, detect items representing the same entity, output all distinct entities. Question : Can we analyze data with near-duplicates in the streaming model space/time efficiently? 10-2
Distinct Elements • Data : points in a metric space • Problem : compute # robust distinct elements ( F 0 ) (Useful in: traffic monitoring, query optimization, . . . ) Robust F 0 : Given threshold α , partition the input item set S into the set of groups G = { G 1 , . . . , G n } of minimum-cardinality so that ∀ p , q ∈ G i , d ( p , q ) ≤ α . – Chen, Z., SIGMOD 2016 (will discuss today) – Chen, Z., ???? (extend to sliding windows and ℓ 0 -sampling) 11-1
Well-shaped dataset ( α, β )-sparse dataset: pairs of items in the same group has distance at most α ; pairs of items in different groups have distance at least β . 12-1
Well-shaped dataset ( α, β )-sparse dataset: pairs of items in the same group has distance at most α ; pairs of items in different groups have distance at least β . If (separation ratio) β/α > 2, call the dataset well-shaped A natural partition exists for a well-shaped dataset 12-2
Well-shaped dataset ( α, β )-sparse dataset: pairs of items in the same group has distance at most α ; pairs of items in different groups have distance at least β . If (separation ratio) β/α > 2, call the dataset well-shaped A natural partition exists for a well-shaped dataset Will talk about general datasets later. 12-3
Algorithm for ( α, β ) ( β > 2 α ) well-shaped datasets in 2D G 1 A random grid G of side length α/ 2 G 2 G 3 13-1
Simple sampling (needs two passes) Algorithm Simple Sampling 1. Sample η ∈ ˜ O (1 /ǫ 2 ) non-empty cells C 2. Use another pass to compute for each sampled cell C , w ( C ) = 1 / w ( G C ) , where G C is the (only) group intersecting C , and w ( G C ) is #cells G C intersects 3. Output z η · � C ∈C w ( C ), where z is the #non-empty cells in G Gives a (1 + ǫ )-approximation of robust F 0 using ˜ O (1 /ǫ 2 ) bits space and 2 passes. 14-1
Bucket sampling • Cannot sample cell early: most sampled cell will be empty thus useless for the estimation. • Cannot sample late: cannot obtain the “neighborhood” information to compute w ( C ) for a sampled C 15-1
Bucket sampling • Cannot sample cell early: most sampled cell will be empty thus useless for the estimation. • Cannot sample late: cannot obtain the “neighborhood” information to compute w ( C ) for a sampled C What to do? 15-2
Bucket sampling • Cannot sample cell early: most sampled cell will be empty thus useless for the estimation. • Cannot sample late: cannot obtain the “neighborhood” information to compute w ( C ) for a sampled C What to do? We sample a collection of cells implicitly , but only maintain the neighborhood info. for “non-empty” sampled cells 15-3
Bucket sampling • Cannot sample cell early: most sampled cell will be empty thus useless for the estimation. • Cannot sample late: cannot obtain the “neighborhood” information to compute w ( C ) for a sampled C What to do? We sample a collection of cells implicitly , but only maintain the neighborhood info. for “non-empty” sampled cells Maintain the collection using a hash function h : That is, all cells C with h ( C ) = 1 15-4
Bucket sampling • Cannot sample cell early: most sampled cell will be empty thus useless for the estimation. • Cannot sample late: cannot obtain the “neighborhood” information to compute w ( C ) for a sampled C What to do? We sample a collection of cells implicitly , but only maintain the neighborhood info. for “non-empty” sampled cells Maintain the collection using a hash function h : That is, all cells C with h ( C ) = 1 Maintain h s.t. |{ C | h ( C ) = 1 ∧ ∃ p ∈ S , d ( p , C ) ≤ α }| = O (1 /ǫ 2 ) 15-5
Bucket sampling (cont.) G 1 G 2 sampled cell store one point of each non-empty neighboring cell; used to compute the weight of the sampled cell. G 3 For a well-shaped dataset, can get (1 + ǫ )-approximation of robust F 0 using ˜ O (1 /ǫ 2 ) bits space, ˜ O (1) time per item. 16-1
General datasets For general datasets, we introduce F 0 -ambiguity: The F 0 -ambiguity of S is the minimum δ s.t. there exists T ⊆ S such that • S \ T is well-shaped • F 0 ( S \ T ) ≥ (1 − δ ) F 0 ( S ) 17-1
General datasets For general datasets, we introduce F 0 -ambiguity: The F 0 -ambiguity of S is the minimum δ s.t. there exists T ⊆ S such that • S \ T is well-shaped • F 0 ( S \ T ) ≥ (1 − δ ) F 0 ( S ) Unfortunately approximate δ is hard – we cannot differentiate whether δ = 0 or 1 / 2 without an Ω( m ) space, by reducing it to the Diameter problem 17-2
General datasets For general datasets, we introduce F 0 -ambiguity: The F 0 -ambiguity of S is the minimum δ s.t. there exists T ⊆ S such that • S \ T is well-shaped • F 0 ( S \ T ) ≥ (1 − δ ) F 0 ( S ) Unfortunately approximate δ is hard – we cannot differentiate whether δ = 0 or 1 / 2 without an Ω( m ) space, by reducing it to the Diameter problem However, we can still guarantee the following even without knowing the value δ 17-3
General datasets For general datasets, we introduce F 0 -ambiguity: The F 0 -ambiguity of S is the minimum δ s.t. there exists T ⊆ S such that • S \ T is well-shaped • F 0 ( S \ T ) ≥ (1 − δ ) F 0 ( S ) Unfortunately approximate δ is hard – we cannot differentiate whether δ = 0 or 1 / 2 without an Ω( m ) space, by reducing it to the Diameter problem However, we can still guarantee the following even without knowing the value δ For a dataset with F 0 -ambiguity δ , can get a (1 + O ( ǫ + δ )) approximation of robust F 0 using ˜ O (1 /ǫ 2 ) bits 17-4
Recommend
More recommend