Fast Evaluation of Union-Intersection Expressions Philip Bille Anna Pagh Rasmus Pagh IT University of Copenhagen
Data Structures for Intersection Queries • Preprocess a collection of sets independently into a representation S 1 , . . . , S m that supports intersection queries of the form , . S i ∩ S j 1 ≤ i, j ≤ m • Application: Boolean AND-queries in search engines. • For each word store the set of documents containing the word. • To search for documents that contains words x and y compute the intersection of the corresponding document sets. • Generalizes to arbitrary expressions over set collection involving intersection, union, and difference.
Previous Comparison-Based Results • Query: S 1 ∩ S 2 • Classical solution: • Represent sets as sorted lists. • Query by merging and reporting duplicates: time. O ( | S 1 | + | S 2 | ) • Special cases with faster solutions: • When : time [HL1972]. S 1 ≪ S 2 O ( | S 1 | log(1 + S 1 S 2 )) • When consists of few sublists from and [DLM2000, BK2002]. S 1 ∩ S 2 S 1 S 2 (adaptive algorithms). • Generalizations to more complicated expressions involving intersections and unions [CFM2005].
Previous Non-Comparison Based Results • Fast solution when : S 1 ≪ S 2 • Build a hashing-based dictionary for each set. • Lookup the elements of in the dictionary for : time. S 1 S 2 O ( S 1 ) • For very small universes: • Represent sets as bitstrings. • Compute intersections as a bitwise-AND.
Our Results • Theorem : There is a non-comparison based linear space representation supporting intersection queries queries in expected time S 1 ∩ S 2 � ( | S 1 | + | S 2 | ) log 2 w � + occ O w • Output-sensitive algorithm. • For the algorithm runs in sublinear time. occ < ( | S 1 | + | S 2 | ) /w • All previously known solutions use worst-case linear time even if the intersection is empty. • We show how to generalize the result to arbitrary union-intersection expressions. • We give a communication complexity lower bound proving that the result is near optimal.
Approximate Set Representation S h ( S ) x 1 h ( x 1 ) h ( x 3 ) x 2 h ( x 2 ) x 3 • Represent set as a set of hash function values . S ⊆ { 0 , 1 } w h ( S ) • is an approximate set representation : h ( S ) • If then . x ∈ S h ( x ) ∈ h ( S ) • if then with probability close to 1. x �∈ S h ( x ) �∈ h ( S )
Computing Intersections 1.Compute intersection of the approximate representations . H = h ( S 1 ) ∩ h ( S 2 ) • We do this in time. o ( | S 1 | + | S 2 | ) 2.Compute and . S ′ S ′ 1 = { x ∈ S 1 | h ( x ) ∈ H } 2 = { x ∈ S 2 | h ( x ) ∈ H } • With a hash table that allows us to lookup a value and retrieve all h ( x ) elements with this value this takes time. O ( | S ′ 1 | + | S ′ 2 | ) 3.Compute and return . S ′ 1 ∩ S ′ 2 • Idea: If the hash function is suitably chosen, the number of elements to be checked in step 2 is small.
Choosing Hash Functions • The number of bits used for the hash values should be: • Small enough so that can be computed quickly. H = h ( S 1 ) ∩ h ( S 2 ) • Large enough to get a significant reduction in the number remaining elements in and so S ′ S ′ 1 = { x ∈ S 1 | h ( x ) ∈ H } 2 = { x ∈ S 2 | h ( x ) ∈ H } that can be computed quickly. S ′ 1 ∩ S ′ 2 • Optimal range of hash function depends on the size of input sets. • We store at “multiple resolutions” using hash functions with different S ranges.
r − b bits . . . w bits 2 b . . . • We store a set of -bit hash values as a bucketed set for parameter : h r ( S ) b r • Elements with the same most significant bits are stored in the same b bucket. • Elements in the same bucket are represented by their least r − b significant bits as a sorted packed array . • We choose to minimize total space. b • We can store a sufficient set of resolutions of in total linear space. S • . r − b = O (log w )
r − b bits . . . w bits 2 b . . . • Intersection algorithm for bucketed sets: • Convert buckets to have a common (suitable chosen) parameter . b • Create a new array of size . 2 b • Repartition packed arrays among the new buckets. • Modify number of bits in packed array representation. • Compute intersection among each of the sorted packed arrays.
10 12 3 13 1 2 4 5 7 8 1 4 6 8 11 12 merge 8 10 11 12 13 1 1 3 4 5 6 7 8 12 2 4 keep duplicate values 12 1 8 4 compact 1 4 8 12 • Lemma : [AH1992, ATNR1995] All of the above operation can be computed in time per word in the packed arrays. O (log w ) � ( | S 1 | + | S 2 | ) log w � • Total time: O · log w w
Our Results • Theorem : There is a non-comparison based linear space representation supporting intersection queries queries in expected time S 1 ∩ S 2 � ( | S 1 | + | S 2 | ) log 2 w � + occ O w • In the paper: • Generalization to arbitrary union-intersection expressions • Lower bound • Open Problem: • Can we extend this to set difference?
Recommend
More recommend