Fast Evaluation of Union-Intersection Expressions Philip Bille Anna - PowerPoint PPT Presentation

Fast Evaluation of Union-Intersection Expressions Philip Bille Anna Pagh Rasmus Pagh IT University of Copenhagen

Data Structures for Intersection Queries • Preprocess a collection of sets independently into a representation S 1 , . . . , S m that supports intersection queries of the form , . S i ∩ S j 1 ≤ i, j ≤ m • Application: Boolean AND-queries in search engines. • For each word store the set of documents containing the word. • To search for documents that contains words x and y compute the intersection of the corresponding document sets. • Generalizes to arbitrary expressions over set collection involving intersection, union, and difference.

Previous Comparison-Based Results • Query: S 1 ∩ S 2 • Classical solution: • Represent sets as sorted lists. • Query by merging and reporting duplicates: time. O ( | S 1 | + | S 2 | ) • Special cases with faster solutions: • When : time [HL1972]. S 1 ≪ S 2 O ( | S 1 | log(1 + S 1 S 2 )) • When consists of few sublists from and [DLM2000, BK2002]. S 1 ∩ S 2 S 1 S 2 (adaptive algorithms). • Generalizations to more complicated expressions involving intersections and unions [CFM2005].

Previous Non-Comparison Based Results • Fast solution when : S 1 ≪ S 2 • Build a hashing-based dictionary for each set. • Lookup the elements of in the dictionary for : time. S 1 S 2 O ( S 1 ) • For very small universes: • Represent sets as bitstrings. • Compute intersections as a bitwise-AND.

Our Results • Theorem : There is a non-comparison based linear space representation supporting intersection queries queries in expected time S 1 ∩ S 2 � ( | S 1 | + | S 2 | ) log 2 w � + occ O w • Output-sensitive algorithm. • For the algorithm runs in sublinear time. occ < ( | S 1 | + | S 2 | ) /w • All previously known solutions use worst-case linear time even if the intersection is empty. • We show how to generalize the result to arbitrary union-intersection expressions. • We give a communication complexity lower bound proving that the result is near optimal.

Approximate Set Representation S h ( S ) x 1 h ( x 1 ) h ( x 3 ) x 2 h ( x 2 ) x 3 • Represent set as a set of hash function values . S ⊆ { 0 , 1 } w h ( S ) • is an approximate set representation : h ( S ) • If then . x ∈ S h ( x ) ∈ h ( S ) • if then with probability close to 1. x �∈ S h ( x ) �∈ h ( S )

Computing Intersections 1.Compute intersection of the approximate representations . H = h ( S 1 ) ∩ h ( S 2 ) • We do this in time. o ( | S 1 | + | S 2 | ) 2.Compute and . S ′ S ′ 1 = { x ∈ S 1 | h ( x ) ∈ H } 2 = { x ∈ S 2 | h ( x ) ∈ H } • With a hash table that allows us to lookup a value and retrieve all h ( x ) elements with this value this takes time. O ( | S ′ 1 | + | S ′ 2 | ) 3.Compute and return . S ′ 1 ∩ S ′ 2 • Idea: If the hash function is suitably chosen, the number of elements to be checked in step 2 is small.

Choosing Hash Functions • The number of bits used for the hash values should be: • Small enough so that can be computed quickly. H = h ( S 1 ) ∩ h ( S 2 ) • Large enough to get a significant reduction in the number remaining elements in and so S ′ S ′ 1 = { x ∈ S 1 | h ( x ) ∈ H } 2 = { x ∈ S 2 | h ( x ) ∈ H } that can be computed quickly. S ′ 1 ∩ S ′ 2 • Optimal range of hash function depends on the size of input sets. • We store at “multiple resolutions” using hash functions with different S ranges.

r − b bits . . . w bits 2 b . . . • We store a set of -bit hash values as a bucketed set for parameter : h r ( S ) b r • Elements with the same most significant bits are stored in the same b bucket. • Elements in the same bucket are represented by their least r − b significant bits as a sorted packed array . • We choose to minimize total space. b • We can store a sufficient set of resolutions of in total linear space. S • . r − b = O (log w )

r − b bits . . . w bits 2 b . . . • Intersection algorithm for bucketed sets: • Convert buckets to have a common (suitable chosen) parameter . b • Create a new array of size . 2 b • Repartition packed arrays among the new buckets. • Modify number of bits in packed array representation. • Compute intersection among each of the sorted packed arrays.

10 12 3 13 1 2 4 5 7 8 1 4 6 8 11 12 merge 8 10 11 12 13 1 1 3 4 5 6 7 8 12 2 4 keep duplicate values 12 1 8 4 compact 1 4 8 12 • Lemma : [AH1992, ATNR1995] All of the above operation can be computed in time per word in the packed arrays. O (log w ) � ( | S 1 | + | S 2 | ) log w � • Total time: O · log w w

Our Results • Theorem : There is a non-comparison based linear space representation supporting intersection queries queries in expected time S 1 ∩ S 2 � ( | S 1 | + | S 2 | ) log 2 w � + occ O w • In the paper: • Generalization to arbitrary union-intersection expressions • Lower bound • Open Problem: • Can we extend this to set difference?

Fast Evaluation of Union-Intersection Expressions Philip Bille Anna - PowerPoint PPT Presentation

Fast Evaluation of Union-Intersection Expressions Philip Bille Anna Pagh Rasmus Pagh IT University of Copenhagen Data Structures for Intersection Queries Preprocess a collection of sets independently into a representation

Intersection Safety Intersection Safety Intersection Safety FHWA Safety Focus Areas FHWA Safety

Regular Expressions (REs) Regular Expressions (REs) p.1/37 Expressions In arithmetic:

Chapter 7 Expressions and Statements Expressions Arithmetic Expressions Conditional

Fem Poble(s): Expressions Meritxell (Txell) Martn Pardo, Ph.D Research associate Data

INTERSECTION LINKUK CONFIDENTIAL 3 3 INTERSECTION LINKUK CONFIDENTIAL 4 Not drawn to scale

Intersection Safety Intersection Safety Intersection Safety Intersections Intersections

Family of intersection problems Family of intersection problems CG Lecture 2 CG Lecture 2 1.

Large deviations for Brownian intersection measures Chiranjib Mukherjee Prague, September, 2011

Mat 2170 Week 3 Chapter Three Java Expressions Variable Declarations Java Expressions

61A Lecture 6 Friday, September 7 Lambda Expressions 2 Lambda Expressions >>> ten =

Regexp Lecture 26: Regular Expressions Regular Expressions Regular expressions are a small

Objectives You should be able to ... Regular Languages Use the syntax of regular expressions

Lecture 6: Flow Control Lecture 6: Flow Control 1 / 28 Relational Expressions Conditions in if

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

City of Camas NW 6 th and Norwood Intersection Options Existing Intersection hird level

Computational Geometry Lecture 2: Line segment intersection for map overlay 1 Computational

An Algorithm for Determining Intro to problem the Endpoints for Isolated Solution

EPFD Verification Software Status and Perspective John Pahl, Transfinite Systems Ltd Bruno Remy,

SIP MIB draft-ietf-sip-mib-00.txt Kevin Lingle Joon Maeng Dave Walker Background why a

1 3GPP Release GPRS/EDGE Data Infrastructure R8 R9 R10 R99 R4 R5 R6 R7 2000 2001 2002

802.1 Plenary March 2019 Vancouver, Canada Closing Agenda John Messenger IEEE 802.1

Multimodal Dependent Type Theory Daniel Gratzer 0 G.A. Kavvos 0 Andreas Nuyts 1 Lars Birkedal 0

Unleashing the potential of open-source in the 5G arena 5G and OpenAirInteface - R2Lab

International Stakeholder Forum Ofcom Riverside House November 2018 PROMOTING CHOICE

Fast Evaluation of Union-Intersection Expressions Philip Bille Anna - PowerPoint PPT Presentation

Fast Evaluation of Union-Intersection Expressions Philip Bille Anna Pagh Rasmus Pagh IT University of Copenhagen Data Structures for Intersection Queries Preprocess a collection of sets independently into a representation

Intersection Safety Intersection Safety Intersection Safety FHWA Safety Focus Areas FHWA Safety

Regular Expressions (REs) Regular Expressions (REs) p.1/37 Expressions In arithmetic:

Chapter 7 Expressions and Statements Expressions Arithmetic Expressions Conditional

Fem Poble(s): Expressions Meritxell (Txell) Martn Pardo, Ph.D Research associate Data

INTERSECTION LINKUK CONFIDENTIAL 3 3 INTERSECTION LINKUK CONFIDENTIAL 4 Not drawn to scale

Intersection Safety Intersection Safety Intersection Safety Intersections Intersections

Family of intersection problems Family of intersection problems CG Lecture 2 CG Lecture 2 1.

Large deviations for Brownian intersection measures Chiranjib Mukherjee Prague, September, 2011

Mat 2170 Week 3 Chapter Three Java Expressions Variable Declarations Java Expressions

61A Lecture 6 Friday, September 7 Lambda Expressions 2 Lambda Expressions &gt;&gt;&gt; ten =

Regexp Lecture 26: Regular Expressions Regular Expressions Regular expressions are a small

Objectives You should be able to ... Regular Languages Use the syntax of regular expressions

Lecture 6: Flow Control Lecture 6: Flow Control 1 / 28 Relational Expressions Conditions in if

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

City of Camas NW 6 th and Norwood Intersection Options Existing Intersection hird level

Computational Geometry Lecture 2: Line segment intersection for map overlay 1 Computational

An Algorithm for Determining Intro to problem the Endpoints for Isolated Solution

EPFD Verification Software Status and Perspective John Pahl, Transfinite Systems Ltd Bruno Remy,

SIP MIB draft-ietf-sip-mib-00.txt Kevin Lingle Joon Maeng Dave Walker Background why a

1 3GPP Release GPRS/EDGE Data Infrastructure R8 R9 R10 R99 R4 R5 R6 R7 2000 2001 2002

802.1 Plenary March 2019 Vancouver, Canada Closing Agenda John Messenger IEEE 802.1

Multimodal Dependent Type Theory Daniel Gratzer 0 G.A. Kavvos 0 Andreas Nuyts 1 Lars Birkedal 0

Unleashing the potential of open-source in the 5G arena 5G and OpenAirInteface - R2Lab

International Stakeholder Forum Ofcom Riverside House November 2018 PROMOTING CHOICE

61A Lecture 6 Friday, September 7 Lambda Expressions 2 Lambda Expressions >>> ten =