RE-Tree: An Efficient Index Structure for Regular Expressions Chee-Yong Chan, Minos Garofalakis, Rajeev Rastogi Information Sciences Research Center Bell Laboratories, Lucent Technologies
RE-Tree: An Efficient Index Structure for Regular Expressions 2 Motivation • Regular Expressions (REs) provide a simple yet powerful formalism for pattern/structure specifications. • Example applications: – XPath pattern language for XML documents – Policy language of Border Gateway Protocol (BGP) • RE Filtering Problem: Input Subset of R RE Filter string that match s s R, Set of REs
RE-Tree: An Efficient Index Structure for Regular Expressions 3 Our Approach: RE-Tree • Idea : Partition RE data set using a height-balanced hierarchical index structure to maximize pruning of search space. Challenge : REs generally define infinite sets and there is no well-defined metric for clustering REs.
RE-Tree: An Efficient Index Structure for Regular Expressions 4 RE-Tree Overview • Dynamic, height-balanced, hierarchical index structure. • REs are stored as finite automata (FA) in the leaf nodes. • Internal nodes contain directory entries pointing to nodes at next level; each directory entry = (FA, Pointer) } Internal FAs M M 1 2 ...... M M 3 4 Leaf FAs M M M M 5 6 7 8
RE-Tree: An Efficient Index Structure for Regular Expressions 5 RE-Tree: Containment Property ∪ ∪ ⊇ L(M1) L(M2) L(M3) L(M4) • Example : • M1 = Bounding FA of { M2, M3, M4 } M 1 N’ a (a | b) ( a | b | c)* .... M M M 2 3 4 a (a | b) c* aa ( a | b | c)* c ab (bc | cc)* N
RE-Tree: An Efficient Index Structure for Regular Expressions 6 Bounding Finite Automata • Many possible bounding FAs for a given set of FAs. – Most precise FA accepts union of L(Mi) for all Mi in the set. Σ * – Least precise FA accepts • Space-Precision tradeoff for bounding FAs: – A more precise FA improves search pruning but its size could be large, resulting in lower fan-out of index node. • RE-tree controls fan-out by bounding the maximum number of states per internal FA (using an index α ). parameter • Goal : Optimize search performance by maximizing precision of bounding FAs.
RE-Tree: An Efficient Index Structure for Regular Expressions 7 RE-Trees vs. R-Trees • RE-trees are similar in spirit to R-trees . R-trees RE-trees Multi-dimensional Regular Data rectangles languages Type Minimal bounding Internal Bounding FAs rectangles (MBR) node entries Minimize size of Minimize volume of Update languages accepted MBRs operations by bounding FAs
RE-Tree: An Efficient Index Structure for Regular Expressions 8 RE-Tree Algorithms • RE-tree construction involves three key operations: – Selecting an optimal insertion node – Computing an optimal bounding FA – Computing an optimal node split
RE-Tree: An Efficient Index Structure for Regular Expressions 9 RE-Tree Optimization Problems • Let S = {M1, M2, ...., Mn} be set of FAs in a node N. • Selecting an optimal insertion node Select the node corrresponding to Mi that maximizes ∩ |L(M Mi)|, where M is the FA to be inserted. • Computing an optimal bounding FA α Compute M, a bounding FA of S (with at most states), that minimizes |L(M)|. • Computing an optimal node split Partition S into S1 & S2 such that |L(union of FAs in S1)| + |L(union of FAs in S2)| is minimized.
RE-Tree: An Efficient Index Structure for Regular Expressions 10 RE-Tree Optimization Problems • Let S = {M1, M2, ...., Mn} be set of FAs in a node N. • Selecting an optimal insertion node Select the node corrresponding to Mi that maximizes ∩ |L(M Mi)|, where M is the FA to be inserted. • Computing an optimal bounding FA α Compute M, a bounding FA of S (with at most states), that minimizes |L(M)|. Possibly • Computing an optimal node split Infinite! Partition S into S1 & S2 such that |L(union of FAs in S1)| + |L(union of FAs in S2)| is minimized.
RE-Tree: An Efficient Index Structure for Regular Expressions 11 Main Challenge • Problem : How to measure size of REs? • Observe : Infinite REs may not have the same size. Example: (a|b)* is larger than a(a|b)*. • Idea : Need a computable measure for size of REs that captures intuition of “larger than’’ relationship. • Let L(M,i) = Set of length-i strings in L(M). • Intuitively, L(M) is larger than L(M’) iff ∈ Ζ + ∃ ∀ N s.t. k > N k k ∑ ∑ L(M, i) L(M’, i) > i = 1 i = 1
RE-Tree: An Efficient Index Structure for Regular Expressions 12 Max-Count Size Measure • Idea : Count size of L(M) up to some maximum length. | L(M)| = |L(M,1)| + |L(M,2)| + .....+ |L(M,k)| • Cons : Sensitive to maximum length parameter value. Example: L(M1) = (b|c)* d (a|b)* d (b|c)* d L(M2) = dd (a|b|c)* d L(M2) is larger than L(M1), but max-count measure is correct iff maximum length parameter value > 15.
RE-Tree: An Efficient Index Structure for Regular Expressions 13 MDL-based Size Measure • MDL Principle: Provides an information-theoretic definition of an optimal model for a given data set. ⊇ ⊇ • Observation: L(M1) S, L(M2) S ∑ Encode(w, M1) M1 is more precise ∈ w S < than M2 ∑ Encode(w, M2) ∈ w S • MDL-based Measure: ∑ Encode(w, M1) / |w| L(M2) is larger ∈ w S1 < than L(M1) ∑ Encode(w, M2) / |w| ∈ w S2
RE-Tree: An Efficient Index Structure for Regular Expressions 14 Definition of Encoding(w,M) ∈ • How to encode w L(M) using M ? • Let p = < s0, s1, ..., sn > be accepting path of w in M. n-1 log ( # out-going transitions in si) ∑ • Encode(w, M) = i = 0 Example: b a,b,c d d d M Encode( ddbd, M) = log(1) + log(2) + log(4) = 5
RE-Tree: An Efficient Index Structure for Regular Expressions 15 Algorithm to Optimize Bounding FA • Compute a bounding FA M for a given set of FAs S s.t. α (1) M has at most number of states, and (2) |L(M)| is minimized. • Problem is NP-hard. • Heuristic: Compute the most precise FA for S & then incrementally relax its precision (by greedily merging pairs of states) until the space constraint is satisfied.
RE-Tree: An Efficient Index Structure for Regular Expressions 16 An Example α Compute bounding FA for S = { abb* , aa*b } with = 3 b b a b a b a b b a a a a b a b
RE-Tree: An Efficient Index Structure for Regular Expressions 17 Other RE-Tree Algorithms • Selecting an optimal insertion node – Select the node corresponding to Mi that maximizes ∩ | L(M Mi)|, where M is the FA to be inserted. • Computing an optimal node split – Partition S into S1 & S2 (each with at last m FAs) such that |L(union of FAs in S1 )| + |L(union of FAs in S2 )| is minimized. – Problem is NP-hard. – Heuristic used is similar to R-tree’s Quadratic Split Algorithm.
RE-Tree: An Efficient Index Structure for Regular Expressions 18 Optimizing RE-Tree Operations • RE-tree algorithms involve many FA operations (i.e., union & intersection). • Speed up performance using sampling techniques. Example : Selecting optimal insertion node requires ∩ computing |L( Mi M )| for each Mi in current node. ∩ An unbiased estimate of |L(Mi M, k)| is given by (# strings in S accepted by Mi) |L(M, k)| |S| where S = uniform random sample of L( M, k) .
RE-Tree: An Efficient Index Structure for Regular Expressions 19 Related Work • A lot of work on the traditional RE search problem : how to speed up searching of an RE query. • But none on the RE filtering problem . • Indexes for filtering XPath expressions : XFilter [VLDB’00], YFilter [ICDE’02], XTrie [ICDE’02], matchMaker [EDBT’02]. – Class of REs supported in XPath is more restrictive. – Indexes for filtering XPath are all main-memory structures.
RE-Tree: An Efficient Index Structure for Regular Expressions 20 Experimental Evaluation • Algorithms : RE-tree vs Sequential File Approach. • Data Set : Generated synthetic RE data sets. α – Vary RE similarity, , size of data set. • Queries : Generated 1000 random query strings from RE data set. • System : 700 MHz Intel Pentium III with 512 MB memory running FreeBSD 4.1.
RE-Tree: An Efficient Index Structure for Regular Expressions 21 Varying Similarity of REs p = 0.5 p = 0.75 p= 1.0 3.5 3 Ratio of FA Comparisons 2.5 2 1.5 1 0.5 0 0 10 20 30 40 50 Result Size
RE-Tree: An Efficient Index Structure for Regular Expressions 22 Varying Similarity of REs p = 0.5 p = 0.75 p= 1.0 2 Ratio of Evaluation Time 1.5 1 0.5 0 0 10 20 30 40 50 Result Size
RE-Tree: An Efficient Index Structure for Regular Expressions 23 Conclusions • RE-Tree, a novel index structure for REs. • Novel size measures for REs. • Update algorithms to optimize bounding FAs. • Sampling-based techniques to speed up RE-tree operations.
Recommend
More recommend