re tree an efficient index structure for regular
play

RE-Tree: An Efficient Index Structure for Regular Expressions - PowerPoint PPT Presentation

RE-Tree: An Efficient Index Structure for Regular Expressions Chee-Yong Chan, Minos Garofalakis, Rajeev Rastogi Information Sciences Research Center Bell Laboratories, Lucent Technologies RE-Tree: An Efficient Index Structure for Regular


  1. RE-Tree: An Efficient Index Structure for Regular Expressions Chee-Yong Chan, Minos Garofalakis, Rajeev Rastogi Information Sciences Research Center Bell Laboratories, Lucent Technologies

  2. RE-Tree: An Efficient Index Structure for Regular Expressions 2 Motivation • Regular Expressions (REs) provide a simple yet powerful formalism for pattern/structure specifications. • Example applications: – XPath pattern language for XML documents – Policy language of Border Gateway Protocol (BGP) • RE Filtering Problem: Input Subset of R RE Filter string that match s s R, Set of REs

  3. RE-Tree: An Efficient Index Structure for Regular Expressions 3 Our Approach: RE-Tree • Idea : Partition RE data set using a height-balanced hierarchical index structure to maximize pruning of search space. Challenge : REs generally define infinite sets and there is no well-defined metric for clustering REs.

  4. RE-Tree: An Efficient Index Structure for Regular Expressions 4 RE-Tree Overview • Dynamic, height-balanced, hierarchical index structure. • REs are stored as finite automata (FA) in the leaf nodes. • Internal nodes contain directory entries pointing to nodes at next level; each directory entry = (FA, Pointer) } Internal FAs M M 1 2 ...... M M 3 4 Leaf FAs M M M M 5 6 7 8

  5. RE-Tree: An Efficient Index Structure for Regular Expressions 5 RE-Tree: Containment Property ∪ ∪ ⊇ L(M1) L(M2) L(M3) L(M4) • Example : • M1 = Bounding FA of { M2, M3, M4 } M 1 N’ a (a | b) ( a | b | c)* .... M M M 2 3 4 a (a | b) c* aa ( a | b | c)* c ab (bc | cc)* N

  6. RE-Tree: An Efficient Index Structure for Regular Expressions 6 Bounding Finite Automata • Many possible bounding FAs for a given set of FAs. – Most precise FA accepts union of L(Mi) for all Mi in the set. Σ * – Least precise FA accepts • Space-Precision tradeoff for bounding FAs: – A more precise FA improves search pruning but its size could be large, resulting in lower fan-out of index node. • RE-tree controls fan-out by bounding the maximum number of states per internal FA (using an index α ). parameter • Goal : Optimize search performance by maximizing precision of bounding FAs.

  7. RE-Tree: An Efficient Index Structure for Regular Expressions 7 RE-Trees vs. R-Trees • RE-trees are similar in spirit to R-trees . R-trees RE-trees Multi-dimensional Regular Data rectangles languages Type Minimal bounding Internal Bounding FAs rectangles (MBR) node entries Minimize size of Minimize volume of Update languages accepted MBRs operations by bounding FAs

  8. RE-Tree: An Efficient Index Structure for Regular Expressions 8 RE-Tree Algorithms • RE-tree construction involves three key operations: – Selecting an optimal insertion node – Computing an optimal bounding FA – Computing an optimal node split

  9. RE-Tree: An Efficient Index Structure for Regular Expressions 9 RE-Tree Optimization Problems • Let S = {M1, M2, ...., Mn} be set of FAs in a node N. • Selecting an optimal insertion node Select the node corrresponding to Mi that maximizes ∩ |L(M Mi)|, where M is the FA to be inserted. • Computing an optimal bounding FA α Compute M, a bounding FA of S (with at most states), that minimizes |L(M)|. • Computing an optimal node split Partition S into S1 & S2 such that |L(union of FAs in S1)| + |L(union of FAs in S2)| is minimized.

  10. RE-Tree: An Efficient Index Structure for Regular Expressions 10 RE-Tree Optimization Problems • Let S = {M1, M2, ...., Mn} be set of FAs in a node N. • Selecting an optimal insertion node Select the node corrresponding to Mi that maximizes ∩ |L(M Mi)|, where M is the FA to be inserted. • Computing an optimal bounding FA α Compute M, a bounding FA of S (with at most states), that minimizes |L(M)|. Possibly • Computing an optimal node split Infinite! Partition S into S1 & S2 such that |L(union of FAs in S1)| + |L(union of FAs in S2)| is minimized.

  11. RE-Tree: An Efficient Index Structure for Regular Expressions 11 Main Challenge • Problem : How to measure size of REs? • Observe : Infinite REs may not have the same size. Example: (a|b)* is larger than a(a|b)*. • Idea : Need a computable measure for size of REs that captures intuition of “larger than’’ relationship. • Let L(M,i) = Set of length-i strings in L(M). • Intuitively, L(M) is larger than L(M’) iff ∈ Ζ + ∃ ∀ N s.t. k > N k k ∑ ∑ L(M, i) L(M’, i) > i = 1 i = 1

  12. RE-Tree: An Efficient Index Structure for Regular Expressions 12 Max-Count Size Measure • Idea : Count size of L(M) up to some maximum length. | L(M)| = |L(M,1)| + |L(M,2)| + .....+ |L(M,k)| • Cons : Sensitive to maximum length parameter value. Example: L(M1) = (b|c)* d (a|b)* d (b|c)* d L(M2) = dd (a|b|c)* d L(M2) is larger than L(M1), but max-count measure is correct iff maximum length parameter value > 15.

  13. RE-Tree: An Efficient Index Structure for Regular Expressions 13 MDL-based Size Measure • MDL Principle: Provides an information-theoretic definition of an optimal model for a given data set. ⊇ ⊇ • Observation: L(M1) S, L(M2) S ∑ Encode(w, M1) M1 is more precise ∈ w S < than M2 ∑ Encode(w, M2) ∈ w S • MDL-based Measure: ∑ Encode(w, M1) / |w| L(M2) is larger ∈ w S1 < than L(M1) ∑ Encode(w, M2) / |w| ∈ w S2

  14. RE-Tree: An Efficient Index Structure for Regular Expressions 14 Definition of Encoding(w,M) ∈ • How to encode w L(M) using M ? • Let p = < s0, s1, ..., sn > be accepting path of w in M. n-1 log ( # out-going transitions in si) ∑ • Encode(w, M) = i = 0 Example: b a,b,c d d d M Encode( ddbd, M) = log(1) + log(2) + log(4) = 5

  15. RE-Tree: An Efficient Index Structure for Regular Expressions 15 Algorithm to Optimize Bounding FA • Compute a bounding FA M for a given set of FAs S s.t. α (1) M has at most number of states, and (2) |L(M)| is minimized. • Problem is NP-hard. • Heuristic: Compute the most precise FA for S & then incrementally relax its precision (by greedily merging pairs of states) until the space constraint is satisfied.

  16. RE-Tree: An Efficient Index Structure for Regular Expressions 16 An Example α Compute bounding FA for S = { abb* , aa*b } with = 3 b b a b a b a b b a a a a b a b

  17. RE-Tree: An Efficient Index Structure for Regular Expressions 17 Other RE-Tree Algorithms • Selecting an optimal insertion node – Select the node corresponding to Mi that maximizes ∩ | L(M Mi)|, where M is the FA to be inserted. • Computing an optimal node split – Partition S into S1 & S2 (each with at last m FAs) such that |L(union of FAs in S1 )| + |L(union of FAs in S2 )| is minimized. – Problem is NP-hard. – Heuristic used is similar to R-tree’s Quadratic Split Algorithm.

  18. RE-Tree: An Efficient Index Structure for Regular Expressions 18 Optimizing RE-Tree Operations • RE-tree algorithms involve many FA operations (i.e., union & intersection). • Speed up performance using sampling techniques. Example : Selecting optimal insertion node requires ∩ computing |L( Mi M )| for each Mi in current node. ∩ An unbiased estimate of |L(Mi M, k)| is given by (# strings in S accepted by Mi) |L(M, k)| |S| where S = uniform random sample of L( M, k) .

  19. RE-Tree: An Efficient Index Structure for Regular Expressions 19 Related Work • A lot of work on the traditional RE search problem : how to speed up searching of an RE query. • But none on the RE filtering problem . • Indexes for filtering XPath expressions : XFilter [VLDB’00], YFilter [ICDE’02], XTrie [ICDE’02], matchMaker [EDBT’02]. – Class of REs supported in XPath is more restrictive. – Indexes for filtering XPath are all main-memory structures.

  20. RE-Tree: An Efficient Index Structure for Regular Expressions 20 Experimental Evaluation • Algorithms : RE-tree vs Sequential File Approach. • Data Set : Generated synthetic RE data sets. α – Vary RE similarity, , size of data set. • Queries : Generated 1000 random query strings from RE data set. • System : 700 MHz Intel Pentium III with 512 MB memory running FreeBSD 4.1.

  21. RE-Tree: An Efficient Index Structure for Regular Expressions 21 Varying Similarity of REs p = 0.5 p = 0.75 p= 1.0 3.5 3 Ratio of FA Comparisons 2.5 2 1.5 1 0.5 0 0 10 20 30 40 50 Result Size

  22. RE-Tree: An Efficient Index Structure for Regular Expressions 22 Varying Similarity of REs p = 0.5 p = 0.75 p= 1.0 2 Ratio of Evaluation Time 1.5 1 0.5 0 0 10 20 30 40 50 Result Size

  23. RE-Tree: An Efficient Index Structure for Regular Expressions 23 Conclusions • RE-Tree, a novel index structure for REs. • Novel size measures for REs. • Update algorithms to optimize bounding FAs. • Sampling-based techniques to speed up RE-tree operations.

Recommend


More recommend