Efficient Filtering of XML Documents with XPath Expression Authors: Chee-Yong Chan, Pascal Felber, Minos Garofalakis, Rajeev Rastogi Bell Laboratories, Lucent Technologies {cychan,pascal,minos,rastogi}@research.bell-labs.com Speaker: Lam-Son LE LamSon.Le@epfl.ch , EPFL, I&C Doctoral School, WS 2002/2003 Distributed Information Processing Page 1
Outline • Introduction – publish/subscribe systems, “bags of words” vs. XPath language • Background – XPE-tree, unordered/ordered matching • XPE Decompositions and Matchings – substring/minimal/simple decomposition, substring-tree • The XTrie Indexing Scheme – substring table, Trie – matching algorithm • Evaluation – comparison with XFilter Page 2
Introduction • Selective data dissemination – publishers selectively deliver data to subscribers • Simple matching schema: “bags of words” • XML data emergence – XPath as filter-specification language, XPE – XPath Expression – Retrieval problem: Given a collection P of XPEs and an input XML document D , find the subset of XPEs in P that match D . • XTrie based on XPath expressions • XTrie efficiently filters XML documents – Indexing on a set of substrings rather than individual element – support both ordered and unordered matching Page 3
Background (1/3) • XML documents as trees – root element, sub elements can be nested to any depth – level ( root ) = 1, level ( d ) = level ( d’ ) + 1 if d’ is the parent of d • XPath expressions (XPEs) – “/”: parent/child operator – “//”: ancestor/descendant operator – “*”: wildcard operator – “[”, “]”: delimiting a predicate – example: p = //a//b[*/c]/d – 2 patterns: path pattern and tree pattern Page 4
Background (2/3) • XPE-tree – predicate expressions give rise to branches of the tree – XPE-tree can have order if the elements in XPE are supposed to be ordered – relative level of a node in XPE-tree • relLevel(t i ) = [k, ∝ ] if t i is prefixed with “//” followed by (k-1) “*” a range • relLevel(t i ) = [k, k] if t i is prefixed with “/” followed by (k-1) “*” a precise value Page 5
Background (3/3) • Unordered matching b 1 – set of nodes with names //a [1, ∝ ] matched a 2 – level differences of match //b [1, ∝ ] nodes are according to b 3 b 10 relative level • Ordered matching is b 4 f 8 stronger: the order of /*/c [2,2] /d [1,1] elements in the XPE-tree is c 9 e 5 d 7 taken into account • Matching example c 6 – p = //a//b[*/c]/d – { a 2 , b 4 , c 6 , d 7 } is an ordered XML tree D XPE-tree T matching of D to p Page 6
XPE Decompositions (1/3) • Substring of an EXP – a possible concatenation of node separated by “/” – example: p = /a/b[c/d//e][g//e/f]//*/*/e/f . Possible substrings: abg , bcd , ef, b • Substring decomposition: set of substring that cover all nodes in XPE tree • Minimal decomposition: one substring couldn’t be a prefix of another – advantage: substring as longest pas possible, resulting in lower probability of being found and matched Page 7
XPE Decompositions (2/3) • Simple decomposition: add a substring for each branching node to the minimal decomposition • Substring-tree: nodes are substrings from simple decomposition – parent if a prefix of the child or – the last element of parent substring is the parent node of the first element of the child substring • Relative level is extended to substrings – computed based on the relative level of the different elements between the given substring and its parent Page 8
XPE Decompositions (3/3) ab /a /a abg ef /b abcd /b /*/*/e /*/*/e /c /c /g /g e ef //e //e /d /d /f /f /f /f //e //e Simple Minimal Substring-tree decomposition decomposition • Example for p = /a/b[c/d//e][g//e/f]//*/*/e/f Page 9
Matching with Substrings (1/2) • A substring matches a node in XML document if its last element match that node • Typically, XML documents are parsed in pre-order (SAX parser). Substrings should also be ordered by pre-order traversal of the substring-tree • Partial matching: matching for all consecutive substrings from the first to the given substring • Complete matching: partial matching for the final substring • Subtree-matching: partial matching found at all descendants of the given substring • Redundant matching: subtree-matching found at some earlier node in the XML document Page 10
Matching with Substrings (2/2) b 1 • Again, p = //a//b[*/c]/d [1, ∝ ] s1 = a – s1 = a, s2 = b, s3 = c, (s 1 ) a 2 [1, ∝ ] s2 = b s4 = db b 3 (s 2 ) b 10 – matching at c 9 and b 10 s4 = bd [1,1] (s 2 ) are redundant b 4 f 8 s3 = c [2,2] (s 3 ) e 5 d 7 c 9 substring-tree (s 4 ) (s 3 ) c 6 XML tree D Page 11
XTrie Indexing Schema (1/2) • XTrie indexing schema built for a set of XPEs – derive the simple decomposition for all XPEs – associated them with relative levels • Consists of 2 data structures – Trie T: a tree where edges are labeled with element name in the XML document – Substring-Table ST: each row represents a substring Page 12
XTrie Indexing Schema (2/2) 0 1 1 a b c d substring Index Parent Relative Rank Number of Next 0 1 8 1 row Level children row 0 1 0 1 3 4 5 2 [4, ∝ ] aabc 1 0 1 1 0 a b c b d ab 2 1 [3, 3] 1 0 3 11 5 0 2 2 3 9 4 10 3 6 7 8 9 10 ab 3 0 [2, 2] 1 2 6 b c d abce 4 3 [2, 2] 1 0 0 0 7 7 8 11 12 5 10 13 bcd 5 3 [4, 4] 2 0 0 c e ab 6 0 [2, 2] 1 2 0 1 12 14 15 abc 7 6 [1, 1] 1 1 0 4 1 d 8 7 [2, 2] 1 0 12 [2, ∝ ] bc 9 6 2 0 0 [2, ∝ ] cb 10 0 1 1 0 Example 2 [2, ∝ ] cd 11 10 1 1 0 d 12 11 [3, 3] 1 0 0 p1 = //a/a/b/c/*/a/b p2 = /a/b[c/e]/*/b/c/d p3 = /a/b[c/*/d]//b/c p4=//c/b//c/d/*/*/d Page 13
XTrie Matching Algorithm (1/2) • Based on SAX to get notified when an element name is parsed • Requires another 2-dimension array sized <number of rows in ST> × <maximum level of XML document> • B[s, l ] is – is initialized to 0 at the beginning – incremented by 1 if non-redundant matching of s at level l is found – reset to 0 when end-tag at level l is parsed • An XPE p match the XML document if B[rs, l ] = m + 1 for some level l , where – rs is the root substring in the substring-tree for p – m is the number of child substring of rs Page 14
XTrie Matching Algorithm (2/2) 0 1 1 b 1 a b c 2 1 3 3 1 4 1 1 2 a 2 d 4 1 5 b 3 b 10 substring Index Parent Relative Rank Number of Next row Level children row b 4 f 8 [1, ∝ ] a 1 0 1 1 0 [1, ∝ ] b 2 1 1 2 0 c 3 2 [2,2] 1 0 0 e 5 d 7 c 9 bd 4 2 [1,1] 2 0 0 c 6 Again, p = //a//b[*/c]/d Page 15
Evaluation 4000 Filtering Time(ms) Filtering Time(ms) 1500 3000 1000 2000 500 1000 0 0 0 100 200 300 400 500 20 100 1000 Varying P (L=20, p w =0.1, p d =0.1, p b =0) Varying doc. length (P=100k, L=20, p w =0.1, p d =0.1, p b =0) • In comparison with XFilter (using hashtable on single element names) Page 16
Thank you! Questions? Page 17
Recommend
More recommend