tasm top k approximate subtree matching
play

TASM: Top- k Approximate Subtree Matching Nikolaus Augsten 1 Denilson - PowerPoint PPT Presentation

TASM: Top- k Approximate Subtree Matching Nikolaus Augsten 1 Denilson Barbosa 2 ohlen 3 Themis Palpanas 4 Michael B 1 Free University of Bozen-Bolzano, Italy augsten@inf.unibz.it 2 University of Alberta, Canada denilson@cs.ualberta.ca 3


  1. TASM: Top- k Approximate Subtree Matching Nikolaus Augsten 1 Denilson Barbosa 2 ohlen 3 Themis Palpanas 4 Michael B¨ 1 Free University of Bozen-Bolzano, Italy augsten@inf.unibz.it 2 University of Alberta, Canada denilson@cs.ualberta.ca 3 University of Zurich, Switzerland boehlen@ifi.uzh.ch 4 University of Trento, Italy themis@disi.unitn.eu ICDE 2010, March 3 Long Beach, CA, USA Nikolaus Augsten (Bolzano, Italy) TASM: Top- k Approx. Subtree Matching ICDE 2010 1 / 28

  2. Outline 1 Motivation and Problem Definition 2 TASM-Postorder Upper Bound on Subtree Size Prefix Ring Buffer Pruning 3 Experiments 4 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) TASM: Top- k Approx. Subtree Matching ICDE 2010 2 / 28

  3. Motivation and Problem Definition Outline 1 Motivation and Problem Definition 2 TASM-Postorder Upper Bound on Subtree Size Prefix Ring Buffer Pruning 3 Experiments 4 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) TASM: Top- k Approx. Subtree Matching ICDE 2010 3 / 28

  4. Motivation and Problem Definition Motivation Query (XML fragment) Document (very large XML) article top- k matches? authors booktitle DBLP author author ICDE 28M nodes, 531MB Tim John Rank the top- k matches for the article query in the DBLP document ! Example Answer: k = 3 inproceedings inproceedings article authors booktitle authors booktitle authors author author author ICDE author author booktitle author author ICDE TKDE Peter Tim John Tim John Tim John (1 error) (2 errors) (3 errors) Nikolaus Augsten (Bolzano, Italy) TASM: Top- k Approx. Subtree Matching ICDE 2010 4 / 28

  5. Motivation and Problem Definition TASM: Top- k Approximate Subtree Matching Definition (TASM: Top- k Approximate Subtree Matching) Given: query tree Q , document tree T , size k of ranking Goal: Compute a top- k ranking R = ( T 1 , T 2 , . . . , T k ) of all subtrees T i of document T with respect to query Q using the tree edit distance for the ranking. Subtree T i : a node and all its descendants largest subtree is document itself top- k ranking R = ( T 1 , T i , . . . , T k ) subtrees sorted by distance to query best k subtrees: T i / ∈ R ⇒ ted ( Q , T k ) ≤ ted ( Q , T i ) Nikolaus Augsten (Bolzano, Italy) TASM: Top- k Approx. Subtree Matching ICDE 2010 5 / 28

  6. Motivation and Problem Definition Ranking Function: Tree Edit Distance (TED) ren(ICDE) del(authors) article article article authors booktitle author author booktitle author author booktitle ICDE Tim John ICDE Tim John TKDE author author Tim John Tree Edit Distance : Minimum number of node edit operations (insert, rename, delete) that transform one tree into the other. TASM computes TED between query and document subtrees Size and number of computed subtrees define TASM complexity Nikolaus Augsten (Bolzano, Italy) TASM: Top- k Approx. Subtree Matching ICDE 2010 6 / 28

  7. Motivation and Problem Definition State of the Art TASM-Dynamic : dynamic programming solution 1 computes distance to every subtree of the document use smaller subtrees to compute larger ones rank subtrees by visiting memoization table Space complexity : O ( mn ), m : query size, n : document size Space complexity limits application to databases in database applications n is huge (database size!) TASM-Dynamic maintains two m × n matrixes in RAM > 6GB RAM for our tiny query ( m = 8) on DBLP ( n = 28 × 10 6 ) For database size solutions dynamic programming is too expensive . State-of-the-art algorithms do not scale! 1 Zhang and Shasha 1989, Demaine et al. 2007 Nikolaus Augsten (Bolzano, Italy) TASM: Top- k Approx. Subtree Matching ICDE 2010 7 / 28

  8. Motivation and Problem Definition Problem Definition Find a solution for TASM (Top-k Approximate Subtree Matching) that scales to very large documents runs in small memory ranks subtrees correctly (no heuristics!) Nikolaus Augsten (Bolzano, Italy) TASM: Top- k Approx. Subtree Matching ICDE 2010 8 / 28

  9. TASM-Postorder Outline 1 Motivation and Problem Definition 2 TASM-Postorder Upper Bound on Subtree Size Prefix Ring Buffer Pruning 3 Experiments 4 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) TASM: Top- k Approx. Subtree Matching ICDE 2010 9 / 28

  10. TASM-Postorder Upper Bound on Subtree Size Outline 1 Motivation and Problem Definition 2 TASM-Postorder Upper Bound on Subtree Size Prefix Ring Buffer Pruning 3 Experiments 4 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) TASM: Top- k Approx. Subtree Matching ICDE 2010 10 / 28

  11. TASM-Postorder Upper Bound on Subtree Size Subtree Size Upper Bound in Three Steps worst match 1. Rank first k subtrees of T in postorder: R ′ = ( T ′ 1 , T ′ 2 , . . . , T ′ k ) insert T ′ delete Q k (i) ted ( Q , T ′ k ) ≤ | Q | + | T ′ k | ∅ T ′ Q k | T ′ k | ≤ k 2. Final ranking R = ( T 1 , T 2 , . . . , T k ) (=TASM result) T i ’s in R are better than worst match T ′ k of R ′ (ii) ted ( Q , T i ) ≤ ted ( Q , T ′ k ) ≤ | Q | + | T ′ k | at least: insert missing nodes 3. Size upper bound for subtree T i | T i | − | Q | T i | T i | − | Q | ≤ ted ( Q , T i ) Q | T i | ≤ ted ( Q , T i ) + | Q | ≤ 2 | Q | + | T ′ k | ≤ 2 | Q | + k Nikolaus Augsten (Bolzano, Italy) TASM: Top- k Approx. Subtree Matching ICDE 2010 11 / 28

  12. TASM-Postorder Upper Bound on Subtree Size Upper Bound on Subtree Size Theorem (Upper Bound on Subtree Size) TASM needs to consider only small document subtrees of size τ or less: τ = 2 | Q | + k Upper bound is very powerful: independent of document size and structure! linear in query size and k Example : top-10 with example query | Q | = 8 on DBLP (28M nodes) with bound: max subtree size τ = 2 ∗ 8 + 10 = 26 without bound: maximum subtree size is 28M (whole document)! Document-independent upper bound on subtree size! Nikolaus Augsten (Bolzano, Italy) TASM: Top- k Approx. Subtree Matching ICDE 2010 12 / 28

  13. TASM-Postorder Prefix Ring Buffer Pruning Outline 1 Motivation and Problem Definition 2 TASM-Postorder Upper Bound on Subtree Size Prefix Ring Buffer Pruning 3 Experiments 4 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) TASM: Top- k Approx. Subtree Matching ICDE 2010 13 / 28

  14. TASM-Postorder Prefix Ring Buffer Pruning Document Format: Postorder Queue dblp John,1 auth,2 X1,1 title,2 article,5 proceedings article book VLDB,1 conf,2 Peter,1 auth,2 X3,1 auth title conf article article title title,2 article,5 Mike,1 auth,2 X4,1 proc,13 John X1 VLDB auth title auth title X2 title,2 article,5 X2,1 title,2 book,3 dblp,22 Peter X3 Mike X4 Postorder queue : queue of (label,size)-pairs dequeue removes leftmost element, e.g., ( John , 1) no random access! Relevant and state-of-the-art for XML Parsing full subtree known only at closing tag closing tags appear in postorder Implementation is efficient and heavily used for XML streams plain XML files (e.g., SAX) XML in database (Dewey, interval encoding, ...) Nikolaus Augsten (Bolzano, Italy) TASM: Top- k Approx. Subtree Matching ICDE 2010 14 / 28

  15. TASM-Postorder Prefix Ring Buffer Pruning Candidate Subtrees Candidate subtrees are all subtrees T i of the document with | T i | ≤ τ AND T i is not contained in a larger subtree | T j | ≤ τ Pruning : find candidate subtrees Nikolaus Augsten (Bolzano, Italy) TASM: Top- k Approx. Subtree Matching ICDE 2010 15 / 28

  16. TASM-Postorder Prefix Ring Buffer Pruning Simple Pruning Approach dblp 22 article 5 proceedings 18 book 21 auth 2 title 4 conf 7 article 12 article 17 title 20 John 1 X1 3 VLDB 6 auth 9 title 11 auth 14 title 16 X2 19 Peter 8 X3 10 Mike 13 X4 15 Simple pruning approach: ( τ = 6 in example above) add nodes to memory buffer until non-candidate ( | T i | > τ ) is added subtrees of non-candidate with | T i | ≤ τ are candidate subtrees Problem : memory buffer can grow very large ! must keep subtrees in memory until non-candidate ancestor is read worst case: memory buffer stores O ( n ) nodes (frequent in data-centric XML!) Example : DBLP, τ = 50 99% of nodes are still in buffer when root node is read! Simple pruning not feasible for large documents! Nikolaus Augsten (Bolzano, Italy) TASM: Top- k Approx. Subtree Matching ICDE 2010 16 / 28

  17. TASM-Postorder Prefix Ring Buffer Pruning Efficient Pruning is Tricky! Problem: when can we remove a node from the buffer? when we see | T i | ≤ τ , we don’t yet know about parent (postorder!) subtree of parent might be smaller than τ ! Our Solution does not wait for parent prefix ring buffer : fixed size buffer pruning rule: prune based on following nodes Nikolaus Augsten (Bolzano, Italy) TASM: Top- k Approx. Subtree Matching ICDE 2010 17 / 28

  18. TASM-Postorder Prefix Ring Buffer Pruning Pruning in Small Memory prefix ring buffer ( τ = 6) VLDB,1 John,1 auth,2 X1,1 title,4 article,5 e ↑ s ↑ Prefix ring buffer of size τ + 1 (main memory) stores prefix ( τ nodes in postorder) of the document two operations append new node remove leftmost subtree/node Pruning rule: If leftmost node in full ring buffer is leaf : leftmost subtree is candidate subtree non-leaf : leftmost node is non-candidate node Nikolaus Augsten (Bolzano, Italy) TASM: Top- k Approx. Subtree Matching ICDE 2010 18 / 28

Recommend


More recommend