Approximate Tree Matching with pq-Grams Nikolaus Augsten a , Michael B¨ ohlen, Johann Gamper DIS - Center for Database and Information Systems Free University of Bozen-Bolzano, Italy www.inf.unibz.it 1 – Motivation . . . . . . . . . . . . . . . . . . 2 2 – Related Work . . . . . . . . . . . . . . . . 6 3 – pq -Grams . . . . . . . . . . . . . . . . . . 7 4 – Properties . . . . . . . . . . . . . . . . . . 11 5 – Experiments . . . . . . . . . . . . . . . . . 14 6 – Conclusion and Future Work . . . . . . . . . 21 a Supported by the Municipality of Bozen-Bolzano.
Motivation — Example Data Sources ☞ We want to link data items in different databases that correspond to the same real world object . ☞ Example query: Who lives in Braun ’s apartment? Land Register Registration Office LR RO resident id num entr apt id num entr apt owner 91 1 - 1 Maier Pichler 30 1 - 1 91 1 - 2 Rossi Rieder 30 1 - 3 ! Fischer 30 2 A - 91 1 - 3 Maier 91 2 A - Braun Rossi 30 2 B 1 ... ... 74 3 A 1 Spiro Spiro 120 3 A 1 74 3 A 2 Barducci Barducci 120 3 A 2 74 3 A 3 Costanzi Costanzi 120 3 A 3 ... ... SRO SLR id street id street 30 Giuseppe-Cesare-Abba-Str. 139 SIEGESPLATZ 5220 Bozner-Boden-Str. 109 GILMWEG ? 3000 Hermann-von-Gilm-Str. 185 P. R. GIULIANI STR. 3030 Pater-Reginaldo-Giuliani-Str. 91 CESARE ABBA STRASSE 3540 Italienallee 165 MUSTERPLATZ 115 ITALIENSTRASSE 4440 Musterplatzl 259 TELSERDURCHGANG 7180 Raffaello-Sernesi-Galerie 207 SERNESIDURCHGANG 7590 Telsergalerie 33 BOZNER BODENWEG 7620 Friedensplatz 263 TRIESTER STRASSE 7650 Turiner Str. 262 TRIENTER STRASSE 7740 Trienter Str. 285 WALTHERPLATZ 7860 Triester Str. 266 TURINER STRASSE 8580 Walther-v.-d.-Vogelweide-Pl. ... ... VLDB 2005, Trondheim Nikolaus Augsten , Michael B¨ ohlen, Johann Gamper Page 2
Motivation — Address Trees ☞ residential addresses are hierarchical → address tree ☞ Idea: corresponding streets ⇒ similar address tree How similar are two address trees? Address trees: CESARE ABBA STRASSE Giuseppe-Cesare-Abba-Str. 3 6 3 6 1 2 4 1 2 4 A D A B C A C B C - B - B A 1 2 3 1 2 3 4 1 3 1 2 3 4 1 2 3 VLDB 2005, Trondheim Nikolaus Augsten , Michael B¨ ohlen, Johann Gamper Page 3
Motivation — Standard Solution: The Edit Distance ☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label change) that transform one tree into an other. T ′ T ′′ T a a x insert( k , e , 3) − rename( a , x ) − → → b b b c c c d f g d f g d f g e e e h i h i k h i k edit distance: dist ed ( T , T ′′ ) = 2 ☞ Problem: Best algorithms O ( n 2 log 2 ( n )) ⇒ not scalable. VLDB 2005, Trondheim Nikolaus Augsten , Michael B¨ ohlen, Johann Gamper Page 4
Motivation — Problem Definition ☞ Our goal: Find an efficient and effective approximation of the tree edit distance that ➠ is scalable for large trees, ➠ emphasizes structure . VLDB 2005, Trondheim Nikolaus Augsten , Michael B¨ ohlen, Johann Gamper Page 5
Related Work — Tree Distances ☞ n → number of tree nodes ☞ Tree edit distance: ➳ for balanced trees [Zhang and Shasha, 1989]: O ( n 2 log 2 ( n )) ➳ for arbitrary trees [Klein, 1998]: O ( n 3 log( n )) ☞ Tree edit distance approximations: ➳ Restricted versions of the tree edit distance: ➠ Alignment [Jiang et al., 1995]: O ( n 2 ) ➠ Isolated subtree [Tanaka and Tanaka, 1988]: O ( n 2 ) ➠ Top-down [Selkow, 1977, Yang, 1991]: O ( n 2 ) ➠ Bottom-up [Valiente, 2001]: O ( n ) → only very specific domains ➳ XML versioning [Chawathe et al., 1996, Chawathe and Garcia-Molina, 1997, Lee et al., 2004]: O ( n 2 ) for very different trees ➳ Tree-edit distance embedding [Garofalakis and Kumar, 2003, Garofalakis and Kumar, 2005]: ➠ O ( n log n ) ➠ guaranteed distance distortion for tree edit distance with subtree move ☞ Related work for strings : ➳ Navarro [Navarro, 2001]: good overview of the edit distance for strings and its variants ➳ Ukkonen [Ukkonen, 1992]: q -grams as lower bound for string edit distance VLDB 2005, Trondheim Nikolaus Augsten , Michael B¨ ohlen, Johann Gamper Page 6
pq -Grams — Subtrees of the pq -Extended Tree ☞ Extended Tree T pq : * 2 , 3 -Extended Tree: Patch boundaries by adding null nodes ( * ): T 2 , 3 T ➳ p − 1 ancestors to the root a a ➳ q − 1 nodes before the first and after a b c the last child of each non-leaf node * * * * − → a b c ➳ q children to each leaf e b * * * * * * * * * * e b ☞ pq -Gram G : Subtree of T pq . ➳ Anchor node * * * * * * ➳ with p − 1 ancestors ➳ and q children. * 2 , 3 -Gram Pattern: Example pq -Grams for T : Contiguous siblings in G are contiguous • p − 1 a * siblings in T pq . • a a a ☞ pq -gram Profile P p,q ( T ) : anchor a b e b * b c • • • ➳ Bag of all pq -grams of T . 2 , 3 -gram 1 , 2 -gram 3 , 2 -gram q a a a a a a a a * * * * * 2 , 3 -Gram Profile of T : a a e a a a a a c a a b b * * a * * e * e b e b * b * * * a b a b c b c * c * * * * * * * * * * * * * * VLDB 2005, Trondheim Nikolaus Augsten , Michael B¨ ohlen, Johann Gamper Page 7
pq -Grams — Algorithm for pq -Gram Profile anc * anc anc anc anc anc anc anc anc anc anc * * * * * * * * * * * P T anc anc anc anc anc anc anc anc a anc anc anc * a * * a * a * * a * a * * a * a * * a * a * * a * a * * a * a * * a * a * * a * a * * a * a * * a * a * * a * a * * a * a * * a * a * * a * a * * a * a * * a * a * * a * a * * a a c b * * * * * * * * * * * * * * * a a * * e a a * * e a a * * e a a * * e a a * * e a a * * e a a * * e a a * * e a a * * e a a * * e a a * * e a a * * e a a * * e a a * * e sib sib sib sib sib sib sib sib sib a e * * * a e * * * a e * * * a e * * * a e * * * a e * * * a e * * * a e * * * a e * * * a e * * * e b * * * * * * * * * * * a a * e b a a * e b a a * e b a a * e b a a * e b a a * e b a a * e b a a * e b 1 CREATE P ROFILE ( T , p, q, P, r , anc ) sib sib sib sib sib sib sib * * * * * * 2 anc := shift( anc, l( r )) a b * * * a b * * * a b * * * a b * * * a b * * * a b * * * a b * * * a b * * * sib : shift register of size q (initialized with * ) 3 sib sib a a e b * a a e b * a a e b * a a e b * a a e b * a a e b * a a e b * 4 5 if r is a leaf then a a b * * a a b * * a a b * * a a b * * a a b * * a a b * * 6 P := P ∪ ( anc ◦ sib ) * a * a b * a * a b * a * a b * a * a b * a * a b 7 else 8 for each child c (from left to right) of r do a b * * * a b * * * a b * * * a b * * * 9 sib := shift( sib, l( c )) * a a b c * a a b c * a a b c * a a b c 10 P := P ∪ ( anc ◦ sib ) 11 P := PROFILE ( T , p, q, P, c , anc ) a c * * * a c * * * a c * * * 12 for k := 1 to q − 1 sib := shift( sib, * ) * a b c * * a b c * * a b c * 13 14 P := P ∪ ( anc ◦ sib ) * a c * * * a c * * 15 return P VLDB 2005, Trondheim Nikolaus Augsten , Michael B¨ ohlen, Johann Gamper Page 8
pq -Grams — The pq -Gram Profile The pq -gram profile is Theorem 1 For tree T with l leaves ☞ small → size O ( n ) and i non-leaves: ☞ easy to store | P p,q ( T ) | = 2 l + qi − 1 . ➳ represent the pq -grams by fingerprint hash value ➳ store profile in single-attribute relation ☞ allows effective distance computation between trees P 2 , 3 ( T ) P 2 , 3 ( T ) T pq − gram hash ( * , a , * , * , a ) 10AE ( a , a , * , * , * ) a 2F1E ( a , e , * , * , * ) 1008 ( a , a , * , e , b ) 13E1 → → ( a , b , * , * , * ) 5F31 ( a , a , e , b , * ) AE1D a b c ( a , a , b , * , * ) 13DF ( * , a , * , a , b ) F310 ( a , b , * , * , * ) 5F31 ( * , a , a , b , c ) 45A1 ( a , c , * , * , * ) e b 973F ( * , a , b , c , * ) 3F1E ( * , a , c , * , * ) 11EF VLDB 2005, Trondheim Nikolaus Augsten , Michael B¨ ohlen, Johann Gamper Page 9
pq -Grams — pq -Gram Distance ☞ Definition 1 For two trees T 1 and T 2 the pq pq pq -gram distance is: | P p,q ( T 1 ) ∩ P p,q ( T 2 ) | ∆ p,q ( T 1 , T 2 ) = 1 − 2 | P p,q ( T 1 ) ∪ P p,q ( T 2 ) | ☞ can be computed in O ( n log n ) time and O ( n ) space (bag intersection of relations) ☞ other terms are constants for normalization : ➳ ∆ p,q ( T 1 , T 2 ) = 1 if trees have no pq -grams in common ➳ ∆ p,q ( T 1 , T 2 ) = 0 if trees have the same pq -gram profile VLDB 2005, Trondheim Nikolaus Augsten , Michael B¨ ohlen, Johann Gamper Page 10
Properties — Sensitivity to Structure Change ☞ Intuition: Nodes with structural information → more significant ☞ Address application: Mismatch of houses (with subnumbers and apartment numbers) is more significant than mismatch of apartments. dist ed = 2 dist ed = 2 ∆ 2 , 3 = 0 . 30 ∆ 2 , 3 = 0 . 89 T ′ T ′′ T ← → a a a c c b b b d h i k f g e e d f d f g h i h i k VLDB 2005, Trondheim Nikolaus Augsten , Michael B¨ ohlen, Johann Gamper Page 11
Recommend
More recommend