approximate tree matching with pq grams
play

Approximate Tree Matching with pq-Grams Nikolaus Augsten a , Michael - PowerPoint PPT Presentation

Approximate Tree Matching with pq-Grams Nikolaus Augsten a , Michael B ohlen, Johann Gamper DIS - Center for Database and Information Systems Free University of Bozen-Bolzano, Italy www.inf.unibz.it 1 Motivation . . . . . . . . . . . . .


  1. Approximate Tree Matching with pq-Grams Nikolaus Augsten a , Michael B¨ ohlen, Johann Gamper DIS - Center for Database and Information Systems Free University of Bozen-Bolzano, Italy www.inf.unibz.it 1 – Motivation . . . . . . . . . . . . . . . . . . 2 2 – Related Work . . . . . . . . . . . . . . . . 6 3 – pq -Grams . . . . . . . . . . . . . . . . . . 7 4 – Properties . . . . . . . . . . . . . . . . . . 11 5 – Experiments . . . . . . . . . . . . . . . . . 14 6 – Conclusion and Future Work . . . . . . . . . 21 a Supported by the Municipality of Bozen-Bolzano.

  2. Motivation — Example Data Sources ☞ We want to link data items in different databases that correspond to the same real world object . Nikolaus Augsten , Michael B¨ VLDB 2005, Trondheim ohlen, Johann Gamper Page 2

  3. Motivation — Example Data Sources ☞ We want to link data items in different databases that correspond to the same real world object . Land Register Registration Office LR RO id num entr apt owner resident id num entr apt 91 1 - 1 Maier Pichler 30 1 - 1 91 1 - 2 Rossi Rieder 30 1 - 3 Fischer 30 2 A - 91 1 - 3 Maier 91 2 A - Braun Rossi 30 2 B 1 ... ... 74 3 A 1 Spiro Spiro 120 3 A 1 74 3 A 2 Barducci Barducci 120 3 A 2 74 3 A 3 Costanzi Costanzi 120 3 A 3 ... ... SRO SLR id street id street 30 Giuseppe-Cesare-Abba-Str. 139 SIEGESPLATZ 5220 Bozner-Boden-Str. 109 GILMWEG 3000 Hermann-von-Gilm-Str. 185 P. R. GIULIANI STR. 3030 Pater-Reginaldo-Giuliani-Str. 91 CESARE ABBA STRASSE 3540 Italienallee 165 MUSTERPLATZ 4440 Musterplatzl 115 ITALIENSTRASSE 259 TELSERDURCHGANG 7180 Raffaello-Sernesi-Galerie 207 SERNESIDURCHGANG 7590 Telsergalerie 33 BOZNER BODENWEG 7620 Friedensplatz 263 TRIESTER STRASSE 7650 Turiner Str. 262 TRIENTER STRASSE 7740 Trienter Str. 285 WALTHERPLATZ 7860 Triester Str. 266 TURINER STRASSE 8580 Walther-v.-d.-Vogelweide-Pl. ... ... Nikolaus Augsten , Michael B¨ VLDB 2005, Trondheim ohlen, Johann Gamper Page 2

  4. Motivation — Example Data Sources ☞ We want to link data items in different databases that correspond to the same real world object . ☞ Example query: Who lives in Braun ’s apartment? Land Register Registration Office LR RO id num entr apt owner resident id num entr apt 91 1 - 1 Maier Pichler 30 1 - 1 91 1 - 2 Rossi Rieder 30 1 - 3 Fischer 30 2 A - 91 1 - 3 Maier 91 2 A - Braun Rossi 30 2 B 1 ... ... 74 3 A 1 Spiro Spiro 120 3 A 1 74 3 A 2 Barducci Barducci 120 3 A 2 74 3 A 3 Costanzi Costanzi 120 3 A 3 ... ... SRO SLR id street id street 30 Giuseppe-Cesare-Abba-Str. 139 SIEGESPLATZ 5220 Bozner-Boden-Str. 109 GILMWEG 3000 Hermann-von-Gilm-Str. 185 P. R. GIULIANI STR. 3030 Pater-Reginaldo-Giuliani-Str. 91 CESARE ABBA STRASSE 3540 Italienallee 165 MUSTERPLATZ 4440 Musterplatzl 115 ITALIENSTRASSE 259 TELSERDURCHGANG 7180 Raffaello-Sernesi-Galerie 207 SERNESIDURCHGANG 7590 Telsergalerie 33 BOZNER BODENWEG 7620 Friedensplatz 263 TRIESTER STRASSE 7650 Turiner Str. 262 TRIENTER STRASSE 7740 Trienter Str. 285 WALTHERPLATZ 7860 Triester Str. 266 TURINER STRASSE 8580 Walther-v.-d.-Vogelweide-Pl. ... ... Nikolaus Augsten , Michael B¨ VLDB 2005, Trondheim ohlen, Johann Gamper Page 2

  5. Motivation — Example Data Sources ☞ We want to link data items in different databases that correspond to the same real world object . ☞ Example query: Who lives in Braun ’s apartment? Land Register Registration Office LR RO id num entr apt owner resident id num entr apt 91 1 - 1 Maier Pichler 30 1 - 1 91 1 - 2 Rossi Rieder 30 1 - 3 Fischer 30 2 A - 91 1 - 3 Maier 91 2 A - Braun Rossi 30 2 B 1 ... ... 74 3 A 1 Spiro Spiro 120 3 A 1 74 3 A 2 Barducci Barducci 120 3 A 2 74 3 A 3 Costanzi Costanzi 120 3 A 3 ... ... SRO SLR id street id street 30 Giuseppe-Cesare-Abba-Str. 139 SIEGESPLATZ 5220 Bozner-Boden-Str. 109 GILMWEG ? 3000 Hermann-von-Gilm-Str. 185 P. R. GIULIANI STR. 3030 Pater-Reginaldo-Giuliani-Str. 91 CESARE ABBA STRASSE 3540 Italienallee 165 MUSTERPLATZ 4440 Musterplatzl 115 ITALIENSTRASSE 259 TELSERDURCHGANG 7180 Raffaello-Sernesi-Galerie 207 SERNESIDURCHGANG 7590 Telsergalerie 33 BOZNER BODENWEG 7620 Friedensplatz 263 TRIESTER STRASSE 7650 Turiner Str. 262 TRIENTER STRASSE 7740 Trienter Str. 285 WALTHERPLATZ 7860 Triester Str. 266 TURINER STRASSE 8580 Walther-v.-d.-Vogelweide-Pl. ... ... Nikolaus Augsten , Michael B¨ VLDB 2005, Trondheim ohlen, Johann Gamper Page 2

  6. Motivation — Example Data Sources ☞ We want to link data items in different databases that correspond to the same real world object . ☞ Example query: Who lives in Braun ’s apartment? Land Register Registration Office LR RO id num entr apt owner resident id num entr apt 91 1 - 1 Maier Pichler 30 1 - 1 91 1 - 2 Rossi Rieder 30 1 - 3 ! Fischer 30 2 A - 91 1 - 3 Maier 91 2 A - Braun Rossi 30 2 B 1 ... ... 74 3 A 1 Spiro Spiro 120 3 A 1 74 3 A 2 Barducci Barducci 120 3 A 2 74 3 A 3 Costanzi Costanzi 120 3 A 3 ... ... SRO SLR id street id street 30 Giuseppe-Cesare-Abba-Str. 139 SIEGESPLATZ 5220 Bozner-Boden-Str. 109 GILMWEG ? 3000 Hermann-von-Gilm-Str. 185 P. R. GIULIANI STR. 3030 Pater-Reginaldo-Giuliani-Str. 91 CESARE ABBA STRASSE 3540 Italienallee 165 MUSTERPLATZ 4440 Musterplatzl 115 ITALIENSTRASSE 259 TELSERDURCHGANG 7180 Raffaello-Sernesi-Galerie 207 SERNESIDURCHGANG 7590 Telsergalerie 33 BOZNER BODENWEG 7620 Friedensplatz 263 TRIESTER STRASSE 7650 Turiner Str. 262 TRIENTER STRASSE 7740 Trienter Str. 285 WALTHERPLATZ 7860 Triester Str. 266 TURINER STRASSE 8580 Walther-v.-d.-Vogelweide-Pl. ... ... Nikolaus Augsten , Michael B¨ VLDB 2005, Trondheim ohlen, Johann Gamper Page 2

  7. Motivation — Address Trees ☞ residential addresses are hierarchical → address tree Address trees: CESARE ABBA STRASSE Giuseppe-Cesare-Abba-Str. 3 6 3 6 1 2 4 1 2 4 A D A B C A C B C - B - B A 1 2 3 1 2 3 4 1 3 1 2 3 4 1 2 3 Nikolaus Augsten , Michael B¨ VLDB 2005, Trondheim ohlen, Johann Gamper Page 3

  8. Motivation — Address Trees ☞ residential addresses are hierarchical → address tree ☞ Idea: corresponding streets ⇒ similar address tree How similar are two address trees? Address trees: CESARE ABBA STRASSE Giuseppe-Cesare-Abba-Str. 3 6 3 6 1 2 4 1 2 4 A D A B C A C B C - B - B A 1 2 3 1 2 3 4 1 3 1 2 3 4 1 2 3 Nikolaus Augsten , Michael B¨ VLDB 2005, Trondheim ohlen, Johann Gamper Page 3

  9. Motivation — Standard Solution: The Edit Distance ☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label change) that transform one tree into an other. Nikolaus Augsten , Michael B¨ VLDB 2005, Trondheim ohlen, Johann Gamper Page 4

  10. Motivation — Standard Solution: The Edit Distance ☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label change) that transform one tree into an other. T ′′ T a x b b c c d f g d f g e e h i h i k Nikolaus Augsten , Michael B¨ VLDB 2005, Trondheim ohlen, Johann Gamper Page 4

  11. Motivation — Standard Solution: The Edit Distance ☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label change) that transform one tree into an other. T ′ T ′′ T a a x insert( k , e , 3) − → b b b c c c d f g d f g d f g e e e h i h i k h i k Nikolaus Augsten , Michael B¨ VLDB 2005, Trondheim ohlen, Johann Gamper Page 4

  12. Motivation — Standard Solution: The Edit Distance ☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label change) that transform one tree into an other. T ′ T ′′ T a a x insert( k , e , 3) − rename( a , x ) − → → b b b c c c d f g d f g d f g e e e h i h i k h i k edit distance: dist ed ( T , T ′′ ) = 2 Nikolaus Augsten , Michael B¨ VLDB 2005, Trondheim ohlen, Johann Gamper Page 4

  13. Motivation — Standard Solution: The Edit Distance ☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label change) that transform one tree into an other. T ′ T ′′ T a a x insert( k , e , 3) − rename( a , x ) − → → b b b c c c d f g d f g d f g e e e h i h i k h i k edit distance: dist ed ( T , T ′′ ) = 2 ☞ Problem: Best algorithms O ( n 2 log 2 ( n )) ⇒ not scalable. Nikolaus Augsten , Michael B¨ VLDB 2005, Trondheim ohlen, Johann Gamper Page 4

  14. Motivation — Problem Definition ☞ Our goal: Find an efficient and effective approximation of the tree edit distance that ➠ is scalable for large trees, ➠ emphasizes structure . Nikolaus Augsten , Michael B¨ VLDB 2005, Trondheim ohlen, Johann Gamper Page 5

  15. Related Work — Tree Distances ☞ n → number of tree nodes ☞ Tree edit distance: ➳ for balanced trees [Zhang and Shasha, 1989]: O ( n 2 log 2 ( n )) ➳ for arbitrary trees [Klein, 1998]: O ( n 3 log( n )) Nikolaus Augsten , Michael B¨ VLDB 2005, Trondheim ohlen, Johann Gamper Page 6

Recommend


More recommend