on the fly token similarity joins in relational databases
play

On-the-Fly Token Similarity Joins in Relational Databases N. Augsten - PowerPoint PPT Presentation

On-the-Fly Token Similarity Joins in Relational Databases N. Augsten 1 A. Miraglia 2 T. Neumann 3 A. Kemper 3 1 University of Salzburg, Austria nikolaus.augsten@sbg.ac.at 2 VU University Amsterdam, Netherlands a.miraglia@student.vu.nl 3 TU M


  1. On-the-Fly Token Similarity Joins in Relational Databases N. Augsten 1 A. Miraglia 2 T. Neumann 3 A. Kemper 3 1 University of Salzburg, Austria nikolaus.augsten@sbg.ac.at 2 VU University Amsterdam, Netherlands a.miraglia@student.vu.nl 3 TU M¨ unchen, Germany { neumann,kemper } @in.tum.de June 26, 2014 SIGMOD, Snowbird, Utah Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 1 / 21

  2. Outline 1 Motivation 2 The Tokenize Operator Efficient Implementation Query Optimization 3 Experiments Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 2 / 21

  3. Motivation Outline 1 Motivation 2 The Tokenize Operator Efficient Implementation Query Optimization 3 Experiments Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 3 / 21

  4. Motivation Token Similarity Join R S A B snowbird ⋊ ⋉ sim( A , B ) ≥ 70% snowbasin canyons snowbirds . . . sim( A , B ) = | α ( A ) ∩ α ( B ) | . . . | α ( A ) ∪ α ( B ) | Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 4 / 21

  5. Motivation Token Similarity Join R S A B snowbird ⋊ ⋉ sim( A , B ) ≥ 70% snowbasin canyons snowbirds . . . sim( A , B ) = | α ( A ) ∩ α ( B ) | . . . | α ( A ) ∪ α ( B ) | α ( A ) α ( B ) #s ba bi sn as ir no si rd ow in d# wb n# 14 = 36% ✗ sim( A , B ) = 5 Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 4 / 21

  6. Motivation Token Similarity Join R S A B snowbird ⋊ ⋉ sim( A , B ) ≥ 70% snowbasin canyons snowbirds . . . sim( A , B ) = | α ( A ) ∩ α ( B ) | . . . | α ( A ) ∪ α ( B ) | α ( A ) α ( B ) α ( A ) α ( B ) #s ba bi #s sn sn as ir no ow ds no si d# rd wb bi s# ow in d# ir rd wb n# 14 = 36% ✗ 11 = 73% ✓ sim( A , B ) = 5 sim( A , B ) = 8 Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 4 / 21

  7. Motivation Token Generation in Similarity Joins recived little attention well studied generate R token set similarity ∼ ⊲ ⊳ S R join on generate tokens S token Part-Enum (VLDB’06) All-Pairs (WWW’07) precomputed tokens assumed PP-Join (WWW’08) token generation not part of MP-Join (Inf. Syst.’11) query plan Adapt-Join (SIGMOD’12) . . . Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 5 / 21

  8. Motivation Token Generation in Similarity Joins recived little attention well studied generate R token set similarity ∼ ⊲ ⊳ S R join on generate tokens S token Part-Enum (VLDB’06) All-Pairs (WWW’07) precomputed tokens assumed PP-Join (WWW’08) token generation not part of MP-Join (Inf. Syst.’11) query plan Adapt-Join (SIGMOD’12) . . . Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 5 / 21

  9. Motivation Token Generation in Similarity Joins recived little attention well studied generate R token set similarity ∼ ⊲ ⊳ S R join on generate tokens S token Part-Enum (VLDB’06) All-Pairs (WWW’07) precomputed tokens assumed PP-Join (WWW’08) token generation not part of MP-Join (Inf. Syst.’11) query plan Adapt-Join (SIGMOD’12) . . . Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 5 / 21

  10. Motivation Token Generation in Similarity Joins recived little attention well studied generate R token set similarity ∼ ⊲ ⊳ S R join on generate tokens S token Part-Enum (VLDB’06) All-Pairs (WWW’07) precomputed tokens assumed Goal: integrate token generation into query plan! PP-Join (WWW’08) token generation not part of MP-Join (Inf. Syst.’11) query plan Adapt-Join (SIGMOD’12) . . . Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 5 / 21

  11. Motivation Generating Tokens Stand-alone client: export data, generate tokens, import tokens ➥ overhead for export/import ➥ no integration into query plan ➥ only good for precomputation Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 6 / 21

  12. Motivation Generating Tokens Stand-alone client: export data, generate tokens, import tokens ➥ overhead for export/import ➥ no integration into query plan ➥ only good for precomputation Table function: ➥ UDF generates tokens on-the-fly ➥ table function used like a table in query Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 6 / 21

  13. Motivation State-of-the-Art: Table Function Customer tables R , S : ➥ join customers with similar names ➥ only customers from ’SLC’ and ’Salt Lake’ SELECT TR.ssn, TS.ssn FROM tblfunc(’R’, ’name’) TR, tblfunc(’S’, ’name’) TS WHERE TR.city = ’SLC’ AND TS.county=’Salt Lake’ AND TR.token = TS.token GROUP BY TR.ssn, TS.ssn HAVING COUNT(*) > = k; Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 7 / 21

  14. Motivation State-of-the-Art: Table Function Customer tables R , S : ➥ join customers with similar names ➥ only customers from ’SLC’ and ’Salt Lake’ SELECT TR.ssn, TS.ssn FROM tblfunc(’R’, ’name’) TR, tblfunc(’S’, ’name’) TS WHERE TR.city = ’SLC’ AND TS.county=’Salt Lake’ AND TR.token = TS.token GROUP BY TR.ssn, TS.ssn HAVING COUNT(*) > = k; Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 7 / 21

  15. Motivation State-of-the-Art: Table Function π TR . ssn , TS . ssn Customer tables R , S : ➥ join customers with σ olap ≥ k ◦ true cardinality similar names γ TR . ssn , TS . ssn ;COUNT( ∗ ) → olap ➥ only customers from ’SLC’ and ’Salt Lake’ ⋉ TR . token = TS . token ⋊ SELECT TR.ssn, TS.ssn 10 2’000 10 1’000 FROM tblfunc(’R’, ’name’) TR, σ city = ’SLC’ σ county = ’Salt Lake’ tblfunc(’S’, ’name’) TS WHERE TR.city = ’SLC’ 1000 100’000 1000 200’000 AND TS.county=’Salt Lake’ tblfunc / TS tblfunc/TR AND TR.token = TS.token GROUP BY TR.ssn, TS.ssn 20’000 10’000 HAVING COUNT(*) > = k; R ( ssn , name , city ) S ( ssn , name , county ) Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 7 / 21

  16. Motivation State-of-the-Art: Table Function π TR . ssn , TS . ssn Customer tables R , S : ➥ join customers with σ olap ≥ k ◦ true cardinality similar names γ TR . ssn , TS . ssn ;COUNT( ∗ ) → olap ➥ only customers from ’SLC’ and ’Salt Lake’ ⋉ TR . token = TS . token ⋊ SELECT TR.ssn, TS.ssn 10 2’000 10 1’000 FROM tblfunc(’R’, ’name’) TR, σ city = ’SLC’ [1%] σ county = ’Salt Lake’ [5%] tblfunc(’S’, ’name’) TS WHERE TR.city = ’SLC’ 1000 100’000 1000 200’000 AND TS.county=’Salt Lake’ tblfunc / TS tblfunc/TR AND TR.token = TS.token GROUP BY TR.ssn, TS.ssn 20’000 10’000 HAVING COUNT(*) > = k; R ( ssn , name , city ) S ( ssn , name , county ) black box: selection and projection not pushed down ➥ tokens computed for all customers ➥ too many attributes replicated (name, city, county) Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 7 / 21

  17. Motivation State-of-the-Art: Table Function π TR . ssn , TS . ssn Customer tables R , S : ➥ join customers with σ olap ≥ k ◦ true cardinality similar names ◦ estimated card. γ TR . ssn , TS . ssn ;COUNT( ∗ ) → olap ➥ only customers from ’SLC’ and ’Salt Lake’ ⋊ ⋉ TR . token = TS . token SELECT TR.ssn, TS.ssn 10 2’000 10 1’000 FROM tblfunc(’R’, ’name’) TR, σ city = ’SLC’ [1%] σ county = ’Salt Lake’ [5%] tblfunc(’S’, ’name’) TS WHERE TR.city = ’SLC’ 1000 100’000 1000 200’000 AND TS.county=’Salt Lake’ tblfunc / TS tblfunc/TR AND TR.token = TS.token GROUP BY TR.ssn, TS.ssn 20’000 10’000 HAVING COUNT(*) > = k; R ( ssn , name , city ) S ( ssn , name , county ) black box: selection and projection not pushed down ➥ tokens computed for all customers ➥ too many attributes replicated (name, city, county) unknown cardinality of table function (often assumed a constant) Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 7 / 21

  18. Motivation State-of-the-Art: Table Function π TR . ssn , TS . ssn Customer tables R , S : ➥ join customers with σ olap ≥ k ◦ true cardinality similar names ◦ estimated card. γ TR . ssn , TS . ssn ;COUNT( ∗ ) → olap ➥ only customers from ’SLC’ and ’Salt Lake’ ⋉ TR . token = TS . token ⋊ SELECT TR.ssn, TS.ssn 10 2’000 10 1’000 FROM tblfunc(’R’, ’name’) TR, σ city = ’SLC’ [1%] σ county = ’Salt Lake’ [5%] tblfunc(’S’, ’name’) TS WHERE TR.city = ’SLC’ 1000 100’000 1000 200’000 AND TS.county=’Salt Lake’ tblfunc / TS tblfunc/TR AND TR.token = TS.token GROUP BY TR.ssn, TS.ssn 20’000 10’000 HAVING COUNT(*) > = k; R ( ssn , name , city ) S ( ssn , name , county ) Problem: poor query plans with table functions. Nikolaus Augsten (Salzburg, Austria) On-the-Fly Token Sim. Join in RDBs SIGMOD 14 – Snowbird, UT 7 / 21

Recommend


More recommend