Scaling out the Discovery of Inclusion Dependencies BTW 2015, - PowerPoint PPT Presentation

Scaling out the Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany Sebastian Kruse, Thorsten Papenbrock, Felix Naumann Research Assistant Hasso Plattner Institute, Potsdam, Germany

Inclusion Dependencies Examples Customers ID Name Address Orders 1 Tanja Jager Marseiller Str. 12 Customer Item Quantity 2 Sandra Möller Flughafenstr. 63 3 CK-242-1 1 3 Dennis Eberhart Sonnenallee 19 3 DF-098-7 1 4 Barbara Pabst Ziegelstr. 76 3 KE-883-6 1 5 Thorsten Mauer Güntzelstr. 90 1 LM-437-2 2 Scaling out the Discovery of INDs 5 PE-383-5 1 Sebastian Kruse Customer ⊆ ID Quantity ⊆ ID March 5, 2015 2

Inclusion Dependencies Examples Scaling out the Discovery of INDs Sebastian Kruse March 5, 2015 3 http://geneontology.org/sites/default/files/public/diag-godb-er.jpg

Inclusion Dependencies Example Scaling out the Discovery of INDs Sebastian Kruse March 5, 2015 4 http://www.ibm.com/developerworks/data/library/techarticle/dm-1109proteindatadb2purexml/pdb_scheme_large.jpg

Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions

Related Work MIND ■ Fabien De Marchi, Stéphan Lopes, and Jean-Marc Petit. Unary and n-ary inclusion dependency discovery in relational databases. Journal of Intelligent Information Systems , 32:53 – 73, 2009. ID Name Address 1 Tanja Jager Marseiller Str. 12 Customer Item Quantity 2 Sandra Möller Flughafenstr. 63 3 CK-242-1 1 3 Dennis Eberhart Sonnenallee 19 3 DF-098-7 1 4 Barbara Pabst Ziegelstr. 76 Scaling out the 3 KE-883-6 1 Discovery of INDs 5 Thorsten Mauer Güntzelstr. 90 1 LM-437-2 2 Sebastian Kruse March 5, 2015 5 PE-383-5 1 7

Related Work MIND Value Attributes Quantity ⊆ ID 1 ID, Customer, Quantity Quantity ⊆ ? Quantity ⊆ Quantity Tanja Jager Name Marseiller Str. 12 Address 2 ID, Quantity Sandra Möller Name Flughafenstr. 63 Address … … Scaling out the Discovery of INDs Sebastian Kruse Intersection ID, Quantity March 5, 2015 8

Related Work SPIDER ■ Jana Bauckmann, Ulf Leser, and Felix Naumann. Efficiently Computing Inclusion Dependencies for Schema Discovery. In ICDE Workshops , 2006. ID Name Address 1 Tanja Jager Marseiller Str. 12 Customer Item Quantity 2 Sandra Möller Flughafenstr. 63 3 CK-242-1 1 3 Dennis Eberhart Sonnenallee 19 3 DF-098-7 1 4 Barbara Pabst Ziegelstr. 76 Scaling out the 3 KE-883-6 1 Discovery of INDs 5 Thorsten Mauer Güntzelstr. 90 1 LM-437-2 2 Sebastian Kruse March 5, 2015 5 PE-383-5 1 9

Related Work SPIDER ■ Jana Bauckmann, Ulf Leser, and Felix Naumann. Efficiently Computing Inclusion Dependencies for Schema Discovery. In ICDE Workshops , 2006. ID Name Item Customer Quantity 1 Barbara Pabst CK-242-1 1 1 2 Dennis Eberhart DF-098-7 3 2 3 Sandra Möller KE-883-6 5 Scaling out the 4 Tanja Jager LM-437-2 Discovery of INDs 5 Thorsten Mauer PE-383-5 Sebastian Kruse March 5, 2015 10

Related Work Common proceeding Input Data Full Outer Join Inclusion Dependencies ID Name Addr Cus Item Qty ID Name Addr 1 1 1 1 T.J. M.12 2 2 2 S.M. F.63 Cus Item Qty 3 3 3 D.E. S.19 Quantity ⊆ ID 3 CK 1 4 4 B.P. Z.76 Customer ⊆ ID 3 DF 1 5 5 5 T.M. G.90 3 KE 1 T.J. 1 LM 2 S.M. 5 PE 1 … … … … … … Scaling out the Discovery of INDs Step 1: Step 2: Sebastian Kruse March 5, 2015 Calculate full outer Extract inclusion join of all attributes dependencies 11

SINDY: A distributed discovery algorithm Distributed setting ID Name Addr 1 T.J. M.12 2 S.M. F.63 3 D.E. S.19 ID Name Addr 4 B.P. Z.76 5 T.M. G.90 Scaling out the Cus Item Qty Discovery of INDs Cus Item Qty 3 CK 1 Sebastian Kruse 1 LM 2 3 DF 1 March 5, 2015 5 PE 1 3 KE 1 13

SINDY: A distributed discovery algorithm Perform full outer join ID Name Addr 1 ID T.J. Name M.12 Addr 1 T.J. M.12 2 ID S.M. Name F.63 Addr 2 S.M. F.63 3 ID D.E. Name S.19 Addr 3 D.E. S.19 Cus Item Qty 3 Cus CK Item 1 Qty 3 CK 1 3 Cus DF Item 1 Qty 3 DF 1 3 Cus KE Item 1 Qty 3 KE 1 ID Name Addr 4 ID B.P. Name Z.76 Addr 4 B.P. Z.76 Scaling out the 5 ID T.M. Name G.90 Addr 1 Cus, Qty 5 T.M. G.90 Discovery of INDs Sebastian Kruse 1 Cus LM Item 2 Qty 5 Cus, ID Cus Item Qty March 5, 2015 1 LM 2 5 Cus PE Item 1 Qty 14 5 PE 1

SINDY: A distributed discovery algorithm Perform full outer join ID Name Addr 1 ID 1 ID, Cus, Qty T.J. Name M.12 Addr 1 T.J. M.12 2 ID S.M. Name F.63 Addr 2 S.M. F.63 3 ID D.E. Name S.19 Addr 3 D.E. S.19 Cus Item Qty 3 Cus CK Item 1 Qty 2 ID, Qty 3 CK 1 DF Item 3 DF 1 KE Item 3 KE 1 ID Name Addr 4 ID 3 ID, Cus B.P. Name Z.76 Addr 4 B.P. Z.76 Scaling out the T.M. Name G.90 Addr 1 Cus, Qty 5 T.M. G.90 Discovery of INDs Sebastian Kruse LM Item 2 Qty 5 Cus, ID Cus Item Qty March 5, 2015 1 LM 2 PE Item 15 5 PE 1

SINDY: A distributed discovery algorithm Perform full outer join ID Name Addr ID, Cus, Qty Addr 1 T.J. M.12 Name ID 2 S.M. F.63 Item Addr Name 3 D.E. S.19 Cus Item Qty Item ID, Qty 3 CK 1 Name 3 DF 1 Cus, ID 3 KE 1 ID Name Addr Cus, ID Addr 4 B.P. Z.76 Scaling out the Addr Addr 5 T.M. G.90 Name Discovery of INDs Sebastian Kruse Name Item Cus Item Qty March 5, 2015 1 LM 2 Item Item 16 5 PE 1

SINDY: A distributed discovery algorithm Distributed join product ID, Cus, Qty Addr Name ID Item ID, Cus Name Item Scaling out the ID, Cus Discovery of INDs Addr ID, Qty Sebastian Kruse Name March 5, 2015 Item 17

SINDY: A distributed discovery algorithm Evaluate full outer join ID, Cus, Qty ID Cus, Qty Cus ID, Qty Qty ID, Cus Addr Ø Addr Name Ø Name ID Ø ID Item Ø Item ID, Cus ID Qty Qty ID ID Ø ID, Qty ID Cus Cus ID Name Name Ø Item Item Ø Scaling out the ID Cus Cus ID ID, Cus Discovery of INDs Name Ø Name Sebastian Kruse March 5, 2015 Item Item Ø 18 Addr Addr Ø

SINDY: A distributed discovery algorithm Evaluate full outer join ID, Cus, Qty Cus ID, Qty Qty ID, Cus Addr Ø Addr Name Ø Name ID Ø ID Item Ø Item ID, Cus Qty ID ID Ø ID, Qty Cus ID Name Name Ø Item Item Ø Scaling out the ID Cus Cus ID ID, Cus Discovery of INDs Name Ø Name Sebastian Kruse March 5, 2015 Item Item Ø 19 Addr Addr Ø

SINDY: A distributed discovery algorithm Distributed inclusion dependencies Addr Ø ID Ø Qty ID Cus ID Scaling out the Item Ø Discovery of INDs Quantity ⊆ ID Sebastian Kruse Customer ⊆ ID March 5, 2015 Name Ø 20

SINDY Variants ■ Inclusion dependencies on combinations of columns (aka n-ary INDs) □ Adaption: Create cells for combinations of values □ Powerful in combination with apriori-like proceeding ■ Partial inclusion dependencies □ Adaption: aggregate IND candidates with multiset union instead of intersection □ Compare with number of distinct values of dependent column Scaling out the Discovery of INDs Sebastian Kruse March 5, 2015 21

Evaluation Experimental setup ■ Cluster Setup □ 1 master node (Intel Xeon @ 2x2.67 GHz, 8 GiB RAM) □ 10 worker nodes (Intel Core 2 Duo @ 2x2.6 GHz, 8 GiB RAM) □ Apache HDFS 2.2, Apache Flink 0.6.2 ■ Single node (for SPIDER) □ Intel Xeon @ 8x2GHz, 128 GiB RAM, RAID-1 Scaling out the ■ Datasets Discovery of INDs Sebastian Kruse □ Relational datasets from different domains March 5, 2015 □ 16 KB to 44.9 GB 23

Evaluation Performance comparison with SPIDER 10000 10 SINDY SPIDER Speed Up 9 1000 8 7 runtime [s] speed up 100 6 5 10 4 3 1 2 Scaling out the 1 Discovery of INDs 0.1 0 Sebastian Kruse March 5, 2015 24

Evaluation Scale-Out Behavior 2048 MB-core (5.8 GB) TPC-H (1.4 GB) 1024 CATH (907 MB) LOD+ (825 MB) 512 BIOSQLSP (567 MB) 256 runtime [s] WIKIPEDIA (539 MB) CATH (907 MB) 128 CENSUS (111 MB) SCOP (15 MB) 64 COMA (16 KB) 32 Scaling out the 16 Discovery of INDs Sebastian Kruse 8 March 5, 2015 1/1 2/2 3/3 4/4 5/5 6/6 7/7 8/8 9/9 10/10 20/10 logicalworkers/physical workers 25

Scaling out the Discovery of Inclusion Dependencies BTW 2015, - PowerPoint PPT Presentation

Scaling out the Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany Sebastian Kruse, Thorsten Papenbrock, Felix Naumann Research Assistant Hasso Plattner Institute, Potsdam, Germany Inclusion Dependencies Examples Customers ID Name

ETI se Scaling for many organizations Hierarchy Recursive vs T.to eRaokon traverses the

Scaling IPv6 Neighbor Discovery Ben Mack-Crane ( ben.mackcrane@huawei.com ) Overview of Neighbor

PIR-PSI : SCALING PRIVATE CONTACT DISCOVERY PETS 2018 Peter Rindal Daniel Demmler Mike Rosulek

Matrix Profile XIV: Scaling Time Series Motif Discovery with GPUs to Break a Quintillion Pairwise

Task Dependencies: ant Steven J Zeil February 25, 2013 Task Dependencies: ant Outline

Building stuff with monadic dependencies + unchanging dependencies + polymorphic dependencies +

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

Dependencies and Hazards Lecture 17 CS301 Data Dependencies We want to keep the pipeline

Managing Dependencies and Runtime Security ActiveState Deminar Managing Dependencies and

Continuous Updating How do you keep track of your LIBRARIES? How many DEPENDENCIES do you have

Dependencies in Interval- -valued valued Dependencies in Interval Symbolic Data Symbolic Data

Inclusion and exclusion atoms in team semantics Pietro Galliani Institute for Logic, Language

Algorithm Discovery API WebCrypto API Proposal Israel Hilerio & Vijay Bharadwaj, Microsoft

Scaling From simple models to rich strategies PPPLab Day, November 30th Scaling: recent

Subgroup Discovery Exploratory Data Analysis Exploratory Data Analysis Classification:

AngularJS Dependencies and Services Dependencies & Services App can get cluttered if all

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

Digital Financial Inclusion C hallenges and opportunities Financial Inclusion Global Initiative

Outline Functional dependencies (3.4) Lecture 09: Rules about FDs (3.5) Design of a

So#ware Scaling Mo/va/on & Goals HW Configura/on & Scale Out So#ware Scaling

UMBC A B M A L T F O U M B C I M Y O R T 1 (10/18/04) I E S R C E O V

Measuring and identifying human behaviors Technical dependencies important but Altmetrics

Functional Dependencies Decompositions Normal Forms: BCNF, Third Normal Form

Scaling laws to quantify tidal dissipation in star-planet systems P . Auclair-Desrotour, S.

Scaling out the Discovery of Inclusion Dependencies BTW 2015, - PowerPoint PPT Presentation

Scaling out the Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany Sebastian Kruse, Thorsten Papenbrock, Felix Naumann Research Assistant Hasso Plattner Institute, Potsdam, Germany Inclusion Dependencies Examples Customers ID Name

ETI se Scaling for many organizations Hierarchy Recursive vs T.to eRaokon traverses the

Scaling IPv6 Neighbor Discovery Ben Mack-Crane ( ben.mackcrane@huawei.com ) Overview of Neighbor

PIR-PSI : SCALING PRIVATE CONTACT DISCOVERY PETS 2018 Peter Rindal Daniel Demmler Mike Rosulek

Matrix Profile XIV: Scaling Time Series Motif Discovery with GPUs to Break a Quintillion Pairwise

Task Dependencies: ant Steven J Zeil February 25, 2013 Task Dependencies: ant Outline

Building stuff with monadic dependencies + unchanging dependencies + polymorphic dependencies +

Analysis of Scaling Algorithms for Matrix &amp; Operator Scaling Contents Scaling Algorithms

Dependencies and Hazards Lecture 17 CS301 Data Dependencies We want to keep the pipeline

Managing Dependencies and Runtime Security ActiveState Deminar Managing Dependencies and

Continuous Updating How do you keep track of your LIBRARIES? How many DEPENDENCIES do you have

Dependencies in Interval- -valued valued Dependencies in Interval Symbolic Data Symbolic Data

Inclusion and exclusion atoms in team semantics Pietro Galliani Institute for Logic, Language

Algorithm Discovery API WebCrypto API Proposal Israel Hilerio &amp; Vijay Bharadwaj, Microsoft

Scaling From simple models to rich strategies PPPLab Day, November 30th Scaling: recent

Subgroup Discovery Exploratory Data Analysis Exploratory Data Analysis Classification:

AngularJS Dependencies and Services Dependencies &amp; Services App can get cluttered if all

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

Digital Financial Inclusion C hallenges and opportunities Financial Inclusion Global Initiative

Outline Functional dependencies (3.4) Lecture 09: Rules about FDs (3.5) Design of a

So#ware Scaling Mo/va/on &amp; Goals HW Configura/on &amp; Scale Out So#ware Scaling

UMBC A B M A L T F O U M B C I M Y O R T 1 (10/18/04) I E S R C E O V

Measuring and identifying human behaviors Technical dependencies important but Altmetrics

Functional Dependencies Decompositions Normal Forms: BCNF, Third Normal Form

Scaling laws to quantify tidal dissipation in star-planet systems P . Auclair-Desrotour, S.

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

Algorithm Discovery API WebCrypto API Proposal Israel Hilerio & Vijay Bharadwaj, Microsoft

AngularJS Dependencies and Services Dependencies & Services App can get cluttered if all

So#ware Scaling Mo/va/on & Goals HW Configura/on & Scale Out So#ware Scaling