Scaling out the Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany Sebastian Kruse, Thorsten Papenbrock, Felix Naumann Research Assistant Hasso Plattner Institute, Potsdam, Germany
Inclusion Dependencies Examples Customers ID Name Address Orders 1 Tanja Jager Marseiller Str. 12 Customer Item Quantity 2 Sandra Möller Flughafenstr. 63 3 CK-242-1 1 3 Dennis Eberhart Sonnenallee 19 3 DF-098-7 1 4 Barbara Pabst Ziegelstr. 76 3 KE-883-6 1 5 Thorsten Mauer Güntzelstr. 90 1 LM-437-2 2 Scaling out the Discovery of INDs 5 PE-383-5 1 Sebastian Kruse Customer ⊆ ID Quantity ⊆ ID March 5, 2015 2
Inclusion Dependencies Examples Scaling out the Discovery of INDs Sebastian Kruse March 5, 2015 3 http://geneontology.org/sites/default/files/public/diag-godb-er.jpg
Inclusion Dependencies Example Scaling out the Discovery of INDs Sebastian Kruse March 5, 2015 4 http://www.ibm.com/developerworks/data/library/techarticle/dm-1109proteindatadb2purexml/pdb_scheme_large.jpg
Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions
Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions
Related Work MIND ■ Fabien De Marchi, Stéphan Lopes, and Jean-Marc Petit. Unary and n-ary inclusion dependency discovery in relational databases. Journal of Intelligent Information Systems , 32:53 – 73, 2009. ID Name Address 1 Tanja Jager Marseiller Str. 12 Customer Item Quantity 2 Sandra Möller Flughafenstr. 63 3 CK-242-1 1 3 Dennis Eberhart Sonnenallee 19 3 DF-098-7 1 4 Barbara Pabst Ziegelstr. 76 Scaling out the 3 KE-883-6 1 Discovery of INDs 5 Thorsten Mauer Güntzelstr. 90 1 LM-437-2 2 Sebastian Kruse March 5, 2015 5 PE-383-5 1 7
Related Work MIND Value Attributes Quantity ⊆ ID 1 ID, Customer, Quantity Quantity ⊆ ? Quantity ⊆ Quantity Tanja Jager Name Marseiller Str. 12 Address 2 ID, Quantity Sandra Möller Name Flughafenstr. 63 Address … … Scaling out the Discovery of INDs Sebastian Kruse Intersection ID, Quantity March 5, 2015 8
Related Work SPIDER ■ Jana Bauckmann, Ulf Leser, and Felix Naumann. Efficiently Computing Inclusion Dependencies for Schema Discovery. In ICDE Workshops , 2006. ID Name Address 1 Tanja Jager Marseiller Str. 12 Customer Item Quantity 2 Sandra Möller Flughafenstr. 63 3 CK-242-1 1 3 Dennis Eberhart Sonnenallee 19 3 DF-098-7 1 4 Barbara Pabst Ziegelstr. 76 Scaling out the 3 KE-883-6 1 Discovery of INDs 5 Thorsten Mauer Güntzelstr. 90 1 LM-437-2 2 Sebastian Kruse March 5, 2015 5 PE-383-5 1 9
Related Work SPIDER ■ Jana Bauckmann, Ulf Leser, and Felix Naumann. Efficiently Computing Inclusion Dependencies for Schema Discovery. In ICDE Workshops , 2006. ID Name Item Customer Quantity 1 Barbara Pabst CK-242-1 1 1 2 Dennis Eberhart DF-098-7 3 2 3 Sandra Möller KE-883-6 5 Scaling out the 4 Tanja Jager LM-437-2 Discovery of INDs 5 Thorsten Mauer PE-383-5 Sebastian Kruse March 5, 2015 10
Related Work Common proceeding Input Data Full Outer Join Inclusion Dependencies ID Name Addr Cus Item Qty ID Name Addr 1 1 1 1 T.J. M.12 2 2 2 S.M. F.63 Cus Item Qty 3 3 3 D.E. S.19 Quantity ⊆ ID 3 CK 1 4 4 B.P. Z.76 Customer ⊆ ID 3 DF 1 5 5 5 T.M. G.90 3 KE 1 T.J. 1 LM 2 S.M. 5 PE 1 … … … … … … Scaling out the Discovery of INDs Step 1: Step 2: Sebastian Kruse March 5, 2015 Calculate full outer Extract inclusion join of all attributes dependencies 11
Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions
SINDY: A distributed discovery algorithm Distributed setting ID Name Addr 1 T.J. M.12 2 S.M. F.63 3 D.E. S.19 ID Name Addr 4 B.P. Z.76 5 T.M. G.90 Scaling out the Cus Item Qty Discovery of INDs Cus Item Qty 3 CK 1 Sebastian Kruse 1 LM 2 3 DF 1 March 5, 2015 5 PE 1 3 KE 1 13
SINDY: A distributed discovery algorithm Perform full outer join ID Name Addr 1 ID T.J. Name M.12 Addr 1 T.J. M.12 2 ID S.M. Name F.63 Addr 2 S.M. F.63 3 ID D.E. Name S.19 Addr 3 D.E. S.19 Cus Item Qty 3 Cus CK Item 1 Qty 3 CK 1 3 Cus DF Item 1 Qty 3 DF 1 3 Cus KE Item 1 Qty 3 KE 1 ID Name Addr 4 ID B.P. Name Z.76 Addr 4 B.P. Z.76 Scaling out the 5 ID T.M. Name G.90 Addr 1 Cus, Qty 5 T.M. G.90 Discovery of INDs Sebastian Kruse 1 Cus LM Item 2 Qty 5 Cus, ID Cus Item Qty March 5, 2015 1 LM 2 5 Cus PE Item 1 Qty 14 5 PE 1
SINDY: A distributed discovery algorithm Perform full outer join ID Name Addr 1 ID 1 ID, Cus, Qty T.J. Name M.12 Addr 1 T.J. M.12 2 ID S.M. Name F.63 Addr 2 S.M. F.63 3 ID D.E. Name S.19 Addr 3 D.E. S.19 Cus Item Qty 3 Cus CK Item 1 Qty 2 ID, Qty 3 CK 1 DF Item 3 DF 1 KE Item 3 KE 1 ID Name Addr 4 ID 3 ID, Cus B.P. Name Z.76 Addr 4 B.P. Z.76 Scaling out the T.M. Name G.90 Addr 1 Cus, Qty 5 T.M. G.90 Discovery of INDs Sebastian Kruse LM Item 2 Qty 5 Cus, ID Cus Item Qty March 5, 2015 1 LM 2 PE Item 15 5 PE 1
SINDY: A distributed discovery algorithm Perform full outer join ID Name Addr ID, Cus, Qty Addr 1 T.J. M.12 Name ID 2 S.M. F.63 Item Addr Name 3 D.E. S.19 Cus Item Qty Item ID, Qty 3 CK 1 Name 3 DF 1 Cus, ID 3 KE 1 ID Name Addr Cus, ID Addr 4 B.P. Z.76 Scaling out the Addr Addr 5 T.M. G.90 Name Discovery of INDs Sebastian Kruse Name Item Cus Item Qty March 5, 2015 1 LM 2 Item Item 16 5 PE 1
SINDY: A distributed discovery algorithm Distributed join product ID, Cus, Qty Addr Name ID Item ID, Cus Name Item Scaling out the ID, Cus Discovery of INDs Addr ID, Qty Sebastian Kruse Name March 5, 2015 Item 17
SINDY: A distributed discovery algorithm Evaluate full outer join ID, Cus, Qty ID Cus, Qty Cus ID, Qty Qty ID, Cus Addr Ø Addr Name Ø Name ID Ø ID Item Ø Item ID, Cus ID Qty Qty ID ID Ø ID, Qty ID Cus Cus ID Name Name Ø Item Item Ø Scaling out the ID Cus Cus ID ID, Cus Discovery of INDs Name Ø Name Sebastian Kruse March 5, 2015 Item Item Ø 18 Addr Addr Ø
SINDY: A distributed discovery algorithm Evaluate full outer join ID, Cus, Qty Cus ID, Qty Qty ID, Cus Addr Ø Addr Name Ø Name ID Ø ID Item Ø Item ID, Cus Qty ID ID Ø ID, Qty Cus ID Name Name Ø Item Item Ø Scaling out the ID Cus Cus ID ID, Cus Discovery of INDs Name Ø Name Sebastian Kruse March 5, 2015 Item Item Ø 19 Addr Addr Ø
SINDY: A distributed discovery algorithm Distributed inclusion dependencies Addr Ø ID Ø Qty ID Cus ID Scaling out the Item Ø Discovery of INDs Quantity ⊆ ID Sebastian Kruse Customer ⊆ ID March 5, 2015 Name Ø 20
SINDY Variants ■ Inclusion dependencies on combinations of columns (aka n-ary INDs) □ Adaption: Create cells for combinations of values □ Powerful in combination with apriori-like proceeding ■ Partial inclusion dependencies □ Adaption: aggregate IND candidates with multiset union instead of intersection □ Compare with number of distinct values of dependent column Scaling out the Discovery of INDs Sebastian Kruse March 5, 2015 21
Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions
Evaluation Experimental setup ■ Cluster Setup □ 1 master node (Intel Xeon @ 2x2.67 GHz, 8 GiB RAM) □ 10 worker nodes (Intel Core 2 Duo @ 2x2.6 GHz, 8 GiB RAM) □ Apache HDFS 2.2, Apache Flink 0.6.2 ■ Single node (for SPIDER) □ Intel Xeon @ 8x2GHz, 128 GiB RAM, RAID-1 Scaling out the ■ Datasets Discovery of INDs Sebastian Kruse □ Relational datasets from different domains March 5, 2015 □ 16 KB to 44.9 GB 23
Evaluation Performance comparison with SPIDER 10000 10 SINDY SPIDER Speed Up 9 1000 8 7 runtime [s] speed up 100 6 5 10 4 3 1 2 Scaling out the 1 Discovery of INDs 0.1 0 Sebastian Kruse March 5, 2015 24
Evaluation Scale-Out Behavior 2048 MB-core (5.8 GB) TPC-H (1.4 GB) 1024 CATH (907 MB) LOD+ (825 MB) 512 BIOSQLSP (567 MB) 256 runtime [s] WIKIPEDIA (539 MB) CATH (907 MB) 128 CENSUS (111 MB) SCOP (15 MB) 64 COMA (16 KB) 32 Scaling out the 16 Discovery of INDs Sebastian Kruse 8 March 5, 2015 1/1 2/2 3/3 4/4 5/5 6/6 7/7 8/8 9/9 10/10 20/10 logicalworkers/physical workers 25
Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions
Recommend
More recommend