ApproxJoin Approximate Distributed Joins Do Le Quoc, Istemi Ekin Akkus, Pramod Bhatotia, Spyros Blanas, Ruichuan Chen, Christof Fetzer, Thorsten Strufe 10/2018
Motivation π X Join is a critical operation in big data analytics systems, but it is very expensive ⨝ Reduce the overhead of join operations using a sampling-based approach ⨝ R 4 ⨝ R 3 R 1 R 2 1
Motivation R 1 R 2 R 1 R 2 A 1 B 0 A 2 C 0 A 1 B 0 C 1 A 2 B 1 A 1 C 1 A 1 B 0 C 2 = … A 2 B 2 A 1 C 2 C m A 1 B 0 … … A 2 B 1 C 0 A 2 B n A 1 C m A 2 B 2 C 0 … A 2 B n C 0 2
Motivation R 1 R 2 A 1 B 0 A 2 C 0 A 2 B 1 A 1 C 1 A 2 B 2 A 1 C 2 … … A 2 B n A 1 C m Sample( R 1 ) Sample( R 2 ) Sample( R 1 ) Sample( R 2 ) ! = Sample( R 1 R 2 ) A 2 B 2 A 1 C 3 A 2 B 5 A 1 C 4 = NULL … … A 2 B n-2 A 1 C m-1 Sampling over joins is a challenging task regarding the output quality 3
Motivation R 1 R 2 R 1 R 2 A 1 B 0 A 2 C 0 A 1 B 0 C 1 A 2 B 1 A 1 C 1 A 1 B 0 C 2 = … A 2 B 2 A 1 C 2 C m A 1 B 0 … … A 2 B 1 C 0 A 2 B n A 1 C m A 2 B 2 C 0 A 3 D 1 A 4 E 1 … A 3 D 2 A 4 E 2 A 2 B n C 0 None-join … … items A 3 D k A 4 E l Unnecessary data shuffle through cluster 4
State-of-the-art Systems AQUA (SIGMOD’99) Requiring priori knowledge of inputs Sampling over joins (SIGMOD’99) (statistical info, indices ) RippleJoin (SIGMOD’99), Using online aggregation approach WanderJoin (SIGMOD’16) for joins SparkSQL (SIGMOD’15), Using pre-existing samples to serve SnappyData (SIGMOD’16) queries 5
State-of-the-art Systems AQUA (SIGMOD’99) Requiring priori knowledge of inputs Sampling over joins (SIGMOD’99) (statistical info, indices ) Designed for single node system RippleJoin (SIGMOD’99), Using online aggregation approach WanderJoin (SIGMOD’16) for joins Do not support SparkSQL (SIGMOD’15), Using pre-existing samples to serve sampling over joins SnappyData (SIGMOD’16) queries 6
Outline • Motivation • Design • Evaluation 7
ApproxJoin: System Overview SELECT SUM(R 1 .V + R 2 .V + … + R n .V) FROM R 1 , R 2 , …, R n WHERE R 1 .A = R 2 .A = … = R n .A WITHIN 120 seconds OR ERROR 0.05 CONFIDENCE 95% Input datasets ApproxJoin R 1 Approximate R 2 Filtering Sampling over Result + (Bloom filters) distributed join … 192.68 ± 0.05 (95% confidence) R n Achieve Low Reduce shuffled latency data size 8
ApproxJoin: Core Idea Input datasets: R 1 R 2 Build bloom filter: BF(R 1 ) BF(R 2 ) R 2 = BF(R 1 ) & BF(R 2 ) JoinBF R 1 Filter out overlap items: R 1 JoinBF R 2 JoinBF Sampling Join Result 9
ApproxJoin: Filtering R 1 R 2 R 1 R 2 A 1 B 0 A 2 C 0 A 1 B 0 C 1 A 2 B 1 A 1 C 1 A 1 B 0 C 2 = … A 2 B 2 A 1 C 2 C m A 1 B 0 … … A 2 B 1 C 0 A 2 B n A 1 C m A 2 B 2 C 0 A 3 D 1 A 4 E 1 Use JoinBF … A 3 D 2 A 4 E 2 to remove A 2 B n C 0 none-join … … items A 3 D k A 4 E l BF(R 1 ) = {A 1 , A 2 , A 3 } BF(R 2 ) = {A 1 , A 2 , A 4 } JoinBF = {A 1 , A 2 } 10
ApproxJoin: Sampling R 1 R 2 A 1 B 0 A 2 C 0 A 2 B 1 A 1 C 1 A 2 B 2 A 1 C 2 R 2 ) = Sample( R 1 … … A 2 B n A 1 C m A 1 B 0 C 1 CoGroup A 1 B 0 C 3 Stratified … A 1 B 0 A 1 C 1 A 2 C 0 A 2 B 1 Sampling C m-2 A 1 B 0 A 2 B 2 A 1 C 2 A 2 B 2 C 0 … … A 2 B 5 C 0 A 1 C m A 2 B n … A 2 B n-3 C 0 11
ApproxJoin: Implementation Result Error-bound Aggregation engine 192.68 ± 0.05 estimator (Apache Spark) (95% confidence) Stratified sampling during join operator SELECT SUM(R 1 .V + R 2 .V + … + R n .V) FROM R 1 , R 2 , …, R n WHERE R 1 .A = R 2 .A = … = R n .A Sample sizes Multi-way WITHIN 120 seconds estimator Bloom filter OR ERROR 0.05 CONFIDENCE 95% constructor (Cost-function) Cluster Input datasets configuration (HDFS) 12
Outline • Motivation • Design • Evaluation 13
Experimental Setup • Evaluation questions See the paper • Latency vs overlap fraction for more • Shuffled data size vs overlap fraction results! • Latency vs sampling fraction • Testbed • Cluster: 10 nodes • Datasets: • Synthesis: Poisson distribution datasets, TPC-H • CAIDA Network traffic traces; Netflix Prize 14
Latency Lower is better ApproxJoin Spark repartition join Native Spark join 1000 Latency (minutes) 100 10 1 0,1 1 2 4 6 8 10 Overlap fraction (%) ~2.6X and ~8X faster than Spark repartition join and native Spark join with overlap fraction of 1% 15
Shuffled Data Size Lower is better ApproxJoin Spark repartition join Native Spark join 1000 100 Size (MB) 10 1 0,1 1 2 4 6 8 10 Overlap fraction (%) ~29X and ~26X lower shuffled data size compared to Spark repartition join and native Spark join with overlap fraction of 1 % 16
Latency Lower is better ApproxJoin Spark, sample after join 1000 Spark, sample before join Latency (minutes) 100 10 1 0,1 10 20 40 60 80 90 Sampling fraction (%) (3X – 7X) faster than Spark with sampling after join (1.01X – 1.3X) slower than Spark with sampling before join 17
Accuracy Lower is better ApproxJoin, sample during join Spark, sample after join 100 Spark, sample before join Accuracy loss (%) 10 1 0,1 0,01 0,001 10 20 40 60 80 90 Sampling fraction (%) Comparable accuracy to Spark with sampling after join ~42X more accurate than Spark with sampling before join 18
Outline • Motivation • Our work • Conclusion 19
Conclusion ApproxJoin: Approximate Distributed Joins Transparent Supports applications w/ minor code changes Practical Adaptive execution based on query budget Efficient Employs sketch & sampling techniques Thank you! 20
Recommend
More recommend