approxjoin
play

ApproxJoin Approximate Distributed Joins Do Le Quoc, Istemi Ekin - PowerPoint PPT Presentation

ApproxJoin Approximate Distributed Joins Do Le Quoc, Istemi Ekin Akkus, Pramod Bhatotia, Spyros Blanas, Ruichuan Chen, Christof Fetzer, Thorsten Strufe 10/2018 Motivation X Join is a critical operation in big data analytics systems, but it


  1. ApproxJoin Approximate Distributed Joins Do Le Quoc, Istemi Ekin Akkus, Pramod Bhatotia, Spyros Blanas, Ruichuan Chen, Christof Fetzer, Thorsten Strufe 10/2018

  2. Motivation π X Join is a critical operation in big data analytics systems, but it is very expensive ⨝ Reduce the overhead of join operations using a sampling-based approach ⨝ R 4 ⨝ R 3 R 1 R 2 1

  3. Motivation R 1 R 2 R 1 R 2 A 1 B 0 A 2 C 0 A 1 B 0 C 1 A 2 B 1 A 1 C 1 A 1 B 0 C 2 = … A 2 B 2 A 1 C 2 C m A 1 B 0 … … A 2 B 1 C 0 A 2 B n A 1 C m A 2 B 2 C 0 … A 2 B n C 0 2

  4. Motivation R 1 R 2 A 1 B 0 A 2 C 0 A 2 B 1 A 1 C 1 A 2 B 2 A 1 C 2 … … A 2 B n A 1 C m Sample( R 1 ) Sample( R 2 ) Sample( R 1 ) Sample( R 2 ) ! = Sample( R 1 R 2 ) A 2 B 2 A 1 C 3 A 2 B 5 A 1 C 4 = NULL … … A 2 B n-2 A 1 C m-1 Sampling over joins is a challenging task regarding the output quality 3

  5. Motivation R 1 R 2 R 1 R 2 A 1 B 0 A 2 C 0 A 1 B 0 C 1 A 2 B 1 A 1 C 1 A 1 B 0 C 2 = … A 2 B 2 A 1 C 2 C m A 1 B 0 … … A 2 B 1 C 0 A 2 B n A 1 C m A 2 B 2 C 0 A 3 D 1 A 4 E 1 … A 3 D 2 A 4 E 2 A 2 B n C 0 None-join … … items A 3 D k A 4 E l Unnecessary data shuffle through cluster 4

  6. State-of-the-art Systems AQUA (SIGMOD’99) Requiring priori knowledge of inputs Sampling over joins (SIGMOD’99) (statistical info, indices ) RippleJoin (SIGMOD’99), Using online aggregation approach WanderJoin (SIGMOD’16) for joins SparkSQL (SIGMOD’15), Using pre-existing samples to serve SnappyData (SIGMOD’16) queries 5

  7. State-of-the-art Systems AQUA (SIGMOD’99) Requiring priori knowledge of inputs Sampling over joins (SIGMOD’99) (statistical info, indices ) Designed for single node system RippleJoin (SIGMOD’99), Using online aggregation approach WanderJoin (SIGMOD’16) for joins Do not support SparkSQL (SIGMOD’15), Using pre-existing samples to serve sampling over joins SnappyData (SIGMOD’16) queries 6

  8. Outline • Motivation • Design • Evaluation 7

  9. ApproxJoin: System Overview SELECT SUM(R 1 .V + R 2 .V + … + R n .V) FROM R 1 , R 2 , …, R n WHERE R 1 .A = R 2 .A = … = R n .A WITHIN 120 seconds OR ERROR 0.05 CONFIDENCE 95% Input datasets ApproxJoin R 1 Approximate R 2 Filtering Sampling over Result + (Bloom filters) distributed join … 192.68 ± 0.05 (95% confidence) R n Achieve Low Reduce shuffled latency data size 8

  10. ApproxJoin: Core Idea Input datasets: R 1 R 2 Build bloom filter: BF(R 1 ) BF(R 2 ) R 2 = BF(R 1 ) & BF(R 2 ) JoinBF R 1 Filter out overlap items: R 1 JoinBF R 2 JoinBF Sampling Join Result 9

  11. ApproxJoin: Filtering R 1 R 2 R 1 R 2 A 1 B 0 A 2 C 0 A 1 B 0 C 1 A 2 B 1 A 1 C 1 A 1 B 0 C 2 = … A 2 B 2 A 1 C 2 C m A 1 B 0 … … A 2 B 1 C 0 A 2 B n A 1 C m A 2 B 2 C 0 A 3 D 1 A 4 E 1 Use JoinBF … A 3 D 2 A 4 E 2 to remove A 2 B n C 0 none-join … … items A 3 D k A 4 E l BF(R 1 ) = {A 1 , A 2 , A 3 } BF(R 2 ) = {A 1 , A 2 , A 4 } JoinBF = {A 1 , A 2 } 10

  12. ApproxJoin: Sampling R 1 R 2 A 1 B 0 A 2 C 0 A 2 B 1 A 1 C 1 A 2 B 2 A 1 C 2 R 2 ) = Sample( R 1 … … A 2 B n A 1 C m A 1 B 0 C 1 CoGroup A 1 B 0 C 3 Stratified … A 1 B 0 A 1 C 1 A 2 C 0 A 2 B 1 Sampling C m-2 A 1 B 0 A 2 B 2 A 1 C 2 A 2 B 2 C 0 … … A 2 B 5 C 0 A 1 C m A 2 B n … A 2 B n-3 C 0 11

  13. ApproxJoin: Implementation Result Error-bound Aggregation engine 192.68 ± 0.05 estimator (Apache Spark) (95% confidence) Stratified sampling during join operator SELECT SUM(R 1 .V + R 2 .V + … + R n .V) FROM R 1 , R 2 , …, R n WHERE R 1 .A = R 2 .A = … = R n .A Sample sizes Multi-way WITHIN 120 seconds estimator Bloom filter OR ERROR 0.05 CONFIDENCE 95% constructor (Cost-function) Cluster Input datasets configuration (HDFS) 12

  14. Outline • Motivation • Design • Evaluation 13

  15. Experimental Setup • Evaluation questions See the paper • Latency vs overlap fraction for more • Shuffled data size vs overlap fraction results! • Latency vs sampling fraction • Testbed • Cluster: 10 nodes • Datasets: • Synthesis: Poisson distribution datasets, TPC-H • CAIDA Network traffic traces; Netflix Prize 14

  16. Latency Lower is better ApproxJoin Spark repartition join Native Spark join 1000 Latency (minutes) 100 10 1 0,1 1 2 4 6 8 10 Overlap fraction (%) ~2.6X and ~8X faster than Spark repartition join and native Spark join with overlap fraction of 1% 15

  17. Shuffled Data Size Lower is better ApproxJoin Spark repartition join Native Spark join 1000 100 Size (MB) 10 1 0,1 1 2 4 6 8 10 Overlap fraction (%) ~29X and ~26X lower shuffled data size compared to Spark repartition join and native Spark join with overlap fraction of 1 % 16

  18. Latency Lower is better ApproxJoin Spark, sample after join 1000 Spark, sample before join Latency (minutes) 100 10 1 0,1 10 20 40 60 80 90 Sampling fraction (%) (3X – 7X) faster than Spark with sampling after join (1.01X – 1.3X) slower than Spark with sampling before join 17

  19. Accuracy Lower is better ApproxJoin, sample during join Spark, sample after join 100 Spark, sample before join Accuracy loss (%) 10 1 0,1 0,01 0,001 10 20 40 60 80 90 Sampling fraction (%) Comparable accuracy to Spark with sampling after join ~42X more accurate than Spark with sampling before join 18

  20. Outline • Motivation • Our work • Conclusion 19

  21. Conclusion ApproxJoin: Approximate Distributed Joins Transparent Supports applications w/ minor code changes Practical Adaptive execution based on query budget Efficient Employs sketch & sampling techniques Thank you! 20

Recommend


More recommend