ApproxJoin Approximate Distributed Joins Do Le Quoc, Istemi Ekin - PowerPoint PPT Presentation

ApproxJoin Approximate Distributed Joins Do Le Quoc, Istemi Ekin Akkus, Pramod Bhatotia, Spyros Blanas, Ruichuan Chen, Christof Fetzer, Thorsten Strufe 10/2018

Motivation π X Join is a critical operation in big data analytics systems, but it is very expensive ⨝ Reduce the overhead of join operations using a sampling-based approach ⨝ R 4 ⨝ R 3 R 1 R 2 1

Motivation R 1 R 2 R 1 R 2 A 1 B 0 A 2 C 0 A 1 B 0 C 1 A 2 B 1 A 1 C 1 A 1 B 0 C 2 = … A 2 B 2 A 1 C 2 C m A 1 B 0 … … A 2 B 1 C 0 A 2 B n A 1 C m A 2 B 2 C 0 … A 2 B n C 0 2

Motivation R 1 R 2 A 1 B 0 A 2 C 0 A 2 B 1 A 1 C 1 A 2 B 2 A 1 C 2 … … A 2 B n A 1 C m Sample( R 1 ) Sample( R 2 ) Sample( R 1 ) Sample( R 2 ) ! = Sample( R 1 R 2 ) A 2 B 2 A 1 C 3 A 2 B 5 A 1 C 4 = NULL … … A 2 B n-2 A 1 C m-1 Sampling over joins is a challenging task regarding the output quality 3

Motivation R 1 R 2 R 1 R 2 A 1 B 0 A 2 C 0 A 1 B 0 C 1 A 2 B 1 A 1 C 1 A 1 B 0 C 2 = … A 2 B 2 A 1 C 2 C m A 1 B 0 … … A 2 B 1 C 0 A 2 B n A 1 C m A 2 B 2 C 0 A 3 D 1 A 4 E 1 … A 3 D 2 A 4 E 2 A 2 B n C 0 None-join … … items A 3 D k A 4 E l Unnecessary data shuffle through cluster 4

State-of-the-art Systems AQUA (SIGMOD’99) Requiring priori knowledge of inputs Sampling over joins (SIGMOD’99) (statistical info, indices ) RippleJoin (SIGMOD’99), Using online aggregation approach WanderJoin (SIGMOD’16) for joins SparkSQL (SIGMOD’15), Using pre-existing samples to serve SnappyData (SIGMOD’16) queries 5

State-of-the-art Systems AQUA (SIGMOD’99) Requiring priori knowledge of inputs Sampling over joins (SIGMOD’99) (statistical info, indices ) Designed for single node system RippleJoin (SIGMOD’99), Using online aggregation approach WanderJoin (SIGMOD’16) for joins Do not support SparkSQL (SIGMOD’15), Using pre-existing samples to serve sampling over joins SnappyData (SIGMOD’16) queries 6

Outline • Motivation • Design • Evaluation 7

ApproxJoin: System Overview SELECT SUM(R 1 .V + R 2 .V + … + R n .V) FROM R 1 , R 2 , …, R n WHERE R 1 .A = R 2 .A = … = R n .A WITHIN 120 seconds OR ERROR 0.05 CONFIDENCE 95% Input datasets ApproxJoin R 1 Approximate R 2 Filtering Sampling over Result + (Bloom filters) distributed join … 192.68 ± 0.05 (95% confidence) R n Achieve Low Reduce shuffled latency data size 8

ApproxJoin: Core Idea Input datasets: R 1 R 2 Build bloom filter: BF(R 1 ) BF(R 2 ) R 2 = BF(R 1 ) & BF(R 2 ) JoinBF R 1 Filter out overlap items: R 1 JoinBF R 2 JoinBF Sampling Join Result 9

ApproxJoin: Filtering R 1 R 2 R 1 R 2 A 1 B 0 A 2 C 0 A 1 B 0 C 1 A 2 B 1 A 1 C 1 A 1 B 0 C 2 = … A 2 B 2 A 1 C 2 C m A 1 B 0 … … A 2 B 1 C 0 A 2 B n A 1 C m A 2 B 2 C 0 A 3 D 1 A 4 E 1 Use JoinBF … A 3 D 2 A 4 E 2 to remove A 2 B n C 0 none-join … … items A 3 D k A 4 E l BF(R 1 ) = {A 1 , A 2 , A 3 } BF(R 2 ) = {A 1 , A 2 , A 4 } JoinBF = {A 1 , A 2 } 10

ApproxJoin: Sampling R 1 R 2 A 1 B 0 A 2 C 0 A 2 B 1 A 1 C 1 A 2 B 2 A 1 C 2 R 2 ) = Sample( R 1 … … A 2 B n A 1 C m A 1 B 0 C 1 CoGroup A 1 B 0 C 3 Stratified … A 1 B 0 A 1 C 1 A 2 C 0 A 2 B 1 Sampling C m-2 A 1 B 0 A 2 B 2 A 1 C 2 A 2 B 2 C 0 … … A 2 B 5 C 0 A 1 C m A 2 B n … A 2 B n-3 C 0 11

ApproxJoin: Implementation Result Error-bound Aggregation engine 192.68 ± 0.05 estimator (Apache Spark) (95% confidence) Stratified sampling during join operator SELECT SUM(R 1 .V + R 2 .V + … + R n .V) FROM R 1 , R 2 , …, R n WHERE R 1 .A = R 2 .A = … = R n .A Sample sizes Multi-way WITHIN 120 seconds estimator Bloom filter OR ERROR 0.05 CONFIDENCE 95% constructor (Cost-function) Cluster Input datasets configuration (HDFS) 12

Outline • Motivation • Design • Evaluation 13

Experimental Setup • Evaluation questions See the paper • Latency vs overlap fraction for more • Shuffled data size vs overlap fraction results! • Latency vs sampling fraction • Testbed • Cluster: 10 nodes • Datasets: • Synthesis: Poisson distribution datasets, TPC-H • CAIDA Network traffic traces; Netflix Prize 14

Latency Lower is better ApproxJoin Spark repartition join Native Spark join 1000 Latency (minutes) 100 10 1 0,1 1 2 4 6 8 10 Overlap fraction (%) ~2.6X and ~8X faster than Spark repartition join and native Spark join with overlap fraction of 1% 15

Shuffled Data Size Lower is better ApproxJoin Spark repartition join Native Spark join 1000 100 Size (MB) 10 1 0,1 1 2 4 6 8 10 Overlap fraction (%) ~29X and ~26X lower shuffled data size compared to Spark repartition join and native Spark join with overlap fraction of 1 % 16

Latency Lower is better ApproxJoin Spark, sample after join 1000 Spark, sample before join Latency (minutes) 100 10 1 0,1 10 20 40 60 80 90 Sampling fraction (%) (3X – 7X) faster than Spark with sampling after join (1.01X – 1.3X) slower than Spark with sampling before join 17

Accuracy Lower is better ApproxJoin, sample during join Spark, sample after join 100 Spark, sample before join Accuracy loss (%) 10 1 0,1 0,01 0,001 10 20 40 60 80 90 Sampling fraction (%) Comparable accuracy to Spark with sampling after join ~42X more accurate than Spark with sampling before join 18

Outline • Motivation • Our work • Conclusion 19

Conclusion ApproxJoin: Approximate Distributed Joins Transparent Supports applications w/ minor code changes Practical Adaptive execution based on query budget Efficient Employs sketch & sampling techniques Thank you! 20

ApproxJoin Approximate Distributed Joins Do Le Quoc, Istemi Ekin - PowerPoint PPT Presentation

ApproxJoin Approximate Distributed Joins Do Le Quoc, Istemi Ekin Akkus, Pramod Bhatotia, Spyros Blanas, Ruichuan Chen, Christof Fetzer, Thorsten Strufe 10/2018 Motivation X Join is a critical operation in big data analytics systems, but it

A new approach for regularization of inverse problems in image processing I. Souopgui 1 , 2 , E.

Generalized characterizations of semicom- putable semicomputable semimeasures semimeasures Tom

Global estimates of CO sources with high resolution by adjoint inversion of multiple satellite

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University

Modelling Unlinkability Stefan K opsell Sandra Steinbrecher Technische Universit at

Montana Commission on Sentencing Win-Wins for Local and

Experiments with Multisource Decoding and A priori Fragments Speech and Hearing Research

Introduction to Large-Scale ML Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science,

Logicism - Frege Caroline Foster / Edgar Andrade Philosophy of Mathematics ILLC - Master of

Multiple Comparison Procedures Cohen Chapter 13 For EDUC/PSY 6600 1 We have to go to the

Classical and Weak Solutions to Local First Order Mean Field Games through Elliptic Regularity

Plan for Today This is an introduction to Game Theory. In particular, well discuss:

Decisions with Multiple Agents: Game Theory Alice Gao Lecture 24 Based on work by K.

What? The study of interacting decision makers Economy Biology Sociology Computer Science

Council Retreat JANUARY 29-30, 2020 Sea Level Rise Subcommittee Priorities S tormwater

Using Public Private Partnerships to Deliver Successful Rail Projects Dr Nick Higton Director,

L14 July 7, 2017 1 Lecture 14: Crash Course in Probability CSCI 1360E: Foundations for

A Crash Course on Discrete Probability Events and Probability Consider a random process (e.g.,

CS70: Jean Walrand: Lecture 34. Uniformly at Random in [ 0 , 1 ] . Uniformly at Random in [ 0 , 1 ]

Brief Review of Probability Ken Kreutz-Delgado (Nuno Vasconcelos) ECE Department, UCSD ECE 175A

Independence and Conditional Probability MATH 107: Finite Mathematics University of Louisville

Probability and Random Events 1. Introduction Suppose you repeat the same experiment a number of

Business Statistics CONTENTS Data as a random sample Probability theory Be careful with

Chapter 2: Basics from Probability Theory and Statistics 2.1 Probability Theory Events,