Online aggrega*on & Sampling from Joins CompSci - PowerPoint PPT Presentation

Online ¡aggrega*on ¡& ¡ ¡ Sampling ¡from ¡Joins ¡ CompSci ¡590.02 ¡ Instructor: ¡Ashwin ¡Machanavajjhala ¡ ¡

Outline ¡ • Online ¡Aggrega*on ¡ • Ripple ¡Joins ¡ • On ¡the ¡hardness ¡of ¡sampling ¡from ¡Joins ¡

Online ¡Aggrega*on ¡ • Most ¡systems ¡compute ¡aggregated ¡like ¡averages/counts/ etc. ¡exactly. ¡ ¡ • But ¡aggregates ¡only ¡provide ¡a ¡“summary-‑view” ¡of ¡the ¡ data. ¡ ¡ • Why ¡wait ¡for ¡an ¡aggregate ¡computa*on ¡on ¡the ¡en*re ¡ data? ¡ ¡

Online ¡Aggrega*on ¡

Examples ¡of ¡Queries ¡ • Select ¡Sum(Salary) ¡From ¡R ¡ ¡ DISTINCT ¡ • Select ¡Count(DISTINCT ¡hashtags) ¡from ¡T ¡ GroupBy ¡ • Select ¡Average(Grade) ¡from ¡STable ¡GroupBy ¡CourseID ¡ JOIN ¡ • Select ¡Sum(Grade*Difficulty) ¡from ¡STable, ¡Course ¡ ¡

Example ¡Scenarios ¡ • Compute ¡the ¡number ¡of ¡individuals ¡in ¡the ¡table ¡that ¡ sa*sfy ¡func*on ¡F, ¡where ¡F ¡is ¡a ¡computa*onally ¡intensive ¡ property. ¡ – Running ¡the ¡query ¡on ¡the ¡en*re ¡data ¡takes ¡O(nf), ¡where ¡f ¡is ¡the ¡ *me ¡for ¡checking ¡F ¡on ¡one ¡record. ¡ ¡ – We ¡can ¡get ¡an ¡approximate ¡answer ¡much ¡faster ¡… ¡

Example ¡Scenarios ¡ • Compute ¡the ¡sum ¡of ¡all ¡elements ¡in ¡a ¡database, ¡which ¡is ¡ par**oned ¡on ¡k ¡machines. ¡ ¡ – Compute ¡sum ¡on ¡each ¡machine ¡Si, ¡and ¡then ¡add ¡up ¡all ¡the ¡Si’s ¡ – Time ¡taken ¡to ¡compute ¡aggregate ¡= ¡max(*me ¡taken ¡by ¡one ¡ machine) ¡ ¡ – If ¡a ¡machine ¡fails ¡… ¡

Example ¡Scenarios ¡ • Find ¡the ¡number ¡of ¡people ¡in ¡database ¡D1 ¡also ¡appears ¡ in ¡database ¡D2 ¡ – Exact ¡answer ¡needs ¡checking ¡|D1|.|D2| ¡pairs ¡of ¡records. ¡ ¡ – Can ¡we ¡get ¡an ¡approximate ¡answer ¡faster? ¡ ¡ ¡

Aggrega*ons ¡on ¡a ¡single ¡table ¡ 1. Read ¡the ¡records ¡of ¡the ¡table ¡in ¡a ¡random ¡order ¡ 2. Maintain ¡a ¡ running ¡es?mate ¡ ¡of ¡the ¡required ¡aggregate ¡ 3. Compute ¡confidence ¡bounds ¡on ¡the ¡error ¡in ¡the ¡running ¡ es*mate. ¡ ¡

Random ¡access ¡ • Random ¡I/Os ¡are ¡expensive ¡ • Heap ¡Scans ¡ – Heaps ¡maintain ¡the ¡data ¡in ¡the ¡order ¡in ¡which ¡they ¡are ¡inserted ¡ – If ¡inser*on ¡order ¡is ¡not ¡correlated ¡with ¡values, ¡then ¡this ¡can ¡be ¡ used ¡instead ¡of ¡a ¡true ¡random ¡ordering ¡ • Index ¡Scans ¡ – If ¡index ¡is ¡on ¡an ¡aaribute ¡that ¡is ¡not ¡the ¡same ¡as ¡the ¡ aggregated ¡column ¡ • Sampling ¡from ¡indexes ¡ ¡ – From ¡previous ¡class ¡

Group-‑By ¡ • E.g., ¡Select ¡Avg(Salary) ¡from ¡R ¡GroupBy ¡Department ¡ • Standard ¡technique ¡ – Sort ¡the ¡rela*on ¡by ¡the ¡grouping ¡aaribute ¡ – Compute ¡the ¡within ¡group ¡aggregate ¡by ¡scanning ¡the ¡sorted ¡ output ¡ • Sor*ng ¡is ¡a ¡blocking ¡opera*on ¡ ¡ • Alterna*ve ¡: ¡Hashing ¡

Running ¡Es*mate ¡ • If ¡N ¡is ¡the ¡number ¡of ¡tuples ¡in ¡the ¡data ¡ • If ¡n ¡is ¡the ¡number ¡of ¡tuples ¡seen ¡… ¡ • SUM ¡: ¡N/n ¡(current ¡sum) ¡ • COUNT: ¡N/n ¡(current ¡count) ¡ • AVG ¡: ¡1/n ¡(current ¡sum) ¡

Confidence ¡bounds ¡ Assuming ¡the ¡input ¡tuples ¡are ¡randomly ¡chosen. ¡ If ¡Xi ¡is ¡the ¡random ¡variable ¡corresponding ¡to ¡the ¡i th ¡tuple, ¡ then ¡X1, ¡X2, ¡… ¡are ¡independent ¡random ¡variables. ¡ ¡ P{|Yn ¡-‑ ¡μ| ¡> ¡ε} ¡< ¡ ¡2 ¡exp{-‑2nε 2 ¡/ ¡(b-‑a) 2 } ¡ ¡ Where ¡ ¡ • Yn ¡is ¡the ¡running ¡es*mate ¡aner ¡seeing ¡n ¡elements ¡ • μ ¡is ¡the ¡actual ¡aggregate ¡ • [a,b]: ¡range ¡of ¡the ¡values ¡in ¡the ¡database ¡

Online ¡Aggrega*on ¡over ¡Joins ¡ • How ¡to ¡generate ¡a ¡random ¡ordering ¡of ¡pairs ¡of ¡tuples ¡ from ¡the ¡Join ¡of ¡a ¡rela*on? ¡ – Op*on ¡1: ¡Compute ¡the ¡join ¡and ¡then ¡read ¡the ¡output ¡of ¡the ¡ join ¡in ¡a ¡random ¡order ¡– ¡BLOCKING! ¡ – Op*on ¡2: ¡Nested ¡Loop ¡Join ¡(over ¡random ¡orderings ¡of ¡the ¡two ¡ tables) ¡

Nested ¡Loop ¡Join ¡ Inner ¡Rela*on ¡ Outer ¡ ¡ Rela*on ¡

Nested ¡Loop ¡Join ¡ Inner ¡Rela*on ¡ Unnecessary ¡work ¡is ¡done ¡if: ¡ ¡-‑ ¡Values ¡in ¡the ¡inner ¡rela*on ¡are ¡roughly ¡the ¡same ¡ ¡-‑ ¡Output ¡of ¡the ¡aggregate ¡is ¡not ¡very ¡sensi*ve ¡to ¡ ¡ Outer ¡ ¡ ¡ ¡ ¡ ¡the ¡values ¡in ¡the ¡inner ¡rela*on ¡ Rela*on ¡

Ripple ¡Join ¡ Inner ¡Rela*on ¡ Read ¡x ¡records ¡from ¡each ¡table, ¡and ¡ ¡ compute ¡the ¡join ¡on ¡these ¡records. ¡ ¡ Outer ¡ ¡ Rela*on ¡

Online ¡aggrega*on ¡with ¡Joins ¡ • The ¡output ¡tuples ¡are ¡no ¡longer ¡independent ¡samples ¡ from ¡the ¡underlying ¡distribu*on ¡ – Why? ¡

Difficulty ¡of ¡Join ¡Sampling ¡ • Sample(Join(R,S)) ¡≠ ¡Join(Sample(R), ¡Sample(S)) ¡ • R: ¡{(a, ¡x0), ¡(b, ¡x1), ¡(b,x2), ¡…, ¡(b,xn)} ¡ • S: ¡{(b,y0), ¡(a,y1), ¡(a,y2), ¡…, ¡(a,yn)} ¡ • In ¡R ¡x ¡S: ¡Half ¡the ¡records ¡have ¡‘a’ ¡and ¡half ¡the ¡records ¡ have ¡‘b’ ¡ • In ¡Sample(R): ¡probability ¡‘a’ ¡appears ¡is ¡very ¡small. ¡ ¡ ¡

Using ¡sta*s*cs ¡ • If ¡we ¡know ¡for ¡each ¡tuple ¡t ¡ε ¡R, ¡how ¡many ¡tuples ¡it ¡joins ¡ with ¡in ¡S ¡(call ¡it ¡n S (t)) ¡ • Pick ¡a ¡random ¡tuple ¡t ¡ε ¡R ¡ • Include ¡it ¡with ¡probability ¡propor*onal ¡to ¡n S (t) ¡ ¡

Summary ¡ • Online ¡aggrega*on ¡helps ¡provide ¡approximate ¡answers ¡ without ¡wai*ng ¡for ¡the ¡exact ¡answer ¡ • Requires ¡itera*ng ¡over ¡a ¡random ¡order ¡of ¡the ¡data ¡ • Sampling ¡over ¡Joins ¡is ¡difficult. ¡ ¡

Online aggrega*on & Sampling from Joins CompSci - PowerPoint PPT Presentation

Online aggregaon & Sampling from Joins CompSci 590.02 Instructor: Ashwin Machanavajjhala Outline Online Aggregaon Ripple Joins On the

SQL Workshop Joins Doug Shook Inner Joins Joins are used to combine data from multiple

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

SQL$Joins Max$Masnick August&7,&2015 What%are%joins?

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

JOINS IN SQL By Rohit Dhanwani OBJECTIVES Define and use different types of joins INNER

S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay Sakharnykh and Jiri Kraus, March

Joins, and more plotting Joins, and more plotting Abhijit Dasgupta Abhijit Dasgupta Fall, 2019

CS 61: Database Systems Joins Adapted from Silberschatz, Korth, and Sundarshan unless otherwise

Notes on exact meets and joins R. N. Ball, J. Picado and A. Pultr 1 Exact meets and joins.

Desi De signi gning ng the he Metadata Mod odel for or the he Aggr Aggrega gation ion

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Introduction to MATLAB Chapter 1 Attaway MATLAB 4E Introduction to MATLAB Very powerful

Michael J. Conroy Background and motivation (brief) Background and motivation (brief)

Theory and Practice of (some) Probabilistic Counting Algorithms Philippe Flajolet, INRIA,

Nature or Nurture: Evidence from Indonesia Methodology Data Results Cara Ebert 1 and Erik Plug 2

Analysis of the Linux Random Number Generator Patrick Lacharme, Andrea R ock, Vincent Stubel,

Introduction to Pseudo-Random Number Generators Nicola Gigante March 9, 2016 Why random

Parallel Data Generation for Performance Analysis of Large, Complex RDBMS Tilmann Rabl and Meikel

Introduction to Political Research Session 11-Probability Sampling Lecturer: Prof. A.

Online aggrega*on & Sampling from Joins CompSci - PowerPoint PPT Presentation

Online aggrega*on & Sampling from Joins CompSci 590.02 Instructor: Ashwin Machanavajjhala Outline Online Aggrega*on Ripple Joins On the

SQL Workshop Joins Doug Shook Inner Joins Joins are used to combine data from multiple

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

SQL$Joins Max$Masnick August&amp;7,&amp;2015 What%are%joins?

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

JOINS IN SQL By Rohit Dhanwani OBJECTIVES Define and use different types of joins INNER

S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay Sakharnykh and Jiri Kraus, March

Joins, and more plotting Joins, and more plotting Abhijit Dasgupta Abhijit Dasgupta Fall, 2019

CS 61: Database Systems Joins Adapted from Silberschatz, Korth, and Sundarshan unless otherwise

Notes on exact meets and joins R. N. Ball, J. Picado and A. Pultr 1 Exact meets and joins.

Desi De signi gning ng the he Metadata Mod odel for or the he Aggr Aggrega gation ion

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Introduction to MATLAB Chapter 1 Attaway MATLAB 4E Introduction to MATLAB Very powerful

Michael J. Conroy Background and motivation (brief) Background and motivation (brief)

Theory and Practice of (some) Probabilistic Counting Algorithms Philippe Flajolet, INRIA,

Nature or Nurture: Evidence from Indonesia Methodology Data Results Cara Ebert 1 and Erik Plug 2

Analysis of the Linux Random Number Generator Patrick Lacharme, Andrea R ock, Vincent Stubel,

Introduction to Pseudo-Random Number Generators Nicola Gigante March 9, 2016 Why random

Parallel Data Generation for Performance Analysis of Large, Complex RDBMS Tilmann Rabl and Meikel

Introduction to Political Research Session 11-Probability Sampling Lecturer: Prof. A.

Online aggregaon & Sampling from Joins CompSci 590.02 Instructor: Ashwin Machanavajjhala Outline Online Aggregaon Ripple Joins On the

SQL$Joins Max$Masnick August&7,&2015 What%are%joins?