Processing Complex Aggregate Queries over Data Streams SIGMOD 2002 - PowerPoint PPT Presentation

Processing Complex Aggregate Queries over Data Streams SIGMOD 2002 Alin Dobra Minos Garofalakis Johannes Gehrke Rajeev Rastogi June 4, 2002

Processing Network Data Streams Data−Stream Join Query Network Operations Center SELECT COUNT(*) FROM R 1 , R 2 , R 3 WHERE R 1 .a = R 2 .b = R 3 .c Measurement R 3 Alarms R 1 R 2 Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 2

Computations over Streaming Data Sketch Sketch Sketch Memory for R1 for R2 for Rr Stream for R1 Stream for R2 Stream Approximate answer to Q Query-Processing Engine Stream for Rr Query Q(R1,...,Rr) • Goal: Approximately answer JOIN-COUNT and JOIN-SUM queries over streams Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 3

Outline of the Talk • Motivation • Sketch-based randomized algorithms • Sketch-based approximation of aggregate queries results • Sketch-partitioning for estimation accuracy boosting • Experimental evaluation • Summary Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 4

Sketch-Based Randomized Algorithms [AMS96] • Estimate F ( D ) for some function F and some data D Method: • Build a probability space and a random variable X with the properties: 1) E [ X ] = F ( D ) ≥ L E 2) Var ( X ) ≤ U V • Combine samples of X to achieve relative error ǫ with probability at least 1 − δ • Boost accuracy to ǫ by averaging 8 U V E pairwise independent samples of X ǫ 2 L 2 • Boost confidence to 1 − δ by taking the median of 2 log(1 /δ ) averages frequency moments [AMS96], size of join [AGMS99], L 1 norm [FKSV99], Example usage: wavelet decomposition [GKMS01] Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 5

Sketch-Based Randomized Algorithms (cont.) h X −1 −1 −1 Data 1 1 1 2 32 − 1 1 2 Uniform random seed space (size 2 65 ) ξ family of random variables • ξ i ( s ) = h ( s, i ) ∈ {− 1 , +1 } • family ξ is 4-wise independent, i.e. ∀ i 1 � = i 2 � = i 3 � = i 4 , ∀ v 1 , v 2 , v 3 , v 4 ∈ {− 1 , +1 } , P [ ξ i 1 = v 1 ∧ ξ i 2 = v 2 ∧ ξ i 3 = v 3 ∧ ξ i 4 = v 4 ] = P [ ξ i 1 = v 1 ] P [ ξ i 2 = v 2 ] P [ ξ i 3 = v 3 ] P [ ξ i 4 = v 4 ] Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 6

Estimation of COUNT ( F ⋊ ⋉ a G ) [AGMS99] F G · · · a · · · · · · a · · · 1 i f i i g i 3 1 1 3 1 3 3 ⇒ ⇒ 2 2 1 2 0 1 3 3 2 3 2 1 1 1 3 ⋉ a G ) = � 3 • Estimate COUNT ( F ⋊ i =1 f i g i = 3 · 3 + 1 · 0 + 2 · 2 = 13 Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 7

Estimation of COUNT ( F ⋊ ⋉ a G ) (cont.) i 1 2 3 ξ i − 1 +1 − 1 X F = � F a ξ a t ∈ F ξ t.a X G = � G a ξ a t ∈ G ξ t.a 1 − 1 − 1 3 − 1 − 1 1 − 1 − 2 3 − 1 − 2 2 +1 − 1 1 − 1 − 3 3 − 1 +0 1 − 1 − 4 1 − 1 − 1 1 − 1 − 5 3 − 1 − 2 X = X F X G = − 2 · − 5 = 10 ≈ 13 SJ ( F ) = (3 · 3) + (1 · 1) + (2 · 2) = 14 , SJ ( G ) = 13 Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 8

Estimation of COUNT ( F ⋊ ⋉ a G ) (cont.) ⋉ G ) = � n • To estimate COUNT ( F ⋊ i =1 f i g i define: n � � X F = f i ξ i = ξ t.a i =1 t ∈ F n � � X G = g i ξ i = ξ t.a i =1 t ∈ G • With X = X F X G we have: n � � � � f i g i ξ 2 E [ X ] = E i + f i g i ′ ξ i ξ i ′ i =1 i � = i ′ = COUNT ( F ⋊ ⋉ a G ) Var ( X ) ≤ 2 SJ ( F ) SJ ( G ) Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 9

Using Sketches to Answer SUM Queries �� ⋉ a G ( a, b )) = � 3 • Estimate SUM b ( F ( a ) ⋊ i =1 f i t ∈ g,t.a = i t.b i 1 2 3 ξ i − 1 +1 − 1 X F = � F a ξ a t ∈ F ξ t.a X G = � G a G b ξ a t ∈ G t.b ξ t.a 1 − 1 − 1 3 2 − 1 − 2 1 − 1 − 2 3 2 − 1 − 4 2 +1 − 1 1 1 − 1 − 5 3 − 1 +0 1 2 − 1 − 7 1 − 1 − 1 1 1 − 1 − 8 3 − 1 − 2 X = X F X G = − 2 · − 8 = 16 ≈ 20 Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 11

Using Sketches to Answer SUM Queries (cont.) �� ⋉ a G ( a, b )) = � n • To estimate SUM b ( F ( a ) ⋊ i =1 f i t ∈ g,t.a = i t.b define: n � � X F = f i ξ i = ξ t.a i =1 t ∈ F n � � � � � X G = t.b ξ i = t.b ξ t.a t ∈ G,t.a = i t ∈ G i =1 • With X = X F X G E [ X ] = SUM b ( F ( a ) ⋊ ⋉ a G ( a, b )) n � 2 � � � Var ( X ) ≤ 2 SJ ( F ) t.b i =1 t ∈ G,t.a = i Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 12

Extension to COUNT ( F ⋊ ⋉ a G ⋊ ⋉ b H ) • Key idea: use independent ξ families for each join attribute i 1 2 3 j 1 2 ξ a ξ b − 1 +1 − 1 +1 − 1 i j X F = � ξ a ξ a X G = � ξ a F a t.a t.a ξ a ξ b t.a ξ b X H = � ξ b G a G b t.a t.b t.b ξ b H b 1 − 1 − 1 t.b t.b 3 2 − 1 − 1 − 1 1 − 1 − 2 2 − 1 − 1 3 2 − 1 − 1 − 2 2 +1 − 1 2 − 1 − 2 1 1 − 1 +1 − 1 3 − 1 +0 1 +1 − 1 1 2 − 1 − 1 0 1 − 1 − 1 2 − 1 − 2 1 1 − 1 +1 1 3 − 1 − 2 X = X F X G X H = − 2 · 1 · − 2 = 4 ≈ 21 Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 13

Extention to COUNT ( F ⋊ ⋉ a G ⋊ ⋉ b H ) (cont.) • To estimate n 1 n 2 � � COUNT ( F ⋊ ⋉ a G ⋊ ⋉ b H ) = f i g ij h j i =1 j =1 • Define: n 1 n 1 n 2 n 2 � � � � f i ξ a g ij ξ a i ξ b h j ξ b X F = i , X G = j , X H = j i =1 i =1 j =1 j =1 • If ξ a and ξ b are independent families of ± 1 4-wise independent pseudo random variables E [ X F X G X H ] = COUNT ( F ⋊ ⋉ a G ⋊ ⋉ b H ) Var ( X F X G X H ) ≤ 4 SJ ( F ) SJ ( G ) SJ ( H ) Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 14

Estimation of COUNT ( R 1 ⋊ ⋉ · · · ⋊ ⋉ R r ) • For each of the n equality join constraint build independent family of pseudo random variables • For every relation R l ( a 1 , . . . , a m ) compute samples of the random variable X R l defined as: n 1 n m � � � X R l = · · · f i 1 ,...,i m ξ 1 ,i 1 . . . ξ m,i m = ξ 1 ,t.a 1 · · · ξ m,t.a m i 1 i m t ∈ R r � X = X R l l =1 • Can show: E [ X ] = COUNT ( R 1 ⋊ ⋉ · · · ⋊ ⋉ R r ) r � Var ( X ) ≤ 2 2 n SJ ( R l ) l =1 Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 15

Sketch Partitioning Problem: large variance ⇒ loose estimation guarantees. Our solution: sketch partitioning i f i g i 1 20 2 Var ( X ) ≈ 2 SJ ( F ) SJ ( G ) 2 5 15 = 2(20 2 + 5 2 + 10 2 + 2 2 )(2 2 + 15 2 + 3 2 + 10 2 ) 3 10 3 = 357604 4 2 10 Idea: split domain I = { 1 , 2 , 3 , 4 } into I 1 = { 1 , 3 } and I 2 = { 2 , 4 } • F splits into F 1 and F 2 , G into G 1 and G 2 • build X 1 to estimate COUNT ( F 1 ⋊ ⋉ G 1 ) and independently X 2 to estimate COUNT ( F 2 ⋊ ⋉ G 2 ) • take X ′ = X 1 + X 2 ; have E [ X ′ ] = COUNT ( F ⋊ ⋉ G ) Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 17

Sketch Partitioning (cont.) • Estimation of COUNT ( F 1 ⋊ ⋉ G 1 ) i f i g i Var ( X 1 ) ≈ 2 SJ ( F 1 ) SJ ( G 1 ) 1 20 2 = 2(20 2 + 10 2 )(2 2 + 3 2 ) 3 10 3 = 13000 • Estimation of COUNT ( F 2 ⋊ ⋉ G 2 ) i f i g i Var ( X 2 ) ≈ 2 SJ ( F 2 ) SJ ( G 2 ) = 2(5 2 + 2 2 )(15 2 + 10 2 ) 2 5 15 4 2 10 = 18850 • Var ( X ′ ) = Var ( X 1 ) + Var ( X 2 ) = 31850 • Improvement Var ( X ) / 2 Var ( X ′ ) = 357604 / 2 ≈ 5 . 6 31850 Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 18

Binary Sketch Partitioning • Prior information: historical data, histograms. • Find the partitioning I = I 1 ∪ I 2 and the space allocation m = m 1 + m 2 that minimizes Var ( X 1 ) + Var ( X 2 ) , m 1 m 2 where � � f 2 g 2 Var ( X k ) ≈ 2 i . i i ∈ I k i ∈ I k � • Allocate space proportional to Var ( X k ) . In example 5:6 • Have to look only at partitioning in the order f i /g i to find optimum ⇒ O ( | I | ) • In example order is { 1 , 3 , 2 , 4 } . Optimal partition is { 1 , 3 } ∪ { 2 , 4 } . Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 19

Processing Complex Aggregate Queries over Data Streams SIGMOD 2002 - PowerPoint PPT Presentation

Processing Complex Aggregate Queries over Data Streams SIGMOD 2002 Alin Dobra Minos Garofalakis Johannes Gehrke Rajeev Rastogi June 4, 2002 Processing Network Data Streams DataStream Join Query Network Operations Center SELECT COUNT(*)

Aggregate Sampling Aggregate Stockpiles CIVL 3137 2 Stockpile Segregation CIVL 3137 3

Asphalt Aggregate Specifications Aggregate Specifications In order to make good asphalt

Aggregate Blending Aggregate Blending To meet the gradation specifications for a concrete or

Streaming Queries over Streaming Data Sirish Chandrasekaran UC Berkeley August 20, 2002 VLDB

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Top-k Queries over Uncertain Scores Qing Liu, Debabrota Basu, Talel Abdessalem, St ephane

Continuous Queries over Data Streams Shivnath Babu and Jennifer Widom Stanford University

SASE: Complex Event Processing Over Streams Daniel Gyllstrom, Eugene Wu, Hee-Jin Chae, Yanlei

Range-Consistent Answers of Aggregate Queries under Aggregate Constraints Sergio Flesca, Filippo

Selecting and Using Views To Compute Aggregate Queries Foto Afrati (NTUA Greece) and Rada

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

Differential Forms for Target Tracking and Aggregate Queries in Distributed Networks Distributed

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Frequency Counts Frequency Counts over over Data Streams Data Streams Gurmeet Singh Manku

Processing Forecasting Queries Processing Forecasting Queries Songyun Duan, Shivnath Babu Duke

WEBINAR SERIES March 16, 2017 Broadband in Infrastructure Legislation: An Overview of SHLBs

Essential Characteristics On-demand self-service Broad network access

Building a minimum viable Security Operations Centre ISGC 2019, 2 nd April 2019 Introduction

The End-to-End Coordination Unit (E2ECU) and EGEE Network Operations Centre (ENOC) Toby Rodwell

NR 5 5G automation and qualification frameworks serving energy networks IEEE 5G Summit

Malware Analysis at AIRBUS Practical Considerations and Issues July, 12th 2017 Xavier

Resilient Power in Schools, Featuring Florida and New Jersey March 31, 2015 Hosted by Todd

Network Operations with Ansible Tower, ServiceNow, and Slack Sean Cavanaugh Jason Edelman

Processing Complex Aggregate Queries over Data Streams SIGMOD 2002 - PowerPoint PPT Presentation

Processing Complex Aggregate Queries over Data Streams SIGMOD 2002 Alin Dobra Minos Garofalakis Johannes Gehrke Rajeev Rastogi June 4, 2002 Processing Network Data Streams DataStream Join Query Network Operations Center SELECT COUNT(*)

Aggregate Sampling Aggregate Stockpiles CIVL 3137 2 Stockpile Segregation CIVL 3137 3

Asphalt Aggregate Specifications Aggregate Specifications In order to make good asphalt

Aggregate Blending Aggregate Blending To meet the gradation specifications for a concrete or

Streaming Queries over Streaming Data Sirish Chandrasekaran UC Berkeley August 20, 2002 VLDB

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

WITH C++ Prof. Amr Goneid AUC Part 9. Streams &amp; Files Prof. amr Goneid, AUC 1 Streams

Top-k Queries over Uncertain Scores Qing Liu, Debabrota Basu, Talel Abdessalem, St ephane

Continuous Queries over Data Streams Shivnath Babu and Jennifer Widom Stanford University

SASE: Complex Event Processing Over Streams Daniel Gyllstrom, Eugene Wu, Hee-Jin Chae, Yanlei

Range-Consistent Answers of Aggregate Queries under Aggregate Constraints Sergio Flesca, Filippo

Selecting and Using Views To Compute Aggregate Queries Foto Afrati (NTUA Greece) and Rada

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data &amp; Real Time Data Streams

Differential Forms for Target Tracking and Aggregate Queries in Distributed Networks Distributed

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Frequency Counts Frequency Counts over over Data Streams Data Streams Gurmeet Singh Manku

Processing Forecasting Queries Processing Forecasting Queries Songyun Duan, Shivnath Babu Duke

WEBINAR SERIES March 16, 2017 Broadband in Infrastructure Legislation: An Overview of SHLBs

Essential Characteristics On-demand self-service Broad network access

Building a minimum viable Security Operations Centre ISGC 2019, 2 nd April 2019 Introduction

The End-to-End Coordination Unit (E2ECU) and EGEE Network Operations Centre (ENOC) Toby Rodwell

NR 5 5G automation and qualification frameworks serving energy networks IEEE 5G Summit

Malware Analysis at AIRBUS Practical Considerations and Issues July, 12th 2017 Xavier

Resilient Power in Schools, Featuring Florida and New Jersey March 31, 2015 Hosted by Todd

Network Operations with Ansible Tower, ServiceNow, and Slack Sean Cavanaugh Jason Edelman

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams