Processing Complex Aggregate Queries over Data Streams SIGMOD 2002 Alin Dobra Minos Garofalakis Johannes Gehrke Rajeev Rastogi June 4, 2002
Processing Network Data Streams Data−Stream Join Query Network Operations Center SELECT COUNT(*) FROM R 1 , R 2 , R 3 WHERE R 1 .a = R 2 .b = R 3 .c Measurement R 3 Alarms R 1 R 2 Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 2
Computations over Streaming Data Sketch Sketch Sketch Memory for R1 for R2 for Rr Stream for R1 Stream for R2 Stream Approximate answer to Q Query-Processing Engine Stream for Rr Query Q(R1,...,Rr) • Goal: Approximately answer JOIN-COUNT and JOIN-SUM queries over streams Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 3
Outline of the Talk • Motivation • Sketch-based randomized algorithms • Sketch-based approximation of aggregate queries results • Sketch-partitioning for estimation accuracy boosting • Experimental evaluation • Summary Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 4
Sketch-Based Randomized Algorithms [AMS96] • Estimate F ( D ) for some function F and some data D Method: • Build a probability space and a random variable X with the properties: 1) E [ X ] = F ( D ) ≥ L E 2) Var ( X ) ≤ U V • Combine samples of X to achieve relative error ǫ with probability at least 1 − δ • Boost accuracy to ǫ by averaging 8 U V E pairwise independent samples of X ǫ 2 L 2 • Boost confidence to 1 − δ by taking the median of 2 log(1 /δ ) averages frequency moments [AMS96], size of join [AGMS99], L 1 norm [FKSV99], Example usage: wavelet decomposition [GKMS01] Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 5
Sketch-Based Randomized Algorithms (cont.) h X −1 −1 −1 Data 1 1 1 2 32 − 1 1 2 Uniform random seed space (size 2 65 ) ξ family of random variables • ξ i ( s ) = h ( s, i ) ∈ {− 1 , +1 } • family ξ is 4-wise independent, i.e. ∀ i 1 � = i 2 � = i 3 � = i 4 , ∀ v 1 , v 2 , v 3 , v 4 ∈ {− 1 , +1 } , P [ ξ i 1 = v 1 ∧ ξ i 2 = v 2 ∧ ξ i 3 = v 3 ∧ ξ i 4 = v 4 ] = P [ ξ i 1 = v 1 ] P [ ξ i 2 = v 2 ] P [ ξ i 3 = v 3 ] P [ ξ i 4 = v 4 ] Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 6
Estimation of COUNT ( F ⋊ ⋉ a G ) [AGMS99] F G · · · a · · · · · · a · · · 1 i f i i g i 3 1 1 3 1 3 3 ⇒ ⇒ 2 2 1 2 0 1 3 3 2 3 2 1 1 1 3 ⋉ a G ) = � 3 • Estimate COUNT ( F ⋊ i =1 f i g i = 3 · 3 + 1 · 0 + 2 · 2 = 13 Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 7
Estimation of COUNT ( F ⋊ ⋉ a G ) (cont.) i 1 2 3 ξ i − 1 +1 − 1 X F = � F a ξ a t ∈ F ξ t.a X G = � G a ξ a t ∈ G ξ t.a 1 − 1 − 1 3 − 1 − 1 1 − 1 − 2 3 − 1 − 2 2 +1 − 1 1 − 1 − 3 3 − 1 +0 1 − 1 − 4 1 − 1 − 1 1 − 1 − 5 3 − 1 − 2 X = X F X G = − 2 · − 5 = 10 ≈ 13 SJ ( F ) = (3 · 3) + (1 · 1) + (2 · 2) = 14 , SJ ( G ) = 13 Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 8
Estimation of COUNT ( F ⋊ ⋉ a G ) (cont.) ⋉ G ) = � n • To estimate COUNT ( F ⋊ i =1 f i g i define: n � � X F = f i ξ i = ξ t.a i =1 t ∈ F n � � X G = g i ξ i = ξ t.a i =1 t ∈ G • With X = X F X G we have: n � � � � f i g i ξ 2 E [ X ] = E i + f i g i ′ ξ i ξ i ′ i =1 i � = i ′ = COUNT ( F ⋊ ⋉ a G ) Var ( X ) ≤ 2 SJ ( F ) SJ ( G ) Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 9
Outline of the Talk • Motivation • Sketch-based randomized algorithms • Sketch-based approximation of aggregate queries results • Sketch-partitioning for estimation accuracy boosting • Experimental evaluation • Summary Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 10
Using Sketches to Answer SUM Queries �� � ⋉ a G ( a, b )) = � 3 • Estimate SUM b ( F ( a ) ⋊ i =1 f i t ∈ g,t.a = i t.b i 1 2 3 ξ i − 1 +1 − 1 X F = � F a ξ a t ∈ F ξ t.a X G = � G a G b ξ a t ∈ G t.b ξ t.a 1 − 1 − 1 3 2 − 1 − 2 1 − 1 − 2 3 2 − 1 − 4 2 +1 − 1 1 1 − 1 − 5 3 − 1 +0 1 2 − 1 − 7 1 − 1 − 1 1 1 − 1 − 8 3 − 1 − 2 X = X F X G = − 2 · − 8 = 16 ≈ 20 Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 11
Using Sketches to Answer SUM Queries (cont.) �� � ⋉ a G ( a, b )) = � n • To estimate SUM b ( F ( a ) ⋊ i =1 f i t ∈ g,t.a = i t.b define: n � � X F = f i ξ i = ξ t.a i =1 t ∈ F n � � � � � X G = t.b ξ i = t.b ξ t.a t ∈ G,t.a = i t ∈ G i =1 • With X = X F X G E [ X ] = SUM b ( F ( a ) ⋊ ⋉ a G ( a, b )) n � 2 � � � Var ( X ) ≤ 2 SJ ( F ) t.b i =1 t ∈ G,t.a = i Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 12
Extension to COUNT ( F ⋊ ⋉ a G ⋊ ⋉ b H ) • Key idea: use independent ξ families for each join attribute i 1 2 3 j 1 2 ξ a ξ b − 1 +1 − 1 +1 − 1 i j X F = � ξ a ξ a X G = � ξ a F a t.a t.a ξ a ξ b t.a ξ b X H = � ξ b G a G b t.a t.b t.b ξ b H b 1 − 1 − 1 t.b t.b 3 2 − 1 − 1 − 1 1 − 1 − 2 2 − 1 − 1 3 2 − 1 − 1 − 2 2 +1 − 1 2 − 1 − 2 1 1 − 1 +1 − 1 3 − 1 +0 1 +1 − 1 1 2 − 1 − 1 0 1 − 1 − 1 2 − 1 − 2 1 1 − 1 +1 1 3 − 1 − 2 X = X F X G X H = − 2 · 1 · − 2 = 4 ≈ 21 Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 13
Extention to COUNT ( F ⋊ ⋉ a G ⋊ ⋉ b H ) (cont.) • To estimate n 1 n 2 � � COUNT ( F ⋊ ⋉ a G ⋊ ⋉ b H ) = f i g ij h j i =1 j =1 • Define: n 1 n 1 n 2 n 2 � � � � f i ξ a g ij ξ a i ξ b h j ξ b X F = i , X G = j , X H = j i =1 i =1 j =1 j =1 • If ξ a and ξ b are independent families of ± 1 4-wise independent pseudo random variables E [ X F X G X H ] = COUNT ( F ⋊ ⋉ a G ⋊ ⋉ b H ) Var ( X F X G X H ) ≤ 4 SJ ( F ) SJ ( G ) SJ ( H ) Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 14
Estimation of COUNT ( R 1 ⋊ ⋉ · · · ⋊ ⋉ R r ) • For each of the n equality join constraint build independent family of pseudo random variables • For every relation R l ( a 1 , . . . , a m ) compute samples of the random variable X R l defined as: n 1 n m � � � X R l = · · · f i 1 ,...,i m ξ 1 ,i 1 . . . ξ m,i m = ξ 1 ,t.a 1 · · · ξ m,t.a m i 1 i m t ∈ R r � X = X R l l =1 • Can show: E [ X ] = COUNT ( R 1 ⋊ ⋉ · · · ⋊ ⋉ R r ) r � Var ( X ) ≤ 2 2 n SJ ( R l ) l =1 Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 15
Outline of the Talk • Motivation • Sketch-based randomized algorithms • Sketch-based approximation of aggregate queries results • Sketch-partitioning for estimation accuracy boosting • Experimental evaluation • Summary Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 16
Sketch Partitioning Problem: large variance ⇒ loose estimation guarantees. Our solution: sketch partitioning i f i g i 1 20 2 Var ( X ) ≈ 2 SJ ( F ) SJ ( G ) 2 5 15 = 2(20 2 + 5 2 + 10 2 + 2 2 )(2 2 + 15 2 + 3 2 + 10 2 ) 3 10 3 = 357604 4 2 10 Idea: split domain I = { 1 , 2 , 3 , 4 } into I 1 = { 1 , 3 } and I 2 = { 2 , 4 } • F splits into F 1 and F 2 , G into G 1 and G 2 • build X 1 to estimate COUNT ( F 1 ⋊ ⋉ G 1 ) and independently X 2 to estimate COUNT ( F 2 ⋊ ⋉ G 2 ) • take X ′ = X 1 + X 2 ; have E [ X ′ ] = COUNT ( F ⋊ ⋉ G ) Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 17
Sketch Partitioning (cont.) • Estimation of COUNT ( F 1 ⋊ ⋉ G 1 ) i f i g i Var ( X 1 ) ≈ 2 SJ ( F 1 ) SJ ( G 1 ) 1 20 2 = 2(20 2 + 10 2 )(2 2 + 3 2 ) 3 10 3 = 13000 • Estimation of COUNT ( F 2 ⋊ ⋉ G 2 ) i f i g i Var ( X 2 ) ≈ 2 SJ ( F 2 ) SJ ( G 2 ) = 2(5 2 + 2 2 )(15 2 + 10 2 ) 2 5 15 4 2 10 = 18850 • Var ( X ′ ) = Var ( X 1 ) + Var ( X 2 ) = 31850 • Improvement Var ( X ) / 2 Var ( X ′ ) = 357604 / 2 ≈ 5 . 6 31850 Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 18
Binary Sketch Partitioning • Prior information: historical data, histograms. • Find the partitioning I = I 1 ∪ I 2 and the space allocation m = m 1 + m 2 that minimizes Var ( X 1 ) + Var ( X 2 ) , m 1 m 2 where � � f 2 g 2 Var ( X k ) ≈ 2 i . i i ∈ I k i ∈ I k � • Allocate space proportional to Var ( X k ) . In example 5:6 • Have to look only at partitioning in the order f i /g i to find optimum ⇒ O ( | I | ) • In example order is { 1 , 3 , 2 , 4 } . Optimal partition is { 1 , 3 } ∪ { 2 , 4 } . Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 19
Recommend
More recommend