processing complex aggregate queries over data streams
play

Processing Complex Aggregate Queries over Data Streams SIGMOD 2002 - PowerPoint PPT Presentation

Processing Complex Aggregate Queries over Data Streams SIGMOD 2002 Alin Dobra Minos Garofalakis Johannes Gehrke Rajeev Rastogi June 4, 2002 Processing Network Data Streams DataStream Join Query Network Operations Center SELECT COUNT(*)


  1. Processing Complex Aggregate Queries over Data Streams SIGMOD 2002 Alin Dobra Minos Garofalakis Johannes Gehrke Rajeev Rastogi June 4, 2002

  2. Processing Network Data Streams Data−Stream Join Query Network Operations Center SELECT COUNT(*) FROM R 1 , R 2 , R 3 WHERE R 1 .a = R 2 .b = R 3 .c Measurement R 3 Alarms R 1 R 2 Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 2

  3. Computations over Streaming Data Sketch Sketch Sketch Memory for R1 for R2 for Rr Stream for R1 Stream for R2 Stream Approximate answer to Q Query-Processing Engine Stream for Rr Query Q(R1,...,Rr) • Goal: Approximately answer JOIN-COUNT and JOIN-SUM queries over streams Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 3

  4. Outline of the Talk • Motivation • Sketch-based randomized algorithms • Sketch-based approximation of aggregate queries results • Sketch-partitioning for estimation accuracy boosting • Experimental evaluation • Summary Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 4

  5. Sketch-Based Randomized Algorithms [AMS96] • Estimate F ( D ) for some function F and some data D Method: • Build a probability space and a random variable X with the properties: 1) E [ X ] = F ( D ) ≥ L E 2) Var ( X ) ≤ U V • Combine samples of X to achieve relative error ǫ with probability at least 1 − δ • Boost accuracy to ǫ by averaging 8 U V E pairwise independent samples of X ǫ 2 L 2 • Boost confidence to 1 − δ by taking the median of 2 log(1 /δ ) averages frequency moments [AMS96], size of join [AGMS99], L 1 norm [FKSV99], Example usage: wavelet decomposition [GKMS01] Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 5

  6. Sketch-Based Randomized Algorithms (cont.) h X −1 −1 −1 Data 1 1 1 2 32 − 1 1 2 Uniform random seed space (size 2 65 ) ξ family of random variables • ξ i ( s ) = h ( s, i ) ∈ {− 1 , +1 } • family ξ is 4-wise independent, i.e. ∀ i 1 � = i 2 � = i 3 � = i 4 , ∀ v 1 , v 2 , v 3 , v 4 ∈ {− 1 , +1 } , P [ ξ i 1 = v 1 ∧ ξ i 2 = v 2 ∧ ξ i 3 = v 3 ∧ ξ i 4 = v 4 ] = P [ ξ i 1 = v 1 ] P [ ξ i 2 = v 2 ] P [ ξ i 3 = v 3 ] P [ ξ i 4 = v 4 ] Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 6

  7. Estimation of COUNT ( F ⋊ ⋉ a G ) [AGMS99] F G · · · a · · · · · · a · · · 1 i f i i g i 3 1 1 3 1 3 3 ⇒ ⇒ 2 2 1 2 0 1 3 3 2 3 2 1 1 1 3 ⋉ a G ) = � 3 • Estimate COUNT ( F ⋊ i =1 f i g i = 3 · 3 + 1 · 0 + 2 · 2 = 13 Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 7

  8. Estimation of COUNT ( F ⋊ ⋉ a G ) (cont.) i 1 2 3 ξ i − 1 +1 − 1 X F = � F a ξ a t ∈ F ξ t.a X G = � G a ξ a t ∈ G ξ t.a 1 − 1 − 1 3 − 1 − 1 1 − 1 − 2 3 − 1 − 2 2 +1 − 1 1 − 1 − 3 3 − 1 +0 1 − 1 − 4 1 − 1 − 1 1 − 1 − 5 3 − 1 − 2 X = X F X G = − 2 · − 5 = 10 ≈ 13 SJ ( F ) = (3 · 3) + (1 · 1) + (2 · 2) = 14 , SJ ( G ) = 13 Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 8

  9. Estimation of COUNT ( F ⋊ ⋉ a G ) (cont.) ⋉ G ) = � n • To estimate COUNT ( F ⋊ i =1 f i g i define: n � � X F = f i ξ i = ξ t.a i =1 t ∈ F n � � X G = g i ξ i = ξ t.a i =1 t ∈ G • With X = X F X G we have: n � � � � f i g i ξ 2 E [ X ] = E i + f i g i ′ ξ i ξ i ′ i =1 i � = i ′ = COUNT ( F ⋊ ⋉ a G ) Var ( X ) ≤ 2 SJ ( F ) SJ ( G ) Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 9

  10. Outline of the Talk • Motivation • Sketch-based randomized algorithms • Sketch-based approximation of aggregate queries results • Sketch-partitioning for estimation accuracy boosting • Experimental evaluation • Summary Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 10

  11. Using Sketches to Answer SUM Queries �� � ⋉ a G ( a, b )) = � 3 • Estimate SUM b ( F ( a ) ⋊ i =1 f i t ∈ g,t.a = i t.b i 1 2 3 ξ i − 1 +1 − 1 X F = � F a ξ a t ∈ F ξ t.a X G = � G a G b ξ a t ∈ G t.b ξ t.a 1 − 1 − 1 3 2 − 1 − 2 1 − 1 − 2 3 2 − 1 − 4 2 +1 − 1 1 1 − 1 − 5 3 − 1 +0 1 2 − 1 − 7 1 − 1 − 1 1 1 − 1 − 8 3 − 1 − 2 X = X F X G = − 2 · − 8 = 16 ≈ 20 Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 11

  12. Using Sketches to Answer SUM Queries (cont.) �� � ⋉ a G ( a, b )) = � n • To estimate SUM b ( F ( a ) ⋊ i =1 f i t ∈ g,t.a = i t.b define: n � � X F = f i ξ i = ξ t.a i =1 t ∈ F n � � � � � X G = t.b ξ i = t.b ξ t.a t ∈ G,t.a = i t ∈ G i =1 • With X = X F X G E [ X ] = SUM b ( F ( a ) ⋊ ⋉ a G ( a, b )) n � 2 � � � Var ( X ) ≤ 2 SJ ( F ) t.b i =1 t ∈ G,t.a = i Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 12

  13. Extension to COUNT ( F ⋊ ⋉ a G ⋊ ⋉ b H ) • Key idea: use independent ξ families for each join attribute i 1 2 3 j 1 2 ξ a ξ b − 1 +1 − 1 +1 − 1 i j X F = � ξ a ξ a X G = � ξ a F a t.a t.a ξ a ξ b t.a ξ b X H = � ξ b G a G b t.a t.b t.b ξ b H b 1 − 1 − 1 t.b t.b 3 2 − 1 − 1 − 1 1 − 1 − 2 2 − 1 − 1 3 2 − 1 − 1 − 2 2 +1 − 1 2 − 1 − 2 1 1 − 1 +1 − 1 3 − 1 +0 1 +1 − 1 1 2 − 1 − 1 0 1 − 1 − 1 2 − 1 − 2 1 1 − 1 +1 1 3 − 1 − 2 X = X F X G X H = − 2 · 1 · − 2 = 4 ≈ 21 Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 13

  14. Extention to COUNT ( F ⋊ ⋉ a G ⋊ ⋉ b H ) (cont.) • To estimate n 1 n 2 � � COUNT ( F ⋊ ⋉ a G ⋊ ⋉ b H ) = f i g ij h j i =1 j =1 • Define: n 1 n 1 n 2 n 2 � � � � f i ξ a g ij ξ a i ξ b h j ξ b X F = i , X G = j , X H = j i =1 i =1 j =1 j =1 • If ξ a and ξ b are independent families of ± 1 4-wise independent pseudo random variables E [ X F X G X H ] = COUNT ( F ⋊ ⋉ a G ⋊ ⋉ b H ) Var ( X F X G X H ) ≤ 4 SJ ( F ) SJ ( G ) SJ ( H ) Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 14

  15. Estimation of COUNT ( R 1 ⋊ ⋉ · · · ⋊ ⋉ R r ) • For each of the n equality join constraint build independent family of pseudo random variables • For every relation R l ( a 1 , . . . , a m ) compute samples of the random variable X R l defined as: n 1 n m � � � X R l = · · · f i 1 ,...,i m ξ 1 ,i 1 . . . ξ m,i m = ξ 1 ,t.a 1 · · · ξ m,t.a m i 1 i m t ∈ R r � X = X R l l =1 • Can show: E [ X ] = COUNT ( R 1 ⋊ ⋉ · · · ⋊ ⋉ R r ) r � Var ( X ) ≤ 2 2 n SJ ( R l ) l =1 Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 15

  16. Outline of the Talk • Motivation • Sketch-based randomized algorithms • Sketch-based approximation of aggregate queries results • Sketch-partitioning for estimation accuracy boosting • Experimental evaluation • Summary Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 16

  17. Sketch Partitioning Problem: large variance ⇒ loose estimation guarantees. Our solution: sketch partitioning i f i g i 1 20 2 Var ( X ) ≈ 2 SJ ( F ) SJ ( G ) 2 5 15 = 2(20 2 + 5 2 + 10 2 + 2 2 )(2 2 + 15 2 + 3 2 + 10 2 ) 3 10 3 = 357604 4 2 10 Idea: split domain I = { 1 , 2 , 3 , 4 } into I 1 = { 1 , 3 } and I 2 = { 2 , 4 } • F splits into F 1 and F 2 , G into G 1 and G 2 • build X 1 to estimate COUNT ( F 1 ⋊ ⋉ G 1 ) and independently X 2 to estimate COUNT ( F 2 ⋊ ⋉ G 2 ) • take X ′ = X 1 + X 2 ; have E [ X ′ ] = COUNT ( F ⋊ ⋉ G ) Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 17

  18. Sketch Partitioning (cont.) • Estimation of COUNT ( F 1 ⋊ ⋉ G 1 ) i f i g i Var ( X 1 ) ≈ 2 SJ ( F 1 ) SJ ( G 1 ) 1 20 2 = 2(20 2 + 10 2 )(2 2 + 3 2 ) 3 10 3 = 13000 • Estimation of COUNT ( F 2 ⋊ ⋉ G 2 ) i f i g i Var ( X 2 ) ≈ 2 SJ ( F 2 ) SJ ( G 2 ) = 2(5 2 + 2 2 )(15 2 + 10 2 ) 2 5 15 4 2 10 = 18850 • Var ( X ′ ) = Var ( X 1 ) + Var ( X 2 ) = 31850 • Improvement Var ( X ) / 2 Var ( X ′ ) = 357604 / 2 ≈ 5 . 6 31850 Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 18

  19. Binary Sketch Partitioning • Prior information: historical data, histograms. • Find the partitioning I = I 1 ∪ I 2 and the space allocation m = m 1 + m 2 that minimizes Var ( X 1 ) + Var ( X 2 ) , m 1 m 2 where � � f 2 g 2 Var ( X k ) ≈ 2 i . i i ∈ I k i ∈ I k � • Allocate space proportional to Var ( X k ) . In example 5:6 • Have to look only at partitioning in the order f i /g i to find optimum ⇒ O ( | I | ) • In example order is { 1 , 3 , 2 , 4 } . Optimal partition is { 1 , 3 } ∪ { 2 , 4 } . Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams 19

Recommend


More recommend