Beyond Simple Aggregates: Indexing for Summary Queries Zhewei Wei and Ke Yi Hong Kong University of Science and Technology 1-1
Reporting vs. Aggregation SELECT salary FROM Table T WHERE 30 < age < 40 2-1
Reporting vs. Aggregation SELECT salary FROM Table T WHERE 30 < age < 40 $32 , 000 $76 , 300 $54 , 400 50 , 000 records · · · $68 , 000 $28 , 000 2-2
Reporting vs. Aggregation SELECT salary SELECT AVG(salary) FROM Table T FROM Table T WHERE 30 < age < 40 WHERE 30 < age < 40 $32 , 000 $76 , 300 $54 , 400 50 , 000 records · · · $68 , 000 $28 , 000 2-3
Reporting vs. Aggregation SELECT salary SELECT AVG(salary) FROM Table T FROM Table T WHERE 30 < age < 40 WHERE 30 < age < 40 $32 , 000 $76 , 300 $54 , 400 $52 , 312 50 , 000 records · · · $68 , 000 $28 , 000 2-4
Reporting vs. Aggregation SELECT salary SELECT AVG(salary) FROM Table T FROM Table T WHERE 30 < age < 40 WHERE 30 < age < 40 # of employees Salary 2-5
Reporting vs. Aggregation Search Engine Log Date Keyword 2011.04.08 Masters 2011 2011.04.08 Libya 2011.04.07 Japan nuclear crisis 2011.04.07 Libya · · · 2011.03.11 Japan earthquake 2011.03.11 Japan tsunami 2011.03.10 NCAA · · · 3-1
Reporting vs. Aggregation Search Engine Log Date Keyword 2011.04.08 Masters 2011 2011.04.08 Libya Keyword Frequency 2011.04.07 Japan nuclear crisis Libya 19.3% 2011.04.07 Libya Japan nuclear crisis 16.5% · · · Japan earthquake 10.2% 2011.03.11 Japan earthquake · · · 2011.03.11 Japan tsunami 2011.03.10 NCAA · · · 3-2
Summary Queries Let D be a database containing N records. Each record p ∈ D is associated with query attribute A q ( p ) (age) and a summary attribute A s ( p ) (salary). 4-1
Summary Queries Let D be a database containing N records. Each record p ∈ D is associated with query attribute A q ( p ) (age) and a summary attribute A s ( p ) (salary). A summary query specifies a range constraint [ q 1 , q 2 ] on A q and the database returns a summary on the A s attribute of all records whose A q attribute is within the range. 4-2
Summary Queries Data summarization techniques Heavy hitters (a.k.a. frequent items) [MG 82] [MAA 06] ... Quantiles [MP 80] [GK 01] ... Histograms [PHIJ 96] [JKMPSS 98] [GGIKMS 02] ... Wavelets [MVW 98] [VM 99] [GKMS 01] ... Various sketches ([AMS 99], Count-Min [CM 05], ... ) . . . 5-1
Summary Queries Data summarization techniques Heavy hitters (a.k.a. frequent items) [MG 82] [MAA 06] ... Quantiles [MP 80] [GK 01] ... Histograms [PHIJ 96] [JKMPSS 98] [GGIKMS 02] ... Wavelets [MVW 98] [VM 99] [GKMS 01] ... Various sketches ([AMS 99], Count-Min [CM 05], ... ) . . . Past research focuses on computing summaries on the whole data set: offline or streaming 5-2
Algorithm Problem vs. Data Structure Problem The algorithm problem The data structure problem Space Time 6-1
Algorithm Problem vs. Data Structure Problem The algorithm problem The data structure problem Space offline: O ( N ) O ( N ): data must be stored streaming: sublinear Time 6-2
Algorithm Problem vs. Data Structure Problem The algorithm problem The data structure problem Space offline: O ( N ) O ( N ): data must be stored streaming: sublinear ˜ preprocessing time: O ( N ) less important sublinear when query time: sampling works Time O (log N + s ε ) internal mem O (log B N + s ε / B ) external mem s ε : summary size B : block size 6-3
Quantile Summaries φ -quantile: the value ranked at φ | D | in D . ε -approximate φ -quantile: any value whose rank is between [( φ − ε ) | D | , ( φ + ε ) | D | ]. Quantile summary: for any 0 < φ < 1, an ε -approximate φ -quantile can be extracted. 7-1
Quantile Summaries φ -quantile: the value ranked at φ | D | in D . ε -approximate φ -quantile: any value whose rank is between [( φ − ε ) | D | , ( φ + ε ) | D | ]. Quantile summary: for any 0 < φ < 1, an ε -approximate φ -quantile can be extracted. # of employees Salary max min 20% 40% 60% 80% 7-2
Quantile Summaries ε | D | values 4 6 7 9 11 13 16 26 21 24 1 3 3 8-1
Quantile Summaries ε | D | values 4 6 7 9 11 13 16 26 21 24 1 3 3 Size: s ε = Θ(1 /ε ); Error: ε | D | u 8-2
A Baseline Solution Decomposable summaries 9-1
A Baseline Solution Decomposable summaries ε -summary ε -summary ε -summary + + · · · + D 1 D 2 D t 9-2
A Baseline Solution Decomposable summaries ε -summary ε -summary ε -summary + + · · · + D 1 D 2 D t ε -summary = D = D 1 ⊎ · · · ⊎ D t 9-3
A Baseline Solution Decomposable summaries ε -summary ε -summary ε -summary + + · · · + D 1 D 2 D t ε -summary Error: ε | D 1 | + · · · + ε | D t | = ε | D | = D = D 1 ⊎ · · · ⊎ D t 9-4
A Baseline Solution ε -summary Query range 10-1
Query Cost s ε log N sorted lists · · · · · · log N -way merging: O ( s ε log N log log N ) 11-1
A Baseline Solution Internal memory Query time: O ( s ε log N log log N ) Space: O ( Ns ε ) 12-1
A Baseline Solution Internal memory Query time: O ( s ε log N log log N ) Space: O ( Ns ε ) Fat leaf: s ε 12-2
A Baseline Solution Internal memory Query time: O ( s ε log N log log N ) Space: O ( N ) Fat leaf: s ε 12-3
Optimal Data Structure S ( ε, D 1 ) S ( 3 2 ε, D 2 ) S (( 3 2 ) 2 ε, D 3 ) Query range 13-1
Optimal Data Structure Quantile summary S ( ε, D ): An ε -quantile summary for data set D . Size: Θ(1 /ε ); Error: ε | D | . 14-1
Optimal Data Structure Quantile summary S ( ε, D ): An ε -quantile summary for data set D . Size: Θ(1 /ε ); Error: ε | D | . Data Data Error Summary Absolute set size param. size error 1 D 1 k ε ε k ε k 3 2 1 3 D 2 2 ε 4 ε k 2 3 ε � 2 ε � 2 ε k � 3 � 2 � 3 � 2 1 k D 3 4 2 3 4 ε · · · � t − 1 ε � t − 1 ε k � 3 � 2 � 3 � t − 1 1 k D t 2 t − 1 2 3 4 ε O ( 1 Θ( k ) ε ) O ( ε k ) D 14-2
Optimal Data Structure Query range 15-1
Optimal Data Structure ε -summary ( 3 2 ε )-summary (( 3 2 ) 2 ε )-summary · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Query range 15-2
Query Cost s ε log N sorted lists · · · · · · 16-1
Query Cost s ε log N sorted lists · · · · · · log N -way merging: Θ( s ε log log N ) 16-2
Query Cost s ε log N sorted lists · · · · · · 16-3
Query Cost s ε log N sorted lists · · · · · · Bottom-up two-way merging: O ( s ε ) 16-4
α -Exponentially Decomposable Multisets D 1 , . . . , D t with F 1 ( D i ) ≤ α i − 1 F 1 ( D 1 ), ∃ constant c , s.t. given S ( ε, D 1 ) , S ( c ε, D 2 ) . . . , S ( c t − 1 ε, D t ): We can construct an O ( ε )-summary for D 1 ⊎ · · · ⊎ D t . The total size of S ( ε, D 1 ) , . . . , S ( c t − 1 ε, D t ) is O ( s ε ) and they can be combined in O ( s ε ) time. The total size of S ( ε, D ) , . . . , S ( c t − 1 ε, D ) is O ( s ε ). 17-1
α -Exponentially Decomposable Multisets D 1 , . . . , D t with F 1 ( D i ) ≤ α i − 1 F 1 ( D 1 ), ∃ constant c , s.t. given S ( ε, D 1 ) , S ( c ε, D 2 ) . . . , S ( c t − 1 ε, D t ): We can construct an O ( ε )-summary for D 1 ⊎ · · · ⊎ D t . The total size of S ( ε, D 1 ) , . . . , S ( c t − 1 ε, D t ) is O ( s ε ) and they can be combined in O ( s ε ) time. The total size of S ( ε, D ) , . . . , S ( c t − 1 ε, D ) is O ( s ε ). Theorem For any (1 / 2)-exponentially decomposable summary, a database D of N records can be stored in an internal memory structure of linear size so that a summary query can be answered in O (log N + s ε ) time. 17-2
Optimal Data Structure - External Memory Standard B-tree blocking with fat leaves 18-1
Optimal Data Structure - External Memory Standard B-tree blocking with fat leaves O (log B ) Θ( B ) Leaf size: s ε 18-2
Query Path u v 0 r 1 w 1 v 1 r 2 w 2 v 2 r 3 v 19-1
Summary Set u w 1 w 2 w 3 v 20-1
Summary Set u R ( u , v ) = { w 1 , w 2 , w 3 } w 1 w 2 w 3 v 20-2
Summary Set u R ( u , v ) = { w 1 , w 2 , w 3 } w 1 w 2 w 3 v RS ( u , v , ε ) S ( ε, w 1 ) S ( c 3 ε, w 3 ) S ( c ε, w 2 ) 20-3
Focus on a Block r B u v 2 v 1 21-1
Focus on a Block r B u v 2 v 1 Case 1. RS ( u , v 1 , ε ) 21-2
Focus on a Block r B u v 2 v 1 Case 1. Size: s ε B log B RS ( u , v 1 , ε ) 21-3
Focus on a Block · · · Case 2. RS ( r B , v 2 , c ε ) RS ( r B , v 2 , ε ) r B u v 2 v 1 Case 1. Size: s ε B log B RS ( u , v 1 , ε ) 21-4
Focus on a Block Size: s ε B · · · Case 2. RS ( r B , v 2 , c ε ) RS ( r B , v 2 , ε ) r B u v 2 v 1 Case 1. Size: s ε B log B RS ( u , v 1 , ε ) 21-5
Focus on a Block Size: s ε B · · · Case 2. RS ( r B , v 2 , c ε ) RS ( r B , v 2 , ε ) r B S ( r B , ε ) S ( r B , ε ) S ( r B , c ε ) S ( r B , c ε ) S ( r B , c 2 ε ) S ( r B , c 2 ε ) · · · u Case 3. v 2 v 1 Case 1. Size: s ε B log B RS ( u , v 1 , ε ) 21-6
Recommend
More recommend