motivation
play

MOTIVATION In many scenarios a data processing pipeline repeatedly - PowerPoint PPT Presentation

C OMPRESSED R EPRESENTATIONS OF C ONJUNCTIVE Q UERY R ESULTS Paris Koutris University of Wisconsin-Madison joint work with Shaleen Deep (UW-Madison) MOTIVATION In many scenarios a data processing pipeline repeatedly accesses the result of a


  1. C OMPRESSED R EPRESENTATIONS OF C ONJUNCTIVE Q UERY R ESULTS Paris Koutris University of Wisconsin-Madison joint work with Shaleen Deep (UW-Madison)

  2. MOTIVATION • In many scenarios a data processing pipeline repeatedly accesses the result of a join query using some access pattern • But the result of the query over large data can lead to a large result, and be very expensive to store directly Can we compress the query result so that we can still allow these accesses to be performed efficiently? 2

  3. EXAMPLE : GRAPH DATA Consider the author relation from DBLP: R ( author, paper ) We want to run a data analysis over the co-author graph, which can be expressed as the following view: V ( x, z ) ← R ( x, p ) , R ( y, p ) 3

  4. EXAMPLE : GRAPH DATA Graph analytics algorithms access a graph through an API that asks for the set of neighbors of a given vertex , expressed by an adorned view: V bf ( x, y ) ← R ( x, p ) , R ( y, p ) • x is a bound ( b ) variable • z is a free ( f ) variable [Xirogiannopoulos& Deshpande, ’17 ] 4

  5. EXAMPLE : GRAPH DATA V bf ( x, y ) ← R ( x, p ) , R ( y, p ) How can we solve this problem? 1. run each access request from scratch 2 . create index on materialized V space can be Ω(𝑂 2 ) • • no extra space needed answer time can be Ω(𝑂) • • answer in constant time what exists between the two extremes? 5

  6. TALK OUTLINE 1. Problem Setting 2. Main Result #1 3. Main Result #2 4. Future Work 6

  7. ADORNED VIEWS We consider the class of conjunctive queries: V ( x, y, z ) ← R ( x, y ) , S ( y, z ) , T ( z, x ) An adorned view [Ullman ‘85] describes an access pattern where some variables are bound ( b ) and others free ( f ) V bbf ( x, y, z ) ← R ( x, y ) , S ( y, z ) , T ( z, x ) V fff ( x, y, z ) ← R ( x, y ) , S ( y, z ) , T ( z, x ) An adorned view is full is every variable appears in the head 7

  8. COMPRESSED REPRESENTATIONS database D adorned view 𝑊 𝜃 compression time 𝑼 𝑫 preprocessing compressed space 𝑻 representation online phase answer time/delay access requests 8

  9. PARAMETERS Goal : construct a space-efficient representation to answer access requests that originate from a given adorned view Parameters : • compression time 𝑈 𝐷 • space: 𝑇 • answering time – total answer time 𝑈 𝐵 ( time to enumerate all results ) – delay 𝜀 ( maximum time between outputting two consecutive tuples ) 9

  10. FACTORIZED DATABASES [Olteanu & Závodný 15] Suppose that the adorned view has only free variables and the query is full: V f ··· f ( x 1 , . . . , x k ) ← . . . In compression time T A = O ( | D | fhw ( Q ) ) we can construct a compressed representation with space S = O ( | D | fhw ( Q ) ) such that we can answer any access request over D with constant delay. * fhw(Q) = fractional hypertree width of Q 10

  11. EXAMPLE | R | = | S | = | T | = N V bbf ( x, y, z ) ← R ( x, y ) , S ( y, z ) , T ( z, x ) compression total answer space delay time time 𝑃(𝑂) 𝑃(𝑂) 𝑃(𝑂) do nothing - materialize 𝑃(𝑂 2/4 ) 𝑃(𝑂 2/4 ) 𝑃(|𝑃𝑉𝑈|) * 𝑃(1) + create index 𝑃 𝑂 2/4 𝑃(𝑂 2/4 ) 𝑃(𝜐|𝑃𝑉𝑈|) 𝑃(𝜐) our results 𝜐 * OUT is the output of an access request 11

  12. ALL BOUND VARIABLES Suppose that the adorned view has only bound variables: V b ··· b ( x 1 , . . . , x k ) ← . . . Then, in time linear to the database size, we can construct a compressed representation with linear space that can answer any access request over D with constant delay. IDEA : simply create a hash index for every relation 12

  13. TALK OUTLINE 1. Problem Setting 2. Main Result #1 3. Main Result #2 4. Future Work 13

  14. QUERY AS A HYPERGRAPH Given an adorned view 𝑅 𝜃 , it will be convenient to view it as a hypergraph 𝐼 = (𝑊, 𝐹) Q fffbbb ( x, y, z, w 1 , w 2 , w 3 ) ← R 1 ( w 1 , x, y ) , R 2 ( w 2 , y, z ) , R 3 ( w 3 , x, z ) x bound variables: 𝑊 w 1 w 3 > y z free variables: 𝑊 ? w 2 14

  15. FRACTIONAL EDGE COVER fractional edge cover : assign a weight to each hyperedge such that for every variable, the sum of the weights that include it is at least 1 1 1 x w 1 w 3 y z w 2 1 15

  16. SLACK slack : given a fractional edge cover 𝒗 , and a subset 𝑇 of the variables, the slack 𝛽(𝑇) is the maximum quantity such that 𝒗/𝛽(𝑇) is still a fractional cover of 𝑇 1 1 x w 1 w 3 V f = { x, y, z } α ( V f ) = 2 y z w 2 the slack is always at least one! 1 16

  17. AGM BOUND Let 𝐼 = 𝑊, 𝐹 be a hypergraph. For every fractional edge cover 𝒗 of 𝑊 , the output size of the corresponding join query is upper bounded by Y | R F | u F F ∈ E 1 1 x w 1 w 3 In our example, if all relations have y z size 𝑂 , we obtain a bound of 𝑂 3 w 2 1 17

  18. MAIN THEOREM #1 𝑅 𝜃 : full adorned view with hypergraph 𝐼 = (𝑊, 𝐹) 𝒗 : any fractional edge cover of 𝑊 For any database D and parameter 𝜐 > 0 , we can construct a compressed representation with: | R F | u F ) compression time T C = ˜ Y O ( | D | + F ∈ E | R F | u F / τ α ( V f ) ) space S = ˜ Y O ( | D | + F ∈ E delay δ = ˜ O ( τ ) answer time T A = ˜ O ( | q ( D ) | + τ · | q ( D ) | 1 / α ( V f )) ) 18

  19. EXAMPLE 1 1 x w 1 w 3 u = (1 , 1 , 1) α ( V f ) = 2 , V f = { x, y, z } y z w 2 1 compression time T C = ˜ O ( N 3 ) space S = ˜ O ( N 3 / τ 2 ) delay δ = ˜ O ( τ ) answer time T A = ˜ O ( | q ( D ) | + τ · | q ( D ) | 1 / 2 ) 19

  20. THE DATA STRUCTURE (1) • Consider an ordering of the free variables 𝑊 𝑔 e.g. 𝑦 ≤ 𝑧 ≤ 𝑨 • This induces a lexicographic ordering for all the valuations over 𝑊 𝑔 • Using this ordering, we can define intervals: 𝐽 K = [ 0,0,10 ,(0,10,20)] 𝐽 4 = [ 3,1,0 , 4,5,0 ) • Given a valuation 𝑤 𝑐 over 𝑊 𝑐 and an interval 𝐽 , we can estimate an upper bound 𝑼(𝒘 𝒄 , 𝑱) on the cost of computing the query restricted on 𝑤 𝑐 , 𝐽 using the AGM bound 20

  21. THE DATA STRUCTURE (2) • The data structure is a binary tree parameterized by a threshold 𝝊 • Each node is labeled by an interval 𝐽 • Each node stores a bit for every valuation over 𝑤 𝑐 over 𝑊 𝑐 with cost 𝑈(𝑤 𝑐 , 𝐽) > 𝜐 : 0 : the query over 𝑤 𝑐 , 𝐽 is empty 1 : the query over 𝑤 𝑐 , 𝐽 is not empty 21

  22. THE DATA STRUCTURE (3) the interval in the root node includes all valuations I at the next level, the interval of the parent I 1 I 2 is split into two smaller intervals I 6 I 3 I 4 I 5 we stop at log |𝐸| levels … … … We split the intervals such that the bits we need to store at the two sub-intervals is balanced 22

  23. USING THE DATA STRUCTURE We are given a valuation 𝑤 𝑐 over 𝑊 𝑐 starting from the root: I 1 • if there is not a bit set, we run the query on the interval • if bit =0, we exit the node - I 1 I 2 1 • if bit = 1, we visit the left and then the right child I 6 I 3 I 4 I 5 - 0 … … … The delay is bounded by the threshold 𝜐 23

  24. COROLLARY OF THEOREM #1 𝑅 𝜃 : full adorned view with hypergraph 𝐼 = (𝑊, 𝐹) 𝜍(𝐼) : minimum fractional edge cover For any input database D and parameter 𝜐 > 0 , we can construct a compressed representation with: space S = ˜ O ( | D | + | D | ρ ( H ) / τ ) delay δ = ˜ O ( τ ) For 𝜐 = 1 , the space matches the AGM bound 24

  25. BETTER BOUNDS USING SLACK 1 1 x 2 1 Star join query x 3 x 1 • fractional edge cover assigns weight 1 z • the slack for {𝑨} in this case is 𝑜 … x n 1 space S = ˜ O ( | D | + | D | n / τ n ) delay δ = ˜ O ( τ ) answer time S = ˜ O ( | q ( D ) | + τ | q ( D ) | 1 /n ) If we ignored slack, the space would be |𝐸| 𝑜 /𝜐 25

  26. DELAY - SPACE TRADEOFF delay slack =1 slack > 1 1 |D| ρ |D| space AGM bound 26

  27. FAST SET INTERSECTION Given a family of sets { 𝑇 1 , . .. , 𝑇 𝑜 } with total size 𝑛 , construct a space-efficient data structure such that given any 𝑗, 𝑘 we can compute 𝑇 𝑗 ∩ 𝑇 𝑘 as fast as possible [Cohen & Porat ‘10] 1 y Q bbf ( x, y, z ) ← R ( x, z ) , R ( y, z ) z x 1 special case of the Theorem #1: 𝑆(𝑡,𝑐) encodes that set s contains element b 27

  28. LIMITATIONS OF THEOREM #1 Consider the following adorned view: Q fff ( x, y, z ) ← R ( x, z ) , S ( y, z ) • Theorem #1 implies that for constant delay we need space 𝑃(𝑂 2 ) • But we know that we can achieve the same delay with only linear space (because of acyclicity) • Why is there a mismatch in the space bounds? We must take the query structure into account as well 28

  29. TALK OUTLINE 1. Problem Setting 2. Main Result #1 3. Main Result #2 4. Future Work 29

  30. TREE DECOMPOSITION Q ( x 1 , . . . , x 7 ) ← R 1 ( x 1 , x 2 ) , R 2 ( x 2 , x 3 ) , . . . , R 6 ( x 6 , x 7 ) x 2 x 1 Given a hypergraph 𝐼 = 𝑊, 𝐹 , a tree decomposition of 𝐼 is a tuple x 3 x 2 (𝑈, (𝐶 𝑢 )) where 𝑈 is a tree, and each bag 𝐶 𝑢 is a subset of 𝑊 such that: x 4 x 3 • each edge is contained in some bag • for each variable 𝑦 , the set of tree x 5 x 4 nodes that contain 𝑦 in their bag is connected x 6 x 5 x 7 x 6 30

Recommend


More recommend