optimal join algorithms meet top
play

Optimal Join Algorithms meet Top- Nikolaos Tziavelis, Wolfgang - PowerPoint PPT Presentation

SIGMOD 2020 tutorial Optimal Join Algorithms meet Top- Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald Ranked results Northeastern University, Boston Part 1 : Top- Time Slides:


  1. SIGMOD 2020 tutorial Optimal Join Algorithms meet Top- 𝑙 Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald Ranked results Northeastern University, Boston Part 1 : Top- 𝑙 Time Slides: https://northeastern-datalab.github.io/topk-join-tutorial/ DOI: https://doi.org/10.1145/3318464.3383132 Data Lab: https://db.khoury.northeastern.edu This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License. 1 See https://creativecommons.org/licenses/by-nc-sa/4.0/ for details

  2. Why "Optimal Join Algorithms meet Top- 𝑙 "? Optimal Join algorithms Top- 𝑙 Return all results over joins Given 𝑙 , return 𝑙 β€œbest” results β‡’ How to avoid large β‡’ How to avoid working on intermediate results? any lower ranked results? Ranked Enumeration (Any- 𝑙 ) Incrementally return the 𝑙 β€œbest” results over joins (for any 𝑙 = 1, 2, ...) β‡’ How to most effectively push sorting through joins? 3

  3. Top- 𝑙 Optimal Join Algorithms Any- 𝑙 middleware cost model RAM cost model return all results; (# accesses) wish: 𝑃 𝑠 , 𝑠 > π‘œ conjunctive queries ranking function small result size; query wish: 𝑃(𝑙) decompositions most important all results minimize results first are equally return only intermediate important 𝑙 - best results results incremental computation 4

  4. Outline tutorial β€’ Part 1: Top- 𝑙 (Wolfgang): ~20min – Top- 𝑙 selection problem – Threshold algorithm [Fagin+ '03] – Top- 𝑙 join problem – J* algorithm [Natsev+ '01] – Discussion on cost models β€’ Part 2: Optimal Join Algorithms (Mirek): ~30min β€’ Part 3: Ranked enumeration over joins (Nikolaos): ~40min 5

  5. Top- 𝑙 Selection Query: overall setup β€’ π‘œ objects π‘Œ ! , π‘Œ " , … , π‘Œ # with β„“ numeric weight attributes π‘₯ ! , π‘₯ " , … , π‘₯ β„“ β€’ weight of object = aggregate function over its weights 𝜍 π‘₯ ! , π‘₯ " , … , π‘₯ β„“ = 𝜍 π‘Œ β€’ Goal: Find top- 𝑙 objects according to some order (e.g. min) In most original papers assumed to be max! id π‘₯ ! π‘₯ " π‘₯ % sum Example aggregate function: 𝜍 = sum {π‘₯ ! , π‘₯ " , π‘₯ % } π‘Œ ! 3 4 3 10 π‘Œ " 4 2 4 10 π‘Œ % 6 8 1 15 π‘Œ & 7 6 6 18 Top- 𝑙 : a set of 𝑙 objects s.t. 𝜍 π‘Œ ( ≀ 𝜍(π‘Œ ) ) π‘Œ ' 8 7 5 20 for every π‘Œ ( ∈ π‘ˆ and every π‘Œ ) βˆ‰ π‘ˆ π‘œ = 5 , β„“ = 3 , 𝑙 = 2 6

  6. Top- 𝑙 Selection Query: information in different relations β€’ Weights are stored in β„“ distinct relations 𝑆 ! 𝑆 ! 𝑆 " 𝑆 % id π‘₯ ! π‘₯ " π‘₯ % sum id π‘₯ ! id π‘₯ " id π‘₯ % π‘Œ ! 3 4 3 10 π‘Œ ! 3 π‘Œ ! 4 π‘Œ ! 3 π‘Œ " 4 2 4 10 π‘Œ " 4 π‘Œ " 2 π‘Œ " 4 π‘Œ % 6 8 1 15 π‘Œ % 6 π‘Œ % 1 π‘Œ % 8 π‘Œ & 7 6 6 18 π‘Œ & 7 π‘Œ & 6 π‘Œ & 6 π‘Œ ' 8 7 5 20 π‘Œ ' 8 π‘Œ ' 7 π‘Œ ' 5 7

  7. Top- 𝑙 Selection Query: sorted access β€’ Weights are stored in β„“ distinct relations 𝑆 ! - each 𝑆 ! is sorted by attribute π‘₯ ! 𝑆 ! 𝑆 " 𝑆 % id π‘₯ ! π‘₯ " π‘₯ % sum id π‘₯ ! id π‘₯ " id π‘₯ % π‘Œ ! 3 4 3 10 π‘Œ ! 3 π‘Œ ! 4 π‘Œ ! 3 π‘Œ " 4 2 4 10 π‘Œ " 4 π‘Œ " 2 π‘Œ " 4 π‘Œ % 6 8 1 15 π‘Œ % 6 π‘Œ % 1 π‘Œ % 8 π‘Œ & 7 6 6 18 π‘Œ & 7 π‘Œ & 6 π‘Œ & 6 π‘Œ ' 8 7 5 20 π‘Œ ' 8 π‘Œ ' 7 π‘Œ ' 5 8

  8. Top- 𝑙 Selection Query: sorted access β€’ Weights are stored in β„“ distinct relations 𝑆 ! - each 𝑆 ! is sorted by attribute π‘₯ ! 𝑆 ! 𝑆 " 𝑆 % id π‘₯ ! π‘₯ " π‘₯ % sum id π‘₯ ! id π‘₯ " id π‘₯ % Notice we sort in increasing order π‘Œ ! 3 4 3 10 π‘Œ ! 3 π‘Œ " 2 π‘Œ % 1 π‘Œ " 4 2 4 10 π‘Œ " 4 π‘Œ ! 4 π‘Œ ! 3 π‘Œ % 6 8 1 15 π‘Œ % 6 π‘Œ & 6 π‘Œ " 4 π‘Œ & 7 6 6 18 π‘Œ & 7 π‘Œ ' 7 π‘Œ ' 5 π‘Œ ' 8 7 5 20 π‘Œ ' 8 π‘Œ % 8 π‘Œ & 6 9

  9. Top- 𝑙 Selection Query: "middleware" assumption As Assumption 1: 1: Mi Middl ddleware c cost m mode del : β€’ Weights are stored in β„“ distinct relations 𝑆 ! we aggregate rankings of other services. - each 𝑆 ! is sorted by attribute π‘₯ ! β€’ we only pay for accesses to attribute lists β€’ Goal: Find top- 𝑙 with minimal access cost β€’ 2 types of access: sequential / random - get next object in 𝑆 ! sequentially: "sorted" sequential access cost 𝑑 "#$ - obtain the weight for a specific object in 𝑆 ! : random access (index lookup) cost 𝑑 %&'( 𝑆 ! 𝑆 " 𝑆 % id π‘₯ ! π‘₯ " π‘₯ % sum id π‘₯ ! id π‘₯ " id π‘₯ % Notice we sort in increasing order π‘Œ ! 3 4 3 10 π‘Œ ! 3 π‘Œ " 2 π‘Œ % 1 π‘Œ " 4 2 4 10 π‘Œ " 4 π‘Œ ! 4 π‘Œ ! 3 π‘Œ % 6 8 1 15 π‘Œ % 6 π‘Œ & 6 π‘Œ " 4 π‘Œ & 7 6 6 18 π‘Œ & 7 π‘Œ ' 7 π‘Œ ' 5 π‘Œ ' 8 7 5 20 π‘Œ ' 8 π‘Œ % 8 π‘Œ & 6 10

  10. Top- 𝑙 Selection Query as a Join Problem As Assumption 1: 1: Mi Middl ddleware c cost m mode del : β€’ Weights are stored in β„“ distinct relations 𝑆 ! we aggregate rankings of other services. - each 𝑆 ! is sorted by attribute π‘₯ ! β€’ we only pay for accesses to attribute lists β€’ Goal: Find top- 𝑙 with minimal access cost β€’ 2 types of access: sequential / random - get next object in 𝑆 ! sequentially: "sorted" sequential access cost 𝑑 "#$ - obtain the weight for a specific object in 𝑆 ! : random access (index lookup) cost 𝑑 %&'( 𝑆 ! 𝑆 " 𝑆 % select R 1 .id, id π‘₯ ! π‘₯ " π‘₯ % sum id π‘₯ ! id π‘₯ " id π‘₯ % sum(w 1 ,w 2 ,w 3 ) as weight π‘Œ ! 3 4 3 10 π‘Œ ! 3 π‘Œ " 2 π‘Œ % 1 from R 1 , R 2 , R 3 π‘Œ " 4 2 4 10 π‘Œ " 4 π‘Œ ! 4 π‘Œ ! 3 where R 1 .id=R 2 .id π‘Œ % 6 8 1 15 π‘Œ % 6 π‘Œ & 6 π‘Œ " 4 and R 2 .id=R 3 .id π‘Œ & 7 6 6 18 π‘Œ & 7 π‘Œ ' 7 π‘Œ ' 5 order by weight π‘Œ ' 8 7 5 20 π‘Œ ' 8 π‘Œ % 8 π‘Œ & 6 limit 2 ~ Joins on unique object id: 1–1 relationships 11

  11. Naive algorithm: retrieve all items Assumption 1: As 1: Mi Middl ddleware c cost m mode del : β€’ Weights are stored in β„“ distinct relations 𝑆 ! we aggregate rankings of other services. - each 𝑆 ! is sorted by attribute π‘₯ ! β€’ we only pay for accesses to attribute lists β€’ Goal: Find top- 𝑙 with minimal access cost β€’ 2 types of access: sequential / random - get next object in 𝑆 ! sequentially: "sorted" sequential access cost 𝑑 "#$ - obtain the weight for a specific object in 𝑆 ! : random access (index lookup) cost 𝑑 %&'( 𝑆 ! 𝑆 " 𝑆 % select R 1 .id, id π‘₯ ! π‘₯ " π‘₯ % sum id π‘₯ ! id π‘₯ " id π‘₯ % sum(w 1 ,w 2 ,w 3 ) as weight π‘Œ ! 3 4 3 10 π‘Œ ! 3 π‘Œ " 2 π‘Œ % 1 from R 1 , R 2 , R 3 π‘Œ " 4 2 4 10 π‘Œ " 4 π‘Œ ! 4 π‘Œ ! 3 where R 1 .id=R 2 .id π‘Œ % 6 8 1 15 π‘Œ % 6 π‘Œ & 6 π‘Œ " 4 and R 2 .id=R 3 .id π‘Œ & 7 6 6 18 π‘Œ & 7 π‘Œ ' 7 π‘Œ ' 5 order by weight π‘Œ ' 8 7 5 20 π‘Œ ' 8 π‘Œ % 8 π‘Œ & 6 limit 2 Naive algorithm: retrieve all items, sort, return top- 𝑙 Cost = π‘œ β‹… β„“ β‹… 𝑑 "#$% 12

  12. Assumption 2: monotonicity of 𝜍 Assumption 1: As 1: Mi Middl ddleware c cost m mode del : β€’ Weights are stored in β„“ distinct relations 𝑆 ! we aggregate rankings of other services. - each 𝑆 ! is sorted by attribute π‘₯ ! β€’ we only pay for accesses to attribute lists β€’ Goal: Find top- 𝑙 with minimal access cost β€’ 2 types of access: sequential / random - get next object in 𝑆 ! sequentially: "sorted" sequential access cost 𝑑 "#$ - obtain the weight for a specific object in 𝑆 ! : random access (index lookup) cost 𝑑 %&'( 𝑆 ! 𝑆 " 𝑆 % select R 1 .id, id π‘₯ ! π‘₯ " π‘₯ % sum id π‘₯ ! id π‘₯ " id π‘₯ % sum(w 1 ,w 2 ,w 3 ) as weight π‘Œ ! 3 4 3 10 π‘Œ ! 3 π‘Œ " 2 π‘Œ % 1 from R 1 , R 2 , R 3 π‘Œ " 4 2 4 10 π‘Œ " 4 π‘Œ ! 4 π‘Œ ! 3 where R 1 .id=R 2 .id π‘Œ % 6 8 1 15 π‘Œ % 6 π‘Œ & 6 π‘Œ " 4 and R 2 .id=R 3 .id π‘Œ & 7 6 6 18 π‘Œ & 7 π‘Œ ' 7 π‘Œ ' 5 order by weight π‘Œ ' 8 7 5 20 π‘Œ ' 8 π‘Œ % 8 π‘Œ & 6 limit 2 Part 3: tropical semiring (min, sum) is instance Assumption 2: As 2: The aggregate function 𝜍 is mo monotone : of " sele lective di dioid " (i.e. min(a,b) = a or b). , if π‘₯ ! ≀ π‘₯ ! , for all i , , π‘₯ * , , … , π‘₯ β„“ 𝜍 π‘₯ ) , π‘₯ * , … , π‘₯ β„“ ≀ 𝜍 π‘₯ ) 𝜍 is decomposable: 𝜍 π‘₯ ) , π‘₯ * , π‘₯ - = 𝜍{π‘₯ ) , π‘₯ * , π‘₯ - } 13

Recommend


More recommend