top k queries marcin kwietniewski
play

Top-K Queries Marcin Kwietniewski Agenda Introduction Early - PowerPoint PPT Presentation

Top-K Queries Marcin Kwietniewski Agenda Introduction Early solution Translation approach RankSQL Related work Introduction Consider a large travel database with some spatial information We're interested in


  1. Top-K Queries Marcin Kwietniewski

  2. Agenda • Introduction • Early solution • Translation approach • RankSQL • Related work

  3. Introduction • Consider a large travel database with some spatial information • We're interested in finding 5 hotels that are closest to Pearson airport • Traditionally: SELECT * FROM hotels h, airports a WHERE a.name = 'Pearson' ORDER BY distance ( a.location, h.location) • In application, fetch only 5 results from the cursor • Even though we only need a few results, the database sorts the whole result set

  4. Introduction (contd.) • Often, amount of needed results is much less than the cardinality of the (sorted) tables • Need to support limiting the output size: • Query engine might use the information to enhance performance of queries

  5. New operator • Additional logical operator is needed in order to process Top-K queries • Scan-stop – if input is sorted, operator just keeps first K elements from input stream • Sort-stop – if input not sorted, keep top-K tuples in a priority heap while scanning the input stream • Output cardinality is known Stop(5) Join Hotels Airports

  6. Stop placement • Placing Stop deep in the execution tree may help a lot • Pushing down the Stop operator is tricky: – Other operators might decrease the number of tuples in the stream – We might keep more tuples, but not too many – Optimizer has no notion of risk

  7. Examples Stop(10) Join Join Stop(10) Dept Emp Dept Emp

  8. Conservative stop placement • Never insert a Stop operator in a place where it may discard tuples that should be in the final result • Example: Join Stop(10) Dept Emp • Rule: push below non-reductive predicates • Joins with equality predicate, where there is a guarantee that matching tuples will always be found (eg. foreign key)

  9. Aggressive • Place Stop operator whenever it can provide a beneficial cardinality reduction • Rule: perform Stop as soon as the first expression in the ORDER BY clause can be evaluated • How many tuples to keep? K = later_reduction_factor * final_K * safety_factor • Restart operator needed • Additional rule: Stop should be placed on top of a pipeline (no loss, less risk)

  10. Aggressive example Stop(N) Join Restart Dept Join TEA Stop(2*N) Emp

  11. Performance • Single table queries: – Oder of magnitude improvement vs traditional model for K up to 100 – As K approaches 10.000 improvement is getting small • Joins: – Orders of magnitude improvement for K < 1000 – Comparable to traditional model only for K around 100.000 • Aggressive policy: – When Stops overestimate, response time 3-10 times shorter – Even with underestimates not performing worse than Conservative

  12. Alternative: translation • Query Model: – Query: desired attribute values (ie. query point) – Answer: a set ordered by how closely the tuples match the query (score = distance from query point) • Idea: translate the query into SQL, using statistics on the data • Algorithm: – Use histograms to find the score S such that K tuples will have score > S – Formulate SQL constraints on attributes in order for score to be > S – Execute SQL query – If too few tuples, decrease S and restart

  13. Histogram usage • Bucket histograms: No restarts With restarts

  14. RankSQL • 2005: RankSQL – an approach to fully integrate rank and Top-K concepts with the relational model • Problem to tackle: • Ranking functions are monolithic to the query engine and are evaluated at the root of execution tree • Solution: – Extended algebra – Incremental execution model – Rank-aware optimization

  15. Rank algebra • Query (base relations, selection) augumented with: – A set of rank predicates, such that each tuple has a predicate score for every predicate – Ranking function – a monotonic function of the predicates • Idea: allow query engine to split the evaluation of ranking predicates and interleave them as it does with Boolean predicates • New concept: Rank-relation, characterized by – Ranking function F – Predicates already evaluated • Each tuple has the maximal possible score for predicates not yet evaluated

  16. Rank-relation example • F = p 1 + p 2 + p 3 • p 1 &p 2 already evaluated • Assume max score for p 3 is 2.0

  17. Operators • New operator Rank(predicate p): – Doesn't change tuple membership – Order induced by F with p evaluated • Other operators modified: – Join, Union, Intersection: order of joined tuples determined by predicates already evaluated for any input relation

  18. Equivalence laws • When can we push the operators down? • • • When p is available in R • When p is available in R and S

  19. Query execution model • Operators incrementally output rank relations – Tuples are returned in order • Queries have an explicit bound on number of desired results (K) • Operators need to know when to stop: – Example: Rank(p) operator can output tuple t if it gets a t' such that score t' with max score on p is smaller than score of t with real score on p. • Priority queues used to keep Top-K results

  20. Example • HRJN – hash rank join • NRJN – nested loops rank join • Scans in order of some predicate

  21. Optimization • Extension of the bottom-up System-R type optimizer • Plan enumeration performed in two dimensional space (membership and rank) • Subplans identified by set of relations and set of ranking predicates • Algorithm exponential in number of relations and predicates • Heuristic: – Left-deep (relations) – Greedy choice of a rank operator 1 − card  plan'  card  plan  goodness  rank op = cost  rank op 

  22. Cost Model • Cardinality estimation is important in the cost model • Here, input size depends on the consumer, ie. operators may choose to stop processing input • Use sampling to estimate the score of the K-th tuple in the final result – Run any conventional plan with a few random tuples from input relations – Derive cardinalities of the real results from the sample run • Authors admit this is hard

  23. Conclusion • First fully rank-aware DBMS • It seems that output cardinality estimation is the weak point (order of magnitude errors)

  24. Related work • Follow-up to RankSQL: adaptive optimization, ie. changing the execution plan at runtime • Rank and generalized Group-By operations combined • Rank in uncertain databases • Approximate algorithms for Top-K

  25. Thank you! Q&A

  26. References • Stop operator: – On Saying “Enough Already!” in SQL Michael J. Carey, Donald Kossmann • RankSQL: – RankSQL: Query Algebra and Optimization for Relational Top­k Queries Chengkai Li et al. – Adaptive Rank-Aware Query Optimization in Relational Databases Ihab F. Ilyas et al. • Translations: – Evaluating Top-k Selection Queries Surajit Chaudhuri, Luis Gravano

Recommend


More recommend