Top-K Queries Marcin Kwietniewski
Agenda • Introduction • Early solution • Translation approach • RankSQL • Related work
Introduction • Consider a large travel database with some spatial information • We're interested in finding 5 hotels that are closest to Pearson airport • Traditionally: SELECT * FROM hotels h, airports a WHERE a.name = 'Pearson' ORDER BY distance ( a.location, h.location) • In application, fetch only 5 results from the cursor • Even though we only need a few results, the database sorts the whole result set
Introduction (contd.) • Often, amount of needed results is much less than the cardinality of the (sorted) tables • Need to support limiting the output size: • Query engine might use the information to enhance performance of queries
New operator • Additional logical operator is needed in order to process Top-K queries • Scan-stop – if input is sorted, operator just keeps first K elements from input stream • Sort-stop – if input not sorted, keep top-K tuples in a priority heap while scanning the input stream • Output cardinality is known Stop(5) Join Hotels Airports
Stop placement • Placing Stop deep in the execution tree may help a lot • Pushing down the Stop operator is tricky: – Other operators might decrease the number of tuples in the stream – We might keep more tuples, but not too many – Optimizer has no notion of risk
Examples Stop(10) Join Join Stop(10) Dept Emp Dept Emp
Conservative stop placement • Never insert a Stop operator in a place where it may discard tuples that should be in the final result • Example: Join Stop(10) Dept Emp • Rule: push below non-reductive predicates • Joins with equality predicate, where there is a guarantee that matching tuples will always be found (eg. foreign key)
Aggressive • Place Stop operator whenever it can provide a beneficial cardinality reduction • Rule: perform Stop as soon as the first expression in the ORDER BY clause can be evaluated • How many tuples to keep? K = later_reduction_factor * final_K * safety_factor • Restart operator needed • Additional rule: Stop should be placed on top of a pipeline (no loss, less risk)
Aggressive example Stop(N) Join Restart Dept Join TEA Stop(2*N) Emp
Performance • Single table queries: – Oder of magnitude improvement vs traditional model for K up to 100 – As K approaches 10.000 improvement is getting small • Joins: – Orders of magnitude improvement for K < 1000 – Comparable to traditional model only for K around 100.000 • Aggressive policy: – When Stops overestimate, response time 3-10 times shorter – Even with underestimates not performing worse than Conservative
Alternative: translation • Query Model: – Query: desired attribute values (ie. query point) – Answer: a set ordered by how closely the tuples match the query (score = distance from query point) • Idea: translate the query into SQL, using statistics on the data • Algorithm: – Use histograms to find the score S such that K tuples will have score > S – Formulate SQL constraints on attributes in order for score to be > S – Execute SQL query – If too few tuples, decrease S and restart
Histogram usage • Bucket histograms: No restarts With restarts
RankSQL • 2005: RankSQL – an approach to fully integrate rank and Top-K concepts with the relational model • Problem to tackle: • Ranking functions are monolithic to the query engine and are evaluated at the root of execution tree • Solution: – Extended algebra – Incremental execution model – Rank-aware optimization
Rank algebra • Query (base relations, selection) augumented with: – A set of rank predicates, such that each tuple has a predicate score for every predicate – Ranking function – a monotonic function of the predicates • Idea: allow query engine to split the evaluation of ranking predicates and interleave them as it does with Boolean predicates • New concept: Rank-relation, characterized by – Ranking function F – Predicates already evaluated • Each tuple has the maximal possible score for predicates not yet evaluated
Rank-relation example • F = p 1 + p 2 + p 3 • p 1 &p 2 already evaluated • Assume max score for p 3 is 2.0
Operators • New operator Rank(predicate p): – Doesn't change tuple membership – Order induced by F with p evaluated • Other operators modified: – Join, Union, Intersection: order of joined tuples determined by predicates already evaluated for any input relation
Equivalence laws • When can we push the operators down? • • • When p is available in R • When p is available in R and S
Query execution model • Operators incrementally output rank relations – Tuples are returned in order • Queries have an explicit bound on number of desired results (K) • Operators need to know when to stop: – Example: Rank(p) operator can output tuple t if it gets a t' such that score t' with max score on p is smaller than score of t with real score on p. • Priority queues used to keep Top-K results
Example • HRJN – hash rank join • NRJN – nested loops rank join • Scans in order of some predicate
Optimization • Extension of the bottom-up System-R type optimizer • Plan enumeration performed in two dimensional space (membership and rank) • Subplans identified by set of relations and set of ranking predicates • Algorithm exponential in number of relations and predicates • Heuristic: – Left-deep (relations) – Greedy choice of a rank operator 1 − card plan' card plan goodness rank op = cost rank op
Cost Model • Cardinality estimation is important in the cost model • Here, input size depends on the consumer, ie. operators may choose to stop processing input • Use sampling to estimate the score of the K-th tuple in the final result – Run any conventional plan with a few random tuples from input relations – Derive cardinalities of the real results from the sample run • Authors admit this is hard
Conclusion • First fully rank-aware DBMS • It seems that output cardinality estimation is the weak point (order of magnitude errors)
Related work • Follow-up to RankSQL: adaptive optimization, ie. changing the execution plan at runtime • Rank and generalized Group-By operations combined • Rank in uncertain databases • Approximate algorithms for Top-K
Thank you! Q&A
References • Stop operator: – On Saying “Enough Already!” in SQL Michael J. Carey, Donald Kossmann • RankSQL: – RankSQL: Query Algebra and Optimization for Relational Topk Queries Chengkai Li et al. – Adaptive Rank-Aware Query Optimization in Relational Databases Ihab F. Ilyas et al. • Translations: – Evaluating Top-k Selection Queries Surajit Chaudhuri, Luis Gravano
Recommend
More recommend