Top-K Queries Marcin Kwietniewski Agenda Introduction Early - PowerPoint PPT Presentation

Top-K Queries Marcin Kwietniewski

Agenda • Introduction • Early solution • Translation approach • RankSQL • Related work

Introduction • Consider a large travel database with some spatial information • We're interested in finding 5 hotels that are closest to Pearson airport • Traditionally: SELECT * FROM hotels h, airports a WHERE a.name = 'Pearson' ORDER BY distance ( a.location, h.location) • In application, fetch only 5 results from the cursor • Even though we only need a few results, the database sorts the whole result set

Introduction (contd.) • Often, amount of needed results is much less than the cardinality of the (sorted) tables • Need to support limiting the output size: • Query engine might use the information to enhance performance of queries

New operator • Additional logical operator is needed in order to process Top-K queries • Scan-stop – if input is sorted, operator just keeps first K elements from input stream • Sort-stop – if input not sorted, keep top-K tuples in a priority heap while scanning the input stream • Output cardinality is known Stop(5) Join Hotels Airports

Stop placement • Placing Stop deep in the execution tree may help a lot • Pushing down the Stop operator is tricky: – Other operators might decrease the number of tuples in the stream – We might keep more tuples, but not too many – Optimizer has no notion of risk

Examples Stop(10) Join Join Stop(10) Dept Emp Dept Emp

Conservative stop placement • Never insert a Stop operator in a place where it may discard tuples that should be in the final result • Example: Join Stop(10) Dept Emp • Rule: push below non-reductive predicates • Joins with equality predicate, where there is a guarantee that matching tuples will always be found (eg. foreign key)

Aggressive • Place Stop operator whenever it can provide a beneficial cardinality reduction • Rule: perform Stop as soon as the first expression in the ORDER BY clause can be evaluated • How many tuples to keep? K = later_reduction_factor * final_K * safety_factor • Restart operator needed • Additional rule: Stop should be placed on top of a pipeline (no loss, less risk)

Aggressive example Stop(N) Join Restart Dept Join TEA Stop(2*N) Emp

Performance • Single table queries: – Oder of magnitude improvement vs traditional model for K up to 100 – As K approaches 10.000 improvement is getting small • Joins: – Orders of magnitude improvement for K < 1000 – Comparable to traditional model only for K around 100.000 • Aggressive policy: – When Stops overestimate, response time 3-10 times shorter – Even with underestimates not performing worse than Conservative

Alternative: translation • Query Model: – Query: desired attribute values (ie. query point) – Answer: a set ordered by how closely the tuples match the query (score = distance from query point) • Idea: translate the query into SQL, using statistics on the data • Algorithm: – Use histograms to find the score S such that K tuples will have score > S – Formulate SQL constraints on attributes in order for score to be > S – Execute SQL query – If too few tuples, decrease S and restart

Histogram usage • Bucket histograms: No restarts With restarts

RankSQL • 2005: RankSQL – an approach to fully integrate rank and Top-K concepts with the relational model • Problem to tackle: • Ranking functions are monolithic to the query engine and are evaluated at the root of execution tree • Solution: – Extended algebra – Incremental execution model – Rank-aware optimization

Rank algebra • Query (base relations, selection) augumented with: – A set of rank predicates, such that each tuple has a predicate score for every predicate – Ranking function – a monotonic function of the predicates • Idea: allow query engine to split the evaluation of ranking predicates and interleave them as it does with Boolean predicates • New concept: Rank-relation, characterized by – Ranking function F – Predicates already evaluated • Each tuple has the maximal possible score for predicates not yet evaluated

Rank-relation example • F = p 1 + p 2 + p 3 • p 1 &p 2 already evaluated • Assume max score for p 3 is 2.0

Operators • New operator Rank(predicate p): – Doesn't change tuple membership – Order induced by F with p evaluated • Other operators modified: – Join, Union, Intersection: order of joined tuples determined by predicates already evaluated for any input relation

Equivalence laws • When can we push the operators down? • • • When p is available in R • When p is available in R and S

Query execution model • Operators incrementally output rank relations – Tuples are returned in order • Queries have an explicit bound on number of desired results (K) • Operators need to know when to stop: – Example: Rank(p) operator can output tuple t if it gets a t' such that score t' with max score on p is smaller than score of t with real score on p. • Priority queues used to keep Top-K results

Example • HRJN – hash rank join • NRJN – nested loops rank join • Scans in order of some predicate

Optimization • Extension of the bottom-up System-R type optimizer • Plan enumeration performed in two dimensional space (membership and rank) • Subplans identified by set of relations and set of ranking predicates • Algorithm exponential in number of relations and predicates • Heuristic: – Left-deep (relations) – Greedy choice of a rank operator 1 − card  plan'  card  plan  goodness  rank op = cost  rank op 

Cost Model • Cardinality estimation is important in the cost model • Here, input size depends on the consumer, ie. operators may choose to stop processing input • Use sampling to estimate the score of the K-th tuple in the final result – Run any conventional plan with a few random tuples from input relations – Derive cardinalities of the real results from the sample run • Authors admit this is hard

Conclusion • First fully rank-aware DBMS • It seems that output cardinality estimation is the weak point (order of magnitude errors)

Related work • Follow-up to RankSQL: adaptive optimization, ie. changing the execution plan at runtime • Rank and generalized Group-By operations combined • Rank in uncertain databases • Approximate algorithms for Top-K

Thank you! Q&A

References • Stop operator: – On Saying “Enough Already!” in SQL Michael J. Carey, Donald Kossmann • RankSQL: – RankSQL: Query Algebra and Optimization for Relational Topk Queries Chengkai Li et al. – Adaptive Rank-Aware Query Optimization in Relational Databases Ihab F. Ilyas et al. • Translations: – Evaluating Top-k Selection Queries Surajit Chaudhuri, Luis Gravano

Top-K Queries Marcin Kwietniewski Agenda Introduction Early - PowerPoint PPT Presentation

Top-K Queries Marcin Kwietniewski Agenda Introduction Early solution Translation approach RankSQL Related work Introduction Consider a large travel database with some spatial information We're interested in

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Top-k Queries over Uncertain Scores Qing Liu, Debabrota Basu, Talel Abdessalem, St ephane

HawkTracer profiler Marcin Kolny Amazon Prime Video marcin.kolny@gmail.com February 2, 2020

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

New Requirements Top-N/Bottom-N queries Interactive queries Decision making

Range Minimum and Lowest Common Ancestor Queries Slides by Solon P. Pissis November 15, 2019

Middleware Queries Queries Middleware Middleware Queries Prof. Paolo Ciaccia Prof. Paolo

To TOP or NOT to TOP www.SAS.com To TOP or NOT to TOP Using the TOP command in Linux By Len van

Flask Marcin Jenczmyk Clearcode m.jenczmyk@clearcode.cc 27/05/2017 Marcin Jenczmyk Flask

Incorporating Detractors into SVM Marcin Orchel AGH University of Science and Technology Marcin

Pro-health 65+ WP.5 - Initial report from scoping review. Mariusz Duplaga, Marcin Grysztar, Marcin

Core Service Failures Co e Se ce a u es Results from TIC WG Marcin Radecki Marcin Radecki

ELCE 2013 - Secure Embedded Linux Product (A Success Story) Marcin Bis http://bis-linux.com

Exploring Linux Kernel Source Code with Eclipse and QTCreator Marcin Bis 2016.10.12 ELCE 2016,

Boosted Top Tagging Seung J. Lee Outline Introduction: top jets @ LHC Modern boosted top

Geometric Algorithms Range & windowing queries (2 lectures) Database queries 2/180 G.

An Accurate Join for Zonotopes, Preserving Affine Input/Output Relations Eric Goubault, Tristan

Relational algebra with discriminative naively Relational algebra, joins and lazy products

When to Optimize Enumerating all possible plans Selection Pushdown Join Conversion Join

Joining Ranked Input In Practice Ihab F. Ilyas Purdue University Joint work with Walid G. Aref

SQL Queries 1 / 28 The SELECT-FROM-WHERE Structure SELECT <attributes > FROM <tables

LINQ to SQL: Taking the Boredom out of Querying Introduction LINQ = Language INtegrated Query =

Query Processing, Resource Management, And Approximation in a Data Stream Management System

Relational Operators indexing technology. Now we can move on to query processing.

Top-K Queries Marcin Kwietniewski Agenda Introduction Early - PowerPoint PPT Presentation

Top-K Queries Marcin Kwietniewski Agenda Introduction Early solution Translation approach RankSQL Related work Introduction Consider a large travel database with some spatial information We're interested in

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Top-k Queries over Uncertain Scores Qing Liu, Debabrota Basu, Talel Abdessalem, St ephane

HawkTracer profiler Marcin Kolny Amazon Prime Video marcin.kolny@gmail.com February 2, 2020

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

New Requirements Top-N/Bottom-N queries Interactive queries Decision making

Range Minimum and Lowest Common Ancestor Queries Slides by Solon P. Pissis November 15, 2019

Middleware Queries Queries Middleware Middleware Queries Prof. Paolo Ciaccia Prof. Paolo

To TOP or NOT to TOP www.SAS.com To TOP or NOT to TOP Using the TOP command in Linux By Len van

Flask Marcin Jenczmyk Clearcode m.jenczmyk@clearcode.cc 27/05/2017 Marcin Jenczmyk Flask

Incorporating Detractors into SVM Marcin Orchel AGH University of Science and Technology Marcin

Pro-health 65+ WP.5 - Initial report from scoping review. Mariusz Duplaga, Marcin Grysztar, Marcin

Core Service Failures Co e Se ce a u es Results from TIC WG Marcin Radecki Marcin Radecki

ELCE 2013 - Secure Embedded Linux Product (A Success Story) Marcin Bis http://bis-linux.com

Exploring Linux Kernel Source Code with Eclipse and QTCreator Marcin Bis 2016.10.12 ELCE 2016,

Boosted Top Tagging Seung J. Lee Outline Introduction: top jets @ LHC Modern boosted top

Geometric Algorithms Range &amp; windowing queries (2 lectures) Database queries 2/180 G.

An Accurate Join for Zonotopes, Preserving Affine Input/Output Relations Eric Goubault, Tristan

Relational algebra with discriminative naively Relational algebra, joins and lazy products

When to Optimize Enumerating all possible plans Selection Pushdown Join Conversion Join

Joining Ranked Input In Practice Ihab F. Ilyas Purdue University Joint work with Walid G. Aref

SQL Queries 1 / 28 The SELECT-FROM-WHERE Structure SELECT &lt;attributes &gt; FROM &lt;tables

LINQ to SQL: Taking the Boredom out of Querying Introduction LINQ = Language INtegrated Query =

Query Processing, Resource Management, And Approximation in a Data Stream Management System

Relational Operators indexing technology. Now we can move on to query processing.

Geometric Algorithms Range & windowing queries (2 lectures) Database queries 2/180 G.

SQL Queries 1 / 28 The SELECT-FROM-WHERE Structure SELECT <attributes > FROM <tables