Cardinality Estimation Done Right: Index-Based Join Sampling Viktor - PowerPoint PPT Presentation

Cardinality Estimation Done Right: Index-Based Join Sampling Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, Thomas Neumann Technische Universität München Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

Join Ordering ◮ finding a good join order is arguably the most important problem in query optimization ◮ the costs of different join orders often vary by orders of magnitude ◮ to distinguish good plans from bad ones, the cost of plans must be estimated using cardinality estimates for intermediate results ◮ example: to find the optimal join order for A ⊲ ⊳ A . id = B . aid B ⊲ ⊳ B . cid = C . id C , one needs estimates for A ⊲ ⊳ B and B ⊲ ⊳ C Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

← → The State of the (Industrial) Art in Cardinality Estimation ◮ in virtually all systems estimation is based on: 1. histograms and unique value counts 2. strong assumptions: uniformity, independence, inclusion ◮ results with real-world data are terrible (Join Order Benchmark [Leis et al., VLDB 2016]): Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

The State of the (Industrial) Art in Cardinality Estimation ◮ in virtually all systems estimation is based on: 1. histograms and unique value counts 2. strong assumptions: uniformity, independence, inclusion ◮ results with real-world data are terrible (Join Order Benchmark [Leis et al., VLDB 2016]): PostgreSQL ← underest. [log scale] overest. → 1e4 1e2 1 1e2 1e4 1e6 1e8 0 1 2 3 4 5 6 number of joins Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

The State of the (Industrial) Art in Cardinality Estimation ◮ in virtually all systems estimation is based on: 1. histograms and unique value counts 2. strong assumptions: uniformity, independence, inclusion ◮ results with real-world data are terrible (Join Order Benchmark [Leis et al., VLDB 2016]): PostgreSQL PostgreSQL normalized plan cost [log scale] ← underest. [log scale] overest. → 1e4 30 1e2 1 10 1e2 1e4 3 1e6 1e8 1 0 1 2 3 4 5 6 0% 25% 50% 75%100% number of joins percentile Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

Sampling to the Rescue? ◮ sampling, which does not rely on strong assumptions, is a highly promising alternative ◮ some systems (e.g., HyPer) use sampling to estimate base table selections: ◮ keep random samples for each table (e.g., 10,000 rows) and execute selection on sample ◮ produces accurate estimates for arbitrary predicates (correlations etc. are not a problem) ◮ How to use sampling for joins, which are the main source of errors? Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

Using Samples for Joins We could pretend that samples are base tables and compute the join result ◮ would allows use to estimate the result size ◮ could become expensive, too (unlikely, but possible) ◮ usually, joining samples will produce small/empty results ◮ need a mechanism to sample the real join result ◮ but we cannot sample everything ◮ number of join candidates is exponential Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

Related (Sampling-Based) Work ◮ CS2 [Yu et al., SIGMOD 2013] ◮ pre-materialize correlated samples to avoid joining independent samples ◮ works well for star queries, but it is unknown how to apply this idea automatically for arbitrary queries ◮ ROX [Kader et al., SIGMOD 2009] ◮ greedy heuristics that uses sampling through indexes to make more informed decisions ◮ does not enumerate all join orders ◮ Sampling-Based Re-Optimization [Wu et al., SIGMOD 2016] ◮ get plan from traditional optimizer ◮ repeat until plan does not change: execute plan using 5% samples of each table ◮ high overhead (large samples), avoids some bad plans but often misses optimal plan (no systematic exploration) ◮ sampling-based approaches proposed so far have weaknesses that preclude their use in industrial-strength systems Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

Index-based Join Sampling: Main Ideas 1. use existing index structures and fixed-size samples (1000) to get samples for larger intermediate results 2. systematically explore intermediate results in a bottom-up fashion (2-way joins, 3-way joins, . . . ) 3. inject cardinalities computed in step 2 and run exhaustive join ordering algorithm (e.g., dynamic programming) Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

1. Cheap Sampling Using Indexes ◮ given a sample S obtain a sample for S ⊲ ⊳ S . id = A . id A using an existing index on A . id : 1. count the number of join partners for each tuple in S 2. draw tuples at random to obtain desired number of results S sample of T count per tuple ( cpt ) sum : 15 6 1 3 5 S A B (not materialized) Sout sample of sid : {2,4,8,13} 2 4 8 13 T B A chosen : 0 chosen : 0 chosen : 2 chosen : 3 o ff set : 2 o ff set : 4 o ff set : 1 o ff set : 3 ◮ makes each sampling step cheap by avoiding “exploding” intermediate result sizes ◮ O ( | S | ) or O ( | S | log | A | ), | S | ≤ 1000 Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

2. Systematic Bottom-Up Exploration ◮ generate samples for base tables ◮ join using existing indexes obtaining results for all 2-way joins, only then proceed to 3-way joins ◮ avoids “fleeing to ignorace” of the optimizer ◮ we need only one estimate per equivalence class ◮ stop early if sampling budget (e.g., 100K lookups) runs out ◮ budget is a parameter that determines how much time is spent in the additional sampling phase Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

3. Join Ordering ◮ inject cardinalities from step 2 into traditional optimizer ◮ fall back to traditional estimation if a result could not be estimated using sampling (due to missing indexes, a too small budget, or very high selectivities) ◮ run traditional (typically exhaustive) join enumeration algorithm, which now has much more accurate information ◮ execute resulting plan Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

Does Sampling Improve Estimation? PostgreSQL index-based (10K) index-based (100K) index-based (no budget) 1e4 ← underestimation [log scale] overest. → 1e2 1 1e2 1e4 95th percentile 1e6 75th percentile median 25th percentile 1e8 5th percentile 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 number of joins Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

How Expensive is Sampling? sampling-based re-optimization index-based sampling (100K) index-based sampling (10K) index-based sampling (no budget) sampling time [ms] [log scale] 1000 100 10 4 8 12 16 number of joins Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

Does Sampling Improve Plan Quality? PostgreSQL normalized plan cost [log scale] 30 sampling-based re-opt. index-based samp. (10K) 10 index-based samp. (100K) index-based samp. (no budget) 3 1 0.3 0% 25% 50% 75% 100% percentile Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

Summary ◮ index-based join sampling is an effective approach for cardinality estimation in main-memory database systems ◮ considerably improves estimation and plan quality ◮ low and configurable overhead ◮ easy to integrate into existing systems ◮ also possible as an optional phase (e.g., triggered by the user) for hard, long-running queries Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

Cardinality Estimation Done Right: Index-Based Join Sampling Viktor - PowerPoint PPT Presentation

Cardinality Estimation Done Right: Index-Based Join Sampling Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, Thomas Neumann Technische Universitt Mnchen Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right:

Chapter 9 Cardinality Estimation How Many Rows Does a Query Yield? Cardinality Estimation

Section 2.5 Section Summary ! Cardinality ! Countable Sets ! Computability Cardinality Definition

Section 2.5 1 Cardinality Definition : The cardinality of a set A is equal to the cardinality of

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

Winning Presentation in a Day Get It Done Right, Get It Done Fast Winning Presentation in a Day

Winning Presentation in a Day Get It Done Right, Get It Winning Presentation in a Day Get It Done

JOINS IN SQL By Rohit Dhanwani OBJECTIVES Define and use different types of joins INNER

CSE 322, Fall 2010 A set is countable if it is finite or has the same cardinality as the natural

Cardinality 2/12 Example: Do the sets Do the sets Cardinality { 5 , 10 , 15 , . . . , 155 } N

Revisiting the cardinality Operator and Introducing the cardinality-path Constraint Family

Index Rules and Methodology Index Name Ticker S-Network US Equity 3000 Index SN3000 S-Network

When to Optimize Enumerating all possible plans Selection Pushdown Join Conversion Join

Finding the Right Target Audience Defining the Right Audience Right Visitors Right Time

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Matrix COSEC Right People in Right Place at Right Time Matrix COmplete SECurity Matrix COSEC

light right light right light right light right to steady the tongue, hold the sides of

Politics Inequality in the United States 1 download slides at: www.inequality.com/slides

CS 147: Computer Systems Performance Analysis Introduction to Queueing Theory 1 / 27 Overview

Develop Your Data Mindset Module 13 - Student Level Goal Evaluation Part 2 - Accumulate &

Trends in the Well-Being of the Aged and Their Prospects through 2030 Gary Burtless T HE B

and other platforms Sankalp Sah, Manish Singh MityLytics Inc Why ARM for Cassandra ? RISC

Lecture 10/Chapter 8 Bell-Shaped Curves & Other Shapes From a Histogram to a Frequency

Lecture 11 : The Basic Numerical Quantities Associated to a Continuous X 0/ 25 In this lecture we

Bootstrap approach for dissolution similarity testing, performance and limitations Leslie Van

Cardinality Estimation Done Right: Index-Based Join Sampling Viktor - PowerPoint PPT Presentation

Cardinality Estimation Done Right: Index-Based Join Sampling Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, Thomas Neumann Technische Universitt Mnchen Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right:

Chapter 9 Cardinality Estimation How Many Rows Does a Query Yield? Cardinality Estimation

Section 2.5 Section Summary ! Cardinality ! Countable Sets ! Computability Cardinality Definition

Section 2.5 1 Cardinality Definition : The cardinality of a set A is equal to the cardinality of

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

Winning Presentation in a Day Get It Done Right, Get It Done Fast Winning Presentation in a Day

Winning Presentation in a Day Get It Done Right, Get It Winning Presentation in a Day Get It Done

JOINS IN SQL By Rohit Dhanwani OBJECTIVES Define and use different types of joins INNER

CSE 322, Fall 2010 A set is countable if it is finite or has the same cardinality as the natural

Cardinality 2/12 Example: Do the sets Do the sets Cardinality { 5 , 10 , 15 , . . . , 155 } N

Revisiting the cardinality Operator and Introducing the cardinality-path Constraint Family

Index Rules and Methodology Index Name Ticker S-Network US Equity 3000 Index SN3000 S-Network

When to Optimize Enumerating all possible plans Selection Pushdown Join Conversion Join

Finding the Right Target Audience Defining the Right Audience Right Visitors Right Time

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Matrix COSEC Right People in Right Place at Right Time Matrix COmplete SECurity Matrix COSEC

light right light right light right light right to steady the tongue, hold the sides of

Politics Inequality in the United States 1 download slides at: www.inequality.com/slides

CS 147: Computer Systems Performance Analysis Introduction to Queueing Theory 1 / 27 Overview

Develop Your Data Mindset Module 13 - Student Level Goal Evaluation Part 2 - Accumulate &amp;

Trends in the Well-Being of the Aged and Their Prospects through 2030 Gary Burtless T HE B

and other platforms Sankalp Sah, Manish Singh MityLytics Inc Why ARM for Cassandra ? RISC

Lecture 10/Chapter 8 Bell-Shaped Curves &amp; Other Shapes From a Histogram to a Frequency

Lecture 11 : The Basic Numerical Quantities Associated to a Continuous X 0/ 25 In this lecture we

Bootstrap approach for dissolution similarity testing, performance and limitations Leslie Van

Develop Your Data Mindset Module 13 - Student Level Goal Evaluation Part 2 - Accumulate &

Lecture 10/Chapter 8 Bell-Shaped Curves & Other Shapes From a Histogram to a Frequency