Data Management Systems • Query Processing • Execution models Understanding the data • Optimization I – heuristics & Calculating Costs rewriting Rule Based Optimizer • Optimization II – cost models • Optimization III - Operators Cost Based Optimizer Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Query processing Optimization II 1
Query optimization • Since SQL is declarative, a database engine has many options to translate a query into an executable program • Problems to solve: • Which starting point? Definitely not the query provided by the user • Queries are normalized, put in capital letters, syntactic sugar removed, etc. • Queries are rewritten • How to generate possible plans for the same query • How to decide which plans are best • Rule based (use heuristics) • Cost based (estimate the cost of the plans and choose the cheapest one) Query processing Optimization II 2
What are the choices? • There are many choices involved in coming up with the best plan: • Access method for each table (leaves of the query tree) • Is there an index on the table? • Is there a predicate on the query? Is the predicate on a key? • Are the tables clustered (on the same extent)? • What implementation of the operator is chosen • Nested loop join, hash join • Is the input data sorted? • Is it useful if the result is sorted? • Shape and form of the query tree • Pushdown selection and projection • In which order a re joins executed Query processing Optimization II 3
Cost based query optimization • The key to run queries faster is to be able to guess their performance without actually executing them • That way we can choose which plan is best • Complex problem: • Many components involved: CPU, memory, I/O, network • Typically, focus on major bottlenecks: memory, I/O • The structured processing of queries helps by providing some relevant parameters (number of tuples, distribution, selectivity, etc.) • Try to make clever decisions by using as much information as it is available Query processing Optimization II 4
Understanding the data https://docs.oracle.com/cd/B28359_01/s erver.111/b28274/optimops.htm#i82005 Query processing Optimization II 5
The basics for optimization • The main information source for query optimization are statistics on the data • These statistics are constantly collected on tables, indexes, buffers, and system and made available (in Oracle, through the “Dictionary”) • The statistical data is the basis for the decisions the query optimizer makes when deciding to choose a plan over another and also regarding which operator implementation to use Query processing Optimization II 6
Typical statistics Table statistics Extended statistics • Number of rows • Index statistics • Number of blocks • Number of leaf blocks • Average row length • Levels • Clustering factor Column statistics • Number of distinct values (NDV) System statistics in column • I/O performance and utilization • Number of nulls in column • CPU performance and utilization • Data distribution (histogram) https://docs.oracle.com/cd/B28359_01/server.111/b28274/stats.htm#i37048 Query processing Optimization II 7
Why such statistics? Examples • Number of rows: • Tells you size of tables • Helps decides how to run a join • Number of distinct values: • Helps to estimate selectivity of a predicate (how many results will be produced) • Helps to decide on join order • Histograms • Helps to estimate selectivity of predicates on skewed tables • Helps to decide on join order Query processing Optimization II 8
Table sizes and joins In a join, there is an outer table and an R S inner table (do not confuse with inner and outer joins): • The outer table is typically the smallest one Nested Loop join: smaller table is outer loop, big table inner loop • Access to the big table is sequential Hash Join: smaller table is used to build hash table, big table is used for probing • Smaller hash table to maintain • Sequential access to big table Query processing Optimization II 9
Histograms • Histograms are widely used in all major databases: • Cardinality estimation (CE) in SQL Server is derived primarily from histograms that are created when indexes or statistics are created, either manually or automatically (SQL Server Manual) • By default the optimizer assumes a uniform distribution of rows across the distinct values in a column. For columns that contain data skew (a non- uniform distribution of data within the column), a histogram enables the optimizer to generate accurate cardinality estimates for filter and join predicates that involve these columns. (Oracle Manual) • … in MySQL we have chosen to support two different types: The “singleton” histogram and the “ equi- height” histogram . (MYSQL Server Blog) Query processing Optimization II 10
Histograms EQUI-DEPTH (EQUI-HEIGHT) EQUI-WIDTH SELECT * FROM person WHERE 25 < age < 40; SELECT * FROM person WHERE 25 < age < 40; 60 50 40 30 20 10 0 20 bis 42 42 bis 48 48 bis 53 53 bis 59 59 bis 70 Same number of tuples per bucket Ranges of values are fixed and equal Helps to partition data evenly Tells how many values in each range The size of a range helps with cardinality estimates Helps identifying hot-spots May store distinct values and min/max, etc. May store distinct values and min/max, etc 11
Singleton or frequency histogram • The frequency histogram plots the frequency of every distinct item in a table • In essence, how often each value appears in the table • Very useful to compute the selectivity of queries • Highly accurate as it gives counts for every possible value • Can be done if the number of distinct values is not too high 12
Selecting a type of histogram (example) • NDV: This represents the number of distinct values in a column. For example, if a column only contains the values 100, 200, and 300, then the NDV for this column is 3. • n: This variable represents the number of histogram buckets. The default is 254. • p: This variable represents an internal percentage threshold that is equal to (1 – (1/n)) * 100. For example, if n = 254, then p is 99.6. https://docs.oracle.com/database/121/TGSQL/tgsql_histo.htm #TGSQL-GUID-FFA0C0AF-3761-4829-995E-9AFA524F96CE Query processing Optimization II 13
Zone Maps • A zone map is a combination of coarse index and statistics • For every block in of a table • Keep the max and min values for some or all columns • Before reading a block, check the zone map: • If range does not match the predicate, do not read the block • It can significantly reduce I/O cost • In some cases it can replace an index • Other statistics can be kept in a zone map • Example of use of the Zone Maps concept is Snowflake (see chapter on Storage) Query processing Optimization II 14
Calculating costs Query processing Optimization II 15
Cardinality I • WARNING : There are several definitions of cardinality going around … • The cardinality of an attribute: how many distinct values are there for that attribute • Table cardinality (Oracle): the number of tuples in the table • Operator cardinality: the number of tuples that must be processed to get the result • Predicate cardinality: how many tuples match the predicate • All these definitions are related and sometimes can actually be the same but they mean different things. • Attribute cardinality is used to determine selectivity (how many tuples will be produced after applying an operator) => the output of the operator • Operator cardinality is used to determine the cost of running an operator (how many tuples need to be read and processed) => the input of the operator • Cardinality of an attribute and predicate cardinality are often related Query processing Optimization II 16
Cardinality II • The number of tuples to be processed can be estimated from • Size of the table or input • Type of operator and access method • Full table scan • Indexes scan • Range predicates • Equality predicates • The attribute cardinality is estimated using statistics • Both are easier to determine for base tables. The difficulty is how to estimate it for intermediate results Query processing Optimization II 17
Selectivity • The selectivity of an operator is how much data it will produce. Typically expressed as a fraction over the data in the table • Selectivity 1 = all the data will be selected • Selectivity 0 = none of the data will be selected • If we can guess the selectivity, we know how much data it will produce for the next operator to process • The selectivity is related to the cardinality of an attribute as that can tell us how many results will be produced for a given selection predicate Query processing Optimization II 18
Example • Table (T) with 1000 rows • SELECT * FROM T • Operator Cardinality = 1000 rows • Query Selectivity = 1 • SELECT * FROM T WHERE T.id = 123456 (T.id is the key) • Operator Cardinality = depends on access method • Query Selectivity = 1/1000 = 0.001 • SELECT * FROM T WHERE T.price > 57 • Operator Cardinality = depends on data distribution and access method • Query Selectivity = depends on data distribution Query processing Optimization II 19
Recommend
More recommend