Data Management Systems • Query Processing • Execution models Views and Schemas • Optimization I – heuristics & Query rewriting rewriting • Optimization II – cost models Basic optimization heuristics • Optimization III - Operators Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Query processing Optimization I 1
Views and Schemas Query processing Optimization I 2
Let’s take one step back • Remember “logical data independence” and “access controls”? • They are implemented using views • What is exactly a view? • Have you asked yourselves where queries come from? • Most often, from applications and programs • Heavy use of templates and parameterized queries • Database engines aim at making it easier to provide these abstractions. In the process, there is a lot going on in the background – mostly related to query processing Query processing Optimization I 3
A Database with a View • In the relational model: • with the exception of base tables, a query and a table are the same • the table contains the actual data, the query tells you how to get the data in the table (i.e., the table itself) • both can be used to reason about queries • A view, when created, results in a virtual table being added to the schema • A materialized view results in an actual table being added to the schema with the data described in the view Query processing Optimization I 4
Example of a view QUERY: SELECT employee_id, last_name, job_id, manager_id, dept_id FROM employees VIEW: CREATE VIEW staff AS SELECT employee_id, last_name, job_id, manager_id, dept_id FROM employees https://docs.oracle.com/cd/B19306_01/network.102/b14266/ accessre.htm#i1007436 Query processing Optimization I 5
Why views? • Views are extensively used • To implement logical data independence (create a schema different from the original one) • To make it easier to write applications (writing SQL is programming) by providing views that combine data in useful ways • To make it easier to write queries by writing intermediate stages as views • For access control by giving a user access to a view instead of to the base tables • To speed up processing by materializing the view and having the data already pre-processed for some queries (e.g., a join) Query processing Optimization I 6
Schemas TPC-C schema • Schemas provide the basic organization of the data • Views allow to tailor that logical organization to the needs of particular applications without changing the basic schema: • Orders in a given district • Orders in a period of time • Orders from a customer • … Query processing Optimization I 7
Analytics (TPC-H) • Databases for analytics often use a start schema: • Fact table (very big) with the central element of the schema • Dimension tables (typically smaller): with more data on the different attributes that are mentioned in the fact table • It helps separate the relevant operational data from information that might be needed but it is not central to processing, e.g., an order Query processing Optimization I 8
Snowflake schema • In star schemas, the dimension tables and the fact table are not normalized • A snowflake schema is a star schema where the dimension tables are normalized (some or all of them). Normalization is applied to low cardinality attributes to remove redundancy Query processing Optimization I 9
Modern analytics (TPC-DS): Snow-storm • Multiple snow-flake schemas linked to each other • TPC-DS • Large number of tables (26) • Large number of columns per table (38) • Multiple fact tables to enable joins between large tables (fact to fact joins) Query processing Optimization I 10
Schemas and views • The more complex the schema (very common in serious applications), the more extensive use of views: • Provide an application with the data it needs without having to understand the whole schema • Since many applications will be running off the same schema (marketing, sales, auditing, logistics, etc.), each one can get just the data it needs simplifying application development • Common operations over the schema can be captured with materialized views Query processing Optimization I 11
Where do queries come from? • In a commercial setting, it is very rare that a database will be used interactively with a user typing queries directly at a terminal • Databases are used programmatically • An application program contains queries that are run to extract the data the program needs • A user interface generates queries after the user checks some options • A service can be provided by providing an RPC like interface to already written queries that implement the service (users call a procedure, not a query) • Database engines aim at simplifying how to write such queries to be used by programs and applications Query processing Optimization I 12
Query templates in user interfaces Brand Lenovo HP SELECT model, price, delivery_date FROM Laptops Apple WHERE Samsung Brand = X Price range AND Less than 800 price < Y AND price > Z Between 800 and 1500 More than 1500 Query processing Optimization I 13
Implications for query processing • The use of views and query templates affects query processing: • Query might be expressed over a view instead of over base tables • A query might contain many conjunctions and disjunctions of arbitrarily complex predicates • Such queries might not be the most efficient way to express the query Query processing Optimization I 14
Query rewriting Query processing Optimization I 15
Anatomy of query processing QUERY Validation, access control Check caches QUERY PARSER INTERMEDIATE REPRESENTATION (Abstract Syntax Tree) Interpretation DB SCHEMA REWRITING Compilation OPERATOR TREE (PLAN) QUERY CODE OPTIMIZATION STATISTICS EXECUTION GENERATION QUERY CODE/ PLAN PLAN Query processing introduction 16
IBM DB2 Query processing introduction 17 https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5 .0/com.ibm.db2.luw.admin.perf.doc/doc/c0005292.html
Query rewritring • Query rewriting refers to transforming the original SQL query into an equivalent SQL query that: • Is more efficient by removing operations • Gives the query optimizer more freedom to operate • Makes more explicit what the query wants to do (and the optimizer can then act accordingly) • Maps the query to actual base tables and views as needed and for efficiency reasons • Query rewriting is typically done just looking at the schema, without paying attention to statistics or cost estimates Query processing Optimization I 18
Rewriting Predicates • Predicates indicate operations to be performed on a tuple or tuples SELECT * FROM T WHERE T.salary > 50.000 AND T.age < 45 AND T.dept = D17 • Think about it as a big “if … then … else” construct that must be run on every tuple of table T • If we find ways to reduce the number of comparisons that need to be made, the query will run faster • If we avoid having to go several times over the same tuple, each one checking a different predicate, the query will run faster Query processing Optimization I 19
Predicate transformation SELECT * FROM T SELECT * FROM T WHERE (T.price > 50 AND T.price < 100) WHERE (T.price > 50 AND T.price < 200) OR (T.price > 90 AND T.price < 200) SELECT * FROM employee WHERE SELECT * FROM employee deptno = 'D11' or where deptno in ('D11', 'D21', 'E21') deptno = 'D21' or deptno = 'E21' Preferred if there is an Preferred if there is no index over deptno index over deptno (follow the index for the (match every tuple against three values) set of values) https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5.0 Query processing Optimization I 20 /com.ibm.db2.luw.admin.perf.doc/doc/c0005296.html
Predicate augmentation SELECT empno, lastname, firstname, deptno, deptname SELECT empno, lastname, firstname, deptno, deptname FROM employee emp, department dept FROM employee emp, department dept WHERE WHERE emp.workdept = dept.deptno AND emp.workdept = dept.deptno AND dept.deptno > 'E00‘ AND dept.deptno > 'E00' emp.workdept > ’E00’ The query is telling us that emp.workdept and By adding the second dept.deptno match and can predicate on emp.workdept, be compared the optimizer can filter out tuples form the table emp before doing the join, making it cheaper https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5.0 /com.ibm.db2.luw.admin.perf.doc/doc/c0005296.html Query processing Optimization I 21
Predicate augmentation SELECT * FROM T, R, S SELECT * FROM T, R, S WHERE WHERE T.id = R.id AND T.id = R.id AND R.id = S.id AND R.id = S.id T.id = S.id As written, the query join T By adding T.id = S.id, we are and R, and R and S telling the optimizer that it But that is not the only can choose any combination option of joins of those three tables Query processing Optimization I 22
Recommend
More recommend