15-721 ADVANCED DATABASE SYSTEMS Lecture #20 Query Compilation - PowerPoint PPT Presentation

15-721 ADVANCED DATABASE SYSTEMS Lecture #20 – Query Compilation Andy Pavlo / / Carnegie Mellon University / / Spring 2016 @Andy_Pavlo // Carnegie Mellon University // Spring 2017

2 TODAY’S AGENDA Background Code Generation / Transpilation JIT Compilation (LLVM) Real-world Implementations CMU 15-721 (Spring 2017)

3 HEKATON REMARK After switching to an in-memory DBMS, the only way to increase throughput is to reduce the number of instructions executed. → To go 10x faster, the DBMS must execute 90% fewer instructions… → To go 100x faster, the DBMS must execute 99% fewer instructions… COMPILATION IN THE MICROSOFT SQL SERVER HEKATON ENGINE IEEE Data Engineering Bulletin 2011 CMU 15-721 (Spring 2017)

4 OBSERVATION The only way that we can achieve such a reduction in the number of instructions is through code specialization . This means generating code that is specific to a particular task in the DBMS. Most code is written to make it easy for humans to understand rather than performance… CMU 15-721 (Spring 2017)

5 EXAMPLE DATABASE CREATE TABLE A ( CREATE TABLE B ( id INT PRIMARY KEY , id INT PRIMARY KEY, val INT val INT ); ); CREATE TABLE C ( a_id INT REFERENCES A( id ), b_id INT REFERENCES B( id ), PRIMARY KEY ( a_id , b_id ) ); CMU 15-721 (Spring 2017)

6 QUERY INTERPRETATION ⨝ SELECT * A.id=C.a_id FROM A, C, ( SELECT B.id, COUNT (*) FROM B ⨝ σ WHERE B.val = ? + 1 A.val=123 B.id=C.b_id GROUP BY B.id) AS B WHERE A.val = 123 AND A.id = C.a_id A Γ AND B.id = C.b_id B.id, COUNT(*) σ B.val= ? +1 B C CMU 15-721 (Spring 2017)

6 QUERY INTERPRETATION ⨝ ⨝ for t 1 in left.getNext() : SELECT * A.id=C.a_id buildHashTable (t 1 ) FROM A, C, for t 2 in right.getNext() : if probe (t 2 ): emit (t 1 ⨝ t 2 ) ( SELECT B.id, COUNT (*) FROM B ⨝ σ σ ⨝ for t 1 in left.getNext() : WHERE B.val = ? + 1 for t in child.getNext() : buildHashTable (t 1 ) A.val=123 B.id=C.b_id GROUP BY B.id) AS B if evalPred (t): emit (t) for t 2 in right.getNext() : WHERE A.val = 123 if probe (t 2 ): emit (t 1 ⨝ t 2 ) AND A.id = C.a_id A Γ A Γ for t in child.getNext() : AND B.id = C.b_id for t in A : buildAggregateTable (t) B.id, COUNT(*) emit (t) for t in aggregateTable : emit (t) σ σ for t in child.getNext() : B.val= ? +1 if evalPred (t): emit (t) B C B C for t in B : for t in C : emit (t) emit (t) CMU 15-721 (Spring 2017)

7 PREDICATE INTERPRETATION SELECT * FROM A, C, ( SELECT B.id, COUNT (*) FROM B WHERE B.val = ? + 1 GROUP BY B.id) AS B WHERE A.val = 123 AND A.id = C.a_id AND B.id = C.b_id CMU 15-721 (Spring 2017)

7 PREDICATE INTERPRETATION Execution Context SELECT * FROM A, C, ( SELECT B.id, COUNT (*) Current Tuple Query Parameters Table Schema FROM B (123, 1000) (int:999) B→( int:id, int:val) WHERE B.val = ? + 1 GROUP BY B.id) AS B WHERE A.val = 123 = AND A.id = C.a_id AND B.id = C.b_id TupleAttribute(val) + Parameter(0) Constant(1) CMU 15-721 (Spring 2017)

7 PREDICATE INTERPRETATION Execution Context SELECT * FROM A, C, ( SELECT B.id, COUNT (*) Current Tuple Query Parameters Table Schema FROM B (123, 1000) (int:999) B→( int:id, int:val) WHERE B.val = ? + 1 GROUP BY B.id) AS B WHERE A.val = 123 = AND A.id = C.a_id AND B.id = C.b_id TupleAttribute(val) + 1000 Parameter(0) Constant(1) CMU 15-721 (Spring 2017)

7 PREDICATE INTERPRETATION Execution Context SELECT * FROM A, C, ( SELECT B.id, COUNT (*) Current Tuple Query Parameters Table Schema FROM B (123, 1000) (int:999) B→( int:id, int:val) WHERE B.val = ? + 1 GROUP BY B.id) AS B WHERE A.val = 123 = AND A.id = C.a_id AND B.id = C.b_id TupleAttribute(val) + 1000 Parameter(0) Constant(1) 999 CMU 15-721 (Spring 2017)

7 PREDICATE INTERPRETATION Execution Context SELECT * FROM A, C, ( SELECT B.id, COUNT (*) Current Tuple Query Parameters Table Schema FROM B (123, 1000) (int:999) B→( int:id, int:val) WHERE B.val = ? + 1 GROUP BY B.id) AS B WHERE A.val = 123 = AND A.id = C.a_id AND B.id = C.b_id TupleAttribute(val) + 1000 Parameter(0) Constant(1) 999 1 CMU 15-721 (Spring 2017)

7 PREDICATE INTERPRETATION Execution Context SELECT * FROM A, C, ( SELECT B.id, COUNT (*) Current Tuple Query Parameters Table Schema FROM B (123, 1000) (int:999) B→( int:id, int:val) WHERE B.val = ? + 1 GROUP BY B.id) AS B WHERE A.val = 123 = AND A.id = C.a_id AND B.id = C.b_id true TupleAttribute(val) + 1000 1000 Parameter(0) Constant(1) 999 1 CMU 15-721 (Spring 2017)

8 CODE SPECIALIZATION Any CPU intensive entity of database can be natively compiled if they have a similar execution pattern on different inputs. → Access Methods → Stored Procedures → Operator Execution → Predicate Evaluation → Logging Operations CMU 15-721 (Spring 2017)

9 BENEFITS Attribute types are known a priori . → Data access function calls can be converted to inline pointer casting. Predicates are known a priori . → They can be evaluated using primitive data comparisons. No function calls in loops → Allows the compiler to efficiently distribute data to registers and increase cache reuse. CMU 15-721 (Spring 2017)

10 ARCHITECTURE OVERVIEW Cost Estimates System Catalog SQL Query Optimizer Annotated AST SQL Rewriter Physical (Optional) Tree Rewriter Plan (Optional) Compiler SQL Query Binder Annotated AST Parser Native Code Abstract Syntax Tree CMU 15-721 (Spring 2017)

11 CODE GENERATION Approach #1: Transpilation → Write code that converts a relational query plan into C/C++ and then run it through a conventional compiler to generate native code. Approach #2: JIT Compilation → Generate an intermediate representation (IR) of the query that can be quickly compiled into native code . CMU 15-721 (Spring 2017)

12 HIQUE – CODE GENERATION For a given query plan, create a C/C++ program that implements that query’s execution. → Bake in all the predicates and type conversions. Use an off-shelf compiler to convert the code into a shared object, link it to the DBMS process, and then invoke the exec function. GENERATING CODE FOR HOLISTIC QUERY EVALUATION ICDE 2010 CMU 15-721 (Spring 2017)

13 OPERATOR TEMPLATES SELECT * FROM A WHERE A.val = ? + 1 CMU 15-721 (Spring 2017)

13 OPERATOR TEMPLATES Interpreted Plan for t in range (table.num_tuples): tuple = get_tuple(table, t) if eval(predicate, tuple, params): emit(tuple) CMU 15-721 (Spring 2017)

13 OPERATOR TEMPLATES Interpreted Plan for t in range (table.num_tuples): tuple = get_tuple(table, t) if eval(predicate, tuple, params): emit(tuple) 1. Get schema in catalog for table. 2. Calculate offset based on tuple size. 3. Return pointer to tuple. CMU 15-721 (Spring 2017)

13 OPERATOR TEMPLATES Interpreted Plan for t in range (table.num_tuples): tuple = get_tuple(table, t) if eval(predicate, tuple, params): emit(tuple) 1. Get schema in catalog for table. 2. Calculate offset based on tuple size. 3. Return pointer to tuple. 1. Traverse predicate tree and pull values up. 2. If tuple value, calculate the offset of the target attribute. 3. Perform casting as needed for comparison operators. 4. Return true / false. CMU 15-721 (Spring 2017)

13 OPERATOR TEMPLATES Interpreted Plan Templated Plan for t in range (table.num_tuples): tuple_size = ### tuple = get_tuple(table, t) predicate_offset = ### if eval(predicate, tuple, params): parameter_value = ### emit(tuple) for t in range (table.num_tuples): tuple = table.data + t ∗ tuple_size 1. Get schema in catalog for table. 2. Calculate offset based on tuple size. val = (tuple+predicate_offset) + 1 3. Return pointer to tuple. if (val == parameter_value): emit(tuple) 1. Traverse predicate tree and pull values up. 2. If tuple value, calculate the offset of the target attribute. 3. Perform casting as needed for comparison operators. 4. Return true / false. CMU 15-721 (Spring 2017)

15-721 ADVANCED DATABASE SYSTEMS Lecture #20 Query Compilation - PowerPoint PPT Presentation

15-721 ADVANCED DATABASE SYSTEMS Lecture #20 Query Compilation Andy Pavlo / / Carnegie Mellon University / / Spring 2016 @Andy_Pavlo // Carnegie Mellon University // Spring 2017 2 TODAYS AGENDA Background Code Generation /

ADVANCED DATABASE SYSTEMS Server-side Logic Execution @ Andy_Pavlo // 15- 721 // Spring 2019

Homework Assignment: 5 11-721: Grammars and Lexicons 11-721: Grammars and Lexicons Fall 2007

ADVANCED DATABASE SYSTEMS Networking @ Andy_Pavlo // 15- 721 // Spring 2019 CMU 15-721

ADVANCED DATABASE SYSTEMS History of Databases @ Andy_Pavlo // 15- 721 // Spring 2020 2

ADVANCED DATABASE SYSTEMS Vectorized Execution @ Andy_Pavlo // 15- 721 // Spring 2019 CMU

ADVANCED DATABASE SYSTEMS OLTP Indexes (Trie Data Structures) @ Andy_Pavlo // 15- 721 //

ADVANCED DATABASE SYSTEMS Self-Driving Database Management Systems @ Andy_Pavlo // 15- 721 //

ADVANCED DATABASE SYSTEMS Database Compression @ Andy_Pavlo // 15- 721 // Spring 2019 CMU

15-721 DATABASE SYSTEMS [Source] Lecture #08 Indexing (OLAP) Andy Pavlo / / Carnegie

ADVANCED DATABASE SYSTEMS Vectorization vs. Compilation @ Andy_Pavlo // 15- 721 // Spring

ADVANCED DATABASE SYSTEMS Storage Models & Data Layout @ Andy_Pavlo // 15- 721 // Spring

15-721 DATABASE SYSTEMS Lecture #09 Storage Models & Data Layout Andy Pavlo / /

15-721 DATABASE SYSTEMS [Source] Lecture #03 Concurrency Control Part I Andy Pavlo / /

Lect ure # 03 ADVANCED DATABASE SYSTEMS Query Compilation @ Andy_Pavlo // 15- 721 // Spring

15-721 DATABASE SYSTEMS [Source] Lecture #04 Concurrency Control Part II Andy Pavlo / /

ADVANCED DATABASE SYSTEMS Recovery Protocols @ Andy_Pavlo // 15- 721 // Spring 2019 CMU

Lecture 2: Overview 1. Discussion of Joes Garage 2. Lean Terminology in Joes Garage 3.

Welcome! Recording, poll results, notes, and Q&A debrief will be sent to participants

9/1/20 Using Authentic Assessment Amidst Covid-19 September 2nd, 2020 HIGHSCOPE CONFIDENTIAL -

COVID-19 and SNFs in OC Orange County Health Care Agency (OCHCA) June 4, 2020 Presenters

Query Optimization Lecture # 13 Database Systems Andy Pavlo AP AP Computer Science

Query Processing in a Self-Organized Storage System Hannes Mhleisen, supervised by Robert

How we run SQL queries in-memory when available memory is constrained with Kognitio analytical

From relation algebra to semi-join algebra: an approach for graph query optimization Jelle

15-721 ADVANCED DATABASE SYSTEMS Lecture #20 Query Compilation - PowerPoint PPT Presentation

15-721 ADVANCED DATABASE SYSTEMS Lecture #20 Query Compilation Andy Pavlo / / Carnegie Mellon University / / Spring 2016 @Andy_Pavlo // Carnegie Mellon University // Spring 2017 2 TODAYS AGENDA Background Code Generation /

ADVANCED DATABASE SYSTEMS Server-side Logic Execution @ Andy_Pavlo // 15- 721 // Spring 2019

Homework Assignment: 5 11-721: Grammars and Lexicons 11-721: Grammars and Lexicons Fall 2007

ADVANCED DATABASE SYSTEMS Networking @ Andy_Pavlo // 15- 721 // Spring 2019 CMU 15-721

ADVANCED DATABASE SYSTEMS History of Databases @ Andy_Pavlo // 15- 721 // Spring 2020 2

ADVANCED DATABASE SYSTEMS Vectorized Execution @ Andy_Pavlo // 15- 721 // Spring 2019 CMU

ADVANCED DATABASE SYSTEMS OLTP Indexes (Trie Data Structures) @ Andy_Pavlo // 15- 721 //

ADVANCED DATABASE SYSTEMS Self-Driving Database Management Systems @ Andy_Pavlo // 15- 721 //

ADVANCED DATABASE SYSTEMS Database Compression @ Andy_Pavlo // 15- 721 // Spring 2019 CMU

15-721 DATABASE SYSTEMS [Source] Lecture #08 Indexing (OLAP) Andy Pavlo / / Carnegie

ADVANCED DATABASE SYSTEMS Vectorization vs. Compilation @ Andy_Pavlo // 15- 721 // Spring

ADVANCED DATABASE SYSTEMS Storage Models &amp; Data Layout @ Andy_Pavlo // 15- 721 // Spring

15-721 DATABASE SYSTEMS Lecture #09 Storage Models &amp; Data Layout Andy Pavlo / /

15-721 DATABASE SYSTEMS [Source] Lecture #03 Concurrency Control Part I Andy Pavlo / /

Lect ure # 03 ADVANCED DATABASE SYSTEMS Query Compilation @ Andy_Pavlo // 15- 721 // Spring

15-721 DATABASE SYSTEMS [Source] Lecture #04 Concurrency Control Part II Andy Pavlo / /

ADVANCED DATABASE SYSTEMS Recovery Protocols @ Andy_Pavlo // 15- 721 // Spring 2019 CMU

Lecture 2: Overview 1. Discussion of Joes Garage 2. Lean Terminology in Joes Garage 3.

Welcome! Recording, poll results, notes, and Q&amp;A debrief will be sent to participants

9/1/20 Using Authentic Assessment Amidst Covid-19 September 2nd, 2020 HIGHSCOPE CONFIDENTIAL -

COVID-19 and SNFs in OC Orange County Health Care Agency (OCHCA) June 4, 2020 Presenters

Query Optimization Lecture # 13 Database Systems Andy Pavlo AP AP Computer Science

Query Processing in a Self-Organized Storage System Hannes Mhleisen, supervised by Robert

How we run SQL queries in-memory when available memory is constrained with Kognitio analytical

From relation algebra to semi-join algebra: an approach for graph query optimization Jelle

ADVANCED DATABASE SYSTEMS Storage Models & Data Layout @ Andy_Pavlo // 15- 721 // Spring

15-721 DATABASE SYSTEMS Lecture #09 Storage Models & Data Layout Andy Pavlo / /

Welcome! Recording, poll results, notes, and Q&A debrief will be sent to participants