FUSED TABLE SCANS: COMBINING AVX-512 AND JIT Markus Dreseler, Jan - PowerPoint PPT Presentation

FUSED TABLE SCANS: COMBINING AVX-512 AND JIT Markus Dreseler, Jan Kossmann, Johannes Frohnhofen, Matthias Uflacker, Hasso Plattner Joint Workshop of HardBD & Active @ ICDE Paris, 16th of April 2018

FUSED TABLE SCANS FUSED TABLE SCANS - COMBINING AVX-512 AND JIT ▸ AVX-512: Intel’s newest instruction set for SIMD operations ▸ J ust- I n- T ime compilation: Creating binary code at program runtime ▸ Efficient (multi-predicate) sequential scans are a necessity for relational database systems ▸ Secondary indexes can speed up such operations ▸ Drawbacks: memory consumption and maintenance cost ▸ Contribution : Combine the above techniques to accelerate table scans � 2

FUSED TABLE SCANS FUSED TABLE SCANS - COMBINING AVX-512 AND JIT ▸ Optimizations of sequential scans can be grouped into two categories ▸ Block-at-a-time: Evaluate multiple values (SIMD) of a column at a time ▸ Store results in position list A B C ▸ Materialization between operators ▸ Data-centric compilation: Generate (JIT) a tight, optimized loop to process one tuple at a time A B C ▸ No utilization of SIMD until now ▸ Suboptimal interplay with some hardware optimizations � 4

FUSED TABLE SCANS WHY SHOULD WE COMBINE DATA-CENTRIC OPERATION & SIMD? ▸ Assumptions: Data resides in-memory in column-major format with fixed size values ▸ could look similar to: SELECT COUNT(*) FROM tbl WHERE a = 5 AND b = 2 int total_results = 0; for (pos_t i = 0; i < col_a.size(); ++i) { if (col_a[i] == 5 && col_b[i] == 2) { ++total_results; } } � 6

FUSED TABLE SCANS WHY SHOULD WE COMBINE DATA-CENTRIC OPERATION & SIMD? int total_results = 0; for (pos_t i = 0; i < col_a.size(); ++i) { if (col_a[i] == 5 && col_b[i] == 2) { ++total_results; } } � 7

FUSED TABLE SCANS WHY SHOULD WE COMBINE DATA-CENTRIC OPERATION & SIMD? ▸ Experiment: Does a single value at a time evaluation fully utilize the available memory bandwidth? ▸ Reduce the number of cpu operations, but still load all data from memory � 8 4-byte values skipped per scanned item

FUSED TABLE SCANS IMPLEMENTATION ▸ Utilizing the new instruction set AVX-512 ▸ Wider (doubled) register sizes ▸ New instructions offer efficiency advantages ▸ We built equivalent functions using AVX2 (up to 32 lines) ▸ Basic idea ▸ Keep data in the AVX-registers during whole scan � 9

128 byte uint32_t Column A: 2 5 4 5 6 1 5 7 6 8 5 3 5 9 9 5 _mm_loadu_si128 __m128i 2 5 4 5 5 5 5 5 first search value _mm_cmpeq_epi32_mask __mmask8 0 1 0 1 0 1 2 3 indexes of current block _mm_mask_compress_epi32 128 byte 0 0 1 3 uint32_t position list Column A: 2 5 4 5 6 1 5 7 6 8 5 3 5 9 9 5 6 1 5 7 5 5 5 5 0 0 1 0 4 5 6 7 _mm_loadu_si128 _mm_permutex2var_epi32 second iteration __m128i 0 1 3 0 2 5 4 5 5 5 5 5 _mm_mask_compress_epi32 first search value 0 1 3 6 _mm_cmpeq_epi32_mask 6 8 5 3 5 5 5 5 __mmask8 0 0 1 0 8 9 10 11 0 1 0 1 0 1 2 3 _mm_permutex2var_epi32 indexes of current third iteration block 1 3 6 0 _mm_mask_compress_epi32 matching positions in 1 3 6 10 column a 0 0 1 3 Column B: 5 2 3 1 8 7 3 3 4 5 6 7 2 9 3 2 _mm_i32gather_epi32 position list 2 1 3 6 2 2 2 2 _mm_mask_cmpeq_epi32_mask 1 0 0 0 _mm_mask_compress_epi32 final result: row 1 (the second entry) 0 0 0 1 matches both conditions

128 byte uint32_t Column A: 2 5 4 5 6 1 5 7 6 8 5 3 5 9 9 5 _mm_loadu_si128 __m128i 2 5 4 5 5 5 5 5 first search value _mm_cmpeq_epi32_mask __mmask8 0 1 0 1 0 1 2 3 indexes of current block _mm_mask_compress_epi32 0 0 1 3 position list position list 6 1 5 7 5 5 5 5 6 1 5 7 5 5 5 5 0 0 1 0 4 5 6 7 _mm_permutex2var_epi32 0 0 1 0 4 5 6 7 second iteration 0 1 3 0 _mm_permutex2var_epi32 second iteration _mm_mask_compress_epi32 0 1 3 0 0 1 3 6 _mm_mask_compress_epi32 6 8 5 3 5 5 5 5 0 1 3 6 0 0 1 0 8 9 10 11 _mm_permutex2var_epi32 6 8 5 3 5 5 5 5 third iteration 1 3 6 0 8 9 10 11 0 0 1 0 _mm_permutex2var_epi32 third iteration matching positions in 1 3 6 10 1 3 6 0 column a Column B: 5 2 3 1 8 7 3 3 4 5 6 7 2 9 3 2 matching positions in 1 3 6 10 _mm_i32gather_epi32 column a 2 1 3 6 2 2 2 2 _mm_mask_cmpeq_epi32_mask 1 0 0 0 _mm_mask_compress_epi32 final result: row 1 (the second entry) 0 0 0 1 matches both conditions

128 byte uint32_t Column A: 2 5 4 5 6 1 5 7 6 8 5 3 5 9 9 5 _mm_loadu_si128 __m128i 2 5 4 5 5 5 5 5 first search value _mm_cmpeq_epi32_mask __mmask8 0 1 0 1 0 1 2 3 indexes of current block _mm_mask_compress_epi32 0 0 1 3 matching positions in 1 3 6 10 column a position list 6 1 5 7 5 5 5 5 Column B: 5 2 3 1 8 7 3 3 4 5 6 7 2 9 3 2 0 0 1 0 4 5 6 7 _mm_permutex2var_epi32 second iteration 0 1 3 0 _mm_i32gather_epi32 _mm_mask_compress_epi32 0 1 3 6 2 1 3 6 2 2 2 2 6 8 5 3 5 5 5 5 0 0 1 0 8 9 10 11 _mm_permutex2var_epi32 _mm_mask_cmpeq_epi32_mask third iteration 1 3 6 0 1 0 0 0 matching positions in 1 3 6 10 column a _mm_mask_compress_epi32 Column B: 5 2 3 1 8 7 3 3 4 5 6 7 2 9 3 2 final result: row 1 (the second entry) 0 0 0 1 _mm_i32gather_epi32 matches both conditions 2 1 3 6 2 2 2 2 _mm_mask_cmpeq_epi32_mask 1 0 0 0 _mm_mask_compress_epi32 final result: row 1 (the second entry) 0 0 0 1 matches both conditions

FUSED TABLE SCANS IMPLEMENTATION - ACHIEVEMENTS ▸ Fully utilize the CPU’s computation power ▸ More comparisons/cycle by using SIMD instructions ▸ Avoid useless prefetches, only load necessary data of the second column ▸ Increased memory bus efficiency ▸ Fewer and cheaper branch mispredictions ▸ Reduced number of conditions in code ▸ Reduced memory transfers ▸ Intermediary results are kept in AVX registers and are not materialized � 13

FUSED TABLE SCANS JIT - RUNTIME CODE GENERATION … … … … ⨝ ⨝ ⨝ ⨝ σ σ σ σ σ σ ꔖ ꔖ σ σ Table A Table B Table A Table A Table B Table B Table A Table B Regular query plan Plan with Fused Table Scan Regular query plan Plan with Fused Table Scan � 14

FUSED TABLE SCANS JIT - RUNTIME CODE GENERATION ▸ Problem: Some parameters are only known at runtime ▸ Size of scanned values ▸ Exact data types ▸ Signed & unsigned 1, 2, 4, or 8 byte int plus float & double ▸ Type of comparison operators: !=, ==, <, >, <=, >= ▸ Larger number of possible code paths � 16

FUSED TABLE SCANS JIT - RUNTIME CODE GENERATION ▸ The query optimizer identifies fusable operator chains ▸ Parameters are determined by the translator during runtime ▸ Result: Specialized, monolithic function SQL String SQL Parser ▸ Cached for efficiency Abstract Syntax Tree SQL Translator Logical Query Plan (LQP) Optimizer Predicates Logical Query Plan (LQP) JIT Compiler LQP Translator Physical Query Plan (PQP) Binary Executor https://github.com/hyrise/hyrise/ � 17

FUSED TABLE SCANS EVALUATION ▸ On current Skylake system ▸ Intel Xeon Platinum @ 2.5 - 3.8 GHz with 2TB of PC4-2666 main memory ▸ Evaluated dimensions during experiments ▸ Table Size ▸ Selectivity ▸ Implementations / Instruction sets ▸ SISD, AVX2, and AVX-512, automatic compiler vectorization ▸ AVX-Register width: 128, 256, and 512 Bit ▸ Number of Predicates � 18

FUSED TABLE SCANS EVALUATION - PERFORMANCE RELATIVE TO SISD IMPLEMENTATION � 19

FUSED TABLE SCANS EVALUATION - INSTRUCTION SETS & REGISTER WIDTH Table with 32M rows Matching Rows (%) � 20

FUSED TABLE SCANS EVALUATION - NUMBER OF PREDICATES Table with 32M rows � 21

FUSED TABLE SCANS SUMMARY & CONCLUSION ▸ Branch mispredictions and useless prefetches are a huge cost factor in multi-predicate scans ▸ Doubling the register size does not (yet) double the performance ▸ Bringing together AVX-512 with Just-In-Time compilation ▸ Use new AVX-512 instructions to efficiently load and remove tuples from AVX-registers without leaving SIMD mode ▸ Performance was at least doubled in 80% of test cases ▸ Future Work: Impact of other encoding methods � 22

FUSED TABLE SCANS: COMBINING AVX-512 AND JIT Markus Dreseler, Jan - PowerPoint PPT Presentation

FUSED TABLE SCANS: COMBINING AVX-512 AND JIT Markus Dreseler, Jan Kossmann, Johannes Frohnhofen, Matthias Uflacker, Hasso Plattner Joint Workshop of HardBD & Active @ ICDE Paris, 16th of April 2018 FUSED TABLE SCANS FUSED TABLE SCANS -

Just-In-Time (JIT) Motivation JIT Philosophy JIT Procedure Toyota Kanban Systems

Miniaturization and Advances of Bulk Head Mounted EMI Filters: Material, Process, Design R.

JIT Compilation Module Overview JIT Compilation Native vs. Managed Compilation Managed

Superinstructions and Replication in the Cacao JVM interpreter M. Anton Ertl Christian Thalinger

ORC LLVMs Next Generation of JIT API Contents LLVM JIT APIs Past, Present and Future I

JVM Optimization 101 Sebastian Zarnekow itemis Static vs Dynamic Compilation AOT vs JIT JIT

Conflict Detection-based Run-Length Encoding AVX-512 CD Instruction Set in Action Annett

Strobe delay scans in STcontrol Jrn Grosse-Knetter Intro: strobe delay scans (1) See talk

Fused and Composable Heterogeneous Cores Roshan Nair and Anirudh Krishna Villivalam Single cores

D AY 159 R OTATION OF 2 D - FUSED FACES I NTRODUCTION In life, we do encounter figures that

Integration of Health and Social Care Simon Carr, Housing Team,JIT JIT is a strategic

LLV8: LLV8: Adding Adding LLVM LLVM as as an an extra extra JIT tier to V8 JavaScript engine

Intra-Pulse Beam-Beam Scans at the NLC IP Steve Smith SLAC Nanobeams 2002 Beam-Beam Scans

Databases Announcements Create Table and Drop Table Create Table 4 Create Table CREATE

Disk Drive Schematic Disk Drive Schematic Typically 512 bytes Typically 512 bytes reads by sensing

Disk Drive Schematic Disk Drive Schematic Typically 512 bytes Typically 512 bytes reads by sensing

How not to Design a Scripting Language Department of Computer Science and Statistics Trinity

Native Code Generation COMP 520: Compiler Design (4 credits) Professor Laurie Hendren

Method Inlining Method inlining replaces a function call site with the body of the callee.

response of photoluminescence in photovoltaics The Student: Mattias Juhl The Supervisors:

Module 3.1 - CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Objective To

Justifying the State Protection and Power Review: Justifying the state: What are the ultimate

Lecture 7/8: block feedback: many people not seeing value of lecture material data: room

Algorithms for Big Data (XIII) Chihao Zhang Shanghai Jiao Tong University Dec. 13, 2019