Lect ure # 20 ADVANCED DATABASE SYSTEMS Vectorized Execution @ Andy_Pavlo // 15- 721 // Spring 2019
CMU 15-721 (Spring 2019) 2 Background Hardware Vectorized Algorithms (Columbia)
CMU 15-721 (Spring 2019) 3 VECTO RIZATIO N The process of converting an algorithm's scalar implementation that processes a single pair of operands at a time, to a vector implementation that processes one operation on multiple pairs of operands at once.
CMU 15-721 (Spring 2019) 4 WH Y TH IS M ATTERS Say we can parallelize our algorithm over 32 cores. Each core has a 4-wide SIMD registers. Potential Speed-up: 32x × 4x = 128x
CMU 15-721 (Spring 2019) 5 M ULTI- CO RE CPUS Use a small number of high-powered cores. → Intel Xeon Skylake / Kaby Lake → High power consumption and area per core. Massively superscalar and aggressive out-of- order execution → Instructions are issued from a sequential stream. → Check for dependencies between instructions. → Process multiple instructions per clock cycle.
CMU 15-721 (Spring 2019) 6 M AN Y IN TEGRATED CO RES (M IC) Use a larger number of low-powered cores. → Intel Xeon Phi → Low power consumption and area per core. → Expanded SIMD instructions with larger register sizes. Knights Ferry (Columbia Paper) → Non-superscalar and in-order execution → Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) → Superscalar and out-of-order execution. → Cores = Silvermont (aka Atom)
CMU 15-721 (Spring 2019) 6 M AN Y IN TEGRATED CO RES (M IC) Use a larger number of low-powered cores. → Intel Xeon Phi → Low power consumption and area per core. → Expanded SIMD instructions with larger register sizes. Knights Ferry (Columbia Paper) → Non-superscalar and in-order execution → Cores = Intel P54C (aka Pentium from the 1990s). Knights Landing (Since 2016) → Superscalar and out-of-order execution. → Cores = Silvermont (aka Atom)
CMU 15-721 (Spring 2019) 8 S IN GLE I N STRUCTIO N, M ULTIPLE D ATA A class of CPU instructions that allow the processor to perform the same operation on multiple data points simultaneously. All major ISAs have microarchitecture support SIMD operations. → x86 : MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2, AVX512 → PowerPC : Altivec → ARM : NEON
CMU 15-721 (Spring 2019) 9 SIM D EXAM PLE X + Y = Z 8 7 x 1 y 1 x 1 +y 1 6 X 5 x 2 y 2 x 2 +y 2 + = 4 Z ⋮ ⋮ ⋮ 3 x n y n x n +y n 2 SISD 1 + 9 1 for (i=0; i<n; i++) { 1 Z[i] = X[i] + Y[i]; 1 } Y 1 1 1 1 1
CMU 15-721 (Spring 2019) 9 SIM D EXAM PLE X + Y = Z 8 7 x 1 y 1 x 1 +y 1 6 X 5 x 2 y 2 x 2 +y 2 + = 4 Z ⋮ ⋮ ⋮ 3 x n y n x n +y n 2 SISD 1 + 9 8 7 6 5 4 3 2 1 for (i=0; i<n; i++) { 1 Z[i] = X[i] + Y[i]; 1 } Y 1 1 1 1 1
CMU 15-721 (Spring 2019) 9 SIM D EXAM PLE 128-bit SIMD Register X + Y = Z 8 7 x 1 y 1 x 1 +y 1 8 7 6 5 6 X 5 x 2 y 2 x 2 +y 2 + = 4 Z ⋮ ⋮ ⋮ 3 x n y n x n +y n 2 SIMD + 1 1 for (i=0; i<n; i++) { 1 1 1 1 1 Z[i] = X[i] + Y[i]; 1 128-bit SIMD Register } Y 1 1 1 1 1
CMU 15-721 (Spring 2019) 9 SIM D EXAM PLE 128-bit SIMD Register X + Y = Z 8 7 x 1 y 1 x 1 +y 1 8 7 6 5 6 X 5 x 2 y 2 x 2 +y 2 + = 4 Z ⋮ ⋮ ⋮ 3 x n y n x n +y n 2 SIMD + 1 9 8 7 6 1 for (i=0; i<n; i++) { 128-bit SIMD Register 1 1 1 1 1 Z[i] = X[i] + Y[i]; 1 128-bit SIMD Register } Y 1 1 1 1 1
CMU 15-721 (Spring 2019) 9 SIM D EXAM PLE X + Y = Z 8 7 x 1 y 1 x 1 +y 1 6 X 5 x 2 y 2 x 2 +y 2 + = 4 Z ⋮ ⋮ ⋮ 3 4 3 2 1 x n y n x n +y n 2 SIMD + 1 9 8 7 6 5 4 3 2 1 for (i=0; i<n; i++) { 1 Z[i] = X[i] + Y[i]; 1 } Y 1 1 1 1 1 1 1 1 1
CMU 15-721 (Spring 2019) 10 STREAM IN G SIM D EXTEN SIO N S (SSE) SSE is a collection SIMD instructions that target special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can be performed on each of the four elements simultaneously. First introduced by Intel in 1999.
CMU 15-721 (Spring 2019) 11 SIM D IN STRUCTIO NS (1) Data Movement → Moving data in and out of vector registers Arithmetic Operations → Apply operation on multiple data items (e.g., 2 doubles, 4 floats, 16 bytes) → Example: ADD , SUB , MUL , DIV , SQRT , MAX , MIN Logical Instructions → Logical operations on multiple data items → Example: AND , OR , XOR , ANDN , ANDPS , ANDNPS
CMU 15-721 (Spring 2019) 12 SIM D IN STRUCTIO NS (2) Comparison Instructions → Comparing multiple data items ( == , < , <= , > , >= , != ) Shuffle instructions → Move data in between SIMD registers Miscellaneous → Conversion: Transform data between x86 and SIMD registers. → Cache Control: Move data directly from SIMD registers to memory (bypassing CPU cache).
CMU 15-721 (Spring 2019) 13 IN TEL SIM D EXTEN SIO N S Width Integers Single-P Double-P 1997 MMX 64 bits ✔ 1999 SSE 128 bits ✔ (×4) ✔ 2001 SSE2 128 bits ✔ (×2) ✔ ✔ 2004 SSE3 128 bits ✔ ✔ ✔ 2006 SSSE 3 128 bits ✔ ✔ ✔ 2006 SSE 4.1 128 bits ✔ ✔ ✔ 2008 SSE 4.2 128 bits ✔ ✔ ✔ 2011 AVX 256 bits ✔ (×8) ✔ (×4) ✔ 2013 AVX2 256 bits ✔ ✔ ✔ 2017 AVX-512 512 bits ✔ (×16) ✔ (×8) ✔ Source: James Reinders
CMU 15-721 (Spring 2019) 14 VECTO RIZATIO N Choice #1: Automatic Vectorization Choice #2: Compiler Hints Choice #3: Explicit Vectorization Source: James Reinders
CMU 15-721 (Spring 2019) 14 VECTO RIZATIO N Ease of Use Choice #1: Automatic Vectorization Choice #2: Compiler Hints Choice #3: Explicit Vectorization Programmer Control Source: James Reinders
CMU 15-721 (Spring 2019) 15 AUTO M ATIC VECTO RIZATIO N The compiler can identify when instructions inside of a loop can be rewritten as a vectorized operation. Works for simple loops only and is rare in database operators. Requires hardware support for SIMD instructions.
CMU 15-721 (Spring 2019) 16 AUTO M ATIC VECTO RIZATIO N This loop is not legal to void add ( int *X , automatically vectorize. int *Y , int *Z ) { for ( int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }
CMU 15-721 (Spring 2019) 16 AUTO M ATIC VECTO RIZATIO N This loop is not legal to void add ( int *X , automatically vectorize. int *Y , *Z=*X+1 int *Z ) { for ( int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } } These might point to the same address!
CMU 15-721 (Spring 2019) 16 AUTO M ATIC VECTO RIZATIO N This loop is not legal to void add ( int *X , automatically vectorize. int *Y , *Z=*X+1 int *Z ) { for ( int i=0; i<MAX; i++) { The code is written such that the Z[i] = X[i] + Y[i]; addition is described as being } } done sequentially. These might point to the same address!
CMU 15-721 (Spring 2019) 17 CO M PILER H IN TS Provide the compiler with additional information about the code to let it know that is safe to vectorize. Two approaches: → Give explicit information about memory locations. → Tell the compiler to ignore vector dependencies.
CMU 15-721 (Spring 2019) 18 CO M PILER H IN TS The restrict keyword in C++ void add ( int * restrict X , tells the compiler that the arrays int * restrict Y , int * restrict Z ) { are distinct locations in memory. for ( int i=0; i<MAX; i++) { Z[i] = X[i] + Y[i]; } }
CMU 15-721 (Spring 2019) 19 CO M PILER H IN TS This pragma tells the compiler to void add ( int *X , ignore loop dependencies for the int *Y , int *Z ) { vectors. #pragma ivdep for ( int i=0; i<MAX; i++) { It’s up to you make sure that this Z[i] = X[i] + Y[i]; } is correct. }
CMU 15-721 (Spring 2019) 20 EXPLICIT VECTO RIZATIO N Use CPU intrinsics to manually marshal data between SIMD registers and execute vectorized instructions. Potentially not portable.
CMU 15-721 (Spring 2019) 21 EXPLICIT VECTO RIZATIO N Store the vectors in 128-bit SIMD void add ( int *X , int *Y , registers. int *Z ) { __mm128i *vecX = (__m128i*)X; Then invoke the intrinsic to add __mm128i *vecY = (__m128i*)Y; __mm128i *vecZ = (__m128i*)Z; together the vectors and write for ( int i=0; i<MAX /4 ; i++) { them to the output location. _mm_store_si128(vecZ++, ⮱ _mm_add_epi32(*vecX++, ⮱ *vecY++)) ; } }
CMU 15-721 (Spring 2019) 22 VECTO RIZATIO N DIRECTIO N Approach #1: Horizontal 0 1 2 3 → Perform operation on all elements together 6 SIMD Add within a single vector. Approach #2: Vertical 0 1 2 3 → Perform operation in an elementwise manner on elements of each vector. 1 2 3 4 SIMD Add 1 1 1 1 Source:
CMU 15-721 (Spring 2019) 23 EXPLICIT VECTO RIZATIO N Linear Access Operators → Predicate evaluation → Compression Ad-hoc Vectorization → Sorting → Merging Composable Operations → Multi-way trees → Bucketized hash tables Source: Orestis Polychroniou
CMU 15-721 (Spring 2019) 24 VECTO RIZED DBM S ALGO RITH M S Principles for efficient vectorization by using fundamental vector operations to construct more advanced functionality. → Favor vertical vectorization by processing different input data per lane. → Maximize lane utilization by executing different things per lane subset. RETHINKING SIMD VECTORIZATION FOR IN IN- MEMORY DATABASES SIGMOD 2015
CMU 15-721 (Spring 2019) 25 FUN DAM EN TAL O PERATIO N S Selective Load Selective Store Selective Gather Selective Scatter
Recommend
More recommend