37 COMPILER HINTS The restrict keyword in void add ( int * restrict X , C++ tells the compiler that int * restrict Y , int * restrict Z ) { the arrays are distinct for ( int i=0; i<MAX; i++) { locations in memory. Z[i] = X[i] + Y[i]; } }
38 COMPILER HINTS This pragma tells the void add ( int *X , compiler to ignore loop int *Y , int *Z ) { dependencies for the vectors. #pragma ivdep for ( int i=0; i<MAX; i++) { It’s up to you make sure that Z[i] = X[i] + Y[i]; } this is correct. }
39 EXPLICIT VECTORIZATION Use CPU intrinsics to manually marshal data between SIMD registers and execute vectorized instructions. Potentially not portable.
40 EXPLICIT VECTORIZATION Store the vectors in 128-bit void add ( int *X , SIMD registers. int *Y , int *Z ) { __mm128i *vecX = (__m128i*)X; Then invoke the intrinsic to __mm128i *vecY = (__m128i*)Y; add together the vectors and __mm128i *vecZ = (__m128i*)Z; for ( int i=0; i<MAX /4 ; i++) { write them to the output _mm_store_si128(vecZ++, location. _mm_add_epi32(*vecX, *vecY)) ; } }
41 VECTORIZATION DIRECTION Approach #1: Horizontal → Perform operation on all elements together within a single vector. Approach #2: Vertical → Perform operation in an elementwise manner on elements of each vector. Source: Przemys ł aw Karpi ń ski
42 VECTORIZATION DIRECTION Approach #1: Horizontal 0 1 2 3 → Perform operation on all elements 6 SIMD Add together within a single vector. Approach #2: Vertical → Perform operation in an elementwise manner on elements of each vector. Source: Przemys ł aw Karpi ń ski
43 VECTORIZATION DIRECTION Approach #1: Horizontal 0 1 2 3 → Perform operation on all elements 6 SIMD Add together within a single vector. Approach #2: Vertical 0 1 2 3 → Perform operation in an elementwise manner on elements of each vector. 1 2 3 4 SIMD Add 1 1 1 1 Source: Przemys ł aw Karpi ń ski
44 EXPLICIT VECTORIZATION Linear Access Operators → Predicate evaluation → Compression Ad-hoc Vectorization → Sorting → Merging Composable Operations → Multi-way trees → Bucketized hash tables Source: Orestis Polychroniou
45 VECTORIZED DBMS ALGORITHMS Principles for efficient vectorization by using fundamental vector operations to construct more advanced functionality. → Favor vertical vectorization by processing different input data per lane. → Maximize lane utilization by executing different things per lane subset. RET RETHINK NKING NG SIMD VEC ECTORI RIZATION N FO FOR R IN-ME MEMO MORY DA DATABASES SIGMOD 2015
46 FUNDAMENTAL OPERATIONS Selective Load Selective Store Selective Gather Selective Scatter
47 FUNDAMENTAL VECTOR OPERATIONS Selective Load Vector A B C D Mask 0 1 0 1 U V W X Y Z • • • Memory
48 FUNDAMENTAL VECTOR OPERATIONS Selective Load Vector A B C D Mask 0 1 0 1 U V W X Y Z • • • Memory
49 FUNDAMENTAL VECTOR OPERATIONS Selective Load Vector A B C D Mask 0 1 0 1 U V W X Y Z • • • Memory
50 FUNDAMENTAL VECTOR OPERATIONS Selective Load Vector A B C D Mask 0 1 0 1 U V W X Y Z • • • Memory
51 FUNDAMENTAL VECTOR OPERATIONS Selective Load Vector A B C D Mask 0 1 0 1 U V W X Y Z • • • Memory
52 FUNDAMENTAL VECTOR OPERATIONS Selective Load Vector A B C D U Mask 0 1 0 1 U V W X Y Z • • • Memory
53 FUNDAMENTAL VECTOR OPERATIONS Selective Load Vector A B C D U Mask 0 1 0 1 U V W X Y Z • • • Memory
54 FUNDAMENTAL VECTOR OPERATIONS Selective Load Vector A B C D U V Mask 0 1 0 1 U V W X Y Z • • • Memory
55 FUNDAMENTAL VECTOR OPERATIONS Selective Load Selective Store U V W X Y Z • • • Vector Memory A B C D U V Mask Mask 0 1 0 1 0 1 0 1 U V W X Y Z • • • Memory Vector A B C D
56 FUNDAMENTAL VECTOR OPERATIONS Selective Load Selective Store U V W X Y Z • • • Vector Memory A B C D U V Mask Mask 0 1 0 1 0 1 0 1 U V W X Y Z • • • Memory Vector A B C D
57 FUNDAMENTAL VECTOR OPERATIONS Selective Load Selective Store U V W X Y Z • • • Vector Memory A B C D U V Mask Mask 0 1 0 1 0 1 0 1 U V W X Y Z • • • Memory Vector A B C D
58 FUNDAMENTAL VECTOR OPERATIONS Selective Load Selective Store U V W X Y Z • • • Vector Memory A B C D U V B Mask Mask 0 1 0 1 0 1 0 1 U V W X Y Z • • • Memory Vector A B C D
59 FUNDAMENTAL VECTOR OPERATIONS Selective Load Selective Store U V W X Y Z • • • Vector Memory A B C D U V B Mask Mask 0 1 0 1 0 1 0 1 U V W X Y Z • • • Memory Vector A B C D
60 FUNDAMENTAL VECTOR OPERATIONS Selective Load Selective Store U V W X Y Z • • • Vector Memory A B C D U V B D Mask Mask 0 1 0 1 0 1 0 1 U V W X Y Z • • • Memory Vector A B C D
61 FUNDAMENTAL VECTOR OPERATIONS Selective Gather Value Vector A B C A D Index Vector 2 1 5 3 U V W X Y Z • • • Memory
62 FUNDAMENTAL VECTOR OPERATIONS Selective Gather Value Vector A B C A D Index Vector 2 1 5 3 U V W X Y Z • • • Memory
63 FUNDAMENTAL VECTOR OPERATIONS Selective Gather Value Vector A B C A D Index Vector 2 1 5 3 U V W X Y Z • • • Memory 0 1 2 3 4 5
64 FUNDAMENTAL VECTOR OPERATIONS Selective Gather Value Vector A B C A D Index Vector 2 1 5 3 U V W X Y Z • • • Memory 0 1 2 3 4 5
65 FUNDAMENTAL VECTOR OPERATIONS Selective Gather Value Vector A B W A C D Index Vector 2 1 5 3 U V W X Y Z • • • Memory 0 1 2 3 4 5
66 FUNDAMENTAL VECTOR OPERATIONS Selective Gather Value Vector A B W V Z A C X D Index Vector 2 1 5 3 U V W X Y Z • • • Memory 0 1 2 3 4 5
67 FUNDAMENTAL VECTOR OPERATIONS Selective Gather Selective Scatter U V W X Y Z • • • Value Vector Memory W V A B C Z A X D Index Vector Index Vector 2 1 5 3 2 1 5 3 U V W X Y Z • • • Memory Value Vector A B C D 0 1 2 3 4 5
68 FUNDAMENTAL VECTOR OPERATIONS Selective Gather Selective Scatter U V W X Y Z • • • Value Vector Memory W V A B C Z A X D Index Vector Index Vector 2 1 5 3 2 1 5 3 U V W X Y Z • • • Memory Value Vector A B C D 0 1 2 3 4 5
69 FUNDAMENTAL VECTOR OPERATIONS Selective Gather Selective Scatter 0 1 2 3 4 5 U V W X Y Z • • • Value Vector Memory A B W V C A Z D X Index Vector Index Vector 2 1 5 3 2 1 5 3 U V W X Y Z • • • Memory Value Vector A B C D 0 1 2 3 4 5
70 FUNDAMENTAL VECTOR OPERATIONS Selective Gather Selective Scatter 0 1 2 3 4 5 U V W X Y Z • • • Value Vector Memory A B W V C A Z D X Index Vector Index Vector 2 1 5 3 2 1 5 3 U V W X Y Z • • • Memory Value Vector A B C D 0 1 2 3 4 5
71 FUNDAMENTAL VECTOR OPERATIONS Selective Gather Selective Scatter 0 1 2 3 4 5 U V W X Y Z • • • Value Vector Memory A B W V A C Z D X B A D C Index Vector Index Vector 2 1 5 3 2 1 5 3 U V W X Y Z • • • Memory Value Vector A B C D 0 1 2 3 4 5
72 ISSUES Gathers and scatters are not really executed in parallel because the L1 cache only allows one or two distinct accesses per cycle. Gathers are only supported in newer CPUs. Selective loads and stores are also emulated in Xeon CPUs using vector permutations.
73 VECTORIZED OPERATORS Selection Scans Hash Tables Partitioning Paper provides additional info: → Joins, Sorting, Bloom filters. RET RETHINK NKING NG SIMD VEC ECTORI RIZATION N FO FOR R IN-ME MEMO MORY DA DATABASES SIGMOD 2015
74 SELECTION SCANS SELECT * FROM table WHERE key >= $(low) AND key <= $(high)
75 SELECTION SCANS Scalar (Branching) i = 0 for t in table : key = t.key if (key≥ low ) && (key≤ high ): copy (t, output[i]) i = i + 1
76 SELECTION SCANS Scalar (Branching) i = 0 for t in table : key = t.key if (key≥ low ) && (key≤ high ): copy (t, output[i]) i = i + 1
77 SELECTION SCANS Scalar (Branching) Scalar (Branchless) i = 0 i = 0 for t in table : for t in table : key = t.key copy (t, output[i]) if (key≥ low ) && (key≤ high ): key = t.key copy (t, output[i]) m = (key≥ low ? 1 : 0) && i = i + 1 (key≤ high ? 1 : 0) i = i + m
78 SELECTION SCANS Scalar (Branching) Scalar (Branchless) i = 0 i = 0 for t in table : for t in table : key = t.key copy (t, output[i]) if (key≥ low ) && (key≤ high ): key = t.key copy (t, output[i]) m = (key≥ low ? 1 : 0) && i = i + 1 (key≤ high ? 1 : 0) i = i + m
79 SELECTION SCANS Scalar (Branching) Scalar (Branchless) i = 0 i = 0 for t in table : for t in table : key = t.key copy (t, output[i]) if (key≥ low ) && (key≤ high ): key = t.key copy (t, output[i]) m = (key≥ low ? 1 : 0) && i = i + 1 (key≤ high ? 1 : 0) i = i + m Source: Bogdan Raducanu
80 SELECTION SCANS Vectorized i = 0 for v t in table : simdLoad ( v t .key, v k ) v m = ( v k ≥ low ? 1 : 0) && ( v k ≤ high ? 1 : 0) simdStore ( v t , v m , output[i]) i = i + | v m ≠false |
81 SELECTION SCANS Vectorized i = 0 for v t in table : simdLoad ( v t .key, v k ) v m = ( v k ≥ low ? 1 : 0) && ( v k ≤ high ? 1 : 0) simdStore ( v t , v m , output[i]) i = i + | v m ≠false |
82 SELECTION SCANS Vectorized i = 0 for v t in table : simdLoad ( v t .key, v k ) v m = ( v k ≥ low ? 1 : 0) && ( v k ≤ high ? 1 : 0) simdStore ( v t , v m , output[i]) i = i + | v m ≠false |
83 SELECTION SCANS Vectorized i = 0 for v t in table : simdLoad ( v t .key, v k ) v m = ( v k ≥ low ? 1 : 0) && ( v k ≤ high ? 1 : 0) simdStore ( v t , v m , output[i]) i = i + | v m ≠false |
84 SELECTION SCANS Vectorized i = 0 for v t in table : simdLoad ( v t .key, v k ) v m = ( v k ≥ low ? 1 : 0) && ( v k ≤ high ? 1 : 0) simdStore ( v t , v m , output[i]) i = i + | v m ≠false |
85 SELECTION SCANS Vectorized i = 0 for v t in table : simdLoad ( v t .key, v k ) v m = ( v k ≥ low ? 1 : 0) && ( v k ≤ high ? 1 : 0) simdStore ( v t , v m , output[i]) i = i + | v m ≠false |
86 SELECTION SCANS Vectorized i = 0 for v t in table : simdLoad ( v t .key, v k ) v m = ( v k ≥ low ? 1 : 0) && ( v k ≤ high ? 1 : 0) simdStore ( v t , v m , output[i]) i = i + | v m ≠false | SELECT * FROM table WHERE key >= "O" AND key <= "U"
87 SELECTION SCANS Vectorized ID KEY 1 J i = 0 2 O for v t in table : 3 Y simdLoad ( v t .key, v k ) 4 S v m = ( v k ≥ low ? 1 : 0) && 5 U ( v k ≤ high ? 1 : 0) 6 X simdStore ( v t , v m , output[i]) i = i + | v m ≠false | SELECT * FROM table WHERE key >= "O" AND key <= "U"
88 SELECTION SCANS Vectorized Key Vector ID KEY J O Y S U X 1 J i = 0 2 O for v t in table : 3 Y simdLoad ( v t .key, v k ) 4 S v m = ( v k ≥ low ? 1 : 0) && 5 U ( v k ≤ high ? 1 : 0) 6 X simdStore ( v t , v m , output[i]) i = i + | v m ≠false | SELECT * FROM table WHERE key >= "O" AND key <= "U"
89 SELECTION SCANS Vectorized Key Vector ID KEY J O Y S U X 1 J i = 0 2 O for v t in table : SIMD Compare 3 Y simdLoad ( v t .key, v k ) 4 S v m = ( v k ≥ low ? 1 : 0) && Mask 0 1 0 1 1 0 5 U ( v k ≤ high ? 1 : 0) 6 X simdStore ( v t , v m , output[i]) i = i + | v m ≠false | SELECT * FROM table WHERE key >= "O" AND key <= "U"
90 SELECTION SCANS Vectorized Key Vector ID KEY J O Y S U X 1 J i = 0 2 O for v t in table : SIMD Compare 3 Y simdLoad ( v t .key, v k ) 4 S v m = ( v k ≥ low ? 1 : 0) && Mask 0 1 0 1 1 0 5 U ( v k ≤ high ? 1 : 0) 6 X simdStore ( v t , v m , output[i]) All Offsets i = i + | v m ≠false | 0 1 2 3 4 5 SELECT * FROM table WHERE key >= "O" AND key <= "U"
91 SELECTION SCANS Vectorized Key Vector ID KEY J O Y S U X 1 J i = 0 2 O for v t in table : SIMD Compare 3 Y simdLoad ( v t .key, v k ) 4 S v m = ( v k ≥ low ? 1 : 0) && Mask 0 1 0 1 1 0 5 U ( v k ≤ high ? 1 : 0) 6 X simdStore ( v t , v m , output[i]) All Offsets i = i + | v m ≠false | 0 1 2 3 4 5 SIMD Store SELECT * FROM table Matched Offsets 1 3 4 WHERE key >= "O" AND key <= "U"
92 SELECTION SCANS Scalar (Branching) Vectorized (Early Mat) Scalar (Branchless) Vectorized (Late Mat) MIC (Xeon Phi 7120P – 61 Cores + 4 × HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2 × HT)
93 SELECTION SCANS Scalar (Branching) Vectorized (Early Mat) Scalar (Branchless) Vectorized (Late Mat) MIC (Xeon Phi 7120P – 61 Cores + 4 × HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2 × HT) 48 6.0 (billion tuples / sec) (billion tuples / sec) Throughput Throughput 32 4.0 1.8 1.7 1.7 1.7 16 2.0 1.5 1.6 1.4 1.2 5.7 5.7 5.6 5.3 4.9 4.3 2.8 1.3 0 0.0 0 1 2 5 10 20 50 100 0 1 2 5 10 20 50 100 Selectivity (%) Selectivity (%)
94 SELECTION SCANS Scalar (Branching) Vectorized (Early Mat) Scalar (Branchless) Vectorized (Late Mat) MIC (Xeon Phi 7120P – 61 Cores + 4 × HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2 × HT) 48 6.0 (billion tuples / sec) (billion tuples / sec) Throughput Throughput 32 4.0 16 2.0 0 0.0 0 1 2 5 10 20 50 100 0 1 2 5 10 20 50 100 Selectivity (%) Selectivity (%)
95 SELECTION SCANS Scalar (Branching) Vectorized (Early Mat) Scalar (Branchless) Vectorized (Late Mat) MIC (Xeon Phi 7120P – 61 Cores + 4 × HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2 × HT) 48 6.0 (billion tuples / sec) (billion tuples / sec) Throughput Throughput 32 4.0 Memory Bandwidth 16 2.0 Memory Bandwidth 0 0.0 0 1 2 5 10 20 50 100 0 1 2 5 10 20 50 100 Selectivity (%) Selectivity (%)
96 SELECTION SCANS Scalar (Branching) Vectorized (Early Mat) Scalar (Branchless) Vectorized (Late Mat) MIC (Xeon Phi 7120P – 61 Cores + 4 × HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2 × HT) 48 6.0 (billion tuples / sec) (billion tuples / sec) Throughput Throughput 32 4.0 16 2.0 0 0.0 0 1 2 5 10 20 50 100 0 1 2 5 10 20 50 100 Selectivity (%) Selectivity (%)
97 SELECTION SCANS Scalar (Branching) Vectorized (Early Mat) Scalar (Branchless) Vectorized (Late Mat) MIC (Xeon Phi 7120P – 61 Cores + 4 × HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2 × HT) 48 6.0 (billion tuples / sec) (billion tuples / sec) Throughput Throughput 32 4.0 16 2.0 0 0.0 0 1 2 5 10 20 50 100 0 1 2 5 10 20 50 100 Selectivity (%) Selectivity (%)
98 HASH TABLES – PROBING Linear Probing Hash Table KEY PAYLOAD
99 HASH TABLES – PROBING Scalar Linear Probing Hash Table Input Key KEY PAYLOAD k1
100 HASH TABLES – PROBING Scalar Linear Probing Hash Table Input Key hash(key) Hash Index KEY PAYLOAD # k1 h1
101 HASH TABLES – PROBING Scalar Linear Probing Hash Table Input Key hash(key) Hash Index KEY PAYLOAD # k1 h1 = k1 k9
Recommend
More recommend