Changelog Changes made in this version not seen in fjrst lecture: 15 November: vector addr picture: make order of result consistent with order of inputs 15 November: correct square to matmul on several vector slides 15 November: correct mixups of A and B, B and C on several matmul vector slides 15 November: correct some si128 instances to si256 in vectorization slides 15 November: addressing transformation: correct more A/B/C mixups 0
Vector Insts / Profjlers / Exceptions intro 1
last time loop unrolling/cache blocking instruction queues and out-of-order list of available instructions multiple execution units (ALUs + other things that can run instr.) each cycle: ready instructions from queue to execution units reassociation reorder operations to reduce data dependencies, expose more parallelism multiple accumulators — reassociation for loops shifting bottlenecks need to optimize what’s slowest — determines longest latency e.g. loop unrolling helps until parallelism limit …, but after improving parallelism, loop unrolling more helps again e.g. cache optimizations won’t matter until loop overhead lowered or vice-versa 2
aliasing problems with cache blocking for ( int k = 0; k < N; k++) { for ( int i = 0; i < N; i += 2) { for ( int j = 0; j < N; j += 2) { } } } can compiler keep A[i*N+k] in a register? 3 C[(i+0)*N + j+0] += A[i*N+k] * B[k*N+j]; C[(i+1)*N + j+0] += A[(i+1)*N+k] * B[k*N+j]; C[(i+0)*N + j+1] += A[i*N+k] * B[k*N+j+1]; C[(i+1)*N + j+1] += A[(i+1)*N+k] * B[k*N+j+1];
“register blocking” for ( int k = 0; k < N; ++k) { for ( int i = 0; i < N; i += 2) { float Ai0k = A[(i+0)*N + k]; float Ai1k = A[(i+1)*N + k]; for ( int j = 0; j < N; j += 2) { float Bkj0 = A[k*N + j+0]; float Bkj1 = A[k*N + j+1]; } } } 4 C[(i+0)*N + j+0] += Ai0k * Bkj0; C[(i+1)*N + j+0] += Ai1k * Bkj0; C[(i+0)*N + j+1] += Ai0k * Bkj1; C[(i+1)*N + j+1] += Ai1k * Bkj1;
vector instructions modern processors have registers that hold “vector” of values example: current x86-64 processors have 256-bit registers 8 ints or 8 fmoats or 4 doubles or … 256-bit registers named %ymm0 through %ymm15 vector instructions or SIMD (single instruction, multiple data) instructions extra copies of ALUs only accessed by vector instructions (also 128-bit versions named %xmm0 through %xmm15 ) 5 instructions that act on all values in register
example vector instruction vpaddd %ymm0, %ymm1, %ymm2 (packed add dword (32-bit)) Suppose registers contain (interpreted as 4 ints) %ymm0: [1, 2, 3, 4, 5, 6, 7, 8] %ymm1: [9, 10, 11, 12, 13, 14, 15, 16] Result will be: %ymm2: [10, 12, 14, 16, 18, 20, 22, 24] 6
vector instructions vmovdqu (%rsi,%rax), %ymm1 ret vzeroupper jne the_loop cmpq $2048, %rax addq $32, %rax vmovdqu %ymm0, (%rdi,%rax) vpaddd %ymm1, %ymm0, %ymm0 vmovdqu (%rdi,%rax), %ymm0 the_loop: xorl %eax, %eax add : } a[i] += b[i]; for ( int i = 0; i < 512; ++i) 7 void add( int * restrict a, int * restrict b) { /* load A into ymm0 */ /* load B into ymm1 */ /* ymm1 + ymm0 -> ymm0 */ /* store ymm0 into A */ /* increment index by 32 bytes */ /* ← - for calling convention reasons */
vector add picture B[10] %ymm0 vmovdqu %ymm1 vpaddd %ymm0 A[8] + B[8] A[9] + B[9] A[10] + A[11] … + B[11] A[12] + B[12] A[13] + B[13] A[14] + B[14] A[15] + B[15] vmovdqu … A[3] B[10] B[3] A[4] B[4] A[5] B[5] A[6] B[6] A[7] B[7] A[8] B[8] A[9] B[9] A[10] A[11] … B[11] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] A[16] B[16] A[17] B[17] … 8
one view of vector functional units (stage 2) vector ALU (one/cycle) output values (one/cycle) input values (stage 3) ALU (lane 4) (stage 2) ALU (lane 4) (stage 1) ALU (lane 4) (stage 3) ALU (lane 3) ALU (lane 3) ALU (lane 1) (stage 1) ALU (lane 3) (stage 3) ALU (lane 2) (stage 2) ALU (lane 2) (stage 1) ALU (lane 2) (stage 3) ALU (lane1) (stage 2) ALU (lane 1) (stage 1) 9
why vector instructions? lots of logic not dedicated to computation instruction queue reorder bufger instruction fetch branch prediction … …but a lot more computational capacity 10 adding vector instructions — little extra control logic
vector instructions and compilers compilers can sometimes fjgure out how to use vector instructions (and have gotten much, much better at it over the past decade) but easily messsed up: by aliasing by conditionals by some operation with no vector instruction … 11
fjckle compiler vectorization (1) GCC 8.2 and Clang 7.0 generate vector instructions for this: } for ( int k = 0; k < N; ++k) for ( int j = 0; j < N; ++j) for ( int i = 0; i < N; ++i) #define N 1024 but not: } for ( int j = 0; j < N; ++j) for ( int i = 0; i < N; ++i) for ( int k = 0; k < N; ++k) #define N 1024 12 void foo( unsigned int *A, unsigned int *B) { B[i * N + j] += A[i * N + k] * A[k * N + j]; void foo( unsigned int *A, unsigned int *B) { B[i * N + j] += A[i * N + k] * A[j * N + k];
fjckle compiler vectorization (2) Clang 5.0.0 generates vector instructions for this: for ( int k = 0; k < N; ++k) for ( int i = 0; i < N; ++i) for ( int j = 0; j < N; ++j) } but not: (fjxed in later versions) for ( long k = 0; k < N; ++k) for ( long i = 0; i < N; ++i) for ( long j = 0; j < N; ++j) } 13 void foo( int N, unsigned int *A, unsigned int *B) { B[i * N + j] += A[i * N + k] * A[k * N + j]; void foo( long N, unsigned int *A, unsigned int *B) { B[i * N + j] += A[i * N + k] * A[k * N + j];
vector intrinsics if compiler doesn’t work… could write vector instruction assembly by hand second option: “intrinsic functions” C functions that compile to particular instructions 14
vector intrinsics: add example _mm256_storeu_si256(( __m256i *) &a[i], sums); epi32 means “8 32-bit integers” function to add u for “unaligned” (otherwise, pointer address must be multiple of 32) si256 means “256-bit integer value” functions to store/load other types: __m256 (fmoats), __m128d (doubles) special type __m256i — “256 bits of integers” } } // {a[i], a[i+1], a[i+2], a[i+3]} = sums __m256i sums = _mm256_add_epi32(a_values, b_values); // sums = {a[i] + b[i], a[i+1] + b[i+1], ....} // add four 32-bit integers __m256i b_values = _mm256_loadu_si256(( __m256i *) &b[i]); // b_values = {b[i], b[i+1], b[i+2], b[i+3]} __m256i a_values = _mm256_loadu_si256(( __m256i *) &a[i]); // a_values = {a[i], a[i+1], a[i+2], a[i+3]} // "si256" --> 256 bit integer for ( int i = 0; i < 128; i += 8) { 15 void vectorized_add( int *a, int *b) {
vector intrinsics: add example _mm256_storeu_si256(( __m256i *) &a[i], sums); epi32 means “8 32-bit integers” function to add u for “unaligned” (otherwise, pointer address must be multiple of 32) si256 means “256-bit integer value” functions to store/load other types: __m256 (fmoats), __m128d (doubles) special type __m256i — “256 bits of integers” } } // {a[i], a[i+1], a[i+2], a[i+3]} = sums __m256i sums = _mm256_add_epi32(a_values, b_values); // sums = {a[i] + b[i], a[i+1] + b[i+1], ....} // add four 32-bit integers __m256i b_values = _mm256_loadu_si256(( __m256i *) &b[i]); // b_values = {b[i], b[i+1], b[i+2], b[i+3]} __m256i a_values = _mm256_loadu_si256(( __m256i *) &a[i]); // a_values = {a[i], a[i+1], a[i+2], a[i+3]} // "si256" --> 256 bit integer for ( int i = 0; i < 128; i += 8) { 15 void vectorized_add( int *a, int *b) {
vector intrinsics: add example _mm256_storeu_si256(( __m256i *) &a[i], sums); epi32 means “8 32-bit integers” function to add u for “unaligned” (otherwise, pointer address must be multiple of 32) si256 means “256-bit integer value” functions to store/load other types: __m256 (fmoats), __m128d (doubles) special type __m256i — “256 bits of integers” } } // {a[i], a[i+1], a[i+2], a[i+3]} = sums __m256i sums = _mm256_add_epi32(a_values, b_values); // sums = {a[i] + b[i], a[i+1] + b[i+1], ....} // add four 32-bit integers __m256i b_values = _mm256_loadu_si256(( __m256i *) &b[i]); // b_values = {b[i], b[i+1], b[i+2], b[i+3]} __m256i a_values = _mm256_loadu_si256(( __m256i *) &a[i]); // a_values = {a[i], a[i+1], a[i+2], a[i+3]} // "si256" --> 256 bit integer for ( int i = 0; i < 128; i += 8) { 15 void vectorized_add( int *a, int *b) {
vector intrinsics: add example _mm256_storeu_si256(( __m256i *) &a[i], sums); epi32 means “8 32-bit integers” function to add u for “unaligned” (otherwise, pointer address must be multiple of 32) si256 means “256-bit integer value” functions to store/load other types: __m256 (fmoats), __m128d (doubles) special type __m256i — “256 bits of integers” } } // {a[i], a[i+1], a[i+2], a[i+3]} = sums __m256i sums = _mm256_add_epi32(a_values, b_values); // sums = {a[i] + b[i], a[i+1] + b[i+1], ....} // add four 32-bit integers __m256i b_values = _mm256_loadu_si256(( __m256i *) &b[i]); // b_values = {b[i], b[i+1], b[i+2], b[i+3]} __m256i a_values = _mm256_loadu_si256(( __m256i *) &a[i]); // a_values = {a[i], a[i+1], a[i+2], a[i+3]} // "si256" --> 256 bit integer for ( int i = 0; i < 128; i += 8) { 15 void vectorized_add( int *a, int *b) {
Recommend
More recommend