Optimization part 1 1
Changelog Changes made in this version not seen in fjrst lecture: 29 Feb 2018: loop unrolling performance: remove bogus instruction cache overhead remark 1 29 Feb 2018: spatial locality in Akj: correct reference to B k +1 ,j to A k +1 ,j
last time what things in C code map to same set? key idea: if bytes per way apart from each other fjnding confmict misses in C how “overloaded” is each cache set cache ‘blocking’ for matrix-like code maximize work per cache miss 2
some logistics exam next week everything up to and including this lecture yes, I know offjce hours were very slow… like to think about how to help with ‘group’ offjce hours? better tools? difgerent priorities on queue? 3
view as an explicit cache imagine we explicitly moved things into cache original loop: for ( int k = 0; k < N; ++k) for ( int i = 0; i < N; ++i) { loadIntoCache(&A[i*N+k]); for ( int j = 0; j < N; ++j) { loadIntoCache(&B[i*N+j]); loadIntoCache(&A[k*N+j]); } } 4 B[i*N+j] += A[i*N+k] * A[k*N+j];
view as an explicit cache imagine we explicitly moved things into cache for ( int kk = 0; kk < N; kk += 2) for ( int i = 0; i < N; ++i) { loadIntoCache(&A[i*N+k]); loadIntoCache(&A[i*N+k+1]); for ( int j = 0; j < N; ++j) { loadIntoCache(&B[i*N+j]); loadIntoCache(&A[k*N+j]); loadIntoCache(&A[(k+1)*N+j]); for ( int k = kk; k < kk + 2; ++k) } } 5 blocking in k : B[i*N+j] += A[i*N+k] * A[k*N+j];
calculation counting with explicit cache 6 before: load ∼ 2 values for one add+multiply after: load ∼ 3 values for two add+multiply
simple blocking: temporal locality in Bij for ( int k = 0; k < N; k += 2) } 7 for ( int j = 0; j < N; ++j) { for ( int i = 0; i < N; i += 2) /* load a block around Aik */ /* process a "block": */ * A k +0 ,j B i +0 ,j += A i +0 ,k +0 * A k +1 ,j B i +0 ,j += A i +0 ,k +1 * A k +0 ,j B i +1 ,j += A i +1 ,k +0 * A k +1 ,j B i +1 ,j += A i +1 ,k +1 before: B ij s accessed once, then not again for N 2 iters after: B ij s accessed twice, then not again for N 2 iters (next k )
simple blocking: temporal locality in Akj for ( int k = 0; k < N; k += 2) } 8 for ( int j = 0; j < N; ++j) { for ( int i = 0; i < N; i += 2) /* load a block around Aik */ /* process a "block": */ * A k +0 ,j B i +0 ,j += A i +0 ,k +0 * A k +1 ,j B i +0 ,j += A i +0 ,k +1 * A k +0 ,j B i +1 ,j += A i +1 ,k +0 * A k +1 ,j B i +1 ,j += A i +1 ,k +1 before blocking: A kj s accessed once, then not again for N iters after blocking: A kj s accessed twice, then not again for N iters (next i )
simple blocking: temporal locality in Aik for ( int k = 0; k < N; k += 2) slightly less temporal locality } 9 for ( int i = 0; i < N; i += 2) for ( int j = 0; j < N; ++j) { /* load a block around Aik */ /* process a "block": */ * A k +0 ,j B i +0 ,j += A i +0 ,k +0 * A k +1 ,j B i +0 ,j += A i +0 ,k +1 * A k +0 ,j B i +1 ,j += A i +1 ,k +0 * A k +1 ,j B i +1 ,j += A i +1 ,k +1 before: A ik s accessed N times, then never again after: A ik s accessed N times but other parts of A ik accessed in between
simple blocking: spatial locality in Bij for ( int k = 0; k < N; k += 2) after blocking: slightly less spatial locality } 10 for ( int i = 0; i < N; i += 2) for ( int j = 0; j < N; ++j) { /* load a block around Aik */ /* process a "block": */ * A k +0 ,j B i +0 ,j += A i +0 ,k +0 * A k +1 ,j B i +0 ,j += A i +0 ,k +1 * A k +0 ,j B i +1 ,j += A i +1 ,k +0 * A k +1 ,j B i +1 ,j += A i +1 ,k +1 before blocking: perfect spatial locality ( B i,j and B i,j +1 adjacent) B i,j and B i +1 ,j far apart ( N elements) but still B i,j +1 accessed iteration after B i,j (adjacent)
simple blocking: spatial locality in Akj for ( int k = 0; k < N; k += 2) after: slightly less spatial locality } 11 for ( int i = 0; i < N; i += 2) for ( int j = 0; j < N; ++j) { /* load a block around Aik */ /* process a "block": */ * A k +0 ,j B i +0 ,j += A i +0 ,k +0 * A k +1 ,j B i +0 ,j += A i +0 ,k +1 * A k +0 ,j B i +1 ,j += A i +1 ,k +0 * A k +1 ,j B i +1 ,j += A i +1 ,k +1 before: perfect spatial locality ( A k,j and B k,j +1 adjacent) A k,j and A k +1 ,j far apart ( N elements) but still A k,j +1 accessed iteration after B k,j (adjacent)
simple blocking: spatial locality in Aik for ( int k = 0; k < N; k += 2) } 12 for ( int j = 0; j < N; ++j) { for ( int i = 0; i < N; i += 2) /* load a block around Aik */ /* process a "block": */ * A k +0 ,j B i +0 ,j += A i +0 ,k +0 * A k +1 ,j B i +0 ,j += A i +0 ,k +1 * A k +0 ,j B i +1 ,j += A i +1 ,k +0 * A k +1 ,j B i +1 ,j += A i +1 ,k +1 before: very poor spatial locality ( A i,k and A i +1 ,k far apart) after: some spatial locality A i,k and B i +1 ,k still far apart ( N elements) but still A i,k accessed together with A i,k +1
generalizing cache blocking for ( int kk = 0; kk < N; kk += K) { for ( int ii = 0; ii < N; ii += I) { with I by K block of A hopefully cached: for ( int jj = 0; jj < N; jj += J) { with K by J block of A, I by J block of B cached: for i in ii to ii+I: for j in jj to jj+J: for k in kk to kk+K: 13 B[i * N + j] += A[i * N + k] * A[k * N + j]; B ij used K times for one miss — N 2 /K misses A ik used J times for one miss — N 2 /J misses A kj used I times for one miss — N 2 /I misses catch: IK + KJ + IJ elements must fjt in cache
generalizing cache blocking for ( int kk = 0; kk < N; kk += K) { for ( int ii = 0; ii < N; ii += I) { with I by K block of A hopefully cached: for ( int jj = 0; jj < N; jj += J) { with K by J block of A, I by J block of B cached: for i in ii to ii+I: for j in jj to jj+J: for k in kk to kk+K: 13 B[i * N + j] += A[i * N + k] * A[k * N + j]; B ij used K times for one miss — N 2 /K misses A ik used J times for one miss — N 2 /J misses A kj used I times for one miss — N 2 /I misses catch: IK + KJ + IJ elements must fjt in cache
generalizing cache blocking for ( int kk = 0; kk < N; kk += K) { for ( int ii = 0; ii < N; ii += I) { with I by K block of A hopefully cached: for ( int jj = 0; jj < N; jj += J) { with K by J block of A, I by J block of B cached: for i in ii to ii+I: for j in jj to jj+J: for k in kk to kk+K: 13 B[i * N + j] += A[i * N + k] * A[k * N + j]; B ij used K times for one miss — N 2 /K misses A ik used J times for one miss — N 2 /J misses A kj used I times for one miss — N 2 /I misses catch: IK + KJ + IJ elements must fjt in cache
generalizing cache blocking for ( int kk = 0; kk < N; kk += K) { for ( int ii = 0; ii < N; ii += I) { with I by K block of A hopefully cached: for ( int jj = 0; jj < N; jj += J) { with K by J block of A, I by J block of B cached: for i in ii to ii+I: for j in jj to jj+J: for k in kk to kk+K: 13 B[i * N + j] += A[i * N + k] * A[k * N + j]; B ij used K times for one miss — N 2 /K misses A ik used J times for one miss — N 2 /J misses A kj used I times for one miss — N 2 /I misses catch: IK + KJ + IJ elements must fjt in cache
cache blocking overview reorder calculations typically work in square-ish chunks of input goal: maximum calculations per load into cache typically: use every value several times after loading it versus naive loop code: some values loaded, then used once some values loaded, then used all possible times 14
cache blocking and miss rate 15 read misses/multiply or add 0.09 blocked 0.08 unblocked 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00 0 100 200 300 400 500 N
what about performance? 16 cycles per multiply/add [less optimized loop] 2.0 1.5 1.0 unblocked 0.5 blocked 0.0 0 100 200 300 400 500 N cycles per multiply/add [optimized loop] 0.5 0.4 0.3 0.2 unblocked 0.1 blocked 0.0 0 200 400 600 800 1000 N
performance for big sizes 17 cycles per multiply or add 1.0 matrix in unblocked L3 cache 0.8 blocked 0.6 0.4 0.2 0.0 0 2000 4000 6000 8000 10000 N
optimized loop??? performance difgerence wasn’t visible at small sizes (mostly by supplying better options to GCC) 1: reducing number of loads 2: doing adds/multiplies/etc. with less instructions 3: simplifying address computations but… how can that make cache blocking better??? 18 until I optimized arithmetic in the loop
optimized loop??? performance difgerence wasn’t visible at small sizes (mostly by supplying better options to GCC) 1: reducing number of loads 2: doing adds/multiplies/etc. with less instructions 3: simplifying address computations but… how can that make cache blocking better??? 18 until I optimized arithmetic in the loop
overlapping loads and arithmetic time load load load multiply add multiply multiply multiply multiply add add add speed of load might not matter if these are slower 19
optimization and bottlenecks arithmetic/loop effjciency was the bottleneck after fjxing this, cache performance was the bottleneck common theme when optimizing: X may not matter until Y is optimized 20
example assembly (unoptimized) %rdx, %rax ... the_loop jl cmpl movl condition: // increment i addl // add to sum, on stack addq (%rax), %rax // add offset movq addq ... long result = 0; for ( int i = 0; i < N; ++i) result += A[i]; return result; } sum: 21 the_loop: ... leaq movq long sum( long *A, int N) { 0(,%rax,8), %rdx // offset ← i * 8 − 24(%rbp), %rax // get A from stack // get *(A + offset) %rax, − 8(%rbp) $1, − 12(%rbp) − 12(%rbp), %eax − 28(%rbp), %eax
Recommend
More recommend