Cache Memories Lecture, Oct. 30, 2018 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
General Cache Concept Smaller, faster, more expensive Cache 8 4 9 14 10 3 memory caches a subset of the blocks Data is copied in block-sized 10 4 transfer units Larger, slower, cheaper memory Memory 0 1 2 3 viewed as partitioned into “blocks” 4 4 5 6 7 8 9 10 10 11 12 13 14 15 2 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
3 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
4 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
5 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Structure Representation r struct rec { int a[4]; size_t i; a i next struct rec *next; 24 32 0 16 }; 6 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
I[0].A I[0].B I[0].BV[0] I[0].B[1] I[1].A I[1].B I[1].BV[0] I[1].B[1] 7 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
I[0].A I[0].B I[0].BV[0] I[0].B[1] Each block I[1].A I[1].B I[1].BV[0] I[1].B[1] associated the first half of the array has a unique spot in memory I[2].A I[2].B I[2].BV[0] I[2].B[1] I[3].A I[3].B I[3].BV[0] I[3].B[1] 2^9 8 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Cache Optimization Techniques for (j = 0; j < 3: j = j+1){ for (i = 0; i < 3: i = i+1){ for( i = 0; i < 3; i = i + 1){ for( j = 0; j < 3; j = j + 1){ x[i][j] = 2*x[i][j]; x[i][j] = 2*x[i][j]; } } } } Inner loop analysis These two loops compute the same result Array in row major order X[0][0] X[0][1] X[0][2] X[1][0] X[1][1] X[1][2] X[2][0] X[2][1] X[2][2] 0x0 – 0x3 0x4 - 0x7 0x8-0x11 0x12 – 0x15 0x16 - 0x19 0x20-0x23 X[0][0] X[0][1] X[0][2] X[1][0] X[1][1] X[1][2] X[2][0] X[2][1] X[2][2] 9 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Cache Optimization Techniques for (j = 0; j < 3: j = j+1){ for (i = 0; i < 3: i = i+1){ for( i = 0; i < 3; i = i + 1){ for( j = 0; j < 3; j = j + 1){ x[i][j] = 2*x[i][j]; x[i][j] = 2*x[i][j]; } } } } These two loops compute the same result int *x = malloc(N*N); Array in row major order for (i = 0; i < 3: i = i+1){ for( j = 0; j < 3; j = j + 1){ x[i*N +j] = 2*x[i*N + j]; X[0][0] X[0][1] X[0][2] } } X[1][0] X[1][1] X[1][2] X[2][0] X[2][1] X[2][2] 0x0 – 0x3 0x4 - 0x7 0x8-0x11 0x12 – 0x15 0x16 - 0x19 0x20-0x23 X[0][0] X[0][1] X[0][2] X[1][0] X[1][1] X[1][2] X[2][0] X[2][1] X[2][2] 10 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Matrix Multiplication Refresher 11 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Miss Rate Analysis for Matrix Multiply • Assume: • Block size = 32B (big enough for four doubles) • Matrix dimension (N) is very large • Cache is not even big enough to hold multiple rows • Analysis Method: • Look at access pattern of inner loop j j k = x i i k C A B 12 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Layout of C Arrays in Memory (review) • C arrays allocated in row-major order • each row in contiguous memory locations • Stepping through columns in one row: • for (i = 0; i < N; i++) sum += a[0][i]; • accesses successive elements • if block size (B) > sizeof(a ij ) bytes, exploit spatial locality • miss rate = sizeof(a ij ) / B • Stepping through rows in one column: • for (i = 0; i < n; i++) sum += a[i][0]; • accesses distant elements • no spatial locality! • miss rate = 1 (i.e. 100%) 13 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Matrix Multiplication (ijk) /* ijk */ Inner loop: for (i=0; i<n; i++) { (*,j) for (j=0; j<n; j++) { (i,j) sum = 0.0; (i,*) for (k=0; k<n; k++) A B C sum += a[i][k] * b[k][j]; c[i][j] = sum; } matmult/mm.c } Row-wise Column- Fixed wise Misses per inner loop iteration: A B C 0.25 1.0 0.0 14 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Matrix Multiplication (jik) /* jik */ Inner loop: for (j=0; j<n; j++) { for (i=0; i<n; i++) { (*,j) sum = 0.0; (i,j) (i,*) for (k=0; k<n; k++) A B C sum += a[i][k] * b[k][j]; c[i][j] = sum } } matmult/mm.c Row-wise Column- Fixed wise Misses per inner loop iteration: A B C 0.25 1.0 0.0 15 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Matrix Multiplication (kij) /* kij */ Inner loop: for (k=0; k<n; k++) { for (i=0; i<n; i++) { (i,k) (k,*) (i,*) r = a[i][k]; for (j=0; j<n; j++) A B C c[i][j] += r * b[k][j]; } Row-wise Row-wise } Fixed matmult/mm.c Misses per inner loop iteration: A B C 0.0 0.25 0.25 16 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Matrix Multiplication (ikj) /* ikj */ Inner loop: for (i=0; i<n; i++) { for (k=0; k<n; k++) { (i,k) (k,*) (i,*) r = a[i][k]; for (j=0; j<n; j++) A B C c[i][j] += r * b[k][j]; } } matmult/mm.c Row-wise Row-wise Fixed Misses per inner loop iteration: A B C 0.0 0.25 0.25 17 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Matrix Multiplication (jki) Inner loop: /* jki */ for (j=0; j<n; j++) { (*,k) (*,j) for (k=0; k<n; k++) { (k,j) r = b[k][j]; for (i=0; i<n; i++) A B C c[i][j] += a[i][k] * r; } } Column- Fixed Column- matmult/mm.c wise wise Misses per inner loop iteration: A B C 1.0 0.0 1.0 18 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Matrix Multiplication (kji) /* kji */ Inner loop: for (k=0; k<n; k++) { (*,k) (*,j) for (j=0; j<n; j++) { (k,j) r = b[k][j]; for (i=0; i<n; i++) A B C c[i][j] += a[i][k] * r; } } matmult/mm.c Column- Fixed Column- wise wise Misses per inner loop iteration: A B C 1.0 0.0 1.0 19 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Summary of Matrix Multiplication for (i=0; i<n; i++) { for (j=0; j<n; j++) { ijk (& jik): sum = 0.0; • 2 loads, 0 stores for (k=0; k<n; k++) { • misses/iter = 1.25 sum += a[i][k] * b[k][j];} c[i][j] = sum; } } for (k=0; k<n; k++) { kij (& ikj): for (i=0; i<n; i++) { • 2 loads, 1 store r = a[i][k]; for (j=0; j<n; j++){ • misses/iter = 0.5 c[i][j] += r * b[k][j];} } } for (j=0; j<n; j++) { jki (& kji): for (k=0; k<n; k++) { • 2 loads, 1 store r = b[k][j]; for (i=0; i<n; i++){ • misses/iter = 2.0 c[i][j] += a[i][k] * r;} } } 20 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Core i7 Matrix Multiply Performance 100 jki / kji Cycles per inner loop iteration ijk / jik jki kji 10 ijk jik kij ikj kij / ikj 1 50 100 150 200 250 300 350 400 450 500 550 600 650 700 Array size (n) 21 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Example: Matrix Multiplication c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[i*n + j] += a[i*n + k] * b[k*n + j]; } j c a b = * i 22 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Cache Miss Analysis • Assume: • Matrix elements are doubles • Assume the matrix is square • Cache block = 8 doubles • Cache size C << n (much smaller than n) • First iteration: • n/8 + n = 9n/8 misses n • Afterwards in cache: (schematic) = * = * 8 wide 23 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Cache Miss Analysis • Assume: • Matrix elements are doubles • Cache block = 8 doubles • Cache size C << n (much smaller than n) • Second iteration: • Again: n/8 + n = 9n/8 misses n • Total misses: • 9n/8 * n 2 = (9/8) * n 3 = * 8 wide 24 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Blocked Matrix Multiplication j1 c a b += * i1 Block size B x B 25 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
j1 c a b += * i1 Block size B x B 26 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
j1 c a b += * i1 Block size B x B 1 2 5 6 1 2 5 6 3 4 7 8 3 4 7 8 9 10 13 14 9 10 13 14 11 12 15 16 11 12 15 16 27 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
j1 c a b += * i1 Block size B x B 1 2 5 6 1 2 5 6 3 4 7 8 3 4 7 8 9 10 13 14 9 10 13 14 11 12 15 16 11 12 15 16 1 2 1 2 5 6 9 10 * + * 3 4 3 4 7 8 11 12 28 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Recommend
More recommend