page 1
play

Page 1 Ridges of Temporal Locality Ridges of Temporal Locality - PDF document

Writing Cache Friendly Code Writing Cache Friendly Code The Memory Mountain The Memory Mountain Repeated references to variables are good (temporal Repeated references to variables are good (temporal Read throughput (read bandwidth) Read


  1. Writing Cache Friendly Code Writing Cache Friendly Code The Memory Mountain The Memory Mountain Repeated references to variables are good (temporal Repeated references to variables are good (temporal Read throughput (read bandwidth) Read throughput (read bandwidth) locality) locality) � Number of bytes read from memory per second (MB/s) Stride Stride- -1 reference patterns are good (spatial locality) 1 reference patterns are good (spatial locality) Memory mountain Memory mountain � Measured read throughput as a function of spatial and Examples: Examples: temporal locality. � cold cache, 4-byte words, 4-word cache blocks � Compact way to characterize memory system performance � Compact way to characterize memory system performance. int sumarrayrows(int a[M][N]) int sumarraycols(int a[M][N]) { { int i, j, sum = 0; int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; sum += a[i][j]; return sum; return sum; } } Miss rate = 1/4 = 25% Miss rate = 100% – 1 – – 2 – Memory Mountain Test Function Memory Mountain Test Function Memory Mountain Main Routine Memory Mountain Main Routine /* mountain.c - Generate the memory mountain. */ #define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */ /* The test function */ #define MAXBYTES (1 << 23) /* ... up to 8 MB */ void test(int elems, int stride) { #define MAXSTRIDE 16 /* Strides range from 1 to 16 */ int i, result = 0; #define MAXELEMS MAXBYTES/sizeof(int) volatile int sink; int data[MAXELEMS]; /* The array we'll be traversing */ for (i = 0; i < elems; i += stride) result += data[i]; int main() sink = result; /* So compiler doesn't optimize away the loop */ { } } i t int size; /* Working set size (in bytes) */ i /* W ki t i (i b t ) */ int stride; /* Stride (in array elements) */ /* Run test(elems, stride) and return read throughput (MB/s) */ double Mhz; /* Clock frequency */ double run(int size, int stride, double Mhz) { init_data(data, MAXELEMS); /* Initialize each element in data to 1 */ double cycles; Mhz = mhz(0); /* Estimate the clock frequency */ int elems = size / sizeof(int); for (size = MAXBYTES; size >= MINBYTES; size >>= 1) { for (stride = 1; stride <= MAXSTRIDE; stride++) test(elems, stride); /* warm up the cache */ printf("%.1f\t", run(size, stride, Mhz)); cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */ printf("\n"); return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */ } } exit(0); } – 3 – – 4 – The Memory Mountain The Memory Mountain Pentium 4 – 2.4 GHz Pentium 4 – 2.4 GHz Pentium III Xeon 5000 1200 550 MHz 4500 16 KB on-chip L1 d-cache hroughput (MB/s) 16 KB on-chip L1 i-cache 1000 4000 L1 512 KB off-chip unified put (MB/s) 4500-5000 L2 cache 3500 800 4000-4500 3500-4000 3000 read th Read throughp 3000 3500 3000-3500 600 2500 2500-3000 Ridges of 2000-2500 400 2000 xe Temporal 1500-2000 Slopes of L2 Locality 1500 1000-1500 Spatial 200 500-1000 Locality 1000 0-500 2k 0 500 16k s1 mem s3 2k 0 128k Working set s5 8k s7 32k s1 size (bytes) s9 s2 128k s3 1024k s4 s5 s11 s6 stride (words) s7 512k s8 s13 working set size (bytes) s9 s10 2m s11 8m s15 s12 s13 s14 s15 8m s16 Stride (words) – 5 – – 6 – Page 1

  2. Ridges of Temporal Locality Ridges of Temporal Locality Pentium 4 Pentium 4 Memory performance (stride = 6) Slice through the memory mountain with stride=1 Slice through the memory mountain with stride=1 � illuminates read throughputs of different caches and 4500 memory 4000 1200 main memory L2 cache L1 cache 3500 (MB/s) region region region 1000 3000 Read throughput read througput (MB/s) 2500 800 2000 600 1500 400 1000 200 500 0 0 8m 4m 2m 1024k 512k 256k 128k 64k 32k 16k 8k 4k 2k 1k 8m 4m 2m 1024k 512k 256k 128k 64k 32k 16k 8k 4k 2k Working set size (bytes) – 7 – working set size (bytes) – 8 – A Slope of Spatial Locality A Slope of Spatial Locality Pentium 4 Pentium 4 Memory performance (working set size = 512 Kbytes) Slice through memory mountain with size=256KB Slice through memory mountain with size=256KB � shows cache block size. 3500 800 3000 700 (MB/s) 2500 600 B/s) read throughput (MB Read throughput 2000 500 one access per cache line 400 1500 300 1000 200 100 500 0 0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 stride (words) – 9 – – 10 – Stride (words) Matrix Multiplication Example Matrix Multiplication Example Miss Rate Analysis for Matrix Multiply Miss Rate Analysis for Matrix Multiply Assume: Assume: Major Cache Effects to Consider Major Cache Effects to Consider � Line size = 32B (big enough for 4 64-bit words) � Total cache size � Matrix dimension (N) is very large � Exploit temporal locality and keep the working set small (e.g., by using � Approximate 1/N as 0.0 blocking) /* ijk */ Variable sum � Cache is not even big enough to hold multiple rows � Block size for (i=0; i<n; i++) { held in register � Exploit spatial locality � Exploit spatial locality for (j=0; j<n; j++) { for (j=0; j<n; j++) { Analysis Method: Analysis Method: Analysis Method: Analysis Method: sum = 0.0; � Look at access pattern of inner loop for (k=0; k<n; k++) Description: Description: sum += a[i][k] * b[k][j]; c[i][j] = sum; k j j � Multiply N x N matrices } � O(N3) total operations i k i } � Accesses C A B � N reads per source element � N values summed per destination » but may be able to hold in register – 11 – – 12 – Page 2

  3. Layout of C Arrays in Memory Layout of C Arrays in Memory Matrix Multiplication (ijk) Matrix Multiplication (ijk) (review) (review) C arrays allocated in row C arrays allocated in row- -major order major order � each row in contiguous memory locations /* ijk */ Inner loop: for (i=0; i<n; i++) { Stepping through columns in one row: Stepping through columns in one row: (*,j) for (j=0; j<n; j++) { � for (i = 0; i < N; i++) (i,j) sum = 0.0; sum += a[0][i]; (i,*) for (k=0; k<n; k++) � accesses successive elements A A B B C C sum += a[i][k] * b[k][j]; sum += a[i][k] * b[k][j]; � if block size (B) > 4 bytes, exploit spatial locality c[i][j] = sum; � compulsory miss rate = 4 bytes / B } Stepping through rows in one column: Stepping through rows in one column: } Row-wise Column- Fixed wise � for (i = 0; i < n; i++) sum += a[i][0]; Misses per Inner Loop Iteration: Misses per Inner Loop Iteration: � accesses distant elements A B C � no spatial locality! 0.25 1.0 0.0 � compulsory miss rate = 1 (i.e. 100%) – 13 – – 14 – Matrix Multiplication (jik) Matrix Multiplication (jik) Matrix Multiplication (kij) Matrix Multiplication (kij) /* jik */ /* kij */ Inner loop: Inner loop: for (k=0; k<n; k++) { for (j=0; j<n; j++) { for (i=0; i<n; i++) { (*,j) for (i=0; i<n; i++) { (i,k) (k,*) (i,*) (i,j) sum = 0.0; r = a[i][k]; (i,*) for (j=0; j<n; j++) A B C for (k=0; k<n; k++) A A B B C C sum += a[i][k] * b[k][j]; + [i][k] * b[k][j] c[i][j] += r * b[k][j]; c[i][j] += r * b[k][j]; c[i][j] = sum } } } Fixed Row-wise Row-wise } Row-wise Column- Fixed wise Misses per Inner Loop Iteration: Misses per Inner Loop Iteration: Misses per Inner Loop Iteration: Misses per Inner Loop Iteration: A B C A B C 0.25 1.0 0.0 0.0 0.25 0.25 – 15 – – 16 – Matrix Multiplication (ikj) Matrix Multiplication (ikj) Matrix Multiplication (jki) Matrix Multiplication (jki) /* ikj */ /* jki */ Inner loop: Inner loop: for (i=0; i<n; i++) { for (j=0; j<n; j++) { (*,k) (*,j) for (k=0; k<n; k++) { for (k=0; k<n; k++) { (i,k) (k,*) (i,*) (k,j) r = a[i][k]; r = b[k][j]; for (j=0; j<n; j++) A B C for (i=0; i<n; i++) A A B B C C c[i][j] += r * b[k][j]; c[i][j] += r * b[k][j]; c[i][j] += a[i][k] * r; c[i][j] += a[i][k] * r; } } } } Fixed Row-wise Row-wise Column - Fixed Column- wise wise Misses per Inner Loop Iteration: Misses per Inner Loop Iteration: Misses per Inner Loop Iteration: Misses per Inner Loop Iteration: A B C A B C 0.0 0.25 0.25 1.0 0.0 1.0 – 17 – – 18 – Page 3

Recommend


More recommend