Improving Cache Performance AMAT: Average Memory Access Time AMAT = T hit + Miss Rate x Miss Penalty Optimizations based on : Reducing Miss Rate: • Structural: Cache size, Associativity, Block size, Compiler support • Reducing Miss Penalty • Structural: Multi-level caches, Critical word/Early Restart, • Latency Hiding: Using concurrency to reduce miss rate or miss penalty • Improving Hit Time • 1
Cache Performance Models Temporal Locality: Repeated access to the same word Spatial Locality: Access to words in physical proximity to accessed word Miss Categories: Compulsory: Cold-start (first-reference) misses • Infinite cache miss rate • Characteristic of the workload: e.g streams (majority of misses compulsory) • Capacity: Data set size larger than that of cache • Increase size of cache to avoid thrashing • Fully associative abstraction • Replacement Algorithms: Optimal off-line algorithm: Belady Rule: Evict the cache block whose next reference is furthest in the future Provides lower bound on the number of capacity misses for a given cache size Conflict: Cache organizations causes block to be discarded and later retrieved 2 • Collision or Interference misses •
Cache Replacement Replacement Algorithms: Optimal off-line algorithm: Belady Rule: Evict the cache block whose next reference is furthest in the future Provides lower bound on the number of capacity misses for a given cache size Cache size: 4 Blocks Block Access Sequence: A B C D E C E A D B C D E A B 5 OPTIMAL (Belady) 5 Compulsory Misses (A, B, C, D, E) B C D A 5. Evict B A B C D E C E A D B C D E A B 6. E D C A 6 Evict A A B C D E C E A D B C D E A B E D C B 7. 7 Evict D (or E or C) E A C B 2 Capacity Misses 3
Cache Replacement Replacement Algorithms: Least Recently Used (LRU): Evict the cache block that was last referenced furthest in the past Cache size: 4 Blocks Block Access Sequence: A B C D E C E A D B C D E A B 5 LRU 2 additional misses due to 5 Compulsory Misses (A, B, C, D, E) B C D A 5. non-optimal replacement Evict A A B C D E C E A D B C D E A B 6. B D C E 6 Evict B A B C D E C E A D B C D E A B A D C B 7 7. Evict A A B C D E C E A D B C D E A B 8. E D C B 8 A B C D E C E A D B C D E A B Evict B 9. E D C A 4 9 Evict C
LRU • Hard to implement efficiently • Software: LRU Stack A B C D E C E A D B C D E A B Miss E C E A D TOP D E C E C C D D C B LRU Block B B B D A Hits On hit: Need to read and write ordering information: Not for hardware maintained cache 2
LRU • Approximate LRU (Some Intel processors) Left/Right accessed last? R R R R R R R A B C D E F G H A,B,C,D,E,F,G,H On Miss: Follow the path of NOT Accessed last • Random Selection
Reducing Miss Rate 1. Larger cache size: + Reduce capacity misses - Hit time may increase - Cost increase 2. Increased Associativity: + Miss rate decreases -- conflict misses - Hit time increases may increase clock cycle time - Hardware cost increases Miss rate with 8-way associative comparable to fully associative (empirical finding) Example Direct mapped cache: Hit time 1 cycle, Miss Penalty 25 cycles (low!), Miss rate = 0.08 8-way set associative: Clock cycle 1.5x, Miss rate = 0.07 Let T be clock cycle of direct mapped cache AMAT (direct mapped) = (1 + 0.08 x 25) x T = 3.0T AMAT (set associative): New clock period = 1.5 x T + 0.07 x Miss Penalty Miss Penalty = ceiling (25 x T /1.5T) x 1.5T = ceiling (25/1.5) x 1.5T = 17 x 1.5 T= 25.5T 5 AMAT = 1.5T+ 0.07 x 25.5T = T(1.5+1.785) = 3.285T (Increasing associativity hurts in this example!!!)
Reducing Miss Rate 3 . Block Size ( B): Miss rate • decreases and then increases with increasing block size • + a) Compulsory miss rate decreases due to better use of spatial locality - b) Capacity (conflict) misses increase as effective cache size decreases Miss penalty • increases with increasing block size • - c) Wasted memory access time: Miss penalty increase not providing any gain Do (a) and (c) balance each other? + d) Amortized memory access time per byte decreases (burst mode memory) Tag overhead decreases • Low latency, Low bandwidth memory: Smaller block size High latency, High bandwidth: Larger block size 6
Reducing Miss Rate Block Size B (contd): Low latency, Low bandwidth memory: Smaller block size High latency, High bandwidth: Larger block size Example: Case 1: Miss ratio of 5% with B=8 and Case 2: Miss ratio of 4% with B=16. Burst-mode Memory: Memory latency of 8 cycles, Transfer rate 2 bytes/cycle. Cache Hit time 1 cycle. AMAT = Hit time + Miss Rate x Miss penalty Case 1: AMAT = 1 + 5% x (8 + 8/2) = 1.6 cycles Case 2: AMAT = 1 + 4% x (8 + 16/2) = 1.64 cycles Suppose memory latency was 16 cycles: Favors larger block size. Case 1: AMAT = 1 + 5% x (16 + 8/2) = 2.0 cycles Case 2: AMAT = 1 + 4% x (16 + 16/2) = 1.96 cycles 7
Reducing Miss Rate 4. Pseudo Associative caches + Maintain hit-speed of direct mapped. + Reduce conflict misses Column (or pseudo) associative: On miss: check one more location in Direct Mapped Cache Like having a fixed way-prediction Way Prediction: Predict block in set to be read on next access. If tag match: 1 cycle hit If failure: do complete selection on subsequent cycles + Power savings potential - Poor prediction increases hit time 12
Column (or pseudo) associative On miss: check one more location in Direct Mapped Cache Like having a fixed way-prediction Direct Map 0xxxx 0xxxx 1xxxx 1xxxx Direct Map Alternate Cache Location for Green block 12
Way Prediction Predict block in set to be read on next access. If tag match: 1 cycle hit If failure: do complete selection on subsequent cycles 2-way set associative map 0xxxx 0xxxx 1xxxx 12
Reducing Miss Rate 5. Compiler Optimizations Instruction access • Rearrange code (procedure, code block placements) to reduce conflict misses • Align entry point of basic block with start of a cache block • Data access : Improve spatial/temporal locality in arrays • a) Merging arrays: Replace parallel arrays with array of struct (spatial locality) update(j): { *name[j] = …; id[j] = …; age[j] = …; salary[j] = …; } update(j): { *(person[j].name) = …; person[j].id = …; person[j].age = …; person[j].salary = …;} When might separate arrays be better? b) Loop Fusion: Combine loops which use the same data (temporal locality) for (j=0; < n; j++) x[j] = y[2 * j]; for (j=0; j < n; j++) { for (j=0; < n; j++) sum += x[j]; x[j] = y[2 * j]; sum += x[j] ; 8 } When might separate loops be better?
Reducing Miss Rate Compiler Optimizations (contd …) • Data access : Improve spatial/temporal locality in arrays c) Loop interchange: Convert column-major matrix access to row-major access (spatial) n A A P P B for (k=0; k < m; k++) C m for (j=0; j < n; j++) D B a[k][j] = 0; E F Only compulsory misses: for (j=0; j < n; j++) 1 per block: C for (k=0; k < m; k++) a[k][j] = 0; F Array element size w bytes Block size B bytes B/w elements per block Assuming Row-Major storage in memory: Misses: mn/ (B/w) = mnw/B Could miss on each access of a[ ][ ] 9 Misses: mn
Reducing Miss Rate Compiler/Programmer Optimizations (contd …) d) Blocking: Use block-oriented access to maximize both temporal and spatial locality Cache Insensitive Matrix Multiplication: O(n 3 ) cache misses for accessing matrix b elements for (i=0; i < n; i++) for (j=0; j < n; j++) for (k=0; k < n; k++) c[i][j] += a[i][k] * b[k][j]; b a 10
Reducing Miss Rate Compiler/Programmer Optimizations (contd …) d) Blocking: Use block-oriented access to maximize both temporal and spatial locality O(n 3 ) cache misses for accessing matrix b elements for (i=0; i < n/s; i++) for (j=0; j < n/s; j++) for (k=0; k < n/s; k++) C[i][j] = C[i][j] +++ A[i][k] *** B[k][j]; Block Matrix Multiplication of A[i][k] with B[k][j] to get one update of Matrix Addition C[i][j] Block Matrix B[0][0] Block Matrix A[0][0] s s b a 11
Recommend
More recommend