Improving Cache Performance AMAT: Average Memory Access Time AMAT = - PowerPoint PPT Presentation

Improving Cache Performance AMAT: Average Memory Access Time AMAT = T hit + Miss Rate x Miss Penalty Optimizations based on : Reducing Miss Rate: • Structural: Cache size, Associativity, Block size, Compiler support • Reducing Miss Penalty • Structural: Multi-level caches, Critical word/Early Restart, • Latency Hiding: Using concurrency to reduce miss rate or miss penalty • Improving Hit Time • 1

Cache Performance Models Temporal Locality: Repeated access to the same word Spatial Locality: Access to words in physical proximity to accessed word Miss Categories: Compulsory: Cold-start (first-reference) misses • Infinite cache miss rate • Characteristic of the workload: e.g streams (majority of misses compulsory) • Capacity: Data set size larger than that of cache • Increase size of cache to avoid thrashing • Fully associative abstraction • Replacement Algorithms: Optimal off-line algorithm: Belady Rule: Evict the cache block whose next reference is furthest in the future Provides lower bound on the number of capacity misses for a given cache size Conflict: Cache organizations causes block to be discarded and later retrieved 2 • Collision or Interference misses •

Cache Replacement Replacement Algorithms: Optimal off-line algorithm: Belady Rule: Evict the cache block whose next reference is furthest in the future Provides lower bound on the number of capacity misses for a given cache size Cache size: 4 Blocks Block Access Sequence: A B C D E C E A D B C D E A B 5 OPTIMAL (Belady) 5 Compulsory Misses (A, B, C, D, E) B C D A 5. Evict B A B C D E C E A D B C D E A B 6. E D C A 6 Evict A A B C D E C E A D B C D E A B E D C B 7. 7 Evict D (or E or C) E A C B 2 Capacity Misses 3

Cache Replacement Replacement Algorithms: Least Recently Used (LRU): Evict the cache block that was last referenced furthest in the past Cache size: 4 Blocks Block Access Sequence: A B C D E C E A D B C D E A B 5 LRU 2 additional misses due to 5 Compulsory Misses (A, B, C, D, E) B C D A 5. non-optimal replacement Evict A A B C D E C E A D B C D E A B 6. B D C E 6 Evict B A B C D E C E A D B C D E A B A D C B 7 7. Evict A A B C D E C E A D B C D E A B 8. E D C B 8 A B C D E C E A D B C D E A B Evict B 9. E D C A 4 9 Evict C

LRU • Hard to implement efficiently • Software: LRU Stack A B C D E C E A D B C D E A B Miss E C E A D TOP D E C E C C D D C B LRU Block B B B D A Hits On hit: Need to read and write ordering information: Not for hardware maintained cache 2

LRU • Approximate LRU (Some Intel processors) Left/Right accessed last? R R R R R R R A B C D E F G H A,B,C,D,E,F,G,H On Miss: Follow the path of NOT Accessed last • Random Selection

Reducing Miss Rate 1. Larger cache size: + Reduce capacity misses - Hit time may increase - Cost increase 2. Increased Associativity: + Miss rate decreases -- conflict misses - Hit time increases may increase clock cycle time - Hardware cost increases Miss rate with 8-way associative comparable to fully associative (empirical finding) Example Direct mapped cache: Hit time 1 cycle, Miss Penalty 25 cycles (low!), Miss rate = 0.08 8-way set associative: Clock cycle 1.5x, Miss rate = 0.07 Let T be clock cycle of direct mapped cache AMAT (direct mapped) = (1 + 0.08 x 25) x T = 3.0T AMAT (set associative): New clock period = 1.5 x T + 0.07 x Miss Penalty Miss Penalty = ceiling (25 x T /1.5T) x 1.5T = ceiling (25/1.5) x 1.5T = 17 x 1.5 T= 25.5T 5 AMAT = 1.5T+ 0.07 x 25.5T = T(1.5+1.785) = 3.285T (Increasing associativity hurts in this example!!!)

Reducing Miss Rate 3 . Block Size ( B): Miss rate • decreases and then increases with increasing block size • + a) Compulsory miss rate decreases due to better use of spatial locality - b) Capacity (conflict) misses increase as effective cache size decreases Miss penalty • increases with increasing block size • - c) Wasted memory access time: Miss penalty increase not providing any gain Do (a) and (c) balance each other? + d) Amortized memory access time per byte decreases (burst mode memory) Tag overhead decreases • Low latency, Low bandwidth memory: Smaller block size High latency, High bandwidth: Larger block size 6

Reducing Miss Rate Block Size B (contd): Low latency, Low bandwidth memory: Smaller block size High latency, High bandwidth: Larger block size Example: Case 1: Miss ratio of 5% with B=8 and Case 2: Miss ratio of 4% with B=16. Burst-mode Memory: Memory latency of 8 cycles, Transfer rate 2 bytes/cycle. Cache Hit time 1 cycle. AMAT = Hit time + Miss Rate x Miss penalty Case 1: AMAT = 1 + 5% x (8 + 8/2) = 1.6 cycles Case 2: AMAT = 1 + 4% x (8 + 16/2) = 1.64 cycles Suppose memory latency was 16 cycles: Favors larger block size. Case 1: AMAT = 1 + 5% x (16 + 8/2) = 2.0 cycles Case 2: AMAT = 1 + 4% x (16 + 16/2) = 1.96 cycles 7

Reducing Miss Rate 4. Pseudo Associative caches + Maintain hit-speed of direct mapped. + Reduce conflict misses Column (or pseudo) associative: On miss: check one more location in Direct Mapped Cache Like having a fixed way-prediction Way Prediction: Predict block in set to be read on next access. If tag match: 1 cycle hit If failure: do complete selection on subsequent cycles + Power savings potential - Poor prediction increases hit time 12

Column (or pseudo) associative On miss: check one more location in Direct Mapped Cache Like having a fixed way-prediction Direct Map 0xxxx 0xxxx 1xxxx 1xxxx Direct Map Alternate Cache Location for Green block 12

Way Prediction Predict block in set to be read on next access. If tag match: 1 cycle hit If failure: do complete selection on subsequent cycles 2-way set associative map 0xxxx 0xxxx 1xxxx 12

Reducing Miss Rate 5. Compiler Optimizations Instruction access • Rearrange code (procedure, code block placements) to reduce conflict misses • Align entry point of basic block with start of a cache block • Data access : Improve spatial/temporal locality in arrays • a) Merging arrays: Replace parallel arrays with array of struct (spatial locality) update(j): { *name[j] = …; id[j] = …; age[j] = …; salary[j] = …; } update(j): { *(person[j].name) = …; person[j].id = …; person[j].age = …; person[j].salary = …;} When might separate arrays be better? b) Loop Fusion: Combine loops which use the same data (temporal locality) for (j=0; < n; j++) x[j] = y[2 * j]; for (j=0; j < n; j++) { for (j=0; < n; j++) sum += x[j]; x[j] = y[2 * j]; sum += x[j] ; 8 } When might separate loops be better?

Reducing Miss Rate Compiler Optimizations (contd …) • Data access : Improve spatial/temporal locality in arrays c) Loop interchange: Convert column-major matrix access to row-major access (spatial) n A A P P B for (k=0; k < m; k++) C m for (j=0; j < n; j++) D B a[k][j] = 0; E F Only compulsory misses: for (j=0; j < n; j++) 1 per block: C for (k=0; k < m; k++) a[k][j] = 0; F Array element size w bytes Block size B bytes B/w elements per block Assuming Row-Major storage in memory: Misses: mn/ (B/w) = mnw/B Could miss on each access of a[ ][ ] 9 Misses: mn

Reducing Miss Rate Compiler/Programmer Optimizations (contd …) d) Blocking: Use block-oriented access to maximize both temporal and spatial locality Cache Insensitive Matrix Multiplication: O(n 3 ) cache misses for accessing matrix b elements for (i=0; i < n; i++) for (j=0; j < n; j++) for (k=0; k < n; k++) c[i][j] += a[i][k] * b[k][j]; b a 10

Reducing Miss Rate Compiler/Programmer Optimizations (contd …) d) Blocking: Use block-oriented access to maximize both temporal and spatial locality O(n 3 ) cache misses for accessing matrix b elements for (i=0; i < n/s; i++) for (j=0; j < n/s; j++) for (k=0; k < n/s; k++) C[i][j] = C[i][j] +++ A[i][k] *** B[k][j]; Block Matrix Multiplication of A[i][k] with B[k][j] to get one update of Matrix Addition C[i][j] Block Matrix B[0][0] Block Matrix A[0][0] s s b a 11

Improving Cache Performance AMAT: Average Memory Access Time AMAT = - PowerPoint PPT Presentation

Improving Cache Performance AMAT: Average Memory Access Time AMAT = T hit + Miss Rate x Miss Penalty Optimizations based on : Reducing Miss Rate: Structural: Cache size, Associativity, Block size, Compiler support Reducing Miss

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Cache Impact on Program Performance T. Yang. UCSB CS240A. 2017 Multi-level cache in computer

CSE378 - Cache Performance metrics for caches Parameters for cache design Basic performance

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Cache Performance Samira Khan March 28, 2017 Agenda Review from last lecture Cache

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and

Cache Creek Placer Area Fee Proposal History of Placer Mining at Cache Creek Prospecting in

A Toolbox for Property Checking from Simulation using Incremental SAT Rob Sumners Centaur

in in Mo Mobi bile le De Devices ices FAST 2015 Daeho Jeong *+ , Youngjae Lee + , Jin-Soo

System-on-Chip Design On-Chip Buses Hao Zheng Comp Sci & Eng U of South Florida 1

A Short Report on the Current Status of the Burst Search Pipeline Mervyn Chan Kazuhiro Hayama

Perspectives dapplications des modes SAR et SARin pour le suivi du niveau des fleuves et leur

APIs, Architecture and Modeling for Extreme Scale Resilience Dagstuhl Seminar: Resilience in

THE IAC80 TELESCOPE DATA TO THE VIRTUAL OBSERVATORY Cristina Zurita & IAC Support Astronomers

A final optimization Example: Alice wants to get to the bus stop as quickly as possible. The bus

Improving Cache Performance AMAT: Average Memory Access Time AMAT = - PowerPoint PPT Presentation

Improving Cache Performance AMAT: Average Memory Access Time AMAT = T hit + Miss Rate x Miss Penalty Optimizations based on : Reducing Miss Rate: Structural: Cache size, Associativity, Block size, Compiler support Reducing Miss

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Cache Impact on Program Performance T. Yang. UCSB CS240A. 2017 Multi-level cache in computer

CSE378 - Cache Performance metrics for caches Parameters for cache design Basic performance

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Cache Performance Samira Khan March 28, 2017 Agenda Review from last lecture Cache

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and

Cache Creek Placer Area Fee Proposal History of Placer Mining at Cache Creek Prospecting in

A Toolbox for Property Checking from Simulation using Incremental SAT Rob Sumners Centaur

in in Mo Mobi bile le De Devices ices FAST 2015 Daeho Jeong *+ , Youngjae Lee + , Jin-Soo

System-on-Chip Design On-Chip Buses Hao Zheng Comp Sci &amp; Eng U of South Florida 1

A Short Report on the Current Status of the Burst Search Pipeline Mervyn Chan Kazuhiro Hayama

Perspectives dapplications des modes SAR et SARin pour le suivi du niveau des fleuves et leur

APIs, Architecture and Modeling for Extreme Scale Resilience Dagstuhl Seminar: Resilience in

THE IAC80 TELESCOPE DATA TO THE VIRTUAL OBSERVATORY Cristina Zurita &amp; IAC Support Astronomers

A final optimization Example: Alice wants to get to the bus stop as quickly as possible. The bus

System-on-Chip Design On-Chip Buses Hao Zheng Comp Sci & Eng U of South Florida 1

THE IAC80 TELESCOPE DATA TO THE VIRTUAL OBSERVATORY Cristina Zurita & IAC Support Astronomers