Cache Systems CPU Main Main CPU Memory Memory 400MHz 10MHz Cache 10MHz Memory Hierarchy Design Bus 66MHz Bus 66MHz Chapter 5 and Appendix C Data object Block transfer transfer Main CPU Cache Memory 1 4 Example: Two-level Hierarchy Overview Access Time • Problem T 1 +T 2 – CPU vs Memory performance imbalance • Solution – Driven by temporal and Driven by temporal and spatial locality – Memory hierarchies • Fast L1, L2, L3 caches • Larger but slower memories • Even larger but even T 1 slower secondary storage • Keep most of the action in the higher levels 1 0 Hit ratio 2 5 Locality of Reference Basic Cache Read Operation • Temporal and Spatial • CPU requests contents of memory location • Sequential access to memory • Check cache for this data • Unit-stride loop (cache lines = 256 bits) • If present, get from cache (fast) for (i = 1; i < 100000; i++) • If not present, read required block from If d i d bl k f sum = sum + a[i]; main memory to cache • Non-unit stride loop (cache lines = 256 bits) • Then deliver from cache to CPU for (i = 0; i <= 100000; i = i+8) • Cache includes tags to identify which block sum = sum + a[i]; of main memory is in each cache slot 3 6 1
Number of Caches Elements of Cache Design • Increased logic density => on-chip cache • Cache size – Internal cache: level 1 (L1) • Line (block) size – External cache: level 2 (L2) • Number of caches • Unified cache • Mapping function Mapping function – Balances the load between instruction and data fetches – Block placement – Only one cache needs to be designed / implemented – Block identification • Split caches (data and instruction) • Replacement Algorithm – Pipelined, parallel architectures • Write Policy 7 10 Cache Size Mapping Function • Cache lines << main memory blocks • Cache size << main memory size • Small enough • Direct mapping – Minimize cost – Maps each block into only one possible line – Speed up access (less gates to address the cache) – (block address) MOD (number of lines) – Keep cache on chip Keep cache on chip • Fully associative • Large enough – Block can be placed anywhere in the cache – Minimize average access time • Set associative • Optimum size depends on the workload – Block can be placed in a restricted set of lines • Practical size? – (block address) MOD (number of sets in cache) 8 11 Line Size Cache Addressing • Optimum size depends on workload • Small blocks do not use locality of reference Block address Block offset principle Tag Index • Larger blocks reduce the number of blocks – Replacement overhead Block offset – selects data object from the block Main Memory • Practical sizes? Tag Cache Index – selects the block set Tag – used to detect a hit 9 12 2
Direct Mapping Replacement Algorithm • Simple for direct-mapped: no choice • Random – Simple to build in hardware • LRU Associativity Two-way Four-way Eight-way Size LRU Random LRU Random LRU Random 16KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96% 64KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53% 256KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% 13 16 Associative Mapping Write Policy • Write is more complex than read – Write and tag comparison can not proceed simultaneously – Only a portion of the line has to be updated • Write policies Write policies – Write through – write to the cache and memory – Write back – write only to the cache (dirty bit) • Write miss: – Write allocate – load block on a write miss – No-write allocate – update directly in memory 14 17 K-Way Set Associative Mapping Alpha AXP 21064 Cache CPU 21 8 5 Address Tag Index offset Data data In out Valid Tag Data (256) Write buffer =? Lower level memory 15 18 3
Write Merging Cache Performance Improvements Write address V V V V • Average memory-access time 1 100 0 0 0 = Hit time + Miss rate x Miss penalty 104 1 0 0 0 • Cache optimizations 108 1 0 0 0 – Reducing the miss rate 1 1 112 112 0 0 0 0 0 0 – Reducing the miss penalty – Reducing the hit time Write address V V V V 100 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 19 22 DECstation 5000 Miss Rates Example 30 Which has the lower average memory access time: 25 A 16-KB instruction cache with a 16-KB data cache or A 32-KB unified cache 20 Instr. Cache Hit time = 1 cycle % 15 Data Cache Unified Miss penalty = 50 cycles p y y 10 10 Load/store hit = 2 cycles on a unified cache 5 Given: 75% of memory accesses are instruction references. 0 Overall miss rate for split caches = 0.75*0.64% + 0.25*6.47% = 2.10% 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB Cache size Miss rate for unified cache = 1.99% Average memory access times: Direct-mapped cache with 32-byte blocks Split = 0.75 * (1 + 0.0064 * 50) + 0.25 * (1 + 0.0647 * 50) = 2.05 Percentage of instruction references is 75% Unified = 0.75 * (1 + 0.0199 * 50) + 0.25 * (2 + 0.0199 * 50) = 2.24 20 23 Cache Performance Measures Cache Performance Equations • Hit rate : fraction found in that level CPU time = (CPU execution cycles + Mem stall cycles) * Cycle time – So high that usually talk about Miss rate Mem stall cycles = Mem accesses * Miss rate * Miss penalty – Miss rate fallacy: as MIPS to CPU performance, CPU time = IC * (CPI execution + Mem accesses per instr * Miss rate * • Average memory-access time Miss penalty) * Cycle time = Hit time + Miss rate x Miss penalty (ns) Hit time + Miss rate x Miss penalty (ns) Misses per instr = Mem accesses per instr * Miss rate • Miss penalty : time to replace a block from lower CPU time = IC * (CPI execution + Misses per instr * Miss penalty) * level, including time to replace in CPU Cycle time – access time to lower level = f(latency to lower level) – transfer time : time to transfer block =f(bandwidth) 21 24 4
Reducing Miss Penalty Critical Word First and Early Restart • Multi-level Caches • Critical Word First: Request the missed word first from memory • Critical Word First and Early Restart • Early Restart: Fetch in normal order, but as • Priority to Read Misses over Writes soon as the requested word arrives, send it • Merging Write Buffers g g to CPU • Victim Caches 25 28 Multi-Level Caches Giving Priority to Read Misses over Writes • Avg mem access time = Hit time(L1) + Miss Rate SW R3, 512(R0) (L1) X Miss Penalty(L1) LW R1, 1024 (R0) • Miss Penalty (L1) = Hit Time (L2) + Miss Rate (L2) X Miss Penalty (L2) LW R2, 512 (R0) • Avg mem access time = Hit Time (L1) + Miss • Direct-mapped, write-through cache pp , g Rate (L1) X (Hit Time (L2) + Miss Rate (L2) X ( ) ( ( ) ( ) Miss Penalty (L2) mapping 512 and 1024 to the same block • Local Miss Rate: number of misses in a cache and a four word write buffer divided by the total number of accesses to the cache • Will R2=R3? • Global Miss Rate: number of misses in a cache • Priority for Read Miss? divided by the total number of memory accesses generated by the cache 26 29 Performance of Multi-Level Caches Victim Caches 27 30 5
Reducing Miss Rates: 1. Larger Block Size Types of Cache Misses • Compulsory • Effects of larger block sizes – First reference or cold start misses – Reduction of compulsory misses • Capacity • Spatial locality – Working set is too big for the cache – Increase of miss penalty (transfer time) – Fully associative caches Fully associative caches – Reduction of number of blocks R d i f b f bl k • Conflict (collision) • Potential increase of conflict misses – Many blocks map to the same block frame (line) • Latency and bandwidth of lower-level memory – Affects – High latency and bandwidth => large block size • Set associative caches • Small increase in miss penalty • Direct mapped caches 31 34 Miss Rates: Absolute and Example Distribution 32 35 Reducing the Miss Rates 2. Larger Caches • More blocks 1. Larger block size • Higher probability of getting the data 2. Larger Caches • Longer hit time and higher cost 3. Higher associativity • Primarily used in 2 nd level caches y 4. Pseudo associative caches 4 Pseudo-associative caches 5. Compiler optimizations 33 36 6
Recommend
More recommend