Who Cares About the Memory Hierarchy? CS252 Graduate Computer Architecture µProc 1000 Lecture 4 CPU 60%/yr. “Moore’s Law” Performance CPU- DRAM Gap Cache Design 100 Processor-Memory Performance Gap: (grows 50% / year) 10 “Less’ Law?” January 31, 2002 DRAM 7%/yr. Prof . David Culler DRAM 1 1980 1981 1985 1988 1989 1992 1995 1996 1997 1999 2000 1982 1983 1984 1986 1987 1990 1991 1993 1994 1998 • 1980: no cache in µproc; 1995 2- level cache on chip (1989 f irst I nt el µproc wit h a cache on chip) CS252/ Culler CS252/ Culler 1/ 31/ 02 1/ 31/ 02 Lec 4. 1 Lec 4. 2 Processor - Memory Generations of Microprocessors Perf ormance Gap “Tax” • Time of a f ull cache miss in inst ruct ions execut ed: 1st Alpha: 340 ns/ 5. 0 ns = 68 clks x 2 or 136 Processor % Area %Transist ors 2nd Alpha: 266 ns/ 3. 3 ns = 80 clks x 4 or 320 (- cost) (- power) 3rd Alpha: 180 ns/ 1. 7 ns =108 clks x 6 or 648 • Alpha 21164 37% 77% • 1/ 2X lat ency x 3X clock rat e x 3X I nstr/ clock ⇒ - 5X St rongArm SA110 61% 94% • • Pentium Pro 64% 88% – 2 dies per package: Proc/ I $/ D$ + L2$ • Caches have no “inherent value”, only t ry t o close perf ormance gap CS252/ Culler CS252/ Culler 1/ 31/ 02 1/ 31/ 02 Lec 4. 3 Lec 4. 4 What is a cache? Traditional Four Questions f or • Small, f ast st orage used t o improve average access time to slow memory. Memory Hierarchy Designers • Exploits spacial and temporal locality • Q1: Where can a block be placed in t he upper level? • I n comput er archit ect ure, almost everyt hing is a cache! (Block placement ) – Registers “a cache” on variables – sof tware managed – Fully Associative, Set Associative, Direct Mapped – First - level cache a cache on second- level cache • Q2: How is a block f ound if it is in t he upper level? – Second- level cache a cache on memory (Block ident if icat ion) – Memory a cache on disk (virtual memory) – Tag/ Block – TLB a cache on page table • Q3: Which block should be replaced on a miss? – Branch- prediction a cache on prediction inf ormation? (Block replacement ) Proc/ Regs – Random, LRU L1- Cache • Q4: What happens on a writ e? Bigger Faster (Write strategy) L2- Cache – Write Back or Write Through (with Write Buf f er) Memory Disk, Tape, etc. CS252/ Culler CS252/ Culler 1/ 31/ 02 1/ 31/ 02 Lec 4. 5 Lec 4. 6 P age 1
Review: Cache perf ormance What are all the aspects of cache organization that impact • Miss- orient ed Approach t o Memory Access: perf ormance? MemAccess = × + × × × CPUtime IC CPI MissRate MissPenalt y CycleTime Execution Inst MemMisses = × + × × CPUtime IC CPI MissPenalt y CycleTime Execution Inst – CPI Execution includes ALU and Memory instructions • Separat ing out Memory component ent irely – AMAT = Average Memory Access Time – CPI ALUOps does not include memory instructions AluOps MemAccess = × × + × × CPUtime IC CPI AMAT CycleTime A l u O p s Inst Inst = + × AMAT HitTime MissRate MissPenalt y ( ) = + × + HitTime MissRate MissPenalt y Inst Inst Inst ( ) + × HitTime MissRate MissPenalt y Data Data Data CS252/ Culler CS252/ Culler 1/ 31/ 02 1/ 31/ 02 Lec 4. 7 Lec 4. 8 Unif ied vs Split Caches I mpact on Perf ormance • Suppose a processor execut es at • Unif ied vs Separat e I &D – Clock Rate = 200 MHz (5 ns per cycle), I deal (no misses) CPI = 1.1 – 50% arith/ logic, 30% ld/ st, 20% control Proc Proc • Suppose t hat 10% of memory operat ions get 50 cycle I - Cache- 1 Proc D- Cache- 1 Unif ied miss penalty Cache- 1 Unif ied Cache- 2 • Suppose t hat 1% of inst ruct ions get same miss penalt y Unif ied Cache- 2 • CPI = ideal CPI + average st alls per inst ruct ion • Example: 1. 1(cycles/ ins) + – 16KB I &D: I nst miss rate=0. 64%, Data miss rate=6. 47% [ 0. 30 (Dat aMops / ins) – 32KB unif ied: Aggregate miss rate=1. 99% x 0. 10 (miss/ Dat aMop) x 50 (cycle/ miss)] + • Which is better (ignore L2 cache)? [ 1 (I nst Mop/ ins) – Assume 33% data ops ⇒ 75% accesses f rom instructions (1. 0/ 1. 33) x 0. 01 (miss/ I nst Mop) x 50 (cycle/ miss)] – hit time=1, miss time=50 = (1. 1 + 1. 5 + . 5) cycle/ ins = 3. 1 – Note that data hit has 1 stall f or unif ied cache (only one port) • 58% of t he t ime t he proc is st alled wait ing f or memory! • AMAT=(1/ 1. 3)x[1+0. 01x50]+(0. 3/ 1. 3)x[1+0. 1x50]=2. 54 AMAT Harvard =75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 AMAT Unif ied =75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24 CS252/ Culler CS252/ Culler 1/ 31/ 02 1/ 31/ 02 Lec 4. 9 Lec 4. 10 Where to misses come f rom? How to I mprove Cache • Classif ying Misses: 3 Cs Perf ormance? – Compulsory —The f irst access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or f irst ref erence misses . (Misses in even an I nf inite Cache) = + × AMAT HitTime MissRate MissPenalt y – Capacit y —I f the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due t o blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) – Conf lict —I f block- placement strategy is set associative or 1. Reduce the miss rate, direct mapped, conf lict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and 2. Reduce t he miss penalt y, or later retrieved if too many blocks map to its set. Also called collision misses or interf erence misses . 3. Reduce the time to hit in the cache. (Misses in N- way Associative, Size X Cache) • 4t h “C”: – Coherence - Misses caused by cache coherence. CS252/ Culler CS252/ Culler 1/ 31/ 02 1/ 31/ 02 Lec 4. 11 Lec 4. 12 P age 2
3Cs Absolute Miss Rate Cache Size (SPEC92) 0.14 1-way 0.12 2-way 0.14 0.1 1-way 4-way Conflict 0.12 0.08 2-way 8-way 0.1 0.06 4-way Capacity 0.08 0.04 8-way 0.06 0.02 Capacity 0.04 0 0.02 1 2 4 8 16 32 64 128 0 Compulsory Cache Size (KB) 1 2 4 8 16 32 64 128 • Old rule of thumb: 2x size => 25% cut in miss rate Compulsory • What does it reduce? Cache Size (KB) CS252/ Culler CS252/ Culler 1/ 31/ 02 1/ 31/ 02 Lec 4. 13 Lec 4. 14 Cache Organization? Huge Caches => Working Sets fic • Assume total cache size not changed: Data traf First working set • What happens if : Capacity -generated traf fic (including conflicts) 1) Change Block Size: Second working set 14 4-node Other capacity -independent communication 12 8-node 2) Change Associat ivit y: Inher ent communication 16-node 10 Cold-start (compulsory) traf fic 32-node Replication capacity (cache size) Miss Rate (%) 8 3) Change Compiler: 6 Example LU Decomposition 4 f rom NAS Parallel Benchmarks Which of 3Cs is obviously af f ect ed? 2 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Per Processor Cache Size (KB) CS252/ Culler CS252/ Culler 1/ 31/ 02 1/ 31/ 02 Lec 4. 15 Lec 4. 16 Larger Block Size Associativity (f ixed size&assoc) 25% 0.14 1-way Conflict 1K 0.12 20% 2-way 0.1 4K 15% 4-way Miss 0.08 16K Rate 8-way 10% 0.06 64K Capacity 0.04 5% 256K 0.02 Reduced 0% compulsory 0 misses 16 32 64 128 256 1 2 4 8 I ncreased 16 32 64 128 Conf lict Block Size (bytes) Compulsory Misses Cache Size (KB) What else drives up block size? CS252/ Culler CS252/ Culler 1/ 31/ 02 1/ 31/ 02 Lec 4. 17 Lec 4. 18 P age 3
More recommend