Review: Who Cares About the Memory Hierarchy? • Processor Only Thus Far in Course: EECS 252 Graduate Computer – CPU cost/performance, ISA, Pipelined Execution µProc 1000 Architecture CPU 60%/yr. CPU-DRAM Gap “Moore’s Law” Performance 100 Processor-Memory Lec 12 - Caches Performance Gap: (grows 50% / year) 10 DRAM David Culler “Less’ Law?” 7%/yr. Electrical Engineering and Computer Sciences DRAM 1 University of California, Berkeley 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 http://www.eecs.berkeley.edu/~culler • 1980: no cache in µproc; 1995 2-level cache on chip http://www-inst.eecs.berkeley.edu/~cs252 (1989 first Intel µproc with a cache on chip) 1/28/2004 CS252-S05 L12 Caches 2 Review: What is a cache? Review: Terminology • Small, fast storage used to improve average access time • Hit: data appears in some block in the upper level to slow memory. (example: Block X) • Exploits spacial and temporal locality – Hit Rate: the fraction of memory access found in the upper level • In computer architecture, almost everything is a cache! – Hit Time: Time to access the upper level which consists of – Registers a cache on variables RAM access time + Time to determine hit/miss – First-level cache a cache on second-level cache • Miss: data needs to be retrieve from a block in the – Second-level cache a cache on memory lower level (Block Y) – Memory a cache on disk (virtual memory) – Miss Rate = 1 - (Hit Rate) – TLB a cache on page table – Miss Penalty: Time to replace a block in the upper level + – Branch-prediction a cache on prediction information? Time to deliver the block the processor Proc/Regs • Hit Time << Miss Penalty (500 instructions on 21264!) L1-Cache Lower Level Bigger Faster L2-Cache Upper Level To Processor Memory Memory Memory Blk X From Processor Disk, Tape, etc. Blk Y 1/28/2004 CS252-S05 L12 Caches 3 1/28/2004 CS252-S05 L12 Caches 4 Why it works Block Placement • Q1: Where can a block be placed in the upper • Exploit the statistical properties of level? programs P(access,t) – Fully Associative, • Locality of reference – Set Associative, – Temporal – Direct Mapped – Spatial Average Memory Access Time address = + × AMAT HitTime MissRate MissPenalt y ( ) = + × + HitTime MissRate MissPenalt y Inst Inst Inst ( ) + × HitTime MissRate MissPenalt y Data Data Data • Simple hardware structure that observes program behavior and reacts to improve future performance • Is the cache visible in the ISA? CS252-S05 L12 Caches 5 CS252-S05 L12 Caches 6 1/28/2004 1/28/2004 NOW Handout Page 1
1 KB Direct Mapped Cache, 32B blocks Review: Set Associative Cache • For a 2 ** N byte cache: • N-way set associative: N entries for each Cache Index – The uppermost (32 - N) bits are always the Cache Tag – N direct mapped caches operates in parallel – The lowest M bits are the Byte Select (Block Size = 2 ** M) – How big is the tag? • Example: Two-way set associative cache 31 9 4 0 Cache Tag Example: 0x50 Cache Index Byte Select – Cache Index selects a “set” from the cache Ex: 0x01 Ex: 0x00 – The two tags in the set are compared to the input in parallel Stored as part – Data is selected based on the tag result of the cache “state” Cache Index Valid Cache Tag Cache Data Cache Data Cache Tag Valid Valid Bit Cache Tag Cache Data Cache Block 0 Cache Block 0 Byte 31 Byte 1 Byte 0 0 : : : : : : : 0x50 Byte 63 : Byte 33 Byte 32 1 2 3 : : : Adr Tag Compare 1 0 Compare Sel1 Mux Sel0 Byte 1023 : Byte 992 31 OR Cache Block 1/28/2004 CS252-S05 L12 Caches 7 1/28/2004 CS252-S05 L12 Caches 8 Hit Q2: How is a block found if it is in the Q3: Which block should be replaced on a upper level? miss? • Index identifies set of possibilities • Easy for Direct Mapped • Tag on each block • Set Associative or Fully Associative: – No need to check index or block offset • Increasing associativity shrinks index, expands – Random tag – LRU (Least Recently Used) Block Address Block Assoc: 2-way 4-way 8-way Offset Tag Index Size LRU Ran LRU Ran LRU Ran 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% Cache size = Associativity * 2 index_size * 2 offest_size 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% 1/28/2004 CS252-S05 L12 Caches 9 1/28/2004 CS252-S05 L12 Caches 10 Write Buffer for Write Through Q4: What happens on a write? • Write through —The information is written to both Cache Processor DRAM the block in the cache and to the block in the lower- level memory. Write Buffer • Write back —The information is written only to the • A Write Buffer is needed between the Cache and block in the cache. The modified cache block is Memory written to main memory only when it is replaced. – Processor: writes data into the cache and the write buffer – is block clean or dirty? – Memory controller: write contents of the buffer to memory • Pros and Cons of each? • Write buffer is just a FIFO: – WT: read misses cannot result in writes – Typical number of entries: 4 – WB: no repeated writes to same location – Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle • WT always combined with write buffers so that don’t wait for lower level memory • What about on a miss? – Write_no_allocate vs write_allocate CS252-S05 L12 Caches 11 CS252-S05 L12 Caches 12 1/28/2004 1/28/2004 NOW Handout Page 2
Impact on Performance Review: Cache performance • Suppose a processor executes at – Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1 • Miss-oriented Approach to Memory Access: – 50% arith/logic, 30% ld/st, 20% control MemAccess • Suppose that 10% of memory operations get 50 cycle miss = × + × × × CPUtime IC CPI MissRate MissPenalt y CycleTime Execution Inst penalty • Suppose that 1% of instructions get same miss penalty • Separating out Memory component entirely • CPI = ideal CPI + average stalls per instruction – AMAT = Average Memory Access Time 1.1(cycles/ins) + MemAccess [ 0.30 (DataMops/ins) CPUtime = IC × CPI + × AMAT × CycleTime AluOps Inst x 0.10 (miss/DataMop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1 – Effective CPI = CPI ideal_mem + P mem * AMAT • 58% of the time the proc is stalled waiting for memory! • AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54 1/28/2004 CS252-S05 L12 Caches 13 1/28/2004 CS252-S05 L12 Caches 14 Example: Harvard Architecture The Cache Design Space • Unified vs Separate I&D (Harvard) • Several interacting dimensions Cache Size – cache size Proc Proc – block size Associativity I-Cache-1 Proc D-Cache-1 Unified Cache-1 – associativity Unified Cache-2 – replacement policy Unified Cache-2 – write-through vs write-back • Statistics (given in H&P): Block Size • The optimal choice is a compromise – 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% – 32KB unified: Aggregate miss rate=1.99% – depends on access characteristics • Which is better (ignore L2 cache)? » workload Bad – Assume 33% data ops ⇒ 75% accesses from instructions (1.0/1.33) » use (I-cache, D-cache, TLB) – hit time=1, miss time=50 – depends on technology / cost – Note that data hit has 1 stall for unified cache (only one port) Factor A Factor B Good • Simplicity often wins Less More AMAT Harvard =75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 AMAT Unified =75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24 1/28/2004 CS252-S05 L12 Caches 15 1/28/2004 CS252-S05 L12 Caches 16 Review: Improving Cache Reducing Misses Performance • Classifying Misses: 3 Cs – Compulsory —The first access to a block is not in the cache, CPUtime = IC × CPI Execution + Memory accesses so the block must be brought into the cache. Also called cold start × Miss rate × Miss penalty × Clock cycle time misses or first reference misses . Instruction (Misses in even an Infinite Cache) – Capacity —If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to 1. Reduce the miss rate, blocks being discarded and later retrieved. 2. Reduce the miss penalty, or (Misses in Fully Associative Size X Cache) – Conflict —If block-placement strategy is set associative or direct 3. Reduce the time to hit in the cache. mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses . (Misses in N-way Associative, Size X Cache) • More recent, 4th “C”: – Coherence - Misses caused by cache coherence. CS252-S05 L12 Caches 17 CS252-S05 L12 Caches 18 1/28/2004 1/28/2004 NOW Handout Page 3
Recommend
More recommend