previous lecture slides for lecture 9
play

Previous Lecture Slides for Lecture 9 ENCM 501: Principles of - PDF document

slide 2/23 ENCM 501 W14 Slides for Lecture 9 Previous Lecture Slides for Lecture 9 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng completion of DRAM coverage introduction to caches Electrical


  1. slide 2/23 ENCM 501 W14 Slides for Lecture 9 Previous Lecture Slides for Lecture 9 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng ◮ completion of DRAM coverage ◮ introduction to caches Electrical & Computer Engineering Schulich School of Engineering University of Calgary 6 February, 2014 ENCM 501 W14 Slides for Lecture 9 slide 3/23 ENCM 501 W14 Slides for Lecture 9 slide 4/23 Today’s Lecture Review: Example computer with only one level of cache and no virtual memory L1 I- DRAM CONTROLLER CACHE DRAM ◮ continued coverage of cache design and cache CORE MODULES performance Related reading in Hennessy & Patterson: Sections B.1–B.3. L1 D- CACHE We’re looking at this simple system because it helps us to think about cache design and performance issues while avoiding the complexity of real systems like the Intel i7 shown in textbook Figure 2.21. slide 5/23 slide 6/23 ENCM 501 W14 Slides for Lecture 9 ENCM 501 W14 Slides for Lecture 9 Review: Example data cache organization Review: Completion of a read hit in the example cache way 0 way 1 set 0 8-to-256 decoder See the previous lecture for details of hit detection. set 1 set 2 We supposed that the data size requested for a read at index . . . . . . . . . . . . . . . address 0x1001abc4 is word . The address split was 8 set 253 0001 0000 0000 0001 10 10 1011 11 00 0100 set 254 set 255 What updates happen in the cache, and what information goes back to the core? block status: 1 valid bit and 1 dirty bit per block Key tag: 1 18-bit stored tag per block Answer: In the way where the hit did happen, the index data: 64-byte (512-bit) data block (10101111 two = 175 ten ), the block offset and data size are set status: 1 LRU bit per set used to find a 32-bit word within a data block to copy into the This could fit within our simple example hierarchy, but is also not core. The LRU bit for set 175 ten is updated to equal the much different from some current L1 D-cache designs. number of the way where the hit did not happen.

  2. slide 7/23 slide 8/23 ENCM 501 W14 Slides for Lecture 9 ENCM 501 W14 Slides for Lecture 9 How did the stored tags get into the D-cache? Completion of a write hit in the example cache And how did V-bit values go from 0 to 1? As with a read, this only happens if the result of hit detection These are good questions. The answer to both is that stored was a hit. tags and valid data blocks get into a cache as a result of To handle writes, this particular cache uses a write-back misses . strategy, in which write hits update the cache, and do not The story of cache access is being told starting not at the cause updates in the next level of the memory hierarchy. beginning, when a program is launched, but in the middle, Let’s suppose that the data size requested for a write at after a program has been running for a while, and some of the address 0x1001abc4 is word . program’s data has already been copied from main memory into the D-cache. What updates happen in the cache, and what information goes back to the core? The reason for telling the story this way is that hits are easier to describe (and, we hope, much more frequent) than misses. ENCM 501 W14 Slides for Lecture 9 slide 9/23 ENCM 501 W14 Slides for Lecture 9 slide 10/23 Completion of a read miss in the example cache Completion of a write miss in the example cache A read miss means that the data the core was seeking is not in the cache. A write miss means that the memory contents the core wanted to update are not currently reflected in the cache. Again suppose that the data size requested for a read at address 0x1001abc4 is word . Again suppose that the data size requested for a write at address 0x1001abc4 is word . What will happen to get the core the data it needs? What updates are needed to make sure that the cache will behave What will happen to complete the write? What updates are efficiently and correctly in the future? needed to make sure that the cache will behave efficiently and correctly in the future? Important: Make sure you know why it is absolutely not good enough to copy only one 32-bit word from DRAM to the cache! slide 11/23 slide 12/23 ENCM 501 W14 Slides for Lecture 9 ENCM 501 W14 Slides for Lecture 9 Storage cells in the example cache CAM cells organized to make a J-bit stored tag BL J–1 BL J–1 BL J–2 BL 0 BL 0 Data blocks are implemented with SRAM cells. The design BL J–2 tradeoffs for the cell design relate to speed, chip area, and WL i energy use per read or write. . . . CAM CAM CAM Tags and status bits might be SRAM cells or might be CAM CELL CELL CELL MATCH i (“content addressable memory”) cells. . . . A CMOS CAM cell uses the same 6-transistor structure as an For reads or writes the wordline and bitlines plays the same SRAM cell for reads, writes, and holding a stored bit for long roles they do in a row of an SRAM array. periods of time during which there are neither reads nor writes. To check for a match, the wordline is held LOW , and the A CAM cell also has 3 or 4 extra transistors that help in search tag is applied to the bitlines. If every search tag bit determining whether the bit pattern in a group of CAM cells matches the corresponding stored tag bit, the matchline stays (e.g., a stored tag) matches some other bit pattern (e.g., a HIGH ; if there is even a single mismatch, the matchline goes search tag). LOW .

  3. slide 13/23 slide 14/23 ENCM 501 W14 Slides for Lecture 9 ENCM 501 W14 Slides for Lecture 9 CAM cells versus SRAM cells for stored tags Cache line is a synonym for cache block With CAM cells, tag comparisons can be done in place. With Hennessy and Patterson are fairly consistent in their use of the SRAM cells, stored tags would have to be read via bitlines to term cache block . A lot of other literature uses the term cache comparator circuits outside the tag array, which is a slower block . process. However, the term cache line , which means the same thing, is (Schematics of caches in the textbook tend to show tag also in wide use. So you will probably read things like . . . comparison done outside of tag arrays, but that is likely done ◮ “In a 4-way set-associative cache, an index finds a set to show that a comparison is needed, not to indicate physical containing 4 cache lines.” design.) ◮ “In a direct-mapped cache, there is one cache line per CAM cells are larger than SRAM cells. But the total area index.” needed for CAM-cell tag arrays will still be much smaller than ◮ “A cache miss, even if it is for access to a single byte, will the total area needed for SRAM data blocks. result in the transfer of an entire cache line.” Would it make sense to use CAM cells for V (valid) bits? ENCM 501 W14 Slides for Lecture 9 slide 15/23 ENCM 501 W14 Slides for Lecture 9 slide 16/23 Direct-mapped caches Data cache index conflict example Consider this sketch of a C function: A direct-mapped cache can be thought of as a special case of a set-associative cache, in which there is only one way . int g1, g2, g3, g4; void func(int *x, int n) { For a given cache capacity, a direct-mapped cache is easier to int loc[10], k; build than an N -way set-associative cache with N ≥ 2: while ( condition ) { ◮ no logic is required to find the correct way for data transfer after a hit is detected; make accesses to g1, g2, g3, g4 and loc ◮ no logic is needed to decide which block in a set to } replace in handling a miss. } Direct-mapped caches may also be faster and more What will happen in the following scenario? energy-efficient. ◮ the addresses of g1 to g4 are 0x0804 fff0 to However, direct-mapped caches are vulnerable to index 0x0804 fffc conflicts (sometimes called index collisions ). ◮ the address of loc[0] is 0xbfbd fff0 slide 17/23 slide 18/23 ENCM 501 W14 Slides for Lecture 9 ENCM 501 W14 Slides for Lecture 9 Instruction cache index conflict example Motivation for set-associative caches (1) Qualitatively: Suppose a program spends much of its time in a loop within function f . . . ◮ In our example of data cache index conflicts, the conflicts go away if the cache is changed from direct-mapped to void f(double *x, double *y, int n) { 2-way set-associative. int k; ◮ In our example of instruction cache index conflicts, in the for (k = 0; k < n; k++) worst case, the conflicts go away if the cache is changed y[k] = g(x[k]) + h(x[k]); from direct-mapped to 4-way set-associative. } Quantitatively, see Figure B.8 on page B-24 of the textbook. Suppose that g and h are small, simple functions that don’t ◮ Conflict misses are a big problem in direct-mapped call other functions. caches. What kind of bad luck could cause huge numbers of misses in ◮ Moving from direct-mapped to 2-way to 4-way to 8-way a direct-mapped instruction cache? reduces the conflict miss rate at each step.

Recommend


More recommend