chapter overview
play

Chapter Overview 5.1 Introduction 5.2 The ABCs of Caches 5.3 - PowerPoint PPT Presentation

Chapter Overview 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples of Virtual Memory 1 The Big Picture:


  1. Chapter Overview 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples of Virtual Memory 1

  2. The Big Picture: Where are Introduction We Now? 5.1 Introduction The Five Classic Components of a Computer 5.2 The ABCs of Caches Processor 5.3 Reducing Cache Misses Input 5.4 Reducing Cache Miss Penalty Control Memory 5.5 Reducing Hit Time 5.6 Main Memory Datapath Output 5.7 Virtual Memory 5.8 Protection and Examples of Virtual Memory • Topics In This Chapter: – SRAM Memory Technology – DRAM Memory Technology – Memory Organization 2

  3. Introduction The Big Picture: Where are We Now? Capacity Speed (latency) Logic: 2x in 3 years 2x in 3 years Technology Trends DRAM: 4x in 3 years 2x in 10 years Disk: 4x in 3 years 2x in 10 years DRAM Year Size Cycle Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1 Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145 ns 1995 64 Mb 120 ns 1000:1! 2:1! 3

  4. The Big Picture: Where are Introduction We Now? Processor-DRAM Memory Who Cares About the Gap (latency) Memory Hierarchy? µProc 1000 CPU 60%/yr. Performance “Moore’s Law” (2X/1.5yr 100 Processor-Memory ) Performance Gap: (grows 50% / year) 10 DRAM 9%/yr. DRAM 1 (2X/10 1986 1988 1989 1990 1991 1992 1994 1995 1996 1997 1998 2000 1980 1981 1982 1983 1984 1985 1987 1993 1999 yrs) Time 4

  5. Interfacing Memory P bus is usually narrower C bus than cache block size (e.g. 8 bytes vs 32) Mem • Memory connected to cache via a bus • Board-level wiring: 100MHz is “good” 5

  6. Miss Penalty P C bus • Three components to miss Mem penalty T hit T miss – 1. Wait for bus 6 – 2. Memory latency (wait for first byte)

  7. Memory Busses • Crude: lock bus for entire transaction – simple: can use the DRAM core interface exactly. • Better: split-transaction – pass commands and data as separate chunks – requires buffering at end-points, tagging of time requests if you permit multiple outstanding requests, etc. Adr. data Bus: .... (bus unused in the meantime) .... 7

  8. Latency vs. Bandwidth • Two metrics of interest: – Latency: I.e. bulk of Cache Miss Penalty • Access Time : time between request and word arrives • Cycle Time : time between requests – Bandwidth: contributes to transfer time • relevent for large cache blocks • relevant for I/O – bandwidth is easier to improve than latency (just add money) 8

  9. • Simple: Memory Organizations – CPU, Cache, Bus, Memory same width (32 or 64 bits) • Wide: – CPU/Mux 1 word; P P P Mux/Cache, Bus, mux Memory N words C C (Alpha: 64 bits & 256 bus C bits; UtraSPARC 512) • Interleaved: mux M M M M M – CPU, Cache, Bus 1 wide M word: Memory N interleaved Modules (4 Modules); example is word interleaved 9 simple

  10. Performan P ce: C Simple Each memory transaction handled separately. M Example: 32byte cache lines, 8-byte-wide bus and memory. Bus: 100MHz (10nS), memory 80nS time A D A D A D A Bus: 0 0 1 1 2 2 3 Total: 40 cycles = 4 * (1 + 8 + 1) 10

  11. Performan ce: P Wide mux C Bus, memory is 32-bytes wide! M Example: 32byte cache lines fetched in one transaction. Bus: 100MHz, memory 80nS time Bus: A D 0 0 Total: 10 cycles 1 * (1 + 8 + 1) 11 works great but is expensive!

  12. Performan P ce: Interleaved C bus Memory is 8-bytes wide but mux there are four banks. M M M M Example: 32byte cache lines fetched in four transactions, overlapped time Bus: 100MHz, memory 80nS A A A A D D D D Bus: 0 1 2 3 0 1 2 3 Total: 13 cycles 1 + 8 + 4 12 nice tradeoff

  13. Interleaving: Two Variations • 1. Strictly sequential accesses – I.e. for cache line fill, as above – Can be implemented with one DRAM array using column-only (“page mode”) accesses (up to width of a column) • 2. Arbitrary accesses – Requires a source of multiple, overlapped requests • Advanced CPU and cache technology: write buffers, non-blocking miss-under-miss caches, prefetching. • I/O with DMA-capable controller • Multiprocessors – Requires multiple independent arrays 13

  14. Independent Memory Banks • How many banks? – IF accesses are sequential, THEN making the number of banks equal to the number of clock cycles to access one bank allows all latency to be covered. – BUT if the pattern is non-sequential, the first bank will be reused prematurely, inducing a stall. 0 1 2 3 0 4 5 6 7 4 8 9 10 11 8 ... ... pathological case sequential access --> slow! --> fast! 14

  15. Avoiding Bank Conflicts • Lots of banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; • Even with 128 banks, since 512 is multiple of 128, conflict on word accesses • Solutions: – software: loop interchange – software: adjust the array size to a prime # (“array padding”) – hardware: prime number of banks (e.g. 17) 15

  16. Improved DRAM interfaces • Multiple CAS accesses: several names (page mode) – Extended Data Out (EDO) : 30% faster in page mode • New DRAMs to address gap; what will they cost, will they survive? – Synchronous DRAM : 2-4 banks on chip, a clock signal to DRAM, transfer synchronous to system clock (66 - 150 MHz) – RAMBUS : startup company; reinvent DRAM interface • Each chip a module vs. slice of memory (8-16 banks) • Short bus between CPU and chips • Does own refresh • Block transfer mechanism, arbitrary sequencing • 2 bytes @ 800Mhz (1.6GB/s per bus) 16

  17. Current Memory Technology: Direct RDRAM (see also SDRAM, DDR SDRAM) • Packet-switched bus interface – 18-bits of data, 8 bits control at 800MHz – collected into packets of 8 at 100MHz (10nS) – That’s 16 bytes w/ECC plus controls for multiple banks • Internally banked – 128Mbit (16Mbyte) part has 32 banks • Timing – TRA is 40nS (but effectively 60nS due to the interface) – one row cycle time is 80nS but a new bank can start every 10nS – TCA is 20nS (effectively 40nS), new column can start every 10nS 17

  18. 18

  19. 19

  20. 20

  21. The Big Picture: Where are Introduction We Now? Today’s Situation: Microprocessor • Rely on caches to bridge gap • Microprocessor-DRAM performance gap – time of a full cache miss in instructions executed 1st Alpha (7000): 340 ns/5.0 ns = 68 clks x 2 or 136 instructions 2nd Alpha (8400): 266 ns/3.3 ns = 80 clks x 4 or 320 instructions 3rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648 instructions – 1/2X latency x 3X clock rate x 3X Instr/clock ⇒ -5X 21

  22. Introduction The Big Picture: Where are We Now? Levels of the Memory Hierarchy Upper Level Capacity Access Time Staging Cost Xfer Unit faster CPU Registers Registers 100s Bytes 1s ns prog./compiler Instr. Operands 1-8 bytes Cache K Bytes Cache 4 ns 1-0.1 cents/bit cache cntl Blocks 8-128 bytes Main Memory Memory M Bytes 100ns- 300ns OS $.0001-.00001 cents /bit Pages Disk 512-4K bytes G Bytes, 10 ms Disk (10,000,000 ns) -6 -5 user/operator 10 - 10 cents/bit Files Mbytes Tape Larger infinite Tape Lower Level sec-min 10 -8 22

  23. The ABCs of Caches 5.1 Introduction In this section we will: 5.2 The ABCs of Caches Learn lots of definitions about caches – you 5.3 Reducing Cache Misses can’t talk about something until you 5.4 Reducing Cache Miss understand it (this is true in computer Penalty science at least!) 5.5 Reducing Hit Time 5.6 Main Memory Answer some fundamental questions about 5.7 Virtual Memory caches: • Q1: Where can a block be placed in the 5.8 Protection and Examples upper level? (Block placement) of Virtual Memory • Q2: How is a block found if it is in the upper level? (Block identification) • Q3: Which block should be replaced on a miss? (Block replacement) • Q4: What happens on a write? (Write strategy) 23

  24. Definitions The ABCs of Caches The Principle of Locality • The Principle of Locality: – Program access a relatively small portion of the address space at any instant of time. • Two Different Types of Locality: – Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) – Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) • Last 15 years, HW relied on locality for speed 24

  25. Definitions The ABCs of Caches Memory Hierarchy: Terminology • Hit: data appears in some block in the upper level (example: Block X) – Hit Rate: the fraction of memory access found in the upper level – Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss • Miss: data needs to be retrieve from a block in the lower level (Block Y) – Miss Rate = 1 - (Hit Rate) – Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor • Hit Time << Miss Penalty (500 instructions on 21264!) Lower Level Upper Level Memory To Processor Memory Blk X From Processor Blk Y 25

Recommend


More recommend