Caches & Memory Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy, McKee, and Sirer]
Programs 101 C Code RISC-V Assembly main: addi sp,sp,-48 int main (int argc, char* argv[ ]) { sw x1,44(sp) int i; sw fp,40(sp) int m = n; move fp,sp int sum = 0; sw x10,-36(fp) for (i = 1; i <= m; i++) { sw x11,-40(fp) sum += i; la x15,n } lw x15,0(x15) printf (“...”, n, sum); sw x15,-28(fp) sw x0,-24(fp) } li x15,1 sw x15,-20(fp) Load/Store Architectures: L2: lw x14,-20(fp) lw x15,-28(fp) • Read data from memory blt x15,x14,L3 (put in registers) . . . • Manipulate it Instructions that read from • Store it back to memory or write to memory… 2
Programs 101 C Code RISC-V Assembly main: addi sp,sp,-48 int main (int argc, char* argv[ ]) { sw ra,44(sp) int i; sw fp,40(sp) int m = n; move fp,sp int sum = 0; sw a0,-36(fp) for (i = 1; i <= m; i++) { sw a1,-40(fp) sum += i; la a5,n } lw a5,0(x15) printf (“...”, n, sum); sw a5,-28(fp) sw x0,-24(fp) } li a5,1 sw a5,-20(fp) Load/Store Architectures: L2: lw a4,-20(fp) lw a5,-28(fp) • Read data from memory blt a5,a4,L3 (put in registers) . . . • Manipulate it Instructions that read from • Store it back to memory or write to memory… 3
1 Cycle Per Stage: the Biggest Lie (So Far) Code Stored in Memory (also, data and stack) compute jump/branch targets A memory register D D ALU file B +4 addr PC inst d in d out M B control memory imm extend new forward pc Stack, Data, Code detect unit Stored in Memory hazard Write- Instruction Instruction ctrl ctrl ctrl Back Memory Execute Fetch Decode IF/ID ID/EX EX/MEM MEM/WB 4
What’s the problem? CPU Main Memory + big – slow – far away SandyBridge Motherboard, 2011 5 http://news.softpedia.com
The Need for Speed CPU Pipeline 6
The Need for Speed CPU Pipeline Instruction speeds: • add,sub,shift : 1 cycle • mult : 3 cycles • load/store : 100 cycles off-chip 50(-70)ns 2(-3) GHz processor 0.5 ns clock 7
The Need for Speed CPU Pipeline 8
What’s the solution? Caches ! Level 1 Data $ Level 2 $ Level 1 Insn $ 9 Intel Pentium 3, 1999
Aside • Go back to 04-state and 05-memory and look at how registers, SRAM and DRAM are built. 10
What’s the solution? Caches ! Level 1 Data $ Level 2 $ Level 1 Insn $ What lucky data gets to go here? 11 Intel Pentium 3, 1999
Locality Locality Locality If you ask for something, you’re likely to ask for: • the same thing again soon Temporal Locality • something near that thing, soon Spatial Locality total = 0; for (i = 0; i < n; i++) total += a[i]; return total; 12
Your life is full of Locality Last Called Speed Dial Favorites Contacts Google/Facebook/email 13
Your life is full of Locality 14
The Memory Hierarchy Small, Fast 1 cycle, 128 bytes Registers 4 cycles, L1 Caches 64 KB 12 cycles, L2 Cache 256 KB 36 cycles, L3 Cache 2-20 MB 50-70 ns, Big, Slow Main Memory 512 MB – 4 GB 5-20 ms Disk 16GB – 4 TB, 15 Intel Haswell Processor, 2013
Some Terminology Cache hit • data is in the Cache • t hit : time it takes to access the cache • Hit rate (%hit): # cache hits / # cache accesses Cache miss • data is not in the Cache • t miss : time it takes to get the data from below the $ • Miss rate (%miss): # cache misses / # cache accesses Cacheline or cacheblock or simply line or block • Minimum unit of info that is present/or not in the cache 16
The Memory Hierarchy 1 cycle, average access time 128 bytes t avg = t hit + % miss * t miss Registers = 4 + 5% x 100 4 cycles, L1 Caches 64 KB = 9 cycles 12 cycles, L2 Cache 256 KB 36 cycles, L3 Cache 2-20 MB 50-70 ns, Main Memory 512 MB – 4 GB 5-20 ms Disk 16GB – 4 TB, 17 Intel Haswell Processor, 2013
Single Core Memory Hierarchy ON CHIP Processor Regs Registers I$ D$ L1 Caches L2 L2 Cache L3 Cache Main Memory Main Memory Disk Disk 18
Multi-Core Memory Hierarchy ON CHIP Processor Processor Processor Processor Regs Regs Regs Regs I$ D$ I$ D$ I$ D$ I$ D$ L2 L2 L2 L2 L3 Main Memory Disk 19
Memory Hierarchy by the Numbers CPU clock rates ~0.33ns – 2ns (3GHz-500MHz) Memory Transistor Access time Access time in $ per GIB Capacity technology count* cycles in 2012 6-8 transistors 0.5-2.5 ns 1-3 cycles $4k 256 KB SRAM (on chip) 1.5-30 ns 5-15 cycles $4k 32 MB SRAM (off chip) 1 transistor 50-70 ns 150-200 cycles $10-$20 8 GB DRAM (needs refresh) 5k-50k ns Tens of $0.75-$1 512 GB SSD (Flash) thousands 5M-20M ns Millions $0.05- 4 TB Disk $0.1 *Registers,D-Flip Flops: 10-100’s of registers 20
Basic Cache Design Direct Mapped Caches 21
MEMORY 16 Byte Memory addr data 0000 A 0001 B 0010 C 0011 D 1100 r1 load 0100 E 0101 F 0110 G 0111 H 1000 J • Byte-addressable memory 1001 K • 4 address bits 16 bytes total 1010 L • b addr bits 2 b bytes in memory 1011 M 1100 N 1101 O 1110 P 1111 Q 22
MEMORY 4-Byte, Direct Mapped Cache addr data 0000 A CACHE 0001 B index 0010 C data index XXXX 0011 D Cache entry A 00 0100 E B 01 = row 0101 F C 10 = ( cache) line 0110 G D 11 = ( cache) block 0111 H Block Size: 1 byte 1000 J 1001 K Direct mapped: 1010 L • Each address maps to 1 cache block 1011 M • 4 entries 2 index bits (2 n n bits) 1100 N 1101 O Index with LSB: 1110 P 1111 Q • Supports spatial locality 23
Analogy to a Spice Rack Spice Wall Spice Rack (Memory) (Cache) index spice A B C D E F … Z • Compared to your spice wall • Smaller • Faster • More costly (per oz.) 24 http:// www.bedbathandbeyond.com
Analogy to a Spice Rack Spice Wall Spice Rack (Memory) (Cache) index tag spice A B innamon Cinnamon C D E F … Z • How do you know what’s in the jar? • Need labels Tag = Ultra-minimalist label 25
MEMORY 4-Byte, Direct Mapped addr data Cache 0000 A 0001 B tag|index 0010 C CACHE XXXX 0011 D tag data 0100 E index 00 A 0101 F 00 00 01 B 0110 G 00 C 0111 H 10 00 D 1000 J 11 1001 K Tag: minimalist label/address 1010 L address = tag + index 1011 M 1100 N 1101 O 1110 P 1111 Q 26
MEMORY 4-Byte, Direct Mapped addr data Cache 0000 A 0001 B 0010 C CACHE 0011 D tag V data 0100 E index 0 00 X 0101 F 00 0 00 X 01 0110 G 0 00 X 0111 H 10 0 00 X 1000 J 11 1001 K One last tweak: valid bit 1010 L 1011 M 1100 N 1101 O 1110 P 1111 Q 27
MEMORY Simulation #1 addr data of a 4-byte, DM Cache 0000 A 0001 B tag|index 0010 C CACHE XXXX 0011 D tag V data 0100 E index 0 11 X 0101 F 00 0 11 X 01 0110 G 0 11 X 0111 H 10 0 11 X 1000 J 11 1001 K load 1100 Lookup: 1010 L • Index into $ 1011 M • Check tag 1100 N 1101 O • Check valid bit 1110 P 1111 Q 28
Block Diagram 4-entry, direct mapped Cache tag|index CACHE 1101 tag V data 2 2 1 00 1111 0000 1 11 1010 0101 0 01 1010 1010 Great! 1 11 0000 0000 Are we done? 2 8 = 1010 0101 data Hit! 29
MEMORY Simulation #2: addr data 4-byte, DM Cache 0000 A 0001 B 0010 C CACHE 0011 D tag V data 0100 E index 1 11 N 0101 F 00 0 11 X 01 0110 G 0 11 X 0111 H 10 0 11 X 1000 J 11 1001 K load 1100 Lookup: Miss 1010 L load 1101 • Index into $ 1011 M load 0100 • Check tag 1100 N load 1100 • Check valid bit 1101 O 1110 P 1111 Q 30
Reducing Cold Misses by Increasing Block Size • Leveraging Spatial Locality 31
MEMORY Increasing Block Size addr data 0000 A 0001 B CACHE offset 0010 C V tag index data XXXX 0011 D 00 0 x A | B 0100 E 01 0 x C | D 0101 F 10 0 x E | F 0110 G 11 0 x G | H 0111 H 1000 J • Block Size: 2 bytes 1001 K • Block Offset: least significant bits 1010 L 1011 M indicate where you live in the block 1100 N • Which bits are the index? tag? 1101 O 1110 P 1111 Q 32
MEMORY Simulation #3: addr data 8-byte, DM Cache 0000 A index 0001 B CACHE tag| |offset 0010 C index V tag data XXXX 0011 D 00 0 x X | X 0100 E 01 0 x X | X 0101 F 10 0 x X | X 0110 G 11 0 x X | X 0111 H 1000 J 1001 K load 1100 Lookup: 1010 L load 1101 • Index into $ 1011 M load 0100 • Check tag 1100 N load 1100 • Check valid bit 1101 O 1110 P 1111 Q 33
Removing Conflict Misses with Fully-Associative Caches 34
MEMORY Simulation #4: addr data 8-byte, FA Cache 0000 A 0001 B XXXX 0010 C tag|offset 0011 D CACHE 0100 E 0101 F V tag data V tag data V tag data V tag data 0110 G 0 xxx X | X 0 xxx X | X 0 xxx X | X 0 xxx X | X 0111 H 1000 J 1001 K load 1100 Lookup: Miss 1010 L load 1101 • Index into $ 1011 M load 0100 • Check tags 1100 N load 1100 • Check valid bits 1101 O 1110 P 1111 Q LRU Pointer 35
Recommend
More recommend