Caches & Memory Hakim Weatherspoon CS 3410 Computer Science - PowerPoint PPT Presentation

Caches & Memory Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy, McKee, and Sirer]

Programs 101 C Code RISC-V Assembly main: addi sp,sp,-48 int main (int argc, char* argv[ ]) { sw x1,44(sp) int i; sw fp,40(sp) int m = n; move fp,sp int sum = 0; sw x10,-36(fp) for (i = 1; i <= m; i++) { sw x11,-40(fp) sum += i; la x15,n } lw x15,0(x15) printf (“...”, n, sum); sw x15,-28(fp) sw x0,-24(fp) } li x15,1 sw x15,-20(fp) Load/Store Architectures: L2: lw x14,-20(fp) lw x15,-28(fp) • Read data from memory blt x15,x14,L3 (put in registers) . . . • Manipulate it  Instructions that read from • Store it back to memory or write to memory… 2

Programs 101 C Code RISC-V Assembly main: addi sp,sp,-48 int main (int argc, char* argv[ ]) { sw ra,44(sp) int i; sw fp,40(sp) int m = n; move fp,sp int sum = 0; sw a0,-36(fp) for (i = 1; i <= m; i++) { sw a1,-40(fp) sum += i; la a5,n } lw a5,0(x15) printf (“...”, n, sum); sw a5,-28(fp) sw x0,-24(fp) } li a5,1 sw a5,-20(fp) Load/Store Architectures: L2: lw a4,-20(fp) lw a5,-28(fp) • Read data from memory blt a5,a4,L3 (put in registers) . . . • Manipulate it  Instructions that read from • Store it back to memory or write to memory… 3

1 Cycle Per Stage: the Biggest Lie (So Far) Code Stored in Memory (also, data and stack) compute jump/branch targets A memory register D D ALU file B +4 addr PC inst d in d out M B control memory imm extend new forward pc Stack, Data, Code detect unit Stored in Memory hazard Write- Instruction Instruction ctrl ctrl ctrl Back Memory Execute Fetch Decode IF/ID ID/EX EX/MEM MEM/WB 4

What’s the problem? CPU Main Memory + big – slow – far away SandyBridge Motherboard, 2011 5 http://news.softpedia.com

The Need for Speed CPU Pipeline 6

The Need for Speed CPU Pipeline Instruction speeds: • add,sub,shift : 1 cycle • mult : 3 cycles • load/store : 100 cycles off-chip 50(-70)ns 2(-3) GHz processor  0.5 ns clock 7

The Need for Speed CPU Pipeline 8

What’s the solution? Caches ! Level 1 Data $ Level 2 $ Level 1 Insn $ 9 Intel Pentium 3, 1999

Aside • Go back to 04-state and 05-memory and look at how registers, SRAM and DRAM are built. 10

What’s the solution? Caches ! Level 1 Data $ Level 2 $ Level 1 Insn $ What lucky data gets to go here? 11 Intel Pentium 3, 1999

Locality Locality Locality If you ask for something, you’re likely to ask for: • the same thing again soon  Temporal Locality • something near that thing, soon  Spatial Locality total = 0; for (i = 0; i < n; i++) total += a[i]; return total; 12

Your life is full of Locality Last Called Speed Dial Favorites Contacts Google/Facebook/email 13

Your life is full of Locality 14

The Memory Hierarchy Small, Fast 1 cycle, 128 bytes Registers 4 cycles, L1 Caches 64 KB 12 cycles, L2 Cache 256 KB 36 cycles, L3 Cache 2-20 MB 50-70 ns, Big, Slow Main Memory 512 MB – 4 GB 5-20 ms Disk 16GB – 4 TB, 15 Intel Haswell Processor, 2013

Some Terminology Cache hit • data is in the Cache • t hit : time it takes to access the cache • Hit rate (%hit): # cache hits / # cache accesses Cache miss • data is not in the Cache • t miss : time it takes to get the data from below the $ • Miss rate (%miss): # cache misses / # cache accesses Cacheline or cacheblock or simply line or block • Minimum unit of info that is present/or not in the cache 16

The Memory Hierarchy 1 cycle, average access time 128 bytes t avg = t hit + % miss * t miss Registers = 4 + 5% x 100 4 cycles, L1 Caches 64 KB = 9 cycles 12 cycles, L2 Cache 256 KB 36 cycles, L3 Cache 2-20 MB 50-70 ns, Main Memory 512 MB – 4 GB 5-20 ms Disk 16GB – 4 TB, 17 Intel Haswell Processor, 2013

Single Core Memory Hierarchy ON CHIP Processor Regs Registers I$ D$ L1 Caches L2 L2 Cache L3 Cache Main Memory Main Memory Disk Disk 18

Multi-Core Memory Hierarchy ON CHIP Processor Processor Processor Processor Regs Regs Regs Regs I$ D$ I$ D$ I$ D$ I$ D$ L2 L2 L2 L2 L3 Main Memory Disk 19

Memory Hierarchy by the Numbers CPU clock rates ~0.33ns – 2ns (3GHz-500MHz) Memory Transistor Access time Access time in $ per GIB Capacity technology count* cycles in 2012 6-8 transistors 0.5-2.5 ns 1-3 cycles $4k 256 KB SRAM (on chip) 1.5-30 ns 5-15 cycles $4k 32 MB SRAM (off chip) 1 transistor 50-70 ns 150-200 cycles $10-$20 8 GB DRAM (needs refresh) 5k-50k ns Tens of $0.75-$1 512 GB SSD (Flash) thousands 5M-20M ns Millions $0.05- 4 TB Disk $0.1 *Registers,D-Flip Flops: 10-100’s of registers 20

Basic Cache Design Direct Mapped Caches 21

MEMORY 16 Byte Memory addr data 0000 A 0001 B 0010 C 0011 D 1100  r1 load 0100 E 0101 F 0110 G 0111 H 1000 J • Byte-addressable memory 1001 K • 4 address bits  16 bytes total 1010 L • b addr bits  2 b bytes in memory 1011 M 1100 N 1101 O 1110 P 1111 Q 22

MEMORY 4-Byte, Direct Mapped Cache addr data 0000 A CACHE 0001 B index 0010 C data index XXXX 0011 D  Cache entry A 00 0100 E B 01 = row 0101 F C 10 = ( cache) line 0110 G D 11 = ( cache) block 0111 H Block Size: 1 byte 1000 J 1001 K Direct mapped: 1010 L • Each address maps to 1 cache block 1011 M • 4 entries  2 index bits (2 n  n bits) 1100 N 1101 O Index with LSB: 1110 P 1111 Q • Supports spatial locality 23

Analogy to a Spice Rack Spice Wall Spice Rack (Memory) (Cache) index spice A B C D E F … Z • Compared to your spice wall • Smaller • Faster • More costly (per oz.) 24 http:// www.bedbathandbeyond.com

Analogy to a Spice Rack Spice Wall Spice Rack (Memory) (Cache) index tag spice A B innamon Cinnamon C D E F … Z • How do you know what’s in the jar? • Need labels Tag = Ultra-minimalist label 25

MEMORY 4-Byte, Direct Mapped addr data Cache 0000 A 0001 B tag|index 0010 C CACHE XXXX 0011 D tag data 0100 E index 00 A 0101 F 00 00 01 B 0110 G 00 C 0111 H 10 00 D 1000 J 11 1001 K Tag: minimalist label/address 1010 L address = tag + index 1011 M 1100 N 1101 O 1110 P 1111 Q 26

MEMORY 4-Byte, Direct Mapped addr data Cache 0000 A 0001 B 0010 C CACHE 0011 D tag V data 0100 E index 0 00 X 0101 F 00 0 00 X 01 0110 G 0 00 X 0111 H 10 0 00 X 1000 J 11 1001 K One last tweak: valid bit 1010 L 1011 M 1100 N 1101 O 1110 P 1111 Q 27

MEMORY Simulation #1 addr data of a 4-byte, DM Cache 0000 A 0001 B tag|index 0010 C CACHE XXXX 0011 D tag V data 0100 E index 0 11 X 0101 F 00 0 11 X 01 0110 G 0 11 X 0111 H 10 0 11 X 1000 J 11 1001 K load 1100 Lookup: 1010 L • Index into $ 1011 M • Check tag 1100 N 1101 O • Check valid bit 1110 P 1111 Q 28

Block Diagram 4-entry, direct mapped Cache tag|index CACHE 1101 tag V data 2 2 1 00 1111 0000 1 11 1010 0101 0 01 1010 1010 Great! 1 11 0000 0000 Are we done? 2 8 = 1010 0101 data Hit! 29

MEMORY Simulation #2: addr data 4-byte, DM Cache 0000 A 0001 B 0010 C CACHE 0011 D tag V data 0100 E index 1 11 N 0101 F 00 0 11 X 01 0110 G 0 11 X 0111 H 10 0 11 X 1000 J 11 1001 K load 1100 Lookup: Miss 1010 L load 1101 • Index into $ 1011 M load 0100 • Check tag 1100 N load 1100 • Check valid bit 1101 O 1110 P 1111 Q 30

Reducing Cold Misses by Increasing Block Size • Leveraging Spatial Locality 31

MEMORY Increasing Block Size addr data 0000 A 0001 B CACHE offset 0010 C V tag index data XXXX 0011 D 00 0 x A | B 0100 E 01 0 x C | D 0101 F 10 0 x E | F 0110 G 11 0 x G | H 0111 H 1000 J • Block Size: 2 bytes 1001 K • Block Offset: least significant bits 1010 L 1011 M indicate where you live in the block 1100 N • Which bits are the index? tag? 1101 O 1110 P 1111 Q 32

MEMORY Simulation #3: addr data 8-byte, DM Cache 0000 A index 0001 B CACHE tag| |offset 0010 C index V tag data XXXX 0011 D 00 0 x X | X 0100 E 01 0 x X | X 0101 F 10 0 x X | X 0110 G 11 0 x X | X 0111 H 1000 J 1001 K load 1100 Lookup: 1010 L load 1101 • Index into $ 1011 M load 0100 • Check tag 1100 N load 1100 • Check valid bit 1101 O 1110 P 1111 Q 33

Removing Conflict Misses with Fully-Associative Caches 34

MEMORY Simulation #4: addr data 8-byte, FA Cache 0000 A 0001 B XXXX 0010 C tag|offset 0011 D CACHE 0100 E 0101 F V tag data V tag data V tag data V tag data 0110 G 0 xxx X | X 0 xxx X | X 0 xxx X | X 0 xxx X | X 0111 H 1000 J 1001 K load 1100 Lookup: Miss 1010 L load 1101 • Index into $ 1011 M load 0100 • Check tags 1100 N load 1100 • Check valid bits 1101 O 1110 P 1111 Q LRU Pointer 35

Caches & Memory Hakim Weatherspoon CS 3410 Computer Science - PowerPoint PPT Presentation

Caches & Memory Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy, McKee, and Sirer] Programs 101 C Code RISC-V Assembly main: addi sp,sp,-48 int main (int argc, char* argv[ ]) { sw x1,44(sp)

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 67 Contents 1

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

Multicore Workshop Caches Mark Bull David Henty EPCC, University of Edinburgh Overview

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 105 Contents 1

CSE 351: Week 7 Tom Bergan, TA 1 Today Cache geometries Lab 4 2 Caches they make

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 ,

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

CS 136: Advanced Architecture Review of Caches 1 / 30 Introduction Why Caches? Basic goal:

ECE232: Hardware Organization and Design Lecture 22: Introduction to Caches Adapted from Computer

1 Implementation Snoop Caches Implementing Snooping Caches Write Races: Multiple processors

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Review: Why We Use Caches Caches Review Mechanism for transparent movement of Proc 1000

Caches & Memcache Example Client N. America Client System Asia + Caches Client Africa

Caching 1 Caches break down an address into which parts? Letter Answer A Tag, delay, length

Direct-Mapped Cache: Write Allocate with Write-Through Protocol Block size in bytes: B = 2 b WRITE

lecture 18 cache 2 - TLB (hit and miss) - instruction or data cache - cache (hit and

CENG3420 Lecture 08: Cache Bei Yu byu@cse.cuhk.edu.hk (Latest update: March 14, 2019) Spring

Previous Lecture Slides for Lecture 8 ENCM 501: Principles of Computer Architecture Winter 2014

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Caches and Memory Hierarchy: Review UCSB CS240A, Fall 2017 1 Motivation Most applications

Memory Hierarchy & Caching CS 351: Systems Programming Michael Saelee <lee@iit.edu>

Caches & Memory Hakim Weatherspoon CS 3410 Computer Science - PowerPoint PPT Presentation

Caches & Memory Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy, McKee, and Sirer] Programs 101 C Code RISC-V Assembly main: addi sp,sp,-48 int main (int argc, char* argv[ ]) { sw x1,44(sp)

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 67 Contents 1

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

Multicore Workshop Caches Mark Bull David Henty EPCC, University of Edinburgh Overview

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 105 Contents 1

CSE 351: Week 7 Tom Bergan, TA 1 Today Cache geometries Lab 4 2 Caches they make

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 ,

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

CS 136: Advanced Architecture Review of Caches 1 / 30 Introduction Why Caches? Basic goal:

ECE232: Hardware Organization and Design Lecture 22: Introduction to Caches Adapted from Computer

1 Implementation Snoop Caches Implementing Snooping Caches Write Races: Multiple processors

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Review: Why We Use Caches Caches Review Mechanism for transparent movement of Proc 1000

Caches &amp; Memcache Example Client N. America Client System Asia + Caches Client Africa

Caching 1 Caches break down an address into which parts? Letter Answer A Tag, delay, length

Direct-Mapped Cache: Write Allocate with Write-Through Protocol Block size in bytes: B = 2 b WRITE

lecture 18 cache 2 - TLB (hit and miss) - instruction or data cache - cache (hit and

CENG3420 Lecture 08: Cache Bei Yu byu@cse.cuhk.edu.hk (Latest update: March 14, 2019) Spring

Previous Lecture Slides for Lecture 8 ENCM 501: Principles of Computer Architecture Winter 2014

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Caches and Memory Hierarchy: Review UCSB CS240A, Fall 2017 1 Motivation Most applications

Memory Hierarchy &amp; Caching CS 351: Systems Programming Michael Saelee &lt;lee@iit.edu&gt;

Caches & Memcache Example Client N. America Client System Asia + Caches Client Africa

Memory Hierarchy & Caching CS 351: Systems Programming Michael Saelee <lee@iit.edu>