1
play

1 Locate set Cache Read Example: Direct Mapped Cache (E = 1) - PDF document

Today Cache memory organization and operation Performance impact of caches Cache Memories The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal locality CSci 2021: Machine


  1. Today  Cache memory organization and operation  Performance impact of caches Cache Memories  The memory mountain  Rearranging loops to improve spatial locality  Using blocking to improve temporal locality CSci 2021: Machine Architecture and Organization April 1st-3rd, 2020 Your instructor: Stephen McCamant Based on slides originally by: Randy Bryant, Dave O’Hallaron 1 2 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Example Memory General Cache Concept Hierarchy L0: Regs CPU registers hold words Smaller, retrieved from the L1 cache. L1 cache L1: faster, (SRAM) and L1 cache holds cache lines retrieved from the L2 cache. costlier L2 cache L2: (per byte) Smaller, faster, more expensive (SRAM) storage Cache L2 cache holds cache lines 4 8 9 10 14 3 memory caches a subset of devices retrieved from L3 cache the blocks L3: L3 cache (SRAM) Data is copied in block-sized L3 cache holds cache lines 10 4 transfer units retrieved from main memory. Larger, L4: Main memory slower, and (DRAM) Larger, slower, cheaper memory Main memory holds cheaper Memory 0 1 2 3 viewed as partitioned into “blocks” disk blocks retrieved (per byte) from local disks. 4 4 5 6 7 storage Local secondary storage L5: devices (local disks) 8 9 10 10 11 Local disks hold files 12 13 14 15 retrieved from disks on remote servers Remote secondary storage L6: (e.g., Web servers) 3 4 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Cache Memories General Cache Organization (S, E, B) E = 2 e lines per set  Cache memories are small, fast SRAM-based memories managed automatically in hardware set  Hold frequently accessed blocks of main memory line  CPU looks first for data in cache  Typical system structure: S = 2 s sets CPU chip Register file Cache ALU memory System bus Memory bus Cache size: C = S x E x B data bytes tag 0 1 2 B-1 v I/O Main Bus interface bridge memory valid bit B = 2 b bytes per cache block (the data) 5 6 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 1

  2. • Locate set Cache Read Example: Direct Mapped Cache (E = 1) • Check if any line in set has matching tag Direct mapped: One line per set E = 2 e lines per set • Yes + line valid: hit Assume: cache block size 8 bytes • Locate data starting at offset Address of int: Address of word: v tag 0 1 2 3 4 5 6 7 t bits 0…01 100 t bits s bits b bits S = 2 s sets v tag 0 1 2 3 4 5 6 7 find set tag set block S = 2 s sets index offset v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 data begins at this offset v tag 0 1 2 B-1 valid bit B = 2 b bytes per cache block (the data) 7 8 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Example: Direct Mapped Cache (E = 1) Example: Direct Mapped Cache (E = 1) Direct mapped: One line per set Direct mapped: One line per set Assume: cache block size 8 bytes Assume: cache block size 8 bytes Address of int: Address of int: valid? + match: assume yes = hit valid? + match: assume yes = hit t bits 0…01 100 t bits 0…01 100 v tag tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 block offset block offset int (4 Bytes) is here If tag doesn’t match: old line is evicted and replaced 9 10 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Direct-Mapped Cache Simulation E-way Set Associative Cache (Here: E = 2) E = 2: Two lines per set t=1 s=2 b=1 M=16 bytes (4-bit addresses), B=2 bytes/block, Assume: cache block size 8 bytes Address of short int: x xx x S=4 sets, E=1 Blocks/set t bits 0…01 100 Address trace (reads, one byte per read): v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 miss 0 [0000 2 ], hit 1 [0001 2 ], find set v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 miss 7 [0111 2 ], miss 8 [1000 2 ], v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 miss 0 [0000 2 ] v Tag Block 1 1 0 1 1 0 0 ? M[8-9] M[0-1] M[0-1] ? Set 0 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 Set 1 Set 2 Set 3 1 0 M[6-7] 11 12 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 2

  3. E-way Set Associative Cache (Here: E = 2) E-way Set Associative Cache (Here: E = 2) E = 2: Two lines per set E = 2: Two lines per set Assume: cache block size 8 bytes Assume: cache block size 8 bytes Address of short int: Address of short int: t bits 0…01 100 t bits 0…01 100 compare both compare both valid? + match: yes = hit valid? + match: yes = hit v tag tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 block offset block offset short int (2 Bytes) is here No match: • One line in set is selected for eviction and replacement • Replacement policies: random, least recently used (LRU), … 13 14 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition What about writes? 2-Way Set Associative Cache Simulation  Multiple copies of data exist: t=2 s=1 b=1  L1, L2, L3, Main Memory, Disk M=16 byte addresses, B=2 bytes/block, xx x x S=2 sets, E=2 blocks/set  What to do on a write-hit?  Write-through (write immediately to memory) Address trace (reads, one byte per read):  Write-back (defer write to memory until replacement of line) miss 0 [0000 2 ], hit  Need a dirty bit (line different from memory or not) 1 [0001 2 ], miss 7 [0111 2 ],  What to do on a write-miss? miss 8 [1000 2 ],  Write-allocate (load into cache, update line in cache) hit 0 [0000 2 ]  Good if more writes to the location follow  No-write-allocate (writes straight to memory, does not load into cache) v Tag Block 0 1 ? 00 ? M[0-1]  Typical Set 0 0 1 10 M[8-9]  Write-through + No-write-allocate  Write-back + Write-allocate 0 1 01 M[6-7] Set 1 0 15 16 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Intel Core i7 Cache Hierarchy Cache Performance Metrics Processor package  Miss Rate Core 0 Core 3  Fraction of memory references not found in cache (misses / accesses) L1 i-cache and d-cache: 32 KB, 8-way, = 1 – hit rate Regs Regs Access: 4 cycles  Typical numbers (in percentages):  3-10% for L1 L1 L1 L1 L1 L2 unified cache:  can be quite small (e.g., < 1%) for L2, depending on size, etc. d-cache i-cache … d-cache i-cache 256 KB, 8-way,  Hit Time Access: 10 cycles  Time to deliver a line in the cache to the processor L2 unified cache L2 unified cache L3 unified cache:  includes time to determine whether the line is in the cache 8 MB, 16-way,  Typical numbers: Access: 40-75 cycles  4 clock cycle for L1 L3 unified cache  10 clock cycles for L2 Block size : 64 bytes for (shared by all cores)  Miss Penalty all caches.  Additional time required because of a miss  typically 50-200 cycles for main memory (Trend: increasing!) Main memory 17 18 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 3

Recommend


More recommend