general cache mechanics
play

General Cache Mechanics CPU Block: unit of data in cache and - PowerPoint PPT Presentation

General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory Hierarchy: Cache Smaller, faster, more expensive. Cache 8 9 14 3 Stores subset of memory blocks . (lines) Data is moved in block units Memory


  1. General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory Hierarchy: Cache Smaller, faster, more expensive. Cache 8 9 14 3 Stores subset of memory blocks . (lines) Data is moved in block units Memory hierarchy Cache basics Memory Larger, slower, cheaper. 0 1 2 3 Locality Partitioned into blocks (lines) . 4 5 6 7 Cache organization 8 9 10 11 Cache-aware programming 12 13 14 15 8 Cache Hit Cache Miss CPU CPU Request: 12 1. Request data in block b. 1. Request data in block b. Request: 14 2. Cache miss: Cache 8 9 14 3 12 9 block is not in cache 2. Cache hit: Cache 8 9 14 14 3 Block b is in cache. 3. Cache eviction: 12 9 Request: 12 Evict a block to make room, maybe store to memory. Memory 0 1 2 3 4. Cache fill: Memory 0 1 2 3 4 5 6 7 Fetch block from memory, store in cache. 4 5 6 7 9 8 9 10 11 8 9 10 11 12 12 13 14 15 12 13 14 15 Placement Policy: Re placement Policy: where to put block in cache which block to evict 9 10

  2. Locality #1 Locality #2 row-major M x N 2D array in C What is stored in memory? sum = 0; int sum_array_rows(int a[M][N]) { for (i = 0; i < n; i++) { int sum = 0; sum += a[i]; a[0][0] a[0][1] a[0][2] a[0][3] } for (int i = 0; i < M; i++) { a[1][0] a[1][1] a[1][2] a[1][3] for (int j = 0; j < N; j++) { return sum; a[2][0] a[2][1] a[2][2] a[2][3] sum += a[i][j]; } Data: 1: a[0][0] 2: a[0][1] } Temporal: sum referenced in each iteration 3: a[0][2] return sum; 4: a[0][3] } Spatial: array a[] accessed in stride-1 pattern 5: a[1][0] 6: a[1][1] Instructions: 7: a[1][2] 8: a[1][3] Temporal: execute loop repeatedly 9: a[2][0] Spatial: execute instructions in sequence 10: a[2][1] 11: a[2][2] 12: a[2][3] Assessing locality in code is an important programming skill. stride 1 12 13 Locality #4 Locality #3 row-major M x N 2D array in C int sum_array_3d(int a[M][N][N]) { int sum_array_cols(int a[M][N]) { int sum = 0; int sum = 0; a[0][0] a[0][1] a[0][2] a[0][3] for (int i = 0; i < N; i++) { … for (int j = 0; j < N; j++) { a[1][0] a[1][1] a[1][2] a[1][3] for (int j = 0; j < N; j++) { for (int i = 0; i < M; i++) { a[2][0] a[2][1] a[2][2] a[2][3] for (int k = 0; k < M; k++) { sum += a[i][j]; … sum += a[k][i][j]; } 1: a[0][0] } 2: a[1][0] } } 3: a[2][0] return sum; } 4: a[0][1] } 5: a[1][1] return sum; 6: a[2][1] } 7: a[0][2] 8: a[1][2] 9: a[2][2] What is "wrong" with this code? 10: a[0][3] 11: a[1][3] How can it be fixed? 12: a[2][3] stride N 14 15

  3. memory hierarchy Cache Performance Metrics explicitly why does it work? program- controlled small, fast, Miss Rate power-hungry, registers expensive Fraction of memory accesses to data not in cache (misses / accesses) Typically: 3% - 10% for L1; maybe < 1% for L2, depending on size, etc. L1 cache (SRAM, on-chip) Hit Time L2 cache Time to find and deliver a block in the cache to the processor. (SRAM, on-chip) Typically: 1 - 2 clock cycles for L1 ; 5 - 20 clock cycles for L2 Memory L3 cache (SRAM, off-chip) Miss Penalty Additional time required on cache miss = main memory access time main memory Typically 50 - 200 cycles for L2 (trend: increasing!) (DRAM) large, slow, persistent storage power-efficient, (hard disk, flash, over network, cloud, etc.) cheap 17 (byte) Memory Cache Organization: Key Points Blocks address 00000 000 Note: drawing address order differently from here on! Divide address space into fixed-size aligned blocks. Block block power of 2 0 Fixed-size unit of data in memory/cache Example: block size = 8 00001 000 Placement Policy full byte address block Where in the cache should a given block be stored? 1 00010 010 § direct-mapped, set associative 00010 000 00010001 00010010 Block ID offset within block Replacement Policy 00010011 block 00010100 address bits - offset bits log 2 (block size) 2 What if there is no room in the cache for requested data? 00010101 00010110 00010111 § least recently used, most recently used 00011 000 Write Policy block 3 When should writes update lower levels of memory hierarchy? § write back, write through, write allocate, no write allocate ... remember withinSameBlock? (Pointers Lab)

  4. Placement: Direct-Mapped Placement: Tags resolve ambiguity Memory Mapping: Memory Mapping: Block ID Block ID index(Block ID) = Block ID mod S index(Block ID) = Block ID mod S 00 00 00 00 (easy for power-of-2 block sizes...) 00 01 00 01 00 10 00 10 00 11 00 11 Cache Cache 01 00 01 00 Tag Data Index Index 01 01 01 01 01 10 00 01 10 00 00 01 01 11 01 11 01 11 S = # slots = 4 S 10 00 10 10 00 10 01 10 01 10 01 11 11 01 10 10 10 10 10 11 10 11 11 00 11 00 11 01 11 01 Block ID bits not used for index. 11 10 11 10 11 11 11 11 22 24 A puzzle. Address = Tag, Index, Offset What slot in the cache? Disambiguates slot contents. Where within a block? a-bit Address Tag Index Offset Cache starts empty. b bits (a-s-b) bits s bits Access (address, hit/miss) stream: Block ID bits - Index bits log 2 (# cache slots) Tag Index (10, miss), (11, hit), (12, miss) 00010 010 full byte address block size >= 2 bytes block size < 8 bytes Block ID Offset within block What could the block size be? Address bits - Offset bits log 2 (block size) = b # address bits 27

  5. sets Placement: direct mapping conflicts Placement: Set Associative S = # slots in cache One index per set of block slots. Mapping: Store block in any slot within set. index(Block ID) = Block ID mod S Block ID 1-way 2-way 4-way 8-way 8 sets, 4 sets, 2 sets, 1 set, What happens when accessing 0000 0001 1 block each 2 blocks each 4 blocks each 8 blocks in repeated pattern: 0010 Set Set Set Set 0011 0010, 0110, 0010, 0110, 0010...? 0 0100 Index 0 1 0101 0 00 2 0110 1 01 0111 3 0 10 1000 4 cache conflict 2 11 1001 5 1 1010 Every access suffers a miss, 6 3 1011 evicts cache line needed 7 1100 1101 by next access. direct mapped fully associative 1110 1111 Replacement policy: if set is full, what block should be replaced? Common: least recently used (LRU) but hardware usually implements “not most recently used” 28 29 Example: Tag, Index, Offset? Example: Tag, Index, Offset? E -way set-associative 16 -bit Address 4 -bit Address S slots Tag Index Offset Tag Index Offset 16-byte blocks E = 1-way E = 2-way E = 4-way Direct-mapped tag bits ____ S = 8 sets S = 4 sets S = 2 sets 4 slots set index bits ____ Set Set Set 0 2-byte blocks block offset bits____ 0 1 0 2 1 3 4 2 5 1 6 3 7 tag bits ____ tag bits ____ tag bits ____ set index bits ____ set index bits ____ set index bits ____ index(1101) = ____ block offset bits ____ block offset bits ____ block offset bits ____ index(0x1833) ____ index(0x1833) ____ index(0x1833) ____

  6. Replacement Policy General Cache Organization (S, E, B) If set is full, what block should be replaced? E lines per set (“ E -way”) Powers of 2 Common: least recently used (LRU) set (but hardware usually implements “not most recently used” block/line Another puzzle: Cache starts empty , uses LRU. Access (address, hit/miss) stream S sets (10, miss); (12, miss); (10, miss) cache capacity : S x E x B data bytes 12’s block replaced 10’s block 12 is not in the same block as 10 address size: t + s + b address bits v tag 0 1 2 B-1 associativity of cache? direct-mapped cache valid bit B = 2 b bytes of data per cache line (the data block) 32 33 Direct-Mapped Cache Practice Cache Read Locate set by index Hit if any block in set: 0x354 is valid; and 12-bit address E = 2 e lines per set has matching tag 16 lines, 4-byte block size Get data at offset in block 0xA20 Direct mapped Offset bits? Index bits? Tag bits? Address of byte in memory: 11 10 9 8 7 6 5 4 3 2 1 0 t bits s bits b bits S = 2 s sets tag set block index offset Index Tag Valid B0 B1 B2 B3 Index Tag Valid B0 B1 B2 B3 0 19 1 99 11 23 11 8 24 1 3A 00 51 89 1 15 0 – – – – 9 2D 0 – – – – 2 1B 1 00 02 04 08 A 2D 1 93 15 DA 3B data begins at this offset 3 36 0 – – – – B 0B 0 – – – – 4 32 1 43 6D 8F 09 C 12 0 – – – – 1 tag 0 1 2 B-1 5 0D 1 36 72 F0 1D D 16 1 04 96 34 15 6 31 0 – – – – E 13 1 83 77 1B D3 valid bit B = 2 b bytes of data per cache line (the data block) 7 16 1 11 C2 DF 03 F 14 0 – – – – 34 37

Recommend


More recommend