CS 240 Stage 3 Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the process model Virtual memory Dynamic memory allocation Victory lap
Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware programming
How does execution time grow with SIZE? int array[SIZE]; fillArrayRandomly(array); int s = 0; for (int i = 0; i < 200000; i++) { for (int j = 0; j < SIZE; j++) { s += array[j]; } } TIME SIZE 4
reality 45 40 35 30 Time 25 20 15 10 5 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 SIZE 5
Processor-Memory Bottleneck Processor performance doubled about Bus bandwidth every 18 months evolved much slower Main Cache CPU Reg Memory Bandwidth: 256 bytes/cycle Bandwidth: 2 Bytes/cycle Latency: 1-few cycles Latency: 100 cycles Example Solution: caches 6
Cache English: n. a hidden storage space for provisions, weapons, or treasures v. to store away in hiding for future use Computer Science: n. a computer memory with short access time used to store frequently or recently used instructions or data v. to store [data/instructions] temporarily for later quick retrieval Also used more broadly in CS: software caches, file caches, etc. 7
General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Smaller, faster, more expensive. Cache 8 9 14 3 Stores subset of memory blocks . (lines) Data is moved in block units Memory Larger, slower, cheaper. 0 1 2 3 Partitioned into blocks (lines) . 4 5 6 7 8 9 10 11 12 13 14 15 8
Cache Hit CPU 1. Request data in block b. Request: 14 2. Cache hit: Cache 8 9 14 14 3 Block b is in cache. Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 9
Cache Miss CPU Request: 12 1. Request data in block b. 2. Cache miss: Cache 8 12 9 9 14 3 block is not in cache 3. Cache eviction: 12 9 Request: 12 Evict a block to make room, maybe store to memory. Memory 0 1 2 3 4. Cache fill: 4 5 6 7 Fetch block from memory, store in cache. 8 9 9 10 11 12 12 13 14 15 Placement Policy: Re placement Policy: where to put block in cache which block to evict 10
Locality: why caches work Programs tend to use data and instructions at addresses near or equal to those they have used recently. Temporal locality: Recently referenced items are likely to be referenced again in the near future. block Spatial locality: Items with nearby addresses are likely to be referenced close together in time. block How do caches exploit temporal and spatial locality? 11
Locality #1 What is stored in memory? sum = 0; for (i = 0; i < n; i++) { sum += a[i]; } return sum; Data: Temporal: sum referenced in each iteration Spatial: array a[] accessed in stride-1 pattern Instructions: Temporal: execute loop repeatedly Spatial: execute instructions in sequence Assessing locality in code is an important programming skill. 12
Locality #2 row-major M x N 2D array in C int sum_array_rows(int a[M][N]) { int sum = 0; a[0][0] a[0][1] a[0][2] a[0][3] for (int i = 0; i < M; i++) { a[1][0] a[1][1] a[1][2] a[1][3] for (int j = 0; j < N; j++) { a[2][0] a[2][1] a[2][2] a[2][3] sum += a[i][j]; } 1: a[0][0] 2: a[0][1] } 3: a[0][2] return sum; 4: a[0][3] } 5: a[1][0] 6: a[1][1] 7: a[1][2] 8: a[1][3] 9: a[2][0] 10: a[2][1] 11: a[2][2] 12: a[2][3] stride 1 13
Locality #3 row-major M x N 2D array in C int sum_array_cols(int a[M][N]) { int sum = 0; a[0][0] a[0][1] a[0][2] a[0][3] … for (int j = 0; j < N; j++) { a[1][0] a[1][1] a[1][2] a[1][3] for (int i = 0; i < M; i++) { a[2][0] a[2][1] a[2][2] a[2][3] sum += a[i][j]; … } 1: a[0][0] 2: a[1][0] } 3: a[2][0] return sum; 4: a[0][1] } 5: a[1][1] 6: a[2][1] 7: a[0][2] 8: a[1][2] 9: a[2][2] 10: a[0][3] 11: a[1][3] 12: a[2][3] stride N 14
Locality #4 int sum_array_3d(int a[M][N][N]) { int sum = 0; for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { for (int k = 0; k < M; k++) { sum += a[k][i][j]; } } } return sum; } What is "wrong" with this code? How can it be fixed? 15
Cost of Cache Misses Miss cost could be 100 × hit cost. 99% hits could be twice as good as 97%. How? Assume cache hit time of 1 cycle, miss penalty of 100 cycles Mean access time: 97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles 99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles hit/miss rates 16
Cache Performance Metrics Miss Rate Fraction of memory accesses to data not in cache (misses / accesses) Typically: 3% - 10% for L1; maybe < 1% for L2, depending on size, etc. Hit Time Time to find and deliver a block in the cache to the processor. Typically: 1 - 2 clock cycles for L1 ; 5 - 20 clock cycles for L2 Miss Penalty Additional time required on cache miss = main memory access time Typically 50 - 200 cycles for L2 (trend: increasing!) 17
memory hierarchy explicitly why does it work? program- controlled small, fast, power-hungry, registers expensive L1 cache (SRAM, on-chip) L2 cache (SRAM, on-chip) Memory L3 cache (SRAM, off-chip) main memory (DRAM) large, slow, persistent storage power-efficient, (hard disk, flash, over network, cloud, etc.) cheap
Cache Organization: Key Points Block Fixed-size unit of data in memory/cache Placement Policy Where in the cache should a given block be stored? § direct-mapped, set associative Replacement Policy What if there is no room in the cache for requested data? § least recently used, most recently used Write Policy When should writes update lower levels of memory hierarchy? § write back, write through, write allocate, no write allocate
(byte) Memory Blocks address 00000 000 Note: drawing address order differently from here on! Divide address space into fixed-size aligned blocks. block power of 2 0 Example: block size = 8 00001 000 full byte address block 1 00010 010 00010 000 00010001 00010010 Block ID offset within block 00010011 block 00010100 address bits - offset bits log 2 (block size) 2 00010101 00010110 00010111 00011 000 block 3 ... remember withinSameBlock? (Pointers Lab)
Placement Policy Mapping: Memory Block ID index(Block ID) = ??? 0000 0001 0010 0011 Cache 0100 Index 0101 0110 00 0111 01 S = # slots = 4 1000 10 1001 11 1010 Small, fixed number of block slots. 1011 1100 1101 1110 1111 Large, fixed number of block slots.
Placement: Direct-Mapped Mapping: Memory Block ID index(Block ID) = Block ID mod S 00 00 (easy for power-of-2 block sizes...) 00 01 00 10 00 11 Cache 01 00 Index 01 01 01 10 00 01 11 01 S = # slots = 4 10 00 10 10 01 11 10 10 10 11 11 00 11 01 11 10 11 11 22
Placement: mapping ambiguity Mapping: Memory Block ID index(Block ID) = Block ID mod S 00 00 00 01 00 10 00 11 Cache 01 00 Index 01 01 01 10 00 01 11 01 S = # slots = 4 10 00 10 10 01 11 10 10 10 11 11 00 Which block is in slot 2? 11 01 11 10 11 11 23
Placement: Tags resolve ambiguity Mapping: Memory Block ID index(Block ID) = Block ID mod S 00 00 00 01 00 10 00 11 Cache 01 00 Tag Data Index 01 01 01 10 00 00 01 11 01 11 S 10 00 10 01 10 01 11 01 10 10 10 11 11 00 11 01 Block ID bits not used for index. 11 10 11 11 24
Address = Tag, Index, Offset What slot in the cache? Disambiguates slot contents. Where within a block? a-bit Address Tag Index Offset (a-s-b) bits s bits b bits Block ID bits - Index bits log 2 (# cache slots) Tag Index 00010 010 full byte address Block ID Offset within block Address bits - Offset bits log 2 (block size) = b # address bits
Placement: Direct-Mapped Why not this mapping? Memory Block ID index(Block ID) = Block ID / S 00 00 (still easy for power-of-2 block sizes...) 00 01 00 10 00 11 Cache 01 00 Index 01 01 00 01 10 01 01 11 10 10 00 11 10 01 10 10 10 11 11 00 11 01 11 10 11 11 26
A puzzle. Cache starts empty. Access (address, hit/miss) stream: (10, miss), (11, hit), (12, miss) block size >= 2 bytes block size < 8 bytes What could the block size be? 27
Placement: direct mapping conflicts Block ID What happens when accessing 0000 0001 in repeated pattern: 0010 0011 0010, 0110, 0010, 0110, 0010...? 0100 Index 0101 00 0110 01 0111 10 1000 cache conflict 11 1001 1010 Every access suffers a miss, 1011 evicts cache line needed 1100 1101 by next access. 1110 1111 28
sets Placement: Set Associative S = # slots in cache One index per set of block slots. Mapping: Store block in any slot within set. index(Block ID) = Block ID mod S 1-way 2-way 4-way 8-way 8 sets, 4 sets, 2 sets, 1 set, 1 block each 2 blocks each 4 blocks each 8 blocks Set Set Set Set 0 0 1 0 2 1 3 0 4 2 5 1 6 3 7 direct mapped fully associative Replacement policy: if set is full, what block should be replaced? Common: least recently used (LRU) but hardware usually implements “not most recently used” 29
Example: Tag, Index, Offset? 4 -bit Address Tag Index Offset Direct-mapped tag bits ____ 4 slots set index bits ____ 2-byte blocks block offset bits____ index(1101) = ____
Recommend
More recommend