L3 Time for Compressed Memory Hierarchies Biswabandan Panda WOCS 2018 December 8 th , 2018
Memory in Single-core Systems DRAM Contr. L3 Core L1 L2 4 Cycles 12 Cycles 30 Cycles 200 Cycles Latency wall DRAM was already critical for performance TIME FOR COMPRESSED MEMORY HIERARCHIES 2
Memory in Multi-core Systems Core Core Core Core DRAM Contr. Core Core 200 800 Core Core Cycles Cycles Core count doubling every 2 years Memory wall = Latency wall + Bandwidth wall DRAM bandwidth doubling every 4 years TIME FOR COMPRESSED MEMORY HIERARCHIES 3
Solution Cache Cache Reuse Compression L3 L3 L3 TIME FOR COMPRESSED MEMORY HIERARCHIES 4
Cache Compression-10000 Feet View Core Core DRAM Contr. L3 L1 L2 Core Core Increases cache capacity without increasing cache size Exploits redundancy in data patterns TIME FOR COMPRESSED MEMORY HIERARCHIES 5
Examples – Data Patterns A[i][j]=0 0x00000000 0x00000000 0x00000000 0x00000000 4 Bytes A[i][j]=constant 0x333333FF 0x333333FF 0x333333FF 0x333333FF Narrow values (small value in large data type) 0x00000001 0x00000002 0x00000003 0x00000004 *ptr 0x888888C0 0x888888C8 0x888888D0 0x888888D8 TIME FOR COMPRESSED MEMORY HIERARCHIES 6
Cache Compression Fixed block size(64B) B0 B0 Compressed block size (< 64B) TIME FOR COMPRESSED MEMORY HIERARCHIES 7
Compaction Compacts multiple contiguous compressed blocks of similar compressibility B0 B1 B2 B3 16B 16B 16B 16B B2 B3 B0 B1 64B DCC [MICRO ’13] and SCC [MICRO ’14] TIME FOR COMPRESSED MEMORY HIERARCHIES 8
Compressed Cache Layout TAG SET INDEX OFFSET 6 0 BLK-ID TAG SET INDEX OFFSET 8 6 0 SUPER BLOCK TAG B0 SET 0 B1 B2 B3 B0 SET 0 B1 SET 1 B2 SET 2 B3 SET 3 TIME FOR COMPRESSED MEMORY HIERARCHIES 9
Compressed Cache Layout Tag array Data array 0 0 - - - ID0 B0 TAG 0 1 ID2 ID0 B2 B0 TAG 1 X ID3 ID2 ID1 ID0 B3 B2 B1 B0 TAG YACC [Sardashti et al., TACO ’16] TIME FOR COMPRESSED MEMORY HIERARCHIES 10
Observations Un-occupied space Block-I Block-II Block-III Block-IV 16B 16B 32B 64B 32B 32B 64B B1 B0 B2 B2 B3 16B 32B 16B 48B 32B 64B 32B 64B 64B 32B B3 B1 B2 B1 B2 B0 64B 64B 16B 32B 16B 48B 32B 32B 32B 32B B3 B1 B1 B2 B3 B0 TIME FOR COMPRESSED MEMORY HIERARCHIES 11
Observations Compression and compaction techniques: oblivious to each other Need for coordination ❶ ❷ Compaction Compression B0 B1 B0 B1 B0 B1 B0 B1 TIME FOR COMPRESSED MEMORY HIERARCHIES 12
DISH: DICTIONARY SHARING BASED CACHE COMPRESSION [MICRO ‘16] TIME FOR COMPRESSED MEMORY HIERARCHIES 13
Scheme-I (Data Content locality) 4B Chunk 64B Block A B A Z C A A D A E A G C H G H 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A B Z C D E G H 0 1 0 2 3 0 0 4 0 5 0 6 3 7 6 7 0 1 2 3 4 5 6 7 1 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 3-BIT POINTERS DICTIONARY OF 38B Compressed Block FOR 16 4B CHUNKS 8 4B ENTRIES 6B 32B 32B 56B 6B 6B 6B 6B SHARED Compressed B0’s PTRs B1’s PTRs B2’s PTRs B3’s PTRs DICTIONARY Block TIME FOR COMPRESSED MEMORY HIERARCHIES 14
Scheme-I (Data Content locality) 4B Chunk 64B Block A B A Z C A A D A E A G C H G H 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Four 64B blocks inside one 64B block A B Z C D E G H 0 1 0 2 3 0 0 4 0 5 0 6 3 7 6 7 0 1 2 3 4 5 6 7 1 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 3-BIT POINTERS DICTIONARY OF 38B Compressed Block Compression latency of 24 cycles FOR 16 4B CHUNKS 8 4B ENTRIES 6B 32B 32B 56B 6B 6B Not in the critical path 6B 6B SHARED Compressed B0’s PTRs B1’s PTRs B2’s PTRs B3’s PTRs DICTIONARY Block TIME FOR COMPRESSED MEMORY HIERARCHIES 15
Scheme-II (Upper-bits Locality) 64B Block … 0x…0 2 0x00000021 0x00000030 0x…3 2 0x…1 2 0 1 2 3 15 … … 0x0000002 0x…3 0x…1 0x…0 0 1 1 2 3 1 0 2 2 2 0 1 0 1 0 1 2 3 2 3 15 2 3 15 DICTIONARY OF 4 28-BIT Four 64B blocks inside one 64B block 16 2-BIT 16 4-BIT ENTRIES POINTERS OFFSETS 12B 12B 14B 12B 12B 62B SHARED Compressed DICTIONARY Block B0’s PTRs B1’s PTRs B2’s PTRs B3’s PTRs & Offsets & Offsets & Offsets & Offsets TIME FOR COMPRESSED MEMORY HIERARCHIES 16
Decompression SHARED DICTIONARY B0’s PTRs B1’s PTRs B2’s PTRs B3’s PTRs 32B 6B 6B 6B 6B A B Z C D E G H 0 1 0 2 3 0 0 4 0 5 0 6 3 7 6 7 0 1 2 3 4 5 6 7 1 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 One cycle decompression latency … A B A Z C A A D A E A G C H G H 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 TIME FOR COMPRESSED MEMORY HIERARCHIES 17
DISH Layout Tag array Data array 0 - - - ID0 B0 TAG One uncompressed block Scheme-I/scheme-II 1 0 ID3ID2 ID1ID0 B3 B2 B1 B0 TAG one or more than one compressed blocks within No additional meta-data a block TIME FOR COMPRESSED MEMORY HIERARCHIES 18
DISH in Action - Compression 1 0 1 0 ID3ID2 ID1ID0 TAG TAG SET INDEX BLK-ID OFFSET 0 8 6 L2 … DICT B0’s PTR B1’s PTR B3’s PTR ❸ ❶ B0 B1 B2 B3 L3 ❹ v = Miss ❷ Compressor B0 B1 ❸ DRAM Contr. TIME FOR COMPRESSED MEMORY HIERARCHIES 19
DISH in Action- Decompression ID3ID2 ID1ID0 ID0 1 0 TAG Core ❶ ❹ ⓿ L2 ❸ ❶ v Hit De B0 B0 B1 B2 B3 ❷ compressor L3 TIME FOR COMPRESSED MEMORY HIERARCHIES 20
Compression Ratio 3 Higher the better 2.5 2.3X 2 1.5 1 CPACK+Z [TVLSI '10] BDI [PACT '12] DISH TIME FOR COMPRESSED MEMORY HIERARCHIES 21
Speedup 1.2 Higher the better 1.15 1.1 IPC 1.05 1 0.95 0.9 astar bwaves bzip2 cactusADM GemsFDTD gromacs h264ref hmmer lbm leslie3d GRAPHS GMEAN SPEC CPACK+Z BDI DISH 2X Baseline 4X Baseline 12.4% improvement over an uncompressed cache TIME FOR COMPRESSED MEMORY HIERARCHIES 22
Summary of Contributions Case for compaction aware compression Inter-block data localities Leverages the compressed cache layout TIME FOR COMPRESSED MEMORY HIERARCHIES 23
What Else Can be Done with the Layout? Cache Reuse Reuse Detection !! L3 TIME FOR COMPRESSED MEMORY HIERARCHIES 24
Reuse Cache [MICRO ‘13] Data Tags Tags Data Reuse LLC 8MB Conventional LLC (4MB Tag + 1MB Data) TIME FOR COMPRESSED MEMORY HIERARCHIES 25
Reuse Cache: 1 st Access Data L2 Tag DRAM Reuse LLC Only tag entry is allocated TIME FOR COMPRESSED MEMORY HIERARCHIES 26
Reuse Cache: 2 nd Access Data L2 Tag DRAM Tag Hit Reuse LLC Data entry allocated, block is reused Decoupled tag/data array Highly efficient 4X more tag entries TIME FOR COMPRESSED MEMORY HIERARCHIES 27
The Question Can we detect the reusability of LLC blocks without 4X more tags in conventional and compressed caches? Can we use the existing layout of compressed caches for reuse also? TIME FOR COMPRESSED MEMORY HIERARCHIES 28
The Answer: Our Contribution Cache Reuse Cache Compression Data A0 Tag A0 Synergistic Cache Layout for Reuse and Compression TIME FOR COMPRESSED MEMORY HIERARCHIES 29
SRC: SYNERGISTIC CACHE LAYOUT FOR REUSE AND COMPRESSION [PACT ‘18] TIME FOR COMPRESSED MEMORY HIERARCHIES 30
Recommend
More recommend