MORC A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID - PowerPoint PPT Presentation

MORC A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID WENTZLAFF 12/7/2015 1

Architectures moving toward manycore Tilera: 64-72 cores Intel MIC: NVIDIA GPGPUs: (2007) 288 threads (2015) 3072 threads (2015) Increasing thread aggregation ◦ Cloud computing ◦ Massive warehouse scale center 12/7/2015 2

Motivation: off-chip bandwidth scalability Throughput = min(compute_avail, bandwidth_avail) Throughput is already bandwidth-bound ◦ Assumption: 1000 threads, 1GB/s per thread ◦ Demand: 1000GB/s ◦ Supply: 102.4GB/s (four DDR4 channels) ◦ Oversubscribed ratio: ~ 10x Bandwidth-wall will stall practical manycore scaling ◦ Economy of high pin-count packaging ◦ Pin size hard to be smaller even in high cost chips ◦ Frequency does not scale well 12/7/2015 3

Compressing LLC as a solution More on-chip cache correlates with higher performance More effective cache through compression correlates with perf. 12/7/2015 4

Compressing LLC as a solution More on-chip cache correlates with higher performance More effective cache through compression correlates with perf. MORC : ◦ Manycore-oriented compressed cache ◦ Compresses the LLC (last level cache) to reduce off-chip misses Insight: ◦ throughput over single-threaded ◦ expensive stream-based compression algorithms 12/7/2015 5

Outline ◦ Stream compression is great! ◦ …but is hard with set -based caches ◦ …and is not for single -threaded performance ◦ Stream compression with log-based caches ◦ Architecture of log-based compressed cache ◦ Results ◦ Performance ◦ Energy 12/7/2015 6

What is stream-based compression? Common software data compression algorithms ◦ LZ77, gzip, LZMA Sequentially compresses cache lines as a single stream ◦ Compress using pointers to copy repeated string (data) 12/7/2015 7

Stream compression example 12/7/2015 11

Stream vs block-based compression Stream-based compression achieves much higher compression 12/7/2015 14

Stream vs block-based compression Stream-based compression achieves much higher compression Many prior-work uses block-based compression Two reasons: single-threaded performance & implement-ability 12/7/2015 15

First reason: Well-matched for throughput Decompression is inherently expensive 12/7/2015 16

First reason: Well-matched for throughput Decompression is inherently expensive Insight : Memory accesses are expensive! ◦ High latency ◦ High energy consumption 12/7/2015 19

Second reason: Hard to implement with set-based caches 12/7/2015 20

Second reason: Hard to implement with set-based caches Implementation: compress each cache set as a compressed stream 12/7/2015 21

Second reason: Hard to implement with set-based caches Implementation: compress each cache set as a compressed stream Cache sets are unsuited for stream-based compression ◦ Evictions and write-backs corrupt the compression stream 12/7/2015 22

Introducing log-based caches Log-based caches organize cache lines by temporal fill order 12/7/2015 23

Fill data-path architecture ◦ Lines stream to one active log sequentially ◦ Record address_1 to log_3 in a table 12/7/2015 24

Fill data-path architecture Log-flush happens when not enough space ◦ Not in critical-path ◦ Only writes back dirty cache lines 12/7/2015 26

Request data-path LMT: Line-Map Table (redirection table) ◦ Indexed by addresses ◦ Points to logs 12/7/2015 28

Request data-path LMT: Line-Map Table (redirection table) 1. Stream compressor ◦ Indexed by addresses 2. LMT 3. Eviction policy (flush) ◦ Points to logs 12/7/2015 29

Content-aware compression with logs Multiple active logs enable content aware compression ◦ Dynamically chooses the best stream based on similarity ◦ Better than strict sequential compression 12/7/2015 30

Prior work in LLC compression Internal External Tags Requiring Scheme fragmentation fragmentation overhead software Set-based Algorithm Adaptive[1] Yes Yes Medium No Yes Block Decoupled[2] Yes No Low No Yes Block SC2[3] Yes Yes High Yes Yes Centralized MORC Very little No Low No Log-based Stream Internal-fragmentation in compression blocks ◦ Decreases absolute compression ratio as much as 12.5% [1] Alameldeen et al, “Adaptive cache compression for high-performance processors,” ISCA’04 External fragmentation [2] Sardashti et al, “Decoupled compressed cache: exploiting spatial locality for energy-optimized ◦ Increase LLC energy by as much as 200% (studied in [2]) compressed caching ,” MICRO’13 [3] Arelakis et al, “SC2: A statistical compression cache scheme,” ISCA’14 12/7/2015 31

Simulation methodology Simulator: PriME[1] ◦ Execution driven, x86 inorder SPEC2006 benchmarks Future manycore system ◦ 1024 cores in a single chip ◦ 128MB LLC (128KB per core) ◦ 100GB/s off-chip bandwidth (100MB/s per core) [1] Y. Fu et al, “ PriME: A parallel and distributed simulator for thousand- core chips,” ISPASS 2014 12/7/2015 32

Compression results 12/7/2015 33

Compression results Max average comp. ratio: 6x Arithmetic mean: 3x 12/7/2015 34

Throughput improvements Max average comp. ratio: 6x Arithmetic mean: 3x 12/7/2015 35

Throughput improvements Max average comp. ratio: 6x Throughput improvements: 40% Arithmetic mean: 3x Best prior work: 20% 12/7/2015 36

Throughput improvements Max average comp. ratio: 6x Throughput improvements: 40% Arithmetic mean: 3x Best prior work: 20% 12/7/2015 37

Throughput improvements Max average comp. ratio: 6x Throughput improvements: 40% Improvements depends Arithmetic mean: 3x Best prior work: 20% on working set sizes 12/7/2015 38

Energy Two questions: ◦ DRAM access energy savings ◦ Compression/decompression energy concern 12/7/2015 39

Energy Expensive DRAM accesses Negligible compression energy Small decompression energy Memory subsystem energy normalized to uncompressed baseline 12/7/2015 40

Energy Expensive DRAM accesses Negligible compression energy Small decompression energy Memory subsystem energy normalized to uncompressed baseline 12/7/2015 41

Summary Stream compression is much better versus block-based ◦ …but is hard with set-based caches ◦ …and is not right approach for single -threaded performance Log-based caches efficiently support stream-based compression ◦ Sequential cache line placements Architecture ◦ Stream compressor, LMT, eviction policy Results ◦ 50% better compression, 100% better throughput improvements ◦ Better energy efficiency 12/7/2015 42

MORC A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID - PowerPoint PPT Presentation

MORC A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID WENTZLAFF 12/7/2015 1 Architectures moving toward manycore Tilera: 64-72 cores Intel MIC: NVIDIA GPGPUs: (2007) 288 threads (2015) 3072 threads (2015) Increasing thread

CONSULTATIVE COMMITTEE WORKSHOP 19th MAY 2016 AIRWORTHINESS CAR 145 PRESENTATION AIRWORTHINESS

BY Ethel Ofoegbu Nigerian Nuclear Regulatory Authority VCDNP PANNEL DISCUSSION ON FACILITATING

Chip-Firing and Algebraic Combinatorics Caroline J. Klivans Brown University Chip-Firing

Understanding Sources of Inefficiency in General-Purpose Chips Rehan Hameed Wajahat Qadeer

Testing ColdADC ASICs at UF Ivan Furic, UF Shanshan Gao, BNL Things we learned even before

NY Energy Storage Initiative Update Panelists: H.G. Chissell, Viridity Energy , Senior Vice

ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology Mikkel B. Stensgaard and Jens

ASICs and Front-End Motherboards FEMB Plans (and other topics) Marco Verzocchi Fermilab 5

On-Chip Communications Somayyeh Koohi Department of Computer Engineering Sharif University of

Small is beautiful 1. Cramming More Components Onto Integrated Circuits, G.E. Moore, 1965 Where

Cool Chips Cool Chips Markets Markets Semiconductor Fabrication Semiconductor

Cool Chips Cool Chips Markets Markets Cool Cargo Applications Cool Cargo Applications

Debunking Fault Injection Myths and Misconceptions Cristofaro Mune Niek Timmers

Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications

Analysis and Approximation of Optimal Co-Scheduling on Chip Multiprocessors Yunlian Jiang

Design of a Simple Computer 2 Schedule Today

SYSC3601 Microprocessor Systems Unit 5: Memory Structures and Interfacing SYSC3601 1

ASIC Clouds: Specializing the Datacenter Ikuo Magaki, Moein Khazraee, Luis Vega Gutierrez, and

The most famous math textbook in history Chirag Kalelkar National Chemical Laboratory Pune At

Structural analysis of expected and unexpected clauses in sentences using gaze-tracking studies

Auto-grading for 3D Modeling Assignments in MOOCs Swapneel Mehta Nitin Ayer Chirag Raman

IDC Update on How Big Data Is Redefining High Performance Computing Earl Joseph

Memory-Optimal Direct Convolutions for Maximizing Classification Accuracy in Embedded Devices

Anne)Bracy: Career)Path Undergrad)@)Stanford Grad)School)@)UPenn (computer)architecture)