Rethinking Last-Level Cache Management for Multicores Operating at - PowerPoint PPT Presentation

COMPUTER ARCHITECTURE GROUP Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold Farrukh Hijaz, Omer Khan University of Connecticut

Power Efficiency Multicores enable efficiency Performance/Watt Still need >3 × improvement to meet HPC GFLOPS/Watt goal Power-performance efficiency stalled Image Credit: http://www.vr-zone.com ‘14 Years COMPUTER 2 ARCHITECTURE GROUP

The Value of Operating at NTV Near Threshold Voltage operation potentially enables 5-10 × power-performance efficiency [Intel: DAC’12] COMPUTER 3 ARCHITECTURE GROUP

NTV Operation? Logic ( ✓ ) NTV ( ✓ ) [Intel:DAC’12] COMPUTER 4 ARCHITECTURE GROUP

NTV Operation? Cache ( ✗ ) NTV ( ✗ ) [Intel:DAC’12] SRAM bit-cells susceptible to errors at NTV COMPUTER 5 ARCHITECTURE GROUP

NTV Approaches for On-chip Memory • High voltage, High frequency • High performance • Low energy efficiency • No faults • Low voltage, Low frequency • Low performance • Highest energy efficiency • No faults • Low voltage, High frequency Our Approach! • High performance • High energy efficiency • Permanent faults COMPUTER 6 ARCHITECTURE GROUP

NTV Approaches for Permanent Faults • Circuit level (8T, 10T SRAM bit-cell) • High area overhead • Higher leakage current • ECC based (SECDED, MS-ECC) • Constant latency overhead • Disabling based (e.g., cache line disabling) • Lower available capacity Our Approach! • Hybrid of ECC and Disabling (e.g., VS-ECC) • Trades off available capacity and latency overhead COMPUTER 7 ARCHITECTURE GROUP

The NTV Challenge in Multicores • Future multicores will have Limited Off-Chip off-chip 100s of cores Bandwidth bandwidth • LLC management is key to optimizing performance and energy • Last-level cache (LLC) data locality and off-chip miss rates 1 st order constraints and often show opposing trends • Lower available LLC capacity at NTV presents Diameter of On-Chip on-chip network new challenges Latency increases with core count COMPUTER 8 ARCHITECTURE GROUP

Static-NUCA (LLC Data Placement) • Statically address interleaves data across all physically distributed LLC slices • No replication of data in the LLC slices • High cache utilization since all data evenly distributed • Data resides in a remote LLC slice with high probability • High remote LLC slice access rate results in higher on- chip network traffic and high average LLC access latency/energy COMPUTER 9 ARCHITECTURE GROUP

Reactive-NUCA (LLC Data Placement, Limited Replication) • Classifies data as private or shared on page granularity using the existing virtual memory system • Maps private pages to requesting core’s local LLC slice • Maps shared pages across the chip based on static address interleaving (similar to Static-NUCA) • Replication of data not allowed • Instructions replicated in LLC slice per cluster of 4, using rotational interleaving • Low LLC access latency/energy for correctly classified private data and instructions • No locality optimizations for shared data COMPUTER 10 ARCHITECTURE GROUP

Victim Replication (LLC Data Placement and Replication) • Starts with S-NUCA and uses the local LLC slice of a core as a victim cache for the cache lines evicted from its L1 cache • Inserts replica only if there exists: • an invalid cache line, • a home cache line with zero sharers, or • another replica • Improves locality and reduces on-chip traffic • Replication strategy causes LLC pollution, resulting in higher evictions of home cache lines with zero sharers and other replicas COMPUTER 11 ARCHITECTURE GROUP

Evaluation Methodology • Evaluation using Graphite multicore simulator for 64 cores • McPAT/CACTI cache energy models and DSENT network energy models at 11 nm • Evaluated 21 benchmarks from the SPLASH-2 (11), PARSEC (8), Parallel MI- bench (1) and UHPC (1) suites • LLC managements schemes compared: • Static-NUCA (S-NUCA) • Reactive-NUCA (R-NUCA) • Victim Replication (VR) COMPUTER 12 ARCHITECTURE GROUP

NTV Fault Model for LLC • Normal distribution of error bits in a cache line with random occurrence probabilities 0e ¡ 1e ¡ ¡2e ¡ 3e ¡ >=4e ¡ LLC Cache Capacity 100% ¡ 80% ¡ 60% ¡ 40% ¡ 20% ¡ 0% ¡ 0.10% ¡ 0.30% ¡ 0.50% ¡ 0.70% ¡ 1.00% ¡ • LLC tag arrays extended to record “disable bits” • 0e – 2e: ECC correction with additional 1-cycle latency • >2e: Cache line disabling COMPUTER 13 ARCHITECTURE GROUP

Average Results – Completion Time 1.2 ¡ Comple'on ¡Time ¡ 1 ¡ ¡SynchronizaAon ¡ (Normalized) ¡ 0.8 ¡ ¡LLCHome-‑OffChip ¡ 0.6 ¡ ¡LLCHome-‑Sharers ¡ 0.4 ¡ 0.2 ¡ ¡LLCHome-‑WaiAng ¡ 0 ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡L1C-‑LLCHome ¡ ¡L1C-‑LLCReplica ¡ ¡Compute ¡ 0% ¡ 0.10% ¡ 0.30% ¡ 0.50% ¡ • R-NUCA and VR perform consistently better than S- NUCA • VR’s replication helps at low fault rates • Lower replication opportunities for VR at higher fault rates result in completion time on-par with R-NUCA COMPUTER 14 ARCHITECTURE GROUP

Average Results – Energy 1.2 ¡ (Normalized) ¡ 1 ¡ ¡DRAM ¡ 0.8 ¡ Energy ¡ ¡Network ¡Link ¡ 0.6 ¡ ¡Network ¡Router ¡ 0.4 ¡ 0.2 ¡ ¡Directory ¡ 0 ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ LLC ¡ ¡L1-‑D ¡Cache ¡ ¡L1-‑I ¡Cache ¡ 0% ¡ 0.10% ¡ 0.30% ¡ 0.50% ¡ • Static energy dominates the overall energy • Energy consumption tracks completion time COMPUTER 15 ARCHITECTURE GROUP

Benchmark Results – Barnes 1.4 ¡ Comple'on ¡Time ¡ 1.2 ¡ ¡SynchronizaAon ¡ (Normalized) ¡ 1 ¡ ¡LLCHome-‑OffChip ¡ 0.8 ¡ 0.6 ¡ ¡LLCHome-‑Sharers ¡ 0.4 ¡ 0.2 ¡ ¡LLCHome-‑WaiAng ¡ 0 ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡L1C-‑LLCHome ¡ ¡L1C-‑LLCReplica ¡ ¡Compute ¡ 0% ¡ 0.10% ¡ 0.30% ¡ 0.50% ¡ • Replication helps significantly at lower fault rates • Lower replication opportunity at higher fault rates diminishes advantage over R-NUCA COMPUTER 16 ARCHITECTURE GROUP

Benchmark Results – Ocean_NC 1.4 ¡ Comple'on ¡Time ¡ 1.2 ¡ ¡SynchronizaAon ¡ (Normalized) ¡ 1 ¡ ¡LLCHome-‑OffChip ¡ 0.8 ¡ 0.6 ¡ ¡LLCHome-‑Sharers ¡ 0.4 ¡ 0.2 ¡ ¡LLCHome-‑WaiAng ¡ 0 ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡L1C-‑LLCHome ¡ ¡L1C-‑LLCReplica ¡ ¡Compute ¡ 0% ¡ 0.10% ¡ 0.30% ¡ 0.50% ¡ • R-NUCA performance 100% ¡ Shared ¡Read-‑ degrades due to false sharing Accesses 80% ¡ Write ¡ Shared ¡Read-‑ LLC 60% ¡ • VR better than R-NUCA, Only ¡ 40% ¡ Private ¡ however, lower advantage at 20% ¡ InstrucAon ¡ higher fault rates 0% ¡ Cache-‑Line ¡ Page ¡ COMPUTER 17 ARCHITECTURE GROUP

Benchmark Results – Dedup 1.4 ¡ Comple'on ¡Time ¡ 1.2 ¡ ¡SynchronizaAon ¡ (Normalized) ¡ 1 ¡ ¡LLCHome-‑OffChip ¡ 0.8 ¡ 0.6 ¡ ¡LLCHome-‑Sharers ¡ 0.4 ¡ 0.2 ¡ ¡LLCHome-‑WaiAng ¡ 0 ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡L1C-‑LLCHome ¡ ¡L1C-‑LLCReplica ¡ ¡Compute ¡ 0% ¡ 0.10% ¡ 0.30% ¡ 0.50% ¡ • High number of LLC accesses to thread-private data • R-NUCA’s local placement of private data is effective in improving completion time over VR COMPUTER 18 ARCHITECTURE GROUP

Observations • No one-fits-all data management scheme at the lower LLC capacity when operating at NTV • A scheme that works optimally at higher LLC capacity might not be effective at the lower usable capacity • Optimizing locality ends up putting extra stress on the LLC, increasing the off-chip miss rate • There is a need for a data management scheme that not only utilizes LLC capacity more intelligently but also possess the ability to handle the random distribution of faults COMPUTER 19 ARCHITECTURE GROUP

Rethinking Last-Level Cache Management for Multicores Operating at - PowerPoint PPT Presentation

COMPUTER ARCHITECTURE GROUP Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold Farrukh Hijaz, Omer Khan University of Connecticut Power Efficiency Multicores enable efficiency Performance/Watt Still need

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

KPart: A Hybrid Cache Sharing-Partitioning Technique for Commodity Multicores Nosayba EI-Sayed

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

RETHINKING THE TOOLS OF ENGAGEMENT FLIPPING THE OUTCOMES RETHINKING THE TOOLS OF ENGAGEMENT /

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Impact on Program Performance T. Yang. UCSB CS240A. 2017 Multi-level cache in computer

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Cache Performance Samira Khan March 28, 2017 Agenda Review from last lecture Cache

lecture 18 cache 2 - TLB (hit and miss) - instruction or data cache - cache (hit and

Priority-Setting Data: California Specific Program Goal DRAFT 06/05/14 Overview of 2015

Laurel Lucia and Miranda Dietz, UC Berkeley Labor Center On behalf of UCLA-UC Berkeley CalSIM

Slides built from Carter Chapter 10 Animating Sprites (textures) Images from wikipedia.org

Student Code of Conduct- Academic Integrity Slides for Faculty Members & Instructors to cover

and experience Stainless steel slides with ladder or single-step ladder Slides 2-1 FHS

Language Access in CA Elections, Past & Future FOCE Conference 2017 Need for Language

Spirit to Spirit Webinar Lisa Beedie-AisanceKwe Chuckle of the Day 2 Reclaiming Health as an

Co-Pay Accumulator Adjustors Strike Back: What You Need to Know for Your Care Presented by:

Rethinking Last-Level Cache Management for Multicores Operating at - PowerPoint PPT Presentation

COMPUTER ARCHITECTURE GROUP Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold Farrukh Hijaz, Omer Khan University of Connecticut Power Efficiency Multicores enable efficiency Performance/Watt Still need

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

KPart: A Hybrid Cache Sharing-Partitioning Technique for Commodity Multicores Nosayba EI-Sayed

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

RETHINKING THE TOOLS OF ENGAGEMENT FLIPPING THE OUTCOMES RETHINKING THE TOOLS OF ENGAGEMENT /

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Impact on Program Performance T. Yang. UCSB CS240A. 2017 Multi-level cache in computer

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Cache Performance Samira Khan March 28, 2017 Agenda Review from last lecture Cache

lecture 18 cache 2 - TLB (hit and miss) - instruction or data cache - cache (hit and

Priority-Setting Data: California Specific Program Goal DRAFT 06/05/14 Overview of 2015

Laurel Lucia and Miranda Dietz, UC Berkeley Labor Center On behalf of UCLA-UC Berkeley CalSIM

Slides built from Carter Chapter 10 Animating Sprites (textures) Images from wikipedia.org

Student Code of Conduct- Academic Integrity Slides for Faculty Members &amp; Instructors to cover

and experience Stainless steel slides with ladder or single-step ladder Slides 2-1 FHS

Language Access in CA Elections, Past &amp; Future FOCE Conference 2017 Need for Language

Spirit to Spirit Webinar Lisa Beedie-AisanceKwe Chuckle of the Day 2 Reclaiming Health as an

Co-Pay Accumulator Adjustors Strike Back: What You Need to Know for Your Care Presented by:

Student Code of Conduct- Academic Integrity Slides for Faculty Members & Instructors to cover

Language Access in CA Elections, Past & Future FOCE Conference 2017 Need for Language