KPart: A Hybrid Cache Sharing-Partitioning Technique for Commodity - PowerPoint PPT Presentation

KPart: A Hybrid Cache Sharing-Partitioning Technique for Commodity Multicores Nosayba EI-Sayed Anurag Mukkara Po-An Tsai Harshad Kasture Xiaosong Ma Daniel Sanchez

Cache partitioning in commodity multicores 2 ¨ Partitioning the last-level cache among co-running apps can reduce interference è improve system performance App App App App 1 2 3 4 Last-Level Cache ✔ Recent processors offer hardware DRAM cache-partitioning support! ✖ Two key challenges limit its usability 1. Current hardware implements coarse-grained way-partitioning è hurts system performance! 2. Lacks hardware monitoring units to collect cache-profiling data KPart tackles these limitations, unlocking significant performance on real hardware (avg gain: 24%, max: 79%), and is publicly available

Limitations of hardware cache partitioning 3 1. Implements coarse-grained way-partitioning è hurts system performance ¨ Real-system example (benchmarks: SPEC-CPU2006, PBBS) Way0 Way1 Way2 Way3 Way4 Way5 Way6 Way7 Way8 Way9 Way10 Way11 Last-Level Cache (12MB)

Limitations of hardware cache partitioning 3 1. Implements coarse-grained way-partitioning è hurts system performance ¨ Real-system example (benchmarks: SPEC-CPU2006, PBBS) ¨ Baseline: NoPart (All apps share all ways) Way0 Way1 Way2 Way3 Way4 Way5 Way6 Way7 Way8 Way9 Way10 Way11 Last-Level Cache (12MB)

Limitations of hardware cache partitioning 3 1. Implements coarse-grained way-partitioning è hurts system performance ¨ Real-system example (benchmarks: SPEC-CPU2006, PBBS) Smallest partition size Way0 Way1 Way2 Way3 Way4 Way5 Way6 Way7 Way8 Way9 Way10 Way11 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB Last-Level Cache (12MB)

Limitations of hardware cache partitioning 3 1. Implements coarse-grained way-partitioning è hurts system performance ¨ Real-system example (benchmarks: SPEC-CPU2006, PBBS) ¨ Conventional policy: Per-app, utility-based cache part (UCP) Application Cache-Profiles Smallest partition size Way0 Way1 Way2 Way3 Way4 Way5 Way6 Way7 Way8 Way9 Way10 Way11 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB … Last-Level Cache (12MB)

Limitations of hardware cache partitioning 3 1. Implements coarse-grained way-partitioning è hurts system performance ¨ Real-system example (benchmarks: SPEC-CPU2006, PBBS) ¨ Conventional policy: Per-app, utility-based cache part (UCP) Application Cache-Profiles Smallest partition size Way0 Way1 Way2 Way3 Way4 Way5 Way6 Way7 Way8 Way9 Way10 Way11 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB … Last-Level Cache (12MB) Conventional policies yield small partitions with few ways: low associativity è more misses This example: throughput degrades by 3.8%

Prior work on cache partitioning 9 ¨ Hardware way-partitioning: restrict ¨ Page coloring insertions into subsets of ways ¤ No hardware support required ¤ Available in commodity hardware ¤ Not compatible with superpages; costly repartitioning due to ¤ Small number of coarsely-grained partitions! recoloring; heavy OS modifications ¨ High-performance, fine-grained ¨ Hybrid technique: Set and WAy hardware partitioners (e.g. Vantage Partitioning (SWAP) [HPCA’17] [ISCA’11] , Futility Scaling [MICRO’14] ) ¤ Combines page coloring and way- partitioning è fine-grained partitions ¤ Support hundreds of partitions ¤ Inherits page coloring limitations ¤ Not available in existing hardware

KPart performs hybrid cache sharing-partitioning to make use of coarse-grained partitions 11 Cache-Aware App Grouping group 1 group 2 group 3 Grouping must be Avoids significant reduction in done carefully! cache associativity è throughput improves by 17%

KPart overview: Hybrid cache sharing-partitioning 12 Application Cache-Sharing Clusters How? Profiles Per-Cluster Cache Partition Plan Cluster#1 Assign Group cache applications Cluster#2 partitions into to clusters clusters Cluster#3 Miss Curves Cache Misses Collected online or offline cache capacity

Clustering apps based on cache-compatibility: Distance metric 13 ¨ How many additional cache misses are expected when Application Profiles two apps share cache capacity vs. when it’s partitioned? Shared LLC Partitioned LLC distance ¨ Use cache miss curves to estimate: combined miss curve Cache [Mukkara et al., ASPLOS’16] Misses area partitioned miss curve [divide cap using UCP] app1 app2 Cache Capacity Area è expected performance degradation when apps share cache capacity (due to additional misses)

Grouping applications into clusters 14 ¨ Hierarchical clustering: ¤ Start with the applications as individual clusters ¤ At each step, merge the closest pair of clusters K=3 K=2 until only one cluster is left.. Application Miss Curves How do we find the best K without running the mix?

Automatic selection of K in KPart 15 Application Performance Estimator Cluster#1 How? Profiles Per-Cluster Cache Partition Plan Estimate throughput K auto under all possible K s Cluster#2 Account for bandwidth contention ... Estimate speedup …. curves Cluster#K auto Return K auto that produces best result

Cache-partitioning in commodity multicores 16 ¨ Partitioning the last-level cache among co-running apps can reduce interference è improve system performance ✔ Recent processors offer hardware cache-partitioning support! ✖ Two key challenges limit its usability 1. Implements coarse-grained way-partitioning è hurts system performance! 2. Lacks hardware monitoring units to collect cache-profiling data

How do we profile applications online at low overhead and high accuracy? 17 ¨ Prior work mostly simulated hardware monitors that don’t exist in real Application Profiles systems, or used expensive software-based mem address sampling DynaWay exploits hardware partitioning support to adjust partition sizes periodically è measure performance (misses, IPC, bandwidth) Miss Curves Cache Misses * ** We applied optimizations to reduce measurement points and interval length (see paper) cache capacity è less than 1% profiling overhead (8-app workloads)

KPart+DynaWay profiles applications online, partitions the cache dynamically 18 KPart Cluster#1 Per-Cluster Partition Plan … Invoke DynaWay Cluster#K auto Generate online profiles + update periodically

KPart+DynaWay profiles applications online, partitions the cache dynamically 19 KPart Cluster#1 Per-Cluster Partition Plan … Invoke DynaWay Cluster#K auto Generate online profiles + update periodically

KPart Evaluation

Evaluation methodology 21 ¨ Platform: 8-core Intel Broadwell D-1540 processor (12MB LLC) ¨ Benchmarks : SPEC-CPU2006, PBBS ¨ Mixes: 30 different mixes of 8 apps (randomly selected), each app running at least 10B instr. ¨ Experiments: KPart on real KPart on real KPart in simulation KPart with mix of system with offline system with online compared against batch and latency- profiling profiling high-performance critical applications (using DynaWay) techniques

KPart unlocks significant performance on real hardware 22 ¨ Evaluation results on a real system with offline profiling Performance gain over NoPart (%) Performance gain over NoPart (%) 80 80 80 KPart up to 79% Kauto Koracle Kauto Throughput gain(%) NoClust Kauto K oracle 60 Important 60 NoClust 60 K2 Koracle to use K auto K4 K auto 40 40 40 K6 instead of NoClust NoClust 20 20 fixed K 20 0 5 10 15 20 25 0 0 0 Avg throughput gain over NoPart(%) -20 -20 -20 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Application Mixes (%) Application Mixes (%) Application Mixes(%) Application Mixes (%) KPart improves system performance NoClust hurts by 24% on average! ~30% of mixes

KPart unlocks significant performance on real hardware 23 ¨ Evaluation results on a real system with offline profiling ¨ Case studies of individual mixes: Mix 2 Mix 1

KPart evaluation with DynaWay’s online profiles 24 KPart+DynaWay can even outperform static KPart with offline profiling (adapts to application phase changes!) KPart+DynaWay K auto [Offline profiles] K oracle [Offline profiles] Reconfiguration Interval (Cycles)

KPart: A Hybrid Cache Sharing-Partitioning Technique for Commodity - PowerPoint PPT Presentation

KPart: A Hybrid Cache Sharing-Partitioning Technique for Commodity Multicores Nosayba EI-Sayed Anurag Mukkara Po-An Tsai Harshad Kasture Xiaosong Ma Daniel Sanchez Cache partitioning in commodity multicores 2 Partitioning

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Using Processor Partitioning to Using Processor Partitioning to Evaluate the Performance of MPI,

Analytical Cache Models with Applications to Cache Partitioning G. Edward Suh, Srinivas Devadas,

Stability for the electromagnetic scattering problem Luca R ONDI Universit di Trieste Joint

Global stability Giovanni Alessandrini for coupled physics inverse problems. Introduction A

Introduction to Mosek Modern Optimization in Energy, 28 June 2018 Micha l Adamaszek

Flat solutions to the Cauchy-Riemann Equations Yuan Zhang Joint with Y. Liu, Z. Chen and Y. Pan

Docker Security Workshop Goals of this Workshop Understand and get Understand and get

A change of variable formula with It o correction term* Jason Swanson Department of

Socket Programming Rohan Murty Hitesh Ballani Last Modified: 2/8/2004 8:31:51 AM Slides

1 Example: Java client (TCP) Example: Java client (TCP), cont. import java.io.*; import

Sambuz

Useful Links

Newsletter

Mail Us

KPart: A Hybrid Cache Sharing-Partitioning Technique for Commodity - PowerPoint PPT Presentation

KPart: A Hybrid Cache Sharing-Partitioning Technique for Commodity Multicores Nosayba EI-Sayed Anurag Mukkara Po-An Tsai Harshad Kasture Xiaosong Ma Daniel Sanchez Cache partitioning in commodity multicores 2 Partitioning

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&amp;M-Spring02 1 System

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Using Processor Partitioning to Using Processor Partitioning to Evaluate the Performance of MPI,

Analytical Cache Models with Applications to Cache Partitioning G. Edward Suh, Srinivas Devadas,

Stability for the electromagnetic scattering problem Luca R ONDI Universit di Trieste Joint

Global stability Giovanni Alessandrini for coupled physics inverse problems. Introduction A

Introduction to Mosek Modern Optimization in Energy, 28 June 2018 Micha l Adamaszek

Flat solutions to the Cauchy-Riemann Equations Yuan Zhang Joint with Y. Liu, Z. Chen and Y. Pan

Docker Security Workshop Goals of this Workshop Understand and get Understand and get

A change of variable formula with It o correction term* Jason Swanson Department of

Socket Programming Rohan Murty Hitesh Ballani Last Modified: 2/8/2004 8:31:51 AM Slides

1 Example: Java client (TCP) Example: Java client (TCP), cont. import java.io.*; import

Sambuz

Useful Links

Newsletter

Mail Us

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System