KPart: A Hybrid Cache Sharing-Partitioning Technique for Commodity Multicores Nosayba EI-Sayed Anurag Mukkara Po-An Tsai Harshad Kasture Xiaosong Ma Daniel Sanchez
Cache partitioning in commodity multicores 2 ¨ Partitioning the last-level cache among co-running apps can reduce interference è improve system performance App App App App 1 2 3 4 Last-Level Cache ✔ Recent processors offer hardware DRAM cache-partitioning support! ✖ Two key challenges limit its usability 1. Current hardware implements coarse-grained way-partitioning è hurts system performance! 2. Lacks hardware monitoring units to collect cache-profiling data KPart tackles these limitations, unlocking significant performance on real hardware (avg gain: 24%, max: 79%), and is publicly available
Limitations of hardware cache partitioning 3 1. Implements coarse-grained way-partitioning è hurts system performance ¨ Real-system example (benchmarks: SPEC-CPU2006, PBBS) Way0 Way1 Way2 Way3 Way4 Way5 Way6 Way7 Way8 Way9 Way10 Way11 Last-Level Cache (12MB)
Limitations of hardware cache partitioning 3 1. Implements coarse-grained way-partitioning è hurts system performance ¨ Real-system example (benchmarks: SPEC-CPU2006, PBBS) ¨ Baseline: NoPart (All apps share all ways) Way0 Way1 Way2 Way3 Way4 Way5 Way6 Way7 Way8 Way9 Way10 Way11 Last-Level Cache (12MB)
Limitations of hardware cache partitioning 3 1. Implements coarse-grained way-partitioning è hurts system performance ¨ Real-system example (benchmarks: SPEC-CPU2006, PBBS) Smallest partition size Way0 Way1 Way2 Way3 Way4 Way5 Way6 Way7 Way8 Way9 Way10 Way11 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB Last-Level Cache (12MB)
Limitations of hardware cache partitioning 3 1. Implements coarse-grained way-partitioning è hurts system performance ¨ Real-system example (benchmarks: SPEC-CPU2006, PBBS) ¨ Conventional policy: Per-app, utility-based cache part (UCP) Application Cache-Profiles Smallest partition size Way0 Way1 Way2 Way3 Way4 Way5 Way6 Way7 Way8 Way9 Way10 Way11 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB … Last-Level Cache (12MB)
Limitations of hardware cache partitioning 3 1. Implements coarse-grained way-partitioning è hurts system performance ¨ Real-system example (benchmarks: SPEC-CPU2006, PBBS) ¨ Conventional policy: Per-app, utility-based cache part (UCP) Application Cache-Profiles Smallest partition size Way0 Way1 Way2 Way3 Way4 Way5 Way6 Way7 Way8 Way9 Way10 Way11 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB … Last-Level Cache (12MB)
Limitations of hardware cache partitioning 3 1. Implements coarse-grained way-partitioning è hurts system performance ¨ Real-system example (benchmarks: SPEC-CPU2006, PBBS) ¨ Conventional policy: Per-app, utility-based cache part (UCP) Application Cache-Profiles Smallest partition size Way0 Way1 Way2 Way3 Way4 Way5 Way6 Way7 Way8 Way9 Way10 Way11 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB … Last-Level Cache (12MB) Conventional policies yield small partitions with few ways: low associativity è more misses This example: throughput degrades by 3.8%
Prior work on cache partitioning 9 ¨ Hardware way-partitioning: restrict ¨ Page coloring insertions into subsets of ways ¤ No hardware support required ¤ Available in commodity hardware ¤ Not compatible with superpages; costly repartitioning due to ¤ Small number of coarsely-grained partitions! recoloring; heavy OS modifications ¨ High-performance, fine-grained ¨ Hybrid technique: Set and WAy hardware partitioners (e.g. Vantage Partitioning (SWAP) [HPCA’17] [ISCA’11] , Futility Scaling [MICRO’14] ) ¤ Combines page coloring and way- partitioning è fine-grained partitions ¤ Support hundreds of partitions ¤ Inherits page coloring limitations ¤ Not available in existing hardware
Prior work on cache partitioning 9 ¨ Hardware way-partitioning: restrict ¨ Page coloring insertions into subsets of ways ¤ No hardware support required ¤ Available in commodity hardware ¤ Not compatible with superpages; costly repartitioning due to ¤ Small number of coarsely-grained partitions! recoloring; heavy OS modifications ¨ High-performance, fine-grained ¨ Hybrid technique: Set and WAy hardware partitioners (e.g. Vantage Partitioning (SWAP) [HPCA’17] [ISCA’11] , Futility Scaling [MICRO’14] ) ¤ Combines page coloring and way- partitioning è fine-grained partitions ¤ Support hundreds of partitions ¤ Inherits page coloring limitations ¤ Not available in existing hardware
KPart performs hybrid cache sharing-partitioning to make use of coarse-grained partitions 11 Cache-Aware App Grouping group 1 group 2 group 3 Grouping must be Avoids significant reduction in done carefully! cache associativity è throughput improves by 17%
KPart overview: Hybrid cache sharing-partitioning 12 Application Cache-Sharing Clusters How? Profiles Per-Cluster Cache Partition Plan Cluster#1 Assign Group cache applications Cluster#2 partitions into to clusters clusters Cluster#3 Miss Curves Cache Misses Collected online or offline cache capacity
Clustering apps based on cache-compatibility: Distance metric 13 ¨ How many additional cache misses are expected when Application Profiles two apps share cache capacity vs. when it’s partitioned? Shared LLC Partitioned LLC distance ¨ Use cache miss curves to estimate: combined miss curve Cache [Mukkara et al., ASPLOS’16] Misses area partitioned miss curve [divide cap using UCP] app1 app2 Cache Capacity Area è expected performance degradation when apps share cache capacity (due to additional misses)
Grouping applications into clusters 14 ¨ Hierarchical clustering: ¤ Start with the applications as individual clusters ¤ At each step, merge the closest pair of clusters K=3 K=2 until only one cluster is left.. Application Miss Curves How do we find the best K without running the mix?
Automatic selection of K in KPart 15 Application Performance Estimator Cluster#1 How? Profiles Per-Cluster Cache Partition Plan Estimate throughput K auto under all possible K s Cluster#2 Account for bandwidth contention ... Estimate speedup …. curves Cluster#K auto Return K auto that produces best result
Cache-partitioning in commodity multicores 16 ¨ Partitioning the last-level cache among co-running apps can reduce interference è improve system performance ✔ Recent processors offer hardware cache-partitioning support! ✖ Two key challenges limit its usability 1. Implements coarse-grained way-partitioning è hurts system performance! 2. Lacks hardware monitoring units to collect cache-profiling data
How do we profile applications online at low overhead and high accuracy? 17 ¨ Prior work mostly simulated hardware monitors that don’t exist in real Application Profiles systems, or used expensive software-based mem address sampling DynaWay exploits hardware partitioning support to adjust partition sizes periodically è measure performance (misses, IPC, bandwidth) Miss Curves Cache Misses * ** We applied optimizations to reduce measurement points and interval length (see paper) cache capacity è less than 1% profiling overhead (8-app workloads)
KPart+DynaWay profiles applications online, partitions the cache dynamically 18 KPart Cluster#1 Per-Cluster Partition Plan … Invoke DynaWay Cluster#K auto Generate online profiles + update periodically
KPart+DynaWay profiles applications online, partitions the cache dynamically 19 KPart Cluster#1 Per-Cluster Partition Plan … Invoke DynaWay Cluster#K auto Generate online profiles + update periodically
KPart Evaluation
Evaluation methodology 21 ¨ Platform: 8-core Intel Broadwell D-1540 processor (12MB LLC) ¨ Benchmarks : SPEC-CPU2006, PBBS ¨ Mixes: 30 different mixes of 8 apps (randomly selected), each app running at least 10B instr. ¨ Experiments: KPart on real KPart on real KPart in simulation KPart with mix of system with offline system with online compared against batch and latency- profiling profiling high-performance critical applications (using DynaWay) techniques
KPart unlocks significant performance on real hardware 22 ¨ Evaluation results on a real system with offline profiling Performance gain over NoPart (%) Performance gain over NoPart (%) 80 80 80 KPart up to 79% Kauto Koracle Kauto Throughput gain(%) NoClust Kauto K oracle 60 Important 60 NoClust 60 K2 Koracle to use K auto K4 K auto 40 40 40 K6 instead of NoClust NoClust 20 20 fixed K 20 0 5 10 15 20 25 0 0 0 Avg throughput gain over NoPart(%) -20 -20 -20 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Application Mixes (%) Application Mixes (%) Application Mixes(%) Application Mixes (%) KPart improves system performance NoClust hurts by 24% on average! ~30% of mixes
KPart unlocks significant performance on real hardware 23 ¨ Evaluation results on a real system with offline profiling ¨ Case studies of individual mixes: Mix 2 Mix 1
KPart evaluation with DynaWay’s online profiles 24 KPart+DynaWay can even outperform static KPart with offline profiling (adapts to application phase changes!) KPart+DynaWay K auto [Offline profiles] K oracle [Offline profiles] Reconfiguration Interval (Cycles)
Recommend
More recommend