IMPROVING PERFORMANCE ISOLATION ON CHIP MULTIPROCESSORS VIA AN OPERATING SYSTEM SCHEDULER Alexandra Fedorova Margo Seltzer Michael D. Smith Presented by Brad Eugene Torrence
INTRODUCTION Motivation: Poor performance Isolation in chip multiprocessors (CMPs) • • The performance of an application is also dependent on the performance of other applications running at the same time. • Inherently acquired from the shared cache design of CMPs • Fairness among threads is not considered at the hardware level
INTRODUCTION • High-miss-rate co-running applications increase execution time • Problems caused by poor performance isolation: • OS scheduler becomes non-deterministic and unpredictable • Weakens thread priority enforcement • QoS reserved resources are not as effective • Complicates per-CPU-hour billing • Applications unfairly billed for longer running times
INTRODUCTION • Overall Performance measures time to complete execution • Performance Variability measures performance isolation by calculating the performance difference between executions • High variability indicates poor performance isolation • Instructions per cycle (IPC) how fast a thread executes • CPU time-slice determines how long a thread takes to execute
INTRODUCTION • Co-runner-dependent cache allocation creates IPC variability • IPC variability drastically effects performance variability • OS CANNOT control IPC variability because it cannot control cache allocation • OS CAN control the CPU time-slice for each thread to compensate for IPC variability • Cache-fair Algorithm: Offsets performance variability by adjusting the thread’s CPU time-slice
CACHE-FAIR ALGORITHM • NOT a new scheduling policy • Complements existing policies to mitigate performance variability effects • Makes threads run as quickly as they would if the cache were shared equally • Works by dynamically adjusting CPU time-slices of the threads • If one thread requires more CPU time, another thread must sacrifice CPU time so overall execution time is not increased
CACHE-FAIR ALGORITHM • Conventional Scheduler on a CMP • High-miss-rate thread B, is co-run with A Thread A gets below fair cache allocation • • Results in worse than fair performance • Hypothetical CMP (enforces fairness) Shows the ideal fair performance • CMP with a cache-fair scheduler • • Thread A is still effected by Thread B Increased CPU time-slice of Thread A • • Allows A to achieve fair performance
CACHE-FAIR ALGORITHM • Two new thread types help maintain proper balance of CPU time-sharing among threads • Cache-fair class threads – managed by the cache-fair complemented scheduler to improve performance isolation • Best-effort class threads – not managed to improve performance isolation but may receive time-slice adjustments
CACHE-FAIR ALGORITHM How it works: • When a cache-fair thread is adjusted, another cache-fair thread is adjusted • in such a way as to offset the previous time-slice adjustment If another cache-fair thread to receive an offset adjustment does not exist, a • best-effort thread is adjusted instead • Adjustments to a single best-effort thread are kept under a specific threshold mitigating the performance effects CPU time-slice adjustments: • • Compute and compare the fair IPC value with the actual IPC • Time-slices are computed so the thread will have an IPC value equal to the fair IPC value when the time-slice expires
FAIR IPC MODEL • The fair IPC of a thread cannot be measured directly • Fair IPC values are estimated using the Fair IPC Model • Estimates the miss-rate under fair-cache allocation • Then calculates the fair IPC given the fair-cache miss-rate
FAIR IPC MODEL OVERVIEW • The miss-rate estimation model: 𝑜 • 𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓 𝐵 = 𝑏 ∗ 𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓 𝐷 𝑗 + 𝑐 𝑗=1 • Where a and b are linear coefficients, n is number of co-runners, 𝐷 𝑗 is the i-th co-runner • After running a target thread with several co-runners, a and b are derived using linear regression analysis • And since, 𝐺𝑏𝑗𝑠𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓 𝐵 = 𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓 𝐵 = 𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓(𝐷 𝑗 ) 𝑐 • Then, 𝐺𝑏𝑗𝑠𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓 𝐵 = 1−𝑏∗𝑜
FAIR IPC MODEL EVALUATION • Actual vs. Estimated fair-cache miss-rates
FAIR IPC MODEL LIMITATIONS • To estimate a thread’s miss rate requires running a thread with several co-runners • This can yield poor results if co-runners are few or all have similar cache-access patterns • Requires running a thread multiple-times with other threads • Highly impractical in reality • Model assumes uniform distribution of cache requests • An unrealistic assumption
IMPLEMENTATION • Loadable module for Solaris 10 OS • Flexibility and independence from kernel’s scheduler • Cache-fair management induced by a system call • Threads are also assigned to a thread type class • Tracks positive and negative adjustments to maintain balance in the overall performance
IMPLEMENTATION • Cache-fair threads go through an initial preparation phase • OS gathers performance data to calculate the fair miss rate • Also necessary if thread changes cache-access patterns • Forced when thread executes 1 billion instructions • Scheduling phase monitors threads using hardware performance counters • CPU time-slice adjusted if thread deviates from its fair IPC • Scheduling phase occurs every 50 million instructions
EVALUATION • Using multi-program workload SPEC CPU2000 benchmarks (for CPU workloads) • • SPEC JBB and TPC-C (for Server workloads) Default scheduler = Solaris fixed-priority scheduler • • Evaluation compares performance isolation between cache-fair and default schedulers Hardware simulator: Simics modules implementing a dual-core CMP •
EVALUATION Slow schedule contains high-miss-rate co-runners • Fast schedule contains low-miss-rate co-runners • • The preparation phase estimations are performed prior to experiment • Run with all benchmark programs to get accurate estimation Principal thread is monitored for 500 million scheduling phase instructions • After the first 10 million to avoid cold-start effects • Concurrently executed with three threads running identical benchmarks • • Only one of these is designated as a best-effort thread • Performance variability measured as percent slowdown • Difference between performance in the fast and slow groups
EFFECT ON PERFORMANCE ISOLATION • Cache-fair scheduling results in <4% variability across all benchmarks
EFFECT ON ABSOLUTE PERFORMANCE • Upper bound is the completion time of the slow group Lower bound is the completion time of the fast group • • Normalized to default scheduler’s completion time in the fast group
EFFECT ON THE OVERALL THROUGHPUT Aggregate IPC found to • be more dependent on relative IPCs of threads that had CPU time-slice adjustments than on the principal benchmark’s IPC • Slow group 1-12% increase • Fast group 1-3% increase
EFFECT ON BEST-EFFORT THREADS Worst-case effect on best-effort • threads is small • Multiple best-effort threads are important in avoiding large performance effects Average <1% effect on best- • performance threads but the range is quite large
EXPERIMENTS WITH DATABASE WORKLOADS • The performance variability metric changes to transactions per second • Two sets of experiments: SPEC JBB and TPC-C each as principal program • The benchmark twolf was used as a best-effort co-running thread • Each emulates the database activities of an order-processing warehouse • Number of warehouses and execution threads are variable • Number of execution threads is fixed at one by authors • Memory constraints required reduced warehouse number to 5 or less • Authors also reduced L2 cache size to 512 KB in response
SPEC JBB Experimental layout • SPEC JBB Experimental Results •
TPC-C Experimental Layout • TPC-C Experimental Results •
COMPARISON TO CACHE PARTITIONING • Created a hardware simulator that simulates a CMP with cache-partitioning Cache-partitioning only reduced variability in 3 / 9 benchmarks • • Poor results due to NO reduction in contention for the memory bus • Cache-fair scheduling accounts for memory bus contention • Therefore, is more effective than the hardware solution
RELATED WORK Hardware solutions • Address problem directly, avoiding OS modifications • • Ensure fair resource allocation • Limited flexibility and effectiveness Increases hardware cost, complexity, and time-to-market • Software solutions • • Co-scheduling attempts to find the optimal co-runner thread • Requires a good co-runner to exist or fails • Limited scalability (requires cores to coordinate in scheduling)
SUMMARY • The Cache-Fair Scheduling Algorithm improves performance isolation on chip multiprocessors by nearly eliminating the effects from co-runner-dependent performance variability • Better than current hardware / software solutions (in 2007) according to authors • I think the experiment shows promising results but the process of calculating the fair-IPC is much too costly to be practical
Recommend
More recommend