chip multiprocessors via an
play

CHIP MULTIPROCESSORS VIA AN OPERATING SYSTEM SCHEDULER Alexandra - PowerPoint PPT Presentation

IMPROVING PERFORMANCE ISOLATION ON CHIP MULTIPROCESSORS VIA AN OPERATING SYSTEM SCHEDULER Alexandra Fedorova Margo Seltzer Michael D. Smith Presented by Brad Eugene Torrence INTRODUCTION Motivation: Poor performance Isolation in chip


  1. IMPROVING PERFORMANCE ISOLATION ON CHIP MULTIPROCESSORS VIA AN OPERATING SYSTEM SCHEDULER Alexandra Fedorova Margo Seltzer Michael D. Smith Presented by Brad Eugene Torrence

  2. INTRODUCTION Motivation: Poor performance Isolation in chip multiprocessors (CMPs) • • The performance of an application is also dependent on the performance of other applications running at the same time. • Inherently acquired from the shared cache design of CMPs • Fairness among threads is not considered at the hardware level

  3. INTRODUCTION • High-miss-rate co-running applications increase execution time • Problems caused by poor performance isolation: • OS scheduler becomes non-deterministic and unpredictable • Weakens thread priority enforcement • QoS reserved resources are not as effective • Complicates per-CPU-hour billing • Applications unfairly billed for longer running times

  4. INTRODUCTION • Overall Performance measures time to complete execution • Performance Variability measures performance isolation by calculating the performance difference between executions • High variability indicates poor performance isolation • Instructions per cycle (IPC) how fast a thread executes • CPU time-slice determines how long a thread takes to execute

  5. INTRODUCTION • Co-runner-dependent cache allocation creates IPC variability • IPC variability drastically effects performance variability • OS CANNOT control IPC variability because it cannot control cache allocation • OS CAN control the CPU time-slice for each thread to compensate for IPC variability • Cache-fair Algorithm: Offsets performance variability by adjusting the thread’s CPU time-slice

  6. CACHE-FAIR ALGORITHM • NOT a new scheduling policy • Complements existing policies to mitigate performance variability effects • Makes threads run as quickly as they would if the cache were shared equally • Works by dynamically adjusting CPU time-slices of the threads • If one thread requires more CPU time, another thread must sacrifice CPU time so overall execution time is not increased

  7. CACHE-FAIR ALGORITHM • Conventional Scheduler on a CMP • High-miss-rate thread B, is co-run with A Thread A gets below fair cache allocation • • Results in worse than fair performance • Hypothetical CMP (enforces fairness) Shows the ideal fair performance • CMP with a cache-fair scheduler • • Thread A is still effected by Thread B Increased CPU time-slice of Thread A • • Allows A to achieve fair performance

  8. CACHE-FAIR ALGORITHM • Two new thread types help maintain proper balance of CPU time-sharing among threads • Cache-fair class threads – managed by the cache-fair complemented scheduler to improve performance isolation • Best-effort class threads – not managed to improve performance isolation but may receive time-slice adjustments

  9. CACHE-FAIR ALGORITHM How it works: • When a cache-fair thread is adjusted, another cache-fair thread is adjusted • in such a way as to offset the previous time-slice adjustment If another cache-fair thread to receive an offset adjustment does not exist, a • best-effort thread is adjusted instead • Adjustments to a single best-effort thread are kept under a specific threshold mitigating the performance effects CPU time-slice adjustments: • • Compute and compare the fair IPC value with the actual IPC • Time-slices are computed so the thread will have an IPC value equal to the fair IPC value when the time-slice expires

  10. FAIR IPC MODEL • The fair IPC of a thread cannot be measured directly • Fair IPC values are estimated using the Fair IPC Model • Estimates the miss-rate under fair-cache allocation • Then calculates the fair IPC given the fair-cache miss-rate

  11. FAIR IPC MODEL OVERVIEW • The miss-rate estimation model: 𝑜 • 𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓 𝐵 = 𝑏 ∗ 𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓 𝐷 𝑗 + 𝑐 𝑗=1 • Where a and b are linear coefficients, n is number of co-runners, 𝐷 𝑗 is the i-th co-runner • After running a target thread with several co-runners, a and b are derived using linear regression analysis • And since, 𝐺𝑏𝑗𝑠𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓 𝐵 = 𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓 𝐵 = 𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓(𝐷 𝑗 ) 𝑐 • Then, 𝐺𝑏𝑗𝑠𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓 𝐵 = 1−𝑏∗𝑜

  12. FAIR IPC MODEL EVALUATION • Actual vs. Estimated fair-cache miss-rates

  13. FAIR IPC MODEL LIMITATIONS • To estimate a thread’s miss rate requires running a thread with several co-runners • This can yield poor results if co-runners are few or all have similar cache-access patterns • Requires running a thread multiple-times with other threads • Highly impractical in reality • Model assumes uniform distribution of cache requests • An unrealistic assumption

  14. IMPLEMENTATION • Loadable module for Solaris 10 OS • Flexibility and independence from kernel’s scheduler • Cache-fair management induced by a system call • Threads are also assigned to a thread type class • Tracks positive and negative adjustments to maintain balance in the overall performance

  15. IMPLEMENTATION • Cache-fair threads go through an initial preparation phase • OS gathers performance data to calculate the fair miss rate • Also necessary if thread changes cache-access patterns • Forced when thread executes 1 billion instructions • Scheduling phase monitors threads using hardware performance counters • CPU time-slice adjusted if thread deviates from its fair IPC • Scheduling phase occurs every 50 million instructions

  16. EVALUATION • Using multi-program workload SPEC CPU2000 benchmarks (for CPU workloads) • • SPEC JBB and TPC-C (for Server workloads) Default scheduler = Solaris fixed-priority scheduler • • Evaluation compares performance isolation between cache-fair and default schedulers Hardware simulator: Simics modules implementing a dual-core CMP •

  17. EVALUATION Slow schedule contains high-miss-rate co-runners • Fast schedule contains low-miss-rate co-runners • • The preparation phase estimations are performed prior to experiment • Run with all benchmark programs to get accurate estimation Principal thread is monitored for 500 million scheduling phase instructions • After the first 10 million to avoid cold-start effects • Concurrently executed with three threads running identical benchmarks • • Only one of these is designated as a best-effort thread • Performance variability measured as percent slowdown • Difference between performance in the fast and slow groups

  18. EFFECT ON PERFORMANCE ISOLATION • Cache-fair scheduling results in <4% variability across all benchmarks

  19. EFFECT ON ABSOLUTE PERFORMANCE • Upper bound is the completion time of the slow group Lower bound is the completion time of the fast group • • Normalized to default scheduler’s completion time in the fast group

  20. EFFECT ON THE OVERALL THROUGHPUT Aggregate IPC found to • be more dependent on relative IPCs of threads that had CPU time-slice adjustments than on the principal benchmark’s IPC • Slow group 1-12% increase • Fast group 1-3% increase

  21. EFFECT ON BEST-EFFORT THREADS Worst-case effect on best-effort • threads is small • Multiple best-effort threads are important in avoiding large performance effects Average <1% effect on best- • performance threads but the range is quite large

  22. EXPERIMENTS WITH DATABASE WORKLOADS • The performance variability metric changes to transactions per second • Two sets of experiments: SPEC JBB and TPC-C each as principal program • The benchmark twolf was used as a best-effort co-running thread • Each emulates the database activities of an order-processing warehouse • Number of warehouses and execution threads are variable • Number of execution threads is fixed at one by authors • Memory constraints required reduced warehouse number to 5 or less • Authors also reduced L2 cache size to 512 KB in response

  23. SPEC JBB Experimental layout • SPEC JBB Experimental Results •

  24. TPC-C Experimental Layout • TPC-C Experimental Results •

  25. COMPARISON TO CACHE PARTITIONING • Created a hardware simulator that simulates a CMP with cache-partitioning Cache-partitioning only reduced variability in 3 / 9 benchmarks • • Poor results due to NO reduction in contention for the memory bus • Cache-fair scheduling accounts for memory bus contention • Therefore, is more effective than the hardware solution

  26. RELATED WORK Hardware solutions • Address problem directly, avoiding OS modifications • • Ensure fair resource allocation • Limited flexibility and effectiveness Increases hardware cost, complexity, and time-to-market • Software solutions • • Co-scheduling attempts to find the optimal co-runner thread • Requires a good co-runner to exist or fails • Limited scalability (requires cores to coordinate in scheduling)

  27. SUMMARY • The Cache-Fair Scheduling Algorithm improves performance isolation on chip multiprocessors by nearly eliminating the effects from co-runner-dependent performance variability • Better than current hardware / software solutions (in 2007) according to authors • I think the experiment shows promising results but the process of calculating the fair-IPC is much too costly to be practical

Recommend


More recommend