CHIP MULTIPROCESSORS VIA AN OPERATING SYSTEM SCHEDULER Alexandra - PowerPoint PPT Presentation

IMPROVING PERFORMANCE ISOLATION ON CHIP MULTIPROCESSORS VIA AN OPERATING SYSTEM SCHEDULER Alexandra Fedorova Margo Seltzer Michael D. Smith Presented by Brad Eugene Torrence

INTRODUCTION Motivation: Poor performance Isolation in chip multiprocessors (CMPs) • • The performance of an application is also dependent on the performance of other applications running at the same time. • Inherently acquired from the shared cache design of CMPs • Fairness among threads is not considered at the hardware level

INTRODUCTION • High-miss-rate co-running applications increase execution time • Problems caused by poor performance isolation: • OS scheduler becomes non-deterministic and unpredictable • Weakens thread priority enforcement • QoS reserved resources are not as effective • Complicates per-CPU-hour billing • Applications unfairly billed for longer running times

INTRODUCTION • Overall Performance measures time to complete execution • Performance Variability measures performance isolation by calculating the performance difference between executions • High variability indicates poor performance isolation • Instructions per cycle (IPC) how fast a thread executes • CPU time-slice determines how long a thread takes to execute

INTRODUCTION • Co-runner-dependent cache allocation creates IPC variability • IPC variability drastically effects performance variability • OS CANNOT control IPC variability because it cannot control cache allocation • OS CAN control the CPU time-slice for each thread to compensate for IPC variability • Cache-fair Algorithm: Offsets performance variability by adjusting the thread’s CPU time-slice

CACHE-FAIR ALGORITHM • NOT a new scheduling policy • Complements existing policies to mitigate performance variability effects • Makes threads run as quickly as they would if the cache were shared equally • Works by dynamically adjusting CPU time-slices of the threads • If one thread requires more CPU time, another thread must sacrifice CPU time so overall execution time is not increased

CACHE-FAIR ALGORITHM • Conventional Scheduler on a CMP • High-miss-rate thread B, is co-run with A Thread A gets below fair cache allocation • • Results in worse than fair performance • Hypothetical CMP (enforces fairness) Shows the ideal fair performance • CMP with a cache-fair scheduler • • Thread A is still effected by Thread B Increased CPU time-slice of Thread A • • Allows A to achieve fair performance

CACHE-FAIR ALGORITHM • Two new thread types help maintain proper balance of CPU time-sharing among threads • Cache-fair class threads – managed by the cache-fair complemented scheduler to improve performance isolation • Best-effort class threads – not managed to improve performance isolation but may receive time-slice adjustments

CACHE-FAIR ALGORITHM How it works: • When a cache-fair thread is adjusted, another cache-fair thread is adjusted • in such a way as to offset the previous time-slice adjustment If another cache-fair thread to receive an offset adjustment does not exist, a • best-effort thread is adjusted instead • Adjustments to a single best-effort thread are kept under a specific threshold mitigating the performance effects CPU time-slice adjustments: • • Compute and compare the fair IPC value with the actual IPC • Time-slices are computed so the thread will have an IPC value equal to the fair IPC value when the time-slice expires

FAIR IPC MODEL • The fair IPC of a thread cannot be measured directly • Fair IPC values are estimated using the Fair IPC Model • Estimates the miss-rate under fair-cache allocation • Then calculates the fair IPC given the fair-cache miss-rate

FAIR IPC MODEL OVERVIEW • The miss-rate estimation model: 𝑜 • 𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓 𝐵 = 𝑏 ∗ 𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓 𝐷 𝑗 + 𝑐 𝑗=1 • Where a and b are linear coefficients, n is number of co-runners, 𝐷 𝑗 is the i-th co-runner • After running a target thread with several co-runners, a and b are derived using linear regression analysis • And since, 𝐺𝑏𝑗𝑠𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓 𝐵 = 𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓 𝐵 = 𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓(𝐷 𝑗 ) 𝑐 • Then, 𝐺𝑏𝑗𝑠𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓 𝐵 = 1−𝑏∗𝑜

FAIR IPC MODEL EVALUATION • Actual vs. Estimated fair-cache miss-rates

FAIR IPC MODEL LIMITATIONS • To estimate a thread’s miss rate requires running a thread with several co-runners • This can yield poor results if co-runners are few or all have similar cache-access patterns • Requires running a thread multiple-times with other threads • Highly impractical in reality • Model assumes uniform distribution of cache requests • An unrealistic assumption

IMPLEMENTATION • Loadable module for Solaris 10 OS • Flexibility and independence from kernel’s scheduler • Cache-fair management induced by a system call • Threads are also assigned to a thread type class • Tracks positive and negative adjustments to maintain balance in the overall performance

IMPLEMENTATION • Cache-fair threads go through an initial preparation phase • OS gathers performance data to calculate the fair miss rate • Also necessary if thread changes cache-access patterns • Forced when thread executes 1 billion instructions • Scheduling phase monitors threads using hardware performance counters • CPU time-slice adjusted if thread deviates from its fair IPC • Scheduling phase occurs every 50 million instructions

EVALUATION • Using multi-program workload SPEC CPU2000 benchmarks (for CPU workloads) • • SPEC JBB and TPC-C (for Server workloads) Default scheduler = Solaris fixed-priority scheduler • • Evaluation compares performance isolation between cache-fair and default schedulers Hardware simulator: Simics modules implementing a dual-core CMP •

EVALUATION Slow schedule contains high-miss-rate co-runners • Fast schedule contains low-miss-rate co-runners • • The preparation phase estimations are performed prior to experiment • Run with all benchmark programs to get accurate estimation Principal thread is monitored for 500 million scheduling phase instructions • After the first 10 million to avoid cold-start effects • Concurrently executed with three threads running identical benchmarks • • Only one of these is designated as a best-effort thread • Performance variability measured as percent slowdown • Difference between performance in the fast and slow groups

EFFECT ON PERFORMANCE ISOLATION • Cache-fair scheduling results in <4% variability across all benchmarks

EFFECT ON ABSOLUTE PERFORMANCE • Upper bound is the completion time of the slow group Lower bound is the completion time of the fast group • • Normalized to default scheduler’s completion time in the fast group

EFFECT ON THE OVERALL THROUGHPUT Aggregate IPC found to • be more dependent on relative IPCs of threads that had CPU time-slice adjustments than on the principal benchmark’s IPC • Slow group 1-12% increase • Fast group 1-3% increase

EFFECT ON BEST-EFFORT THREADS Worst-case effect on best-effort • threads is small • Multiple best-effort threads are important in avoiding large performance effects Average <1% effect on best- • performance threads but the range is quite large

EXPERIMENTS WITH DATABASE WORKLOADS • The performance variability metric changes to transactions per second • Two sets of experiments: SPEC JBB and TPC-C each as principal program • The benchmark twolf was used as a best-effort co-running thread • Each emulates the database activities of an order-processing warehouse • Number of warehouses and execution threads are variable • Number of execution threads is fixed at one by authors • Memory constraints required reduced warehouse number to 5 or less • Authors also reduced L2 cache size to 512 KB in response

SPEC JBB Experimental layout • SPEC JBB Experimental Results •

TPC-C Experimental Layout • TPC-C Experimental Results •

COMPARISON TO CACHE PARTITIONING • Created a hardware simulator that simulates a CMP with cache-partitioning Cache-partitioning only reduced variability in 3 / 9 benchmarks • • Poor results due to NO reduction in contention for the memory bus • Cache-fair scheduling accounts for memory bus contention • Therefore, is more effective than the hardware solution

RELATED WORK Hardware solutions • Address problem directly, avoiding OS modifications • • Ensure fair resource allocation • Limited flexibility and effectiveness Increases hardware cost, complexity, and time-to-market • Software solutions • • Co-scheduling attempts to find the optimal co-runner thread • Requires a good co-runner to exist or fails • Limited scalability (requires cores to coordinate in scheduling)

SUMMARY • The Cache-Fair Scheduling Algorithm improves performance isolation on chip multiprocessors by nearly eliminating the effects from co-runner-dependent performance variability • Better than current hardware / software solutions (in 2007) according to authors • I think the experiment shows promising results but the process of calculating the fair-IPC is much too costly to be practical

CHIP MULTIPROCESSORS VIA AN OPERATING SYSTEM SCHEDULER Alexandra - PowerPoint PPT Presentation

IMPROVING PERFORMANCE ISOLATION ON CHIP MULTIPROCESSORS VIA AN OPERATING SYSTEM SCHEDULER Alexandra Fedorova Margo Seltzer Michael D. Smith Presented by Brad Eugene Torrence INTRODUCTION Motivation: Poor performance Isolation in chip

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

Reducing the Interconnection Network Cost of Chip Multiprocessors Pablo Abad , Valentn Puente

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

7 On-Chip Interconnection Networks Chip Multiprocessors (ACS MPhil) Robert Mullins

Why Multiprocessors? Limits on the performance of a single processor: what are they? Spring 2009

Cap5 - Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

6 Transactional Memory Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

2 Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins

Study Of Chip Breaker El-Sherbeeny, PhD 2014 Project-Group 6 TYPES ES OF F CHI HIP a)

Australian Junior Resources Blue Chip Australian Junior Resources Blue Chip Australian Junior

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

Exploring Chip to Chip Photonic Networks Philip Watts Computer Laboratory University of Cambridge

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors Supervisor

Database servers on chip multiprocessors: limitations and opportunities N. Hardavellas N.

A different kind of functional language John Reppy University of Chicago / NSF November 2012

SoK: Privacy on Mobile Devices Its Complicated Chad Spensky , Jeffrey Stewart, Arkady

Diderot: A Parallel DSL for Image Analysis and Visualization Charisee Chiw Gordon Kindlmann

Kathryn Accurso, Dr. Brenda Muzeta, & Marsha Liaw March 2018 I teach middle school

Allan Rocha, Usman Alim, Julio Daniel Silva, and Mario Costa Sousa Interactive Modeling,

Long live robustness! Michael L. Seltzer Microsoft Research REVERB 2014 | May 10, 2014

Consistency Control Algorithms for Web Caching Leon Cao University of Waterloo February 28,

OS Structure and Performance OS Structure and Performance Reconsidering the Kernel Interface

Sambuz

Useful Links

Newsletter

Mail Us

CHIP MULTIPROCESSORS VIA AN OPERATING SYSTEM SCHEDULER Alexandra - PowerPoint PPT Presentation

IMPROVING PERFORMANCE ISOLATION ON CHIP MULTIPROCESSORS VIA AN OPERATING SYSTEM SCHEDULER Alexandra Fedorova Margo Seltzer Michael D. Smith Presented by Brad Eugene Torrence INTRODUCTION Motivation: Poor performance Isolation in chip

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

Reducing the Interconnection Network Cost of Chip Multiprocessors Pablo Abad , Valentn Puente

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

7 On-Chip Interconnection Networks Chip Multiprocessors (ACS MPhil) Robert Mullins

Why Multiprocessors? Limits on the performance of a single processor: what are they? Spring 2009

Cap5 - Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

6 Transactional Memory Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

2 Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins

Study Of Chip Breaker El-Sherbeeny, PhD 2014 Project-Group 6 TYPES ES OF F CHI HIP a)

Australian Junior Resources Blue Chip Australian Junior Resources Blue Chip Australian Junior

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

Exploring Chip to Chip Photonic Networks Philip Watts Computer Laboratory University of Cambridge

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors Supervisor

Database servers on chip multiprocessors: limitations and opportunities N. Hardavellas N.

A different kind of functional language John Reppy University of Chicago / NSF November 2012

SoK: Privacy on Mobile Devices Its Complicated Chad Spensky , Jeffrey Stewart, Arkady

Diderot: A Parallel DSL for Image Analysis and Visualization Charisee Chiw Gordon Kindlmann

Kathryn Accurso, Dr. Brenda Muzeta, &amp; Marsha Liaw March 2018 I teach middle school

Allan Rocha, Usman Alim, Julio Daniel Silva, and Mario Costa Sousa Interactive Modeling,

Long live robustness! Michael L. Seltzer Microsoft Research REVERB 2014 | May 10, 2014

Consistency Control Algorithms for Web Caching Leon Cao University of Waterloo February 28,

OS Structure and Performance OS Structure and Performance Reconsidering the Kernel Interface

Sambuz

Useful Links

Newsletter

Mail Us

Kathryn Accurso, Dr. Brenda Muzeta, & Marsha Liaw March 2018 I teach middle school