Hardware Execution Throttling for Multicore Resource Management - PowerPoint PPT Presentation

Hardware Execution Throttling for Multicore Resource Management Xiao Zhang Sandhya Dwarkadas Kai Shen 1

The Multi-Core Challenge • Multi-core chip – Dominant on market – Last level on-chip cache is commonly shared by sibling cores, however sharing is not well controlled • Challenge: Performance Isolation source: http://www.intel.com – Poor & unpredictable performance – Denial of service attacks 2

A Full Solution Includes … • Good mechanism – Should be both efficient and practical to deploy – Main focus of this talk • Good policy to govern mechanism – as important as mechanism, and not easy – Omitted in this talk 3

Existing Mechanism(I): Software based Page Coloring • Classic technique originally used to Thread A’s footprint Memory page reduce cache miss, recently used A1 by OS to manage cache partitioning A2 • Partition cache at coarse A3 granularity Thread A • No need for hardware A4 support Thread B A5 Way-1 ………… Way-n Shared Cache

Existing Mechanism(II): Scheduling Quantum Adjustment • Shorten the time quantum of app that overuses cache • May let core idle if there is no other active thread available Core 0 Thread A idle Thread A idle Thread A idle Core 1 Thread B Thread B Thread B time 5

New Mechanism: Hardware Execution Throttling • Throttle the execution speed of app that overuses cache – Duty cycle modulation • CPU works only in duty cycles and stalls in non-duty cycles • Allow per-core control (vs. per-processor control for existing Dynamic Voltage Frequency Scaling) – Enable/disable cache prefetchers • L1 prefetchers – IP: keeps per-instruction load history to detect stride pattern – DCU: prefetches next line when it detects multiple loads from the same line within a time limit • L2 prefetchers – Adjacent line: Prefetches the adjacent line of required data – Stream: looks at streams of data for regular patterns

Brief View of Hardware Execution Throttling • Comparison to page coloring – Little complexity to kernel • Code length: 40 lines in a single file (as a reference our page coloring implementation takes 700+ lines of code crossing 10+ files) – Lightweight to configure • Read plus write register: duty-cycle 265 + 350 cycles, prefetcher 298 + 2065 cycles, which is less than 1 microsecond on a 3Ghz CPU (as a reference re- coloring a page takes 3 microseconds on the same CPU) • Comparison to scheduling quantum adjustment – More fine-grained controlling Quantum adjustment Hardware execution throttling Core 0 Thread A idle 7 Core 1 Thread B time

Evaluation • Candidate mechanisms – Page coloring – Scheduling quantum adjustment – Hardware execution throttling • Experiment setup – Conducted on a 3.0 Ghz Intel dual-core processor – 3 SPECCPU-2000 apps (swim, mcf, & equake) and 2 server-style apps (SPECjbb2005 & SPECweb99), running all possible pair-wise co-schedule • Goal: evaluate their effectiveness in providing performance fairness – For each mechanism, tune its configuration offline to achieve best fairness 8

Fairness Comparison • Unfairness factor: coefficient of variation (deviation- to-mean ratio, σ / μ ) of co-running apps’ normalized performances • On average all three mechanisms are effective in improving fairness • Case {swim, SPECweb} illustrates limitation of page coloring 9

Performance Comparison • System efficiency: geometric mean of co-running apps’ normalized performances • On average all three mechanisms achieve system efficiency comparable to default sharing • Case where severe inter- thread cache conflicts exist favors segregation, e.g. {swim, mcf} • Case where well-interleaved cache accesses exist favors sharing, e.g. {mcf, mcf} 10

Drawbacks of Page Coloring • Expensive re-coloring cost – Prohibitive in a dynamic environment where frequent re-coloring may be necessary Thread A’s footprint A1 • Complex memory management – Introduces artificial memory pressure A2 A3 Way-1 ………… Way-n Thread A A4 Thread B A5 For more details on tackling these problems, please read our Eurosys’09 paper: Shared Cache Practical Page coloring based Multi-core Memory page Cache Management

Drawback of Scheduling Quantum Adjustment • Coarse-grained control at scheduling quantum granularity may result in fluctuating service delays for individual transactions 12

Summary • Hardware execution throttling mechanism for multi-core cache management – Fine-grained control – Lightweight solution that cleverly reuses existing hardware features – System efficiency is competitive to default sharing, largely comparable to scheduling quantum adjustment, but inferior to ideal page coloring • Future work – Investigate policy for online configuration 13

Hardware Execution Throttling for Multicore Resource Management - PowerPoint PPT Presentation

Hardware Execution Throttling for Multicore Resource Management Xiao Zhang Sandhya Dwarkadas Kai Shen 1 The Multi-Core Challenge Multi-core chip Dominant on market Last level on-chip cache is commonly shared by sibling cores,

Throttling numbers for cop vs gambler James Lin Carl Joshua Quines Espen Slettnes mentor: Jesse

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Hardware Observability Framework Hardware Observability Framework Hardware Observability

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim

Using Throttling and Traffic Shaping to Combat Spam Ken Simpson, Founder and CEO, for USENIX LISA

Zero forcing, propagation time, and throttling on a graph Leslie Hogben Iowa State University

MetaCAPTCHA: A Metamorphic Throttling Service for the Web Akshay Dua, Thai Bui, Tien Le, Nhan

TSLP Throttling Automatic Vectorization: When Less is More Vasileios Porpodas and Timothy M.

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Rational Barycentrics Polygons and Polycons 1 Properties Regular within each element

Cycles, Chords, and Planarity in Graphs Damon Hochnadel Under the direction of Prof. John

ComputationalDifferentialPrivacy IlyaMironov (MICROSOFT) OmkantPandey (UCLA)

Terminology Adjacency Adjacency Two vertices u and v are adjacent if there is an edge connecting

Modern Discrete Probability I - Introduction Stochastic processes on graphs: models and questions

t srt ts t ts

WORKSHOP Workshop Purpose Continue to engage with community members who are interested in

The Decomposition of Graphs DPV Chapter 3 Jim Royer EECS February 6, 2019 Royer (EECS) Graph

Sambuz

Useful Links

Newsletter

Mail Us