Exploration of Influence of Program Inputs on CMP Co-Scheduling - PowerPoint PPT Presentation

Exploration of Influence of Program Inputs on CMP Co-Scheduling Yunlian Jiang Xipeng Shen Computer Science The College of William and Mary, USA

Cache sharing in CMP  Commercial CMPs  Intel Core 2 Duo E6750 CPU CPU  AMD Athlon X2 6400+ Shared Cache 2

Cache sharing  Pros  Shorten inter-thread communication  Flexible usage of cache  Cons: causes cache contention  degrade performance  impair fairness  hurt performance isolation 3

Job co-scheduling  To assign jobs to chips in a manner to minimize contention  Example P1 CMP Chip1 P2 P3 CMP Chip2 P4 4

Job co-scheduling  To assign jobs to chips in a manner to minimize contention  Example P1 Chip2 P2 P1 P3 P4 CMP Chip1 P2 P3 CMP Chip2 P4 5

Job co-scheduling  To assign jobs to chips in a manner to minimize contention  Example P1 CMP Chip1 P2 P3 P1 Chip2 P4 CMP Chip2 P2 Chip1 P3 P4 6

Previous co-scheduling work  Runtime sampling based  Online sampling the performance on different schedules and pick the best  E.g., [Tullsen+: ASPLOS’00, ….]  Profiling directed  Offline profiling to learn program cache behavior  E.g., [ Nussbaum+: USENIX’05 ….] 7

Our focus  Two factors determining cache contention  Programs running together  Inputs to the programs 8

Contributions of this work  Exposing input impact on cache contention  Construction of cross-input predictive models  Evaluation on a proactive co-scheduler 9

Measurement of input impact  Machine: Intel Xeon dual-core processors  Compiler: gcc4.1  Hardware performance API: PAPI3.5  Experiments  Measure the performance degradation  every pair of 12 SPEC CPU2k programs  3 different input sets (test, train, and ref) 11

Metric  sCPI : Cycles per Instruction (CPI) when running alone  cCPI : CPI when co-running with other programs 12

Co-run degradation on different inputs 13

Objective An arbitrary input Predictive model Corun schedule Cache CAPS behavior Scheduler 15

Proactive Co-Scheduler: CAPS 16

Single-run behaviors to predict  Access per Instruction  Density of memory references in an execution  Distinct Memory Blocks per Cycle (DPC)  Aggressiveness of cache contention DPC = Distinct Blocks per Instruction (DPI) x Instructions per cycle  Reuse Signature 17

Reuse signature  Reuse distance Number of distinct data between data reuse  E.g,  b a a c b  2  Reuse signature  Histogram of reuse distances in an execution  Predictable with over 94% accuracy [Zhong+:TC’07] 18

Construction of predictive models New Input < I1 B1 > … Predictive < Ik Bk > Model … Regression < In Bn > Model Memory Behavior 19

Regression models  Linear model  Least Mean Squares (LMS) method  Linear function between inputs and outputs  Non-linear model  K-Nearest-Neighbor  Use k similar instances to estimate new output value  Hybrid method  Pick the model with minimum training errors for a program 20

Prediction accuracy result Programs Access per instruction DPI LMS NN Hybrid LMS NN Hybrid ammp 89.58 98.76 98.76 39.83 86.72 86.72 art 98.86 94.25 98.86 98.96 94.25 98.96 bzip 75.79 78.62 78.62 67.69 64.05 67.69 crafty 99.54 99.24 99.54 76.31 72.50 76.31 equake 54.58 54.42 54.58 82.27 82.13 82.27 gap 74.75 79.35 79.35 79.87 78.08 79.87 gzip 82.76 86.98 86.98 77.85 66.47 77.85 mcf 90.25 92.45 92.45 89.73 88.11 89.73 mesa 96.39 96.98 96.98 89.43 93.33 93.33 parser 96.02 98.61 98.61 89.49 70.42 89.49 twolf 97.11 98.10 98.10 52.12 86.75 86.75 vpr 81.50 81.50 81.50 96.30 95.28 96.30 22 Average Average 86.43 86.43 88.27 88.27 88.69 88.69 78.32 78.32 81.51 81.51 85.44 85.44

Effects on Co-Scheduling Normalized Corun Degradation 2.5 2 1.5 1 0.5 0 optimal CAPS-real CAPS-pred random 23

Conclusion  Input influence to job co-scheduling  Co-schedulers should adapt to program inputs  Cross-input predictive models  Reasonable accuracy through LMS and NN  Effective in proactive co-scheduling 24

Thanks! Questions? 25

Exploration of Influence of Program Inputs on CMP Co-Scheduling - PowerPoint PPT Presentation

Exploration of Influence of Program Inputs on CMP Co-Scheduling Yunlian Jiang Xipeng Shen Computer Science The College of William and Mary, USA Cache sharing in CMP Commercial CMPs Intel Core 2 Duo E6750 CPU CPU AMD Athlon X2

Atelier Num erique OMP Code Optimization: Vectorization Bertrand Putigny July 5, 2016 1 / 27

Pre-2012 CMP 2012 CMP Amendments 2018 CMP Amendments Above: Solar panel carports

Workshop 1 North Central Texas Council of Governments CMP Workshop Overview Overview of

THE CMP INTEGRATES LIFELONG LEARNING WITH ASSESSMENT THE CMP INTEGRATES LIFELONG LEARNING WITH

http://cmp.imag.fr CMP annual users meeting, 4 Feb. 2016, PARIS Pr Process Portf rtfolio lio fr

Inputs and Outputs Objects with Analog Inputs A io.sch analog.sch Objects with Digital Inputs

4th ANNUAL CIVIL MONEY PENALTY (CMP) GRANT TRAINING MAY 7, 2019 Hosted by: Mississippi

MANAGEMENT PLAN (CMP) PROPOSED AMENDMENTS Classification - Public THE DEVELOPMENT

NC MULTI - SLIDES PROGRAMMABLE CMP 350 OPTIMIZATION // FLEXIBILITY = FAST PRODUCTION The CMP 350

MEMS Processes at CMP Bulk Micromachining MUMPs from MEMSCAP Teledyne DALSA MIDIS Micralyne

Program Behaviour Program Behaviour semantics .c .c .c source program code inputs Program

Machine Learning Regression Where we are Inputs Prob- Density ability Estimator Inputs

Draft EE 8235: Lecture 15 1 Lecture 15: Systems with inputs Input types Additive inputs

INPUT THE INPUTS ON THE ARDUINO READ VOLTAGE. ALL INPUTS NEED TO BE THOUGHT OF IN TERMS OF

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

INFLUENCE OF LEAD ON ORGANO - INFLUENCE OF LEAD ON ORGANO- - INFLUENCE OF LEAD ON ORGANO

Database Systems Do Not Scale to 1000 CPU Cores And Other Tales of the Macabre @ andy_pavlo 2

Interference-aware Scheduling for Data-processing Frameworks in Container-based Clusters Miguel

On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma

What well talk about 2 ZSim has a full-featured memory system (originally designed for

L-Store: A Real-time OLTP and OLAP System Mohammad Sadoghi Souvik Bhattacharjee ,

PolarDB Cloud Native DB @ Alibaba Lixun Peng Inaam Rana Alibaba Cloud Team Agenda

DMA API Performance and Contention on IOMMU Enabled Environments Thadeu Cascardo

IS TOPOLOGY IMPORTANT AGAIN? Effects of contention on message latencies in large supercomputers

Sambuz

Useful Links

Newsletter

Mail Us