An Empir irical ical Model el for Predicting dicting Cross ss-Cor Core Perf rform rmance ance Inter terfer ference ence on Multico ticore Processor essors Jiacheng Zhao Institute of Computing Technology, CAS In Conjunction with Prof. Jingling Xue, UNSW, Australia Sep 11, 2013
Problem – Resource Utilization in Datacenters How? ASPLOS’09 by Dav avid id Meisne ner+ 2013/9/11
Problem – Resource Utilization in Datacenters Co-located applications Applications Co-Runners Contention for shared cache, shared IMC, etc. Core Core Core Core Negative and unpredictable interference L1 L1 L1 L1 Two types of applications Batch – No QoS guarantees Latency Sensitive - Attain high QoS Shar ared ed Cach ache Co-location is disabled Low server utilization Memory Contr troll oller er Lacking the knowledge of interference 2013/9/11
Problem – Resource Utilization in Datacenters Co-located applications Contention for shared cache, shared IMC, etc. Negative and unpredictable interference Two types of applications Batch – No QoS guarantees Latency Sensitive - Attain high QoS Co-location is disabled Low server utilization Lacking the knowledge of interference 2013/9/11
Problem – Resource Utilization in Datacenters [Micro’11 by Jason Mars+] Co-located applications Contention for shared cache, shared IMC, etc. Negative and unpredictable interference Two types of applications Batch – No QoS guarantees Latency Sensitive - Attain high QoS Co-location is disabled Low server utilization Lacking the knowledge of interference Figure: Task placement in datacenters 2013/9/11
Our Goals: Predicting the interference Quantitatively predict the cross-core performance interference Applicable for arbitrarily co-locations Identify any “safe” co -locations Deployable for datacenters 2013/9/11
Our Intuition – Mining a model from large training data Training Set Using machine learning approaches 2013/9/11
Motivation example 0.485𝑄 𝑐𝑥 + 0.183𝑄 𝑑𝑏𝑑ℎ𝑓 − 0.138, 𝑗𝑔 𝑄 𝑐𝑥 < 3.2 0.706𝑄 𝑐𝑥 + 1.725𝑄 𝑑𝑏𝑑ℎ𝑓 − 0.220, 𝑗𝑔 3.2 ≤ 𝑄 𝑐𝑥 ≤ 9.6 𝑄𝐸 𝑛𝑑𝑔 = 0.907𝑄 𝑐𝑥 + 3.087𝑄 𝑑𝑏𝑑ℎ𝑓 − 0.561, 𝑗𝑔 𝑄 𝑐𝑥 > 9.6 2013/9/11
Outline Introduction Our Key Observations Our Approach – Two-Phase Approach Experimental Results Conclusion 2013/9/11
Our Key Observations Observation 1: The function depends only on the pressure on shared resources, regardless of individual pressures from one co-runner. For an application A, PD A = f(P cache , P bw ) (P cache , P bw ) = g(A 1 ,A 2 ,…,A m ) 2013/9/11
Our Key Observations Observation 2: The function f is piecewise. 2013/9/11
Our Key Observations Naively, we can create A ’ s prediction model using brute-force approach BUT , we can NOT apply brute force approach for each application! Thousands of applications in one datacenter Frequent software updates Different generations of processors Even steps for one application is expensive Observation 3: The function form is platform-dependent and application independent Only the coefficients are application-dependent 2013/9/11
Outline Introduction Our Key Observations Our Approach - Two-Phase Approach Experimental Results Conclusion 2013/9/11
Our Approach - Two-Phase Approach Phas ase 1: Get the ab abstr tract ct mode del Phas ase 2: Instantia tantiate te the ab abstr tract ct model Find a function form best suitable for Determine the application-specific all applications on a given platform coefficients (a11, etc.) Training Co-running One Co-running Applications Trainer Application Trainer Heavy – many training workloads Light-weighted, with a small number of trainings Run once for one platform Run once for one application 𝑏 11 𝑄 𝑐𝑥 + 𝑏 12 𝑄 𝑑𝑏𝑑ℎ𝑓 + 𝑏 13 , 𝑡𝑣𝑐𝑒𝑝𝑛𝑏𝑗𝑜1 0.49𝑄 𝑐𝑥 + 0.18𝑄 𝑑𝑏𝑑ℎ𝑓 − 0.13, 𝑄 𝑐𝑥 < 3.2 𝑏 21 𝑄 𝑐𝑥 + 𝑏 22 𝑄 𝑑𝑏𝑑ℎ𝑓 + 𝑏 23 , 𝑡𝑣𝑐𝑒𝑝𝑛𝑏𝑗𝑜2 0.71𝑄 𝑐𝑥 + 1.73𝑄 𝑑𝑏𝑑ℎ𝑓 − 0.22, 𝑝𝑢ℎ𝑓𝑠𝑡 𝑄D = 𝑄𝐸 𝑛𝑑𝑔 = 𝑏 31 𝑄 𝑐𝑥 + 𝑏 32 𝑄 𝑑𝑏𝑑ℎ𝑓 + 𝑏 33 , 𝑡𝑣𝑐𝑒𝑝𝑛𝑏𝑗𝑜3 0.91𝑄 𝑐𝑥 + 3.09𝑄 𝑑𝑏𝑑ℎ𝑓 − 0.56, 𝑄 𝑐𝑥 > 9.6 2013/9/11
Our Approach - Two-Phase Approach 2013/9/11
Our Approach - Two-Phase Approach Q1: What are selected as application features Q2: How? Q3: What’s the cost of the training? 2013/9/11
Our Approach – Some Key Points Q1: What are selected as application features? Runtime profiles Shared cache consumption Bandwidth consumption 2013/9/11
Our Approach – Some Key Points Q2: How to create the abstract model? Regression analysis Configurable Each configuration binding to a function form Searching for the best function form for all applications in the training set 2013/9/11
Our Approach – Some Key Points Q3: What’s the cost of the training when instantiation Cover all sub-domains of the piecewise function, say S Constant points for each sub-domain, say C The constant depends on the form of abstraction model C*S training runs in total Usually C and S are small, our experience: C=4, S=3 2013/9/11
Outline Introduction Our Key Observations Our Approach - Two-Phase Approach Experimental Results Conclusion 2013/9/11
Experimental Results Accuracy of our two-phase regression approach Prediction precision Error analysis Deployment in a datacenter Utilization gained QoS enforced and violated 2013/9/11
Experimental Results Benchmarks: SPEC2006 Nine real-world datacenter applications Nlp-mt, openssl, openclas, MR-iindex, etc. Platforms: Intel quad-core Xeon E5506 (main) Datacenter: 300 quad-core Xeon E5506 2013/9/11
Some Predictor Function 2013/9/11
Prediction precision for SPEC Benchmarks Prediction Error: Average 0. 0.2% 2%, from 0.0% to 8.6%. 2013/9/11
Prediction precision for datacenter applications 15 workloads for each datacenter applications Prediction Error: Average 0.3%, from 0.0% to 5%. 2013/9/11
Error Distribution Error Distribution 4.00% 3.00% 2.00% 1.00% 0.00% -1.00% -2.00% -3.00% -4.00% 2013/9/11
Prediction Efficiency Precision Real Two-Phase Brute-Force 80% Performance Degradation Two-Phase: 70% 60% 0.0~11.7%, Average: 0.40% 50% 40% Brute-Force 30% 20% 0.0~10.1%, Average: 0.23% 10% 0% Efficiency 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Workload ID co-running: ~200 12 2013/9/11
Benefits of piecewise predictor functions 2013/9/11
Benefits of piecewise predictor functions 2013/9/11
Deployment in a datacenter 300 quad-core Xeon 1200 tasks when fully occupied Applications Latency sensitive: Nlp-mt machine translation 600 dedicated cores, 2/chip Batch job 600 tasks, kmeans, MR Our Purpose QoS policy Issue batch jobs to idle cores 2013/9/11
Cross-platform applicability Six-core Intel Xeon Real Predicted 80% Performance Degradation 60% 40% 20% 0% 1 6 11 16 21 26 AVG Workload ID Prediction Error: Average 0.1%, range from 0.0% to 10.2% 2013/9/11
Cross-platform applicability Quad-core AMD Real Predicted 40% Performance Degradation 30% 20% 10% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 AVG Workload ID Prediction Error: Average 0.3%, range from 0.0% to 5.1% 2013/9/11
Outline Introduction Our Key Observations Our Approach - Two-Phase Approach Experimental Results Conclusion 2013/9/11
Conclusion An empirical model, based on our key observations Using aggregated resource consumptions to create the predictor function, thus working for arbitrarily co-locations Piecewise is reasonable and effective Breaking the model creation into two phases, for efficiency 2013/9/11
2013/9/11
Backup slides How to make the training set representative? Partition the space into grids Sample for each grid 2013/9/11
Backup slides How to do domain partitioning? Specified in configuration file Syntax: (shared resource i , condition i ), e.g. (P bw , equal(4)) Empirical knowledge to perform this task #Aggregation #Pre-Processing: none, exp(2), log(2), pow(2) #mode: add, mul #Domain Partitioning: {((Pbw), equal (4)), ((Pcache), equal (4)), ((Pcache, Pbw), equal (4, 4))}, #Function: linear, polynomial(2) 2013/9/11
Recommend
More recommend