perf rform rmance ance inter terfer ference ence on
play

Perf rform rmance ance Inter terfer ference ence on Multico - PowerPoint PPT Presentation

An Empir irical ical Model el for Predicting dicting Cross ss-Cor Core Perf rform rmance ance Inter terfer ference ence on Multico ticore Processor essors Jiacheng Zhao Institute of Computing Technology, CAS In Conjunction


  1. An Empir irical ical Model el for Predicting dicting Cross ss-Cor Core Perf rform rmance ance Inter terfer ference ence on Multico ticore Processor essors Jiacheng Zhao Institute of Computing Technology, CAS In Conjunction with Prof. Jingling Xue, UNSW, Australia Sep 11, 2013

  2. Problem – Resource Utilization in Datacenters How?  ASPLOS’09 by Dav avid id Meisne ner+ 2013/9/11

  3. Problem – Resource Utilization in Datacenters Co-located applications Applications Co-Runners   Contention for shared cache, shared IMC, etc. Core Core Core Core  Negative and unpredictable interference L1 L1 L1 L1 Two types of applications   Batch – No QoS guarantees  Latency Sensitive - Attain high QoS Shar ared ed Cach ache Co-location is disabled   Low server utilization Memory Contr troll oller er Lacking the knowledge of interference  2013/9/11

  4. Problem – Resource Utilization in Datacenters Co-located applications   Contention for shared cache, shared IMC, etc.  Negative and unpredictable interference Two types of applications   Batch – No QoS guarantees  Latency Sensitive - Attain high QoS Co-location is disabled   Low server utilization Lacking the knowledge of interference  2013/9/11

  5. Problem – Resource Utilization in Datacenters [Micro’11 by Jason Mars+] Co-located applications   Contention for shared cache, shared IMC, etc.  Negative and unpredictable interference Two types of applications   Batch – No QoS guarantees  Latency Sensitive - Attain high QoS Co-location is disabled   Low server utilization Lacking the knowledge of interference  Figure: Task placement in datacenters 2013/9/11

  6. Our Goals: Predicting the interference Quantitatively predict the cross-core performance interference  Applicable for arbitrarily co-locations  Identify any “safe” co -locations  Deployable for datacenters  2013/9/11

  7. Our Intuition – Mining a model from large training data Training Set  Using machine learning approaches 2013/9/11

  8. Motivation example 0.485𝑄 𝑐𝑥 + 0.183𝑄 𝑑𝑏𝑑ℎ𝑓 − 0.138, 𝑗𝑔 𝑄 𝑐𝑥 < 3.2 0.706𝑄 𝑐𝑥 + 1.725𝑄 𝑑𝑏𝑑ℎ𝑓 − 0.220, 𝑗𝑔 3.2 ≤ 𝑄 𝑐𝑥 ≤ 9.6 𝑄𝐸 𝑛𝑑𝑔 = 0.907𝑄 𝑐𝑥 + 3.087𝑄 𝑑𝑏𝑑ℎ𝑓 − 0.561, 𝑗𝑔 𝑄 𝑐𝑥 > 9.6 2013/9/11

  9. Outline Introduction  Our Key Observations  Our Approach – Two-Phase Approach  Experimental Results  Conclusion  2013/9/11

  10. Our Key Observations Observation 1: The function depends only on the pressure on shared  resources, regardless of individual pressures from one co-runner. For an application A, PD A = f(P cache , P bw ) (P cache , P bw ) = g(A 1 ,A 2 ,…,A m ) 2013/9/11

  11. Our Key Observations Observation 2:   The function f is piecewise. 2013/9/11

  12. Our Key Observations Naively, we can create A ’ s prediction model using brute-force approach   BUT , we can NOT apply brute force approach for each application!  Thousands of applications in one datacenter  Frequent software updates  Different generations of processors  Even steps for one application is expensive  Observation 3:  The function form is platform-dependent and application independent  Only the coefficients are application-dependent 2013/9/11

  13. Outline Introduction  Our Key Observations  Our Approach - Two-Phase Approach  Experimental Results  Conclusion  2013/9/11

  14. Our Approach - Two-Phase Approach Phas ase 1: Get the ab abstr tract ct mode del Phas ase 2: Instantia tantiate te the ab abstr tract ct model  Find a function form best suitable for  Determine the application-specific all applications on a given platform coefficients (a11, etc.) Training Co-running One Co-running Applications Trainer Application Trainer  Heavy – many training workloads  Light-weighted, with a small number of trainings  Run once for one platform  Run once for one application 𝑏 11 𝑄 𝑐𝑥 + 𝑏 12 𝑄 𝑑𝑏𝑑ℎ𝑓 + 𝑏 13 , 𝑡𝑣𝑐𝑒𝑝𝑛𝑏𝑗𝑜1 0.49𝑄 𝑐𝑥 + 0.18𝑄 𝑑𝑏𝑑ℎ𝑓 − 0.13, 𝑄 𝑐𝑥 < 3.2 𝑏 21 𝑄 𝑐𝑥 + 𝑏 22 𝑄 𝑑𝑏𝑑ℎ𝑓 + 𝑏 23 , 𝑡𝑣𝑐𝑒𝑝𝑛𝑏𝑗𝑜2 0.71𝑄 𝑐𝑥 + 1.73𝑄 𝑑𝑏𝑑ℎ𝑓 − 0.22, 𝑝𝑢ℎ𝑓𝑠𝑡 𝑄D = 𝑄𝐸 𝑛𝑑𝑔 = 𝑏 31 𝑄 𝑐𝑥 + 𝑏 32 𝑄 𝑑𝑏𝑑ℎ𝑓 + 𝑏 33 , 𝑡𝑣𝑐𝑒𝑝𝑛𝑏𝑗𝑜3 0.91𝑄 𝑐𝑥 + 3.09𝑄 𝑑𝑏𝑑ℎ𝑓 − 0.56, 𝑄 𝑐𝑥 > 9.6 2013/9/11

  15. Our Approach - Two-Phase Approach 2013/9/11

  16. Our Approach - Two-Phase Approach Q1: What are selected as application features Q2: How? Q3: What’s the cost of the training? 2013/9/11

  17. Our Approach – Some Key Points  Q1: What are selected as application features?  Runtime profiles  Shared cache consumption  Bandwidth consumption 2013/9/11

  18. Our Approach – Some Key Points  Q2: How to create the abstract model?  Regression analysis  Configurable  Each configuration binding to a function form  Searching for the best function form for all applications in the training set 2013/9/11

  19. Our Approach – Some Key Points  Q3: What’s the cost of the training when instantiation  Cover all sub-domains of the piecewise function, say S  Constant points for each sub-domain, say C  The constant depends on the form of abstraction model  C*S training runs in total  Usually C and S are small, our experience: C=4, S=3 2013/9/11

  20. Outline Introduction  Our Key Observations  Our Approach - Two-Phase Approach  Experimental Results  Conclusion  2013/9/11

  21. Experimental Results  Accuracy of our two-phase regression approach  Prediction precision  Error analysis  Deployment in a datacenter  Utilization gained  QoS enforced and violated 2013/9/11

  22. Experimental Results  Benchmarks:  SPEC2006  Nine real-world datacenter applications  Nlp-mt, openssl, openclas, MR-iindex, etc.  Platforms:  Intel quad-core Xeon E5506 (main)  Datacenter:  300 quad-core Xeon E5506 2013/9/11

  23. Some Predictor Function 2013/9/11

  24. Prediction precision for SPEC Benchmarks  Prediction Error: Average 0. 0.2% 2%, from 0.0% to 8.6%. 2013/9/11

  25. Prediction precision for datacenter applications 15 workloads for each datacenter applications   Prediction Error: Average 0.3%, from 0.0% to 5%. 2013/9/11

  26. Error Distribution Error Distribution 4.00% 3.00% 2.00% 1.00% 0.00% -1.00% -2.00% -3.00% -4.00% 2013/9/11

  27. Prediction Efficiency  Precision Real Two-Phase Brute-Force 80% Performance Degradation  Two-Phase: 70% 60% 0.0~11.7%, Average: 0.40% 50% 40%  Brute-Force 30% 20% 0.0~10.1%, Average: 0.23% 10% 0%  Efficiency 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Workload ID  co-running: ~200  12 2013/9/11

  28. Benefits of piecewise predictor functions 2013/9/11

  29. Benefits of piecewise predictor functions 2013/9/11

  30. Deployment in a datacenter  300 quad-core Xeon  1200 tasks when fully occupied  Applications  Latency sensitive: Nlp-mt  machine translation  600 dedicated cores, 2/chip  Batch job  600 tasks, kmeans, MR  Our Purpose  QoS policy  Issue batch jobs to idle cores 2013/9/11

  31.  Cross-platform applicability Six-core Intel Xeon  Real Predicted 80% Performance Degradation 60% 40% 20% 0% 1 6 11 16 21 26 AVG Workload ID  Prediction Error: Average 0.1%, range from 0.0% to 10.2% 2013/9/11

  32.  Cross-platform applicability Quad-core AMD  Real Predicted 40% Performance Degradation 30% 20% 10% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 AVG Workload ID  Prediction Error: Average 0.3%, range from 0.0% to 5.1% 2013/9/11

  33. Outline Introduction  Our Key Observations  Our Approach - Two-Phase Approach  Experimental Results  Conclusion  2013/9/11

  34. Conclusion An empirical model, based on our key observations  Using aggregated resource consumptions to create the predictor function, thus  working for arbitrarily co-locations Piecewise is reasonable and effective  Breaking the model creation into two phases, for efficiency  2013/9/11

  35. 2013/9/11

  36. Backup slides  How to make the training set representative?  Partition the space into grids  Sample for each grid 2013/9/11

  37. Backup slides  How to do domain partitioning?  Specified in configuration file  Syntax: (shared resource i , condition i ), e.g. (P bw , equal(4))  Empirical knowledge to perform this task #Aggregation #Pre-Processing: none, exp(2), log(2), pow(2) #mode: add, mul #Domain Partitioning: {((Pbw), equal (4)), ((Pcache), equal (4)), ((Pcache, Pbw), equal (4, 4))}, #Function: linear, polynomial(2) 2013/9/11

Recommend


More recommend