Perf rform rmance ance Inter terfer ference ence on Multico - PowerPoint PPT Presentation

An Empir irical ical Model el for Predicting dicting Cross ss-Cor Core Perf rform rmance ance Inter terfer ference ence on Multico ticore Processor essors Jiacheng Zhao Institute of Computing Technology, CAS In Conjunction with Prof. Jingling Xue, UNSW, Australia Sep 11, 2013

Problem – Resource Utilization in Datacenters How?  ASPLOS’09 by Dav avid id Meisne ner+ 2013/9/11

Problem – Resource Utilization in Datacenters Co-located applications Applications Co-Runners   Contention for shared cache, shared IMC, etc. Core Core Core Core  Negative and unpredictable interference L1 L1 L1 L1 Two types of applications   Batch – No QoS guarantees  Latency Sensitive - Attain high QoS Shar ared ed Cach ache Co-location is disabled   Low server utilization Memory Contr troll oller er Lacking the knowledge of interference  2013/9/11

Problem – Resource Utilization in Datacenters Co-located applications   Contention for shared cache, shared IMC, etc.  Negative and unpredictable interference Two types of applications   Batch – No QoS guarantees  Latency Sensitive - Attain high QoS Co-location is disabled   Low server utilization Lacking the knowledge of interference  2013/9/11

Problem – Resource Utilization in Datacenters [Micro’11 by Jason Mars+] Co-located applications   Contention for shared cache, shared IMC, etc.  Negative and unpredictable interference Two types of applications   Batch – No QoS guarantees  Latency Sensitive - Attain high QoS Co-location is disabled   Low server utilization Lacking the knowledge of interference  Figure: Task placement in datacenters 2013/9/11

Our Goals: Predicting the interference Quantitatively predict the cross-core performance interference  Applicable for arbitrarily co-locations  Identify any “safe” co -locations  Deployable for datacenters  2013/9/11

Our Intuition – Mining a model from large training data Training Set  Using machine learning approaches 2013/9/11

Motivation example 0.485𝑄 𝑐𝑥 + 0.183𝑄 𝑑𝑏𝑑ℎ𝑓 − 0.138, 𝑗𝑔 𝑄 𝑐𝑥 < 3.2 0.706𝑄 𝑐𝑥 + 1.725𝑄 𝑑𝑏𝑑ℎ𝑓 − 0.220, 𝑗𝑔 3.2 ≤ 𝑄 𝑐𝑥 ≤ 9.6 𝑄𝐸 𝑛𝑑𝑔 = 0.907𝑄 𝑐𝑥 + 3.087𝑄 𝑑𝑏𝑑ℎ𝑓 − 0.561, 𝑗𝑔 𝑄 𝑐𝑥 > 9.6 2013/9/11

Outline Introduction  Our Key Observations  Our Approach – Two-Phase Approach  Experimental Results  Conclusion  2013/9/11

Our Key Observations Observation 1: The function depends only on the pressure on shared  resources, regardless of individual pressures from one co-runner. For an application A, PD A = f(P cache , P bw ) (P cache , P bw ) = g(A 1 ,A 2 ,…,A m ) 2013/9/11

Our Key Observations Observation 2:   The function f is piecewise. 2013/9/11

Our Key Observations Naively, we can create A ’ s prediction model using brute-force approach   BUT , we can NOT apply brute force approach for each application!  Thousands of applications in one datacenter  Frequent software updates  Different generations of processors  Even steps for one application is expensive  Observation 3:  The function form is platform-dependent and application independent  Only the coefficients are application-dependent 2013/9/11

Outline Introduction  Our Key Observations  Our Approach - Two-Phase Approach  Experimental Results  Conclusion  2013/9/11

Our Approach - Two-Phase Approach Phas ase 1: Get the ab abstr tract ct mode del Phas ase 2: Instantia tantiate te the ab abstr tract ct model  Find a function form best suitable for  Determine the application-specific all applications on a given platform coefficients (a11, etc.) Training Co-running One Co-running Applications Trainer Application Trainer  Heavy – many training workloads  Light-weighted, with a small number of trainings  Run once for one platform  Run once for one application 𝑏 11 𝑄 𝑐𝑥 + 𝑏 12 𝑄 𝑑𝑏𝑑ℎ𝑓 + 𝑏 13 , 𝑡𝑣𝑐𝑒𝑝𝑛𝑏𝑗𝑜1 0.49𝑄 𝑐𝑥 + 0.18𝑄 𝑑𝑏𝑑ℎ𝑓 − 0.13, 𝑄 𝑐𝑥 < 3.2 𝑏 21 𝑄 𝑐𝑥 + 𝑏 22 𝑄 𝑑𝑏𝑑ℎ𝑓 + 𝑏 23 , 𝑡𝑣𝑐𝑒𝑝𝑛𝑏𝑗𝑜2 0.71𝑄 𝑐𝑥 + 1.73𝑄 𝑑𝑏𝑑ℎ𝑓 − 0.22, 𝑝𝑢ℎ𝑓𝑠𝑡 𝑄D = 𝑄𝐸 𝑛𝑑𝑔 = 𝑏 31 𝑄 𝑐𝑥 + 𝑏 32 𝑄 𝑑𝑏𝑑ℎ𝑓 + 𝑏 33 , 𝑡𝑣𝑐𝑒𝑝𝑛𝑏𝑗𝑜3 0.91𝑄 𝑐𝑥 + 3.09𝑄 𝑑𝑏𝑑ℎ𝑓 − 0.56, 𝑄 𝑐𝑥 > 9.6 2013/9/11

Our Approach - Two-Phase Approach 2013/9/11

Our Approach - Two-Phase Approach Q1: What are selected as application features Q2: How? Q3: What’s the cost of the training? 2013/9/11

Our Approach – Some Key Points  Q1: What are selected as application features?  Runtime profiles  Shared cache consumption  Bandwidth consumption 2013/9/11

Our Approach – Some Key Points  Q2: How to create the abstract model?  Regression analysis  Configurable  Each configuration binding to a function form  Searching for the best function form for all applications in the training set 2013/9/11

Our Approach – Some Key Points  Q3: What’s the cost of the training when instantiation  Cover all sub-domains of the piecewise function, say S  Constant points for each sub-domain, say C  The constant depends on the form of abstraction model  C*S training runs in total  Usually C and S are small, our experience: C=4, S=3 2013/9/11

Experimental Results  Accuracy of our two-phase regression approach  Prediction precision  Error analysis  Deployment in a datacenter  Utilization gained  QoS enforced and violated 2013/9/11

Experimental Results  Benchmarks:  SPEC2006  Nine real-world datacenter applications  Nlp-mt, openssl, openclas, MR-iindex, etc.  Platforms:  Intel quad-core Xeon E5506 (main)  Datacenter:  300 quad-core Xeon E5506 2013/9/11

Some Predictor Function 2013/9/11

Prediction precision for SPEC Benchmarks  Prediction Error: Average 0. 0.2% 2%, from 0.0% to 8.6%. 2013/9/11

Prediction precision for datacenter applications 15 workloads for each datacenter applications   Prediction Error: Average 0.3%, from 0.0% to 5%. 2013/9/11

Error Distribution Error Distribution 4.00% 3.00% 2.00% 1.00% 0.00% -1.00% -2.00% -3.00% -4.00% 2013/9/11

Prediction Efficiency  Precision Real Two-Phase Brute-Force 80% Performance Degradation  Two-Phase: 70% 60% 0.0~11.7%, Average: 0.40% 50% 40%  Brute-Force 30% 20% 0.0~10.1%, Average: 0.23% 10% 0%  Efficiency 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Workload ID  co-running: ~200  12 2013/9/11

Benefits of piecewise predictor functions 2013/9/11

Deployment in a datacenter  300 quad-core Xeon  1200 tasks when fully occupied  Applications  Latency sensitive: Nlp-mt  machine translation  600 dedicated cores, 2/chip  Batch job  600 tasks, kmeans, MR  Our Purpose  QoS policy  Issue batch jobs to idle cores 2013/9/11

 Cross-platform applicability Six-core Intel Xeon  Real Predicted 80% Performance Degradation 60% 40% 20% 0% 1 6 11 16 21 26 AVG Workload ID  Prediction Error: Average 0.1%, range from 0.0% to 10.2% 2013/9/11

 Cross-platform applicability Quad-core AMD  Real Predicted 40% Performance Degradation 30% 20% 10% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 AVG Workload ID  Prediction Error: Average 0.3%, range from 0.0% to 5.1% 2013/9/11

Conclusion An empirical model, based on our key observations  Using aggregated resource consumptions to create the predictor function, thus  working for arbitrarily co-locations Piecewise is reasonable and effective  Breaking the model creation into two phases, for efficiency  2013/9/11

2013/9/11

Backup slides  How to make the training set representative?  Partition the space into grids  Sample for each grid 2013/9/11

Backup slides  How to do domain partitioning?  Specified in configuration file  Syntax: (shared resource i , condition i ), e.g. (P bw , equal(4))  Empirical knowledge to perform this task #Aggregation #Pre-Processing: none, exp(2), log(2), pow(2) #mode: add, mul #Domain Partitioning: {((Pbw), equal (4)), ((Pcache), equal (4)), ((Pcache, Pbw), equal (4, 4))}, #Function: linear, polynomial(2) 2013/9/11

Perf rform rmance ance Inter terfer ference ence on Multico - PowerPoint PPT Presentation

An Empir irical ical Model el for Predicting dicting Cross ss-Cor Core Perf rform rmance ance Inter terfer ference ence on Multico ticore Processor essors Jiacheng Zhao Institute of Computing Technology, CAS In Conjunction

perf scripts jiri olsa 1 PERF SCRIPTS | JIRI OLSA HI basics perf in python post

Perf rform rmance Evalu luatio ion of Se Sele lected Pla lants and Ir Iron Rich ich Media

Corr rresp espond ondenc ence a e analys alysis is as a tool t ol to perf rform orm the

Tracing with Perf tools Namhyung Kim 2013-11-13 Wed Namhyung Kim Tracing with Perf tools

Both on the Risk of Cardiovascular Disease Brian A Ference, Thatcher B Ference, Robert D Brook,

Brian A Ference, Thatcher B Ference, Robert D Brook, Alberico L Catapano, Christian T Ruff, David

Ov Overview w pre present ntation LLDCs Cs Trad Trade Pe Perf rform rmanc nce Ms.

LACCD Strategic Planning Initiative SWO WOT Mi Missio ion Visio Vi ion Pe Perf rform

PER PERFO FORM RMANCE AUT ANCE AUTOM OMOBILI OBILITY TY DEEP P LEARNING ING MEETS TS

CRC: VALUE-DRIVEN Jeffe feries es 2018 8 Global Energy y Confer ference ence M a r k S m i

Thinking about performance Search: a case study Perf: speed/power/etc. Perf: why do we care?

Linux Perf Tools Overview and Current Developments Arnaldo Carvalho de Melo, Jiri Olsa Red Hat

world of In Inter Ic Ice-Pump JAN 2016 Presentation of Inter Ice-Pump 1 Inter Ice-Pump ApS //

Why Inter- -Municipal Municipal Why Inter Cooperation? Cooperation? 1 Inter- -Municipal

Investor Relations Presentation October 2019 PYC Corporate snapshot For personal use only

MS MSHS HSAA & K- 8 Aft 8 After S School ool P Progr ograms E Evaluation on 2018

I N C LOUD C OMPUTING Christina Delimitrou Stanford University Defense May 26 th

Architecture-based Dependability Prediction for Service-oriented Computing Vincenzo Grassi

TensorFlow w/XLA: TensorFlow, Compiled! Expressiveness with performance Pre-release

ECE 1747H ECE 1747H : Parallel Meeting time: Mon 4-6 PM Programming Meeting place: BA

CSE 331 Composite Layouts; Decorators slides created by Marty Stepp based on materials by M.

Using Invariant Analysis for Improving Instrumentation-based Performance Evaluation of

Digging Into The Core of Boot Yuriy Bulygin @c7zero Oleksandr Bazhaniuk @ABazhaniuk Agenda

Porting D3JS to kotlin multiplaform Gatan Zoritchak Pierre Mariac @gz_k @imtam5

Perf rform rmance ance Inter terfer ference ence on Multico - PowerPoint PPT Presentation

An Empir irical ical Model el for Predicting dicting Cross ss-Cor Core Perf rform rmance ance Inter terfer ference ence on Multico ticore Processor essors Jiacheng Zhao Institute of Computing Technology, CAS In Conjunction

perf scripts jiri olsa 1 PERF SCRIPTS | JIRI OLSA HI basics perf in python post

Perf rform rmance Evalu luatio ion of Se Sele lected Pla lants and Ir Iron Rich ich Media

Corr rresp espond ondenc ence a e analys alysis is as a tool t ol to perf rform orm the

Tracing with Perf tools Namhyung Kim 2013-11-13 Wed Namhyung Kim Tracing with Perf tools

Both on the Risk of Cardiovascular Disease Brian A Ference, Thatcher B Ference, Robert D Brook,

Brian A Ference, Thatcher B Ference, Robert D Brook, Alberico L Catapano, Christian T Ruff, David

Ov Overview w pre present ntation LLDCs Cs Trad Trade Pe Perf rform rmanc nce Ms.

LACCD Strategic Planning Initiative SWO WOT Mi Missio ion Visio Vi ion Pe Perf rform

PER PERFO FORM RMANCE AUT ANCE AUTOM OMOBILI OBILITY TY DEEP P LEARNING ING MEETS TS

CRC: VALUE-DRIVEN Jeffe feries es 2018 8 Global Energy y Confer ference ence M a r k S m i

Thinking about performance Search: a case study Perf: speed/power/etc. Perf: why do we care?

Linux Perf Tools Overview and Current Developments Arnaldo Carvalho de Melo, Jiri Olsa Red Hat

world of In Inter Ic Ice-Pump JAN 2016 Presentation of Inter Ice-Pump 1 Inter Ice-Pump ApS //

Why Inter- -Municipal Municipal Why Inter Cooperation? Cooperation? 1 Inter- -Municipal

Investor Relations Presentation October 2019 PYC Corporate snapshot For personal use only

MS MSHS HSAA &amp; K- 8 Aft 8 After S School ool P Progr ograms E Evaluation on 2018

I N C LOUD C OMPUTING Christina Delimitrou Stanford University Defense May 26 th

Architecture-based Dependability Prediction for Service-oriented Computing Vincenzo Grassi

TensorFlow w/XLA: TensorFlow, Compiled! Expressiveness with performance Pre-release

ECE 1747H ECE 1747H : Parallel Meeting time: Mon 4-6 PM Programming Meeting place: BA

CSE 331 Composite Layouts; Decorators slides created by Marty Stepp based on materials by M.

Using Invariant Analysis for Improving Instrumentation-based Performance Evaluation of

Digging Into The Core of Boot Yuriy Bulygin @c7zero Oleksandr Bazhaniuk @ABazhaniuk Agenda

Porting D3JS to kotlin multiplaform Gatan Zoritchak Pierre Mariac @gz_k @imtam5

MS MSHS HSAA & K- 8 Aft 8 After S School ool P Progr ograms E Evaluation on 2018