Accelerators in Technical Computing: Is it Worth the Pain? A TCO Perspective Sandra Wienke, Dieter an Mey, Matthias S. Müller Center for Computing and Communication JARA – High-Performance Computing RWTH Aachen University Rechen- und Kommunikationszentrum (RZ)
Agenda Introduction Modeling Total Cost of Ownership (TCO) Comparison Metrics Case Study on Accelerators Programming Models & System Types TCO Components @ RWTH Real-World Application Results Conclusion & Outlook TCO of Accelerators 2 Sandra Wienke | Center for Computing and Communication
Introduction Today: Varity of HPC clusters Usage of accelerators (NVIDIA GPU, Intel Xeon Phi) motivated by promising performance per watt ratio System comparison by performance or performance per watt not sufficient for purchase decision Total costs of ownership (TCO) Acquisition costs, housing, operation costs,.. Inclusion of manpower costs (administration & programming) Comparison of costs per program run (application-dependent) Investigation of a real-world software package OpenMP on Intel Sandy Bridge Impact of manpower effort/ OpenMP + LEO on Intel Xeon Phi programming model? OpenCL, OpenACC on NVIDA Fermi GPU TCO of Accelerators 3 Sandra Wienke | Center for Computing and Communication
Modeling – Total Cost of Ownership (TCO) Basis: single compute node extrapolate to cluster amount 𝑜: number of nodes 𝐉𝐨𝐰𝐟𝐭𝐮𝐧𝐟𝐨𝐮 𝑱 = 𝐔𝐃𝐏 𝒐, 𝝊 = 𝑫 𝒑𝒖 (𝒐) + 𝑫 𝒒𝒃 (𝒐) ∙ 𝝊 𝜐: system lifetime One-time costs C ot Per node: HW acquisition, building/infrastructure, OS/ env. installation Per node type: OS/ env. installation, programming effort Annual costs C pa Per node: HW maintenance, building/infrastructure, OS/ env. maintenance, power consumption Per node type: OS/ env. maintenance, compiler/software, application maintenance TCO depends on architecture & application TCO of Accelerators 4 Sandra Wienke | Center for Computing and Communication
Modeling – Comparison Metrics Costs per program run C ppr 𝑜 ∶ number of nodes 𝜐 ∶ system lifetime Includes investment/ TCO & application performance 𝑜 𝑓𝑦 ∶ #app. executions 𝐷 𝑞𝑞𝑠 𝑜, 𝜐 = TCO(𝑜, 𝜐) 𝑜 𝑓𝑦 (𝜐) ∙ 𝑜 with 𝑜 𝑓𝑦 𝜐 = 𝑙 ∙ 𝜐 𝑙 ∶ system usage rate 𝑢 𝑞𝑏𝑠 : parallel runtime 𝑢 𝑞𝑏𝑠 Used baseline for system X: Intel Sandy Bridge (SNB) + OpenMP 𝐷 𝑞𝑞𝑠,𝑌 𝑜 𝑌 , 𝜐 − 𝐷 𝑞𝑞𝑠,𝑃𝑁𝑄 𝑜 𝑃𝑁𝑄 , 𝜐 < 0 ≥ 0 𝑗𝑔 𝑌 𝑃𝑁𝑄 beneficial 𝐷 𝑞𝑞𝑠,𝑃𝑁𝑄 𝑜 𝑃𝑁𝑄 , 𝜐 Break-even investments Min. budget needed so that system X beneficial over OpenMP on SNB Solve for 𝐽 with given fixed lifetime 𝜐 : 𝐷 𝑞𝑞𝑠,𝑌 𝑜 𝑌 , 𝜐 − 𝐷 𝑞𝑞𝑠,𝑃𝑁𝑄 𝑜 𝑃𝑁𝑄 , 𝜐 = 0 with TCO 𝑜, 𝜐 = 𝐽 TCO of Accelerators 5 Sandra Wienke | Center for Computing and Communication
Case Study on Accelerators – Programming Models & System Types Programming Model Accelerator Host Compiler Serial 2x Intel Sandy Bridge, Intel 13.0.1 OpenMP 16 cores, 2 GHz (simple, vectorized) Intel Xeon Phi LEO + OpenMP Intel 13.0.1 5110P, 60 cores 1x Intel Westmere, OpenACC NVIDIA Tesla PGI 12.9 4 cores, 2.4 GHz C2050 (Fermi), OpenCL Intel 13.0.1 ECC on TCO of Accelerators 6 Sandra Wienke | Center for Computing and Communication
Case Study on Accelerators – TCO Components @ RWTH One-time costs HW purchase: list prices from Bull Building/infrastructure: as annual costs since it is amortized over 25 years OS/env. installation: - Programming effort: Full-time employee costs 285.71 € a day Annual costs HW maintenance: 5% of HW purchase costs Building/infrastructure: 200,000 € per year; costs per node: division by 1.6MW; multiplication by max. power consumption of each node OS/env. maintenance: 4 admins, 75% maintenance cluster (~2300 nodes): 180,000 € / 2300 = 78 € per node and year Software/compiler: - Power: PUE 1.5, regional electricity costs 0.15 € /kWh Application maintenance: - (small kernels) Given lifetime of 4 years & investment C ppr #nodes, #executions (usage rate 80%) TCO of Accelerators 7 Sandra Wienke | Center for Computing and Communication
Case Study on Accelerators – Real-World Application Basis Serial version Small kernel Assumption: homogeneous app. landscape KegelSpan 2 Source: BMW, ZF, Klingelnberg 3D simulation of bevel gear cutting process Kernel artificially increased from 25% to 90% TCO of Accelerators 2 C. Brecher, C. Gorgels, and A. Hardjosuwito. Simulation based Tool Wear Analysis in 8 Sandra Wienke | Center for Computing and Communication Bevel Gear Cutting. In International Conference on Gears, volume 2108.2 of VDI- Berichte, pp.1381 – 1384, Düsseldorf, VDI Verlag, 2010.
Case Study on Accelerators – TCO Components of Application 180 250 OpenCL (GPU) 158 power consumption [W] 160 OpenACC (GPU) 140 200 140 119 OpenMP+LEO (Phi) runtime [s] 120 OpenMP-vec (SNB) 150 100 OpenMP-simp (SNB) 80 100 60 40 50 20 0 0 6 5.0 effort [days] 4.5 3.5 4 1.5 2 0.5 0 TCO of Accelerators 9 Sandra Wienke | Center for Computing and Communication
Case Study on Accelerators – Results 20% costs per program run (relative to OMP-simp) OpenCL (GPU) OpenACC (GPU) 10% OpenMP+LEO (Phi) 3.62% 0% OpenMP-vec (SNB) -10% -12.09% -16.82% -20% -17.15% 0 € 100K € 200K € Investment 10,000 € break-even investment 7,787 7,231 5,000 € 1,809 0 € TCO of Accelerators 10 Sandra Wienke | Center for Computing and Communication
Conclusion Are accelerators beneficial? “It depends” TCO spreadsheet 1 for own computations available Our results (w/ 90% kernel portion) show SNB-OMP (4 years, 250 K € ) GPU Fermi beneficial over 2-socket Intel SNB server -17% C ppr + 4% C ppr Intel Xeon Phi results disappointing for now Mainly due to high acquisition costs NVIDIA Kepler probably similar Programming effort impacts break-even investment (see OpenACC OpenCL) Bigger codes: increase of kernel size ~ increase of break-even invest. Projections possible (e.g. hybrid codes) 1 Wienke, S., an Mey, D., Müller, M.S.: Accelerators for Technical TCO of Accelerators 11 Computing: Is it Worth the Pain? TCO Spreadsheet. https://sharepoint. Sandra Wienke | Center for Computing and Communication campus.rwth-aachen.de/units/rz/HPC/public/Shared%20Documents/ WienkeEtAl_Accelerators-TCO-Perspective.xlsx, 2013
Outlook Hybrid code implementation (cmp to projections) Model extensions New programming models & architectures (OpenMP 4.0, NVIDIA Kepler) Network communication (MPI) Mixed job execution (heterogeneous application landscape) Assessment of decrease in runtime/ gaining more results Comprehensive TCO calculation with predictive powers Performance, power consumption, manpower Towards exascale computing, architectures might get more complex More difficult to manage & program Thank you for Impact of manpower effort might get stronger your attention! TCO of Accelerators 12 Sandra Wienke | Center for Computing and Communication
Recommend
More recommend