future computing platforms
play

Future Computing Platforms for Science in a Power Constrained Era - PowerPoint PPT Presentation

Future Computing Platforms for Science in a Power Constrained Era David Abdurachmanov (FNAL) Peter Elmer (Princeton) Giulio Eulisse (FNAL) Robert Knight (Princeton) 1 Power in Data Centers An Inconvenient Truth Energy-related costs account


  1. Future Computing Platforms for Science in a Power Constrained Era David Abdurachmanov (FNAL) Peter Elmer (Princeton) Giulio Eulisse (FNAL) Robert Knight (Princeton) 1

  2. Power in Data Centers An Inconvenient Truth Energy-related costs account for approximately 12 % of overall expenditure and are the fastest-rising cost, according to Gartner, Inc. (29 / 9 / 2010) CMS for 2012 data used ~ 100K x86_64 cores from 350K cores at WLCG Scaling up from the mix of machines at FNAL we estimate WLCG aggregate power consumption at 10MW CMS expects 2 - 3 orders of magnitude increase in data produced in 15 years Think green Local green or/and cheaper power source, e.g. Princeton energy plant (15MW) combines electricity, heat and cooling. When electricity cost increased gas, diesel or/and bio-diesel fuel is used to power local generators. Hot water and steam is provided from waste energy. Low-power and / or highly efficient hardware, e.g., Intel Atom, X-Gene (ARMv8 64-bit), GPUs, Xeon Phi. 2

  3. Intel Xeon Status quo Obvious market leader , currently targeted at non power constrained applications. De-facto standard in HEP. Needs to be a reference point even in the power efficiency case because the only way to win the game is to be power efficient and performant . Very diverse offering to match different needs in terms of performance, recently introduced "custom" silicon for big players. Advantages Intel main advantage comes from being one generation ahead in terms of manufacturing process and architectural sophistication (e.g. large vector units), not too mention maturity of the development toolset. Many features introduced over the years to monitor power consumption (e.g. RAPL) and improve power efficiency (e.g. TurboBoost, SpeedStep). 3

  4. APM X-Gene1 64-bit ARM Old kid on the block ARM32 is the obvious volume leader for low power, embedded world . Interesting not only because it's power efficient, but also for the business model where ``custom'' designs are the norm and because it has the economy of scale of cell phones behind. Since last CHEP a few 64-bit chips started to appear, first in embedded world (iPhone!), and now, thanks to Applied Micro X-Gene 1, in the server world. Porting Effort CMS Offline SW (CMSSW) has been ported to work on APM X-Gene1. Most of the work is getting the core of the linux distribution to work. Porting CMSSW to ARM64 less of an issue, because compatibility issue either solved for ARM32 or really 32-bit vs 64-bit problems in our code base. Choosing OpenSource software is key to be ready for new platforms. 4

  5. Intel Atom The Empire strikes back Obviously Intel is not sitting idle in such a strategic and lucrative market. While Atom has been so far unable to touch ARM dominance in the embedded world, it is becoming an attractive player in the server market. Advantages It's a standard x86_64 core, where tradeoff (number of cores, cache size / hierarchy, no hyperthreading) have been made to focus on low power consumption. Production process edge a clear strategical advantage. 5

  6. Outsiders IBM POWER8 Evolution of the old POWER / PowerPC architecture. Once IBM / Motorola / Apple partnership, now guidance comes from the OpenPOWER consortium, similar to what happens with ARM. Pitched at very highly parallel workloads in high-end server market. Effort to make porting to it much easier (e.g. now with little endian support). Intel Xeon Phi Intel answers to GPGPU, it's becoming a somewhat popular platform albeit not widely used in HEP . Provides a high number of small cores (61 with 4 threads each) with large vector units. Main advantage is the fact such cores are normal x86_64. 6

  7. Software setup ParFullCMS Standalone CMS simulation using Geant4 (v10.1) with representative geometry (but simplified physics). Compiled with GCC 4.9.x (apart from Xeon Phi which uses ICC), with static binaries and multithreading support. CMS Software (CMSSW) Latest development version as of 1st of April 2015. Compiled using GCC 4.9.x. Reconstruction from local file, conditions from Frontier. HEPSPEC06 Standard benchmark for HEP software, used for in CMS as a metric for computing pledges. Compiled using GCC 4.9.x. 7

  8. 22nm 18 Intel KNC7100 Q2/14 Intel 22nm Frequency (GHz) Cores Threads per Core SandyBridge 2.0 (2.8) 8 2 Haswell 2.3 (3.6) 2 Vendor Atom 2.4 8 1 X-Gene1 2.4 8 1 IBM 3.4 10 8 Phi 1.23 Xeon Phi IBM 4 Intel Model Year Fab Process SandyBridge Intel E5-2650 Q1/12 Intel 32nm Haswell Intel E5-2699 Q3/14 22nm Late 13 883408 8247-22L IBM POWER8 40nm TMSC Q3/13 APM Atom X-Gene1 22nm Intel Q3/13 C2750 Intel 61 CPU Specs 8

  9. All numbers normalised to SandyBridge 1 core performance. Raw performance Single Core Single Socket ParFullCMS CMSSW HEPSPEC06 0.0 0.5 1.0 1.5 2.0 0 5 10 15 20 25 Xeon SandyBridge (8 cores, 2 − 2.8 GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) X−Gene1 (8 cores, 2.4 GHz) Atom (8 cores, 2.4GHz) POWER8 (10 cores, 3.4 GHz) Xeon Phi (61 cores, 1.3 Ghz) 9

  10. Scalability ParFullCMS 25 20 evts / s (SandyBridge one core normalised) Hyperthreading regime 15 10 Atom (8 cores, 2.4GHz) 5 POWER8 (10 cores, 3.4 GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) Xeon SandyBridge (8 cores, 2 − 2.8 GHz) X−Gene1 (8 cores, 2.4 GHz) Xeon Phi (61 cores, 1.3 Ghz) 0 0 2 4 6 8 Threads per core 10

  11. Scalability ParFullCMS 2.0 Atom (8 cores, 2.4GHz) POWER8 (10 cores, 3.4 GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) (evts / s) / thread (SandyBridge one core normalised) Xeon SandyBridge (8 cores, 2 − 2.8 GHz) X−Gene1 (8 cores, 2.4 GHz) Xeon Phi (61 cores, 1.3 Ghz) 1.5 Turbo Boost 1.0 0.5 0.0 0 2 4 6 8 Threads per core 11

  12. Atom number is for a card, not for a CPU. Power Effjciency (single CPU) ParFullCMS − Performance vs. Power Consumption 25 Atom (8 cores, 2.4GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) X−Gene1 (8 cores, 2.4 GHz) Performance (Evt / s) (SandyBridge 1 core normalised) Xeon SandyBridge (8 cores, 2 − 2.8 GHz) Xeon Phi (61 cores, 1.3 Ghz) 20 15 10 5 0 0 50 100 150 200 Power consumption (W) 12

  13. Power Effjciency (single CPU) ParFullCMS 5 Efficiency (Evt / J) (SandyBridge single core normalised) 4 3 2 1 Atom (8 cores, 2.4GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) X−Gene1 (8 cores, 2.4 GHz) Xeon SandyBridge (8 cores, 2 − 2.8 GHz) Xeon Phi (61 cores, 1.3 Ghz) 0 0.0 0.5 1.0 1.5 2.0 Threads per Core 13

  14. 4.3U system which supports both Atom and X-Gene 1 up to 45 cartdriges. Box to box comparison. 14

  15. Power Effjciency (box) ParFullCMS 0.10 Projected fully populated m300 system 0.08 2 sockets Haswell 3U 0.06 (Evt/J) 0.04 Projected fully populated m400 system 0.02 XGene−1 (8 cores, 2.4GHz) in HP Moonshot m400 card Atom (8 cores, 2.4GHz) in HP Moonshot m300 card 0.00 0 10 20 30 40 Active cardriges 15

  16. Outlook Current market The race is heating up , and Intel is not sitting idle . Fabrication process advantage is king. ARM64 is not an easy answer as previously thought. Intel Atom and Intel Xeon are currently unmatched in terms of both performance and power efficientcy. Will be interesting to see next X-Gene iterations, but it's not like Intel does not have a roadmap as well. More thoughts. Haswell vs Atom clearly shows that we need to keep into account volume and limits on the infrastructure in the equation as well. As we very well know by now, exploiting parallelism and multithreading is not an easy task. POWER8 or Xeon Phi really require effort to even remotely scale as advertised (just like GPUs). 16

  17. For providing hardware and / or helping setting it up we would like to thank: Thanks! ▶ Applied Micro ▶ Intel ▶ CERN TechLab 17

  18. Backup slides 18

  19. Performance ParFullCMS (multithreading − SandyBridge one core normalised) 25 Atom (8 cores, 2.4GHz) POWER8 (10 cores, 3.4 GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) Xeon SandyBridge (8 cores, 2 − 2.8 GHz) X−Gene1 (8 cores, 2.4 GHz) 20 Xeon Phi (61 cores, 1.3 Ghz) evts / s (SandyBridge normalised) 15 10 5 0 1 2 5 10 20 50 100 200 500 # cores (log) 19

  20. Performance (multi-core) CMSSW (multi−job − SandyBridge normalised) 14 12 evts / s (SandyBridge normalised) 10 8 6 4 X−Gene1 (8 cores, 2.4 GHz) 2 Atom (8 cores, 2.4GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) Xeon SandyBridge (8 cores, 2 − 2.8 GHz) 0 1 2 5 10 20 50 100 # cores (log) 20

  21. Performance (multi-core) (HEPSPEC06) 25 20 Performance index 15 10 5 Xeon SandyBridge (8 cores, 2 − 2.8 GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) X−Gene1 (8 cores, 2.4 GHz) POWER8 (10 cores, 3.4 GHz) Atom (8 cores, 2.4GHz) 0 1 2 5 10 20 50 100 200 500 # cores (log) 21

Recommend


More recommend