case of marconi the new cineca
play

case of Marconi, the new CINECA flagship system Piero Lanucara - PowerPoint PPT Presentation

HPC Architectures evolution: the case of Marconi, the new CINECA flagship system Piero Lanucara BG/Q (Fermi) as a Tier0 Resource Many advantages as a supercomputing resource: Low energy consumption. Limited floor space


  1. HPC Architectures evolution: the case of Marconi, the new CINECA flagship system Piero Lanucara

  2. BG/Q (Fermi) as a Tier0 Resource • Many advantages as a supercomputing resource: – Low energy consumption. – Limited floor space requirements – Fast internal network – Homogeneous architecture → simple usage model. • But – Low, single core performance + I/O structure meant very high parallelism necessary (at least 1024 cores). – For some applications low memory/core (1Gb) and I/O performance also a problem. FERMI scheduled to be Also limited capabilities of O.S. on compute cores (e.g. no interactive access) decommissioned mid-end 2016 – Cross compilation, because login nodes different to compute nodes, can complicate some build procedures.

  3. Replacing Fermi at Cineca - considerations • A new procurement is a complicated process and considers many factors but must include (together with the price): – Minimum peak compute power – Power consumption – Floor space required – Availability – Disk space, internal network, etc. • IBM no longer offers the BlueGene range for supercomputers so cannot be a solution. • Many computer centres are adopting instead a heterogenous model for computer clusters

  4. Replacing Fermi – the Marconi solution 2016 2017 2018 2019 2020 Fermi 2PFlops . 0.8Mwatt Galileo & PICO 1.2PFlops . 0.4Mwatt Marconi A1 Volume Beta Alpha 2.1PFlops . 0.7Mwatt Marconi A2 Commit Wins 11PFlops. 1.3MWatt Marconi A3 7PFlops. 1.0Mwatt Sistem A4(?) 50PFlops. 3.2 Mwatt 2.4Mwatt 2.3Mwatt 2.3Mwatt 3.2Mwatt 1.2Mwatt 50 rack 120 rack 150 rack 120 rack 120 rack 100mq 240mq 240mq 240mq 300mq

  5. Marconi High level system Characteristics Tender proposal Partition Installation CPU # nodes # of Racks Power A1 – Broadwell (2.1 April 2016 E5-2697 v4 1512 25 700KW PFlops) A2 - Knight Landing (11 September KNL 3600 50 1300KW Pflops) 2016 A3 – Skylake (7 Pflops June 2017 E5-2680 v5 >1500 >25 1000KW expected) Network: Intel OmniPath

  6. � 1 PFs – no-conventional (KNL) A2 A2 KNL 68cores, 1.4 GHz; 3600 nodes, 11 PFs 5 PFs - conventional A3 A3 1 PFs - conventional A1 A1 SKL 2x20 cores, 2.3 GHz; >1500 nodes, 7 PFs BRD 2x18 cores, 2.3GHz 1500 nodes, 2PFs

  7. � storage GSS*: 5 PB scratch area (50 GB/s) by Jan17: 10 PB long term storage (40 GB/s) 20 PB Tape library 1 PFs – no-conventional (KNL) A2 A2 KNL 68cores, 1.4 GHz; 3600 nodes, 11 PFs 5 PFs - conventional A3 A3 1 PFs - conventional A1 A1 SKL 2x20 cores, 2.3 GHz; >1500 nodes, 7 PFs BRD 2x18 cores, 2.3GHz 1500 nodes, 2PFs

  8. Marconi - Compute A1 ( half reserved to EUROfusion ) 1512 Lenovo NeXtScale Server -> 2 PFlops Intel E5-2697 v4 Broadwell 18 cores @ 2.3GHz. dual socket node: 36 core and 128GByte / node A2 ( 1 PFlops to EUROfusion ) 3600 server Intel AdamPass -> 11 PFlops Intel PHI code name Knight Landing (KNL) 68 cores @ 1.4GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM A3 ( great part reserved to EUROfusion ) >1500 Lenovo Stark Server -> 7 PFlops Intel E5-2680v5 SkyLake 20 cores @ 2.3GHz dual socket node: 40 core and 196GByte /node

  9. Marconi - Compute A1 ( half reserved to EUROfusion ) 1512 Lenovo NeXtScale Server -> 2 PFlops Intel E5-2697 v4 Broadwell 18 cores @ 2.3GHz. dual socket node: 36 core and 128GByte / node A2 ( 1 PFlops to EUROfusion ) 3600 server Intel AdamPass -> 11 PFlops Intel PHI code name Knight Landing (KNL) 68 cores @ 1.4GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM A3 ( great part reserved to EUROfusion ) >1500 Lenovo Stark Server -> 7 PFlops Intel E5-2680v5 SkyLake 20 cores @ 2.3GHz dual socket node: 40 core and 196GByte /node

  10. Marconi - Compute A1 ( half reserved to EUROfusion ) 1512 Lenovo NeXtScale Server -> 2 PFlops Intel E5-2697 v4 Broadwell 18 cores @ 2.3GHz. dual socket node: 36 core and 128GByte / node A2 ( 1 PFlops to EUROfusion ) 3600 server Intel AdamPass -> 11 PFlops Intel PHI code name Knight Landing (KNL) 68 cores @ 1.4GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM A3 ( great part reserved to EUROfusion ) >1500 Lenovo Stark Server -> 7 PFlops Intel E5-2680v5 SkyLake 20 cores @ 2.3GHz dual socket node: 40 core and 196GByte /node

  11. Marconi - Network Network type: new Intel Omnipath Largest Omnipath cluster of the world Network topology: Fat-tree 2:1 oversubscription tapering at the level of the core switches only Core Switches: 5 x OPA Core Switch “Sawtooth Forest” 768 ports each Hdge Switch: 216 OPA Edge Switch “Eldorado Forest” 48 ports each Maximum system configuration: 5(opa) x 768(ports) x 2(tapering) -> 7680 servers

  12. 5 x 768 ports core Switches 3x 216 x 48 ports Hedge Switches 32 downlink 6624 Compute nodes 32 nodes fully interconnected island

  13. A1 HPL Full system Linpack: • 1 MPI task per node • perf range: 1.6 – 1.7PFs. • Max Perf: 1.72389PFs with Turbo-OFF. • Turbo-ON -> throttling June 2016:Number 46

  14. A2 HPL Full system Linpack: 3556 nodes • 1 MPI task per node • Max Perf with HyperThreading-OFF. November 2016:Number 12 [0] ================================================================================ [0] T/V N NB P Q Time Gflops [0] -------------------------------------------------------------------------------- [0] W[0] R00L2L4 6287568 336 28 127 26628.96 6.22304e+06 [0] HPL_pdgesv() start time Fri Nov 4 23:10:08 2016 [0] [0] HPL_pdgesv() end time Sat Nov 5 06:33:57 2016 [0] [0] HPL Efficiency by CPU Cycle 2505.196% [0] HPL Efficiency by BUS Cycle 2439.395% [0] -------------------------------------------------------------------------------- [0] ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0006293 ...... PASSED [0] ================================================================================ [0] [0] Finished 1 tests with the following results: [0] 1 tests completed and passed residual checks, [0] 0 tests completed and failed residual checks, [0] 0 tests skipped because of illegal input values. [0] -------------------------------------------------------------------------------- [0] [0] End of Tests. [0] ================================================================================

  15. Intel Xeon PHI KNC (Galileo@CINECA) • An accelerator (like GPUs) but more similar to a conventional multicore CPU. • Current version, Knight’s • No need to write CUDA Corner (KNC) has 57-61 or OpenCL as Intel 1.0-1.2 GHz cores,8- compilers will compile 16GB RAM. 512 bit Fortran or C code for the vector unit. MIC. • Cores connected in a • 1-2 Tflops, according to ring topology and MPI model. possible.

  16. • A2: Knights Landing (KNL) – A big unknown because very few people currently have access to KNL. – But we know the architecture of KNL and the differences and similarities with respect to KNC. – The main differences are: • KNL will be a standalone processor not an accelerator (unlike KNC) • KNL has more powerful cores and faster internal network. • On package high performance, memory (16GB, MCDRAM).

  17. Intel Xeon PHI KNC-KNL comparision KNC (Galileo) KNL (Marconi) 61 (pentium) #cores 68 (Atom ) Core frequency 1.238 GHz 1.4 Ghz Memory 16GB GDDR5 96GB DDR4 +16Gb MCDRAM Internal network Bi-directional Ring Mesh Vectorisation 512 bit /core 2xAVX-512 /core Usage Co-processor Standalone Performance (Gflops) 1208 (dp)/2416 (sp) ~3000 (dp) Power ~300W ~200W A KNC core can be 10x slower than a Haswell core. A KNL core is expected to be 2-3X slower. Big differences also in memory bandwidth.

  18. Coming next: A3 • A3. Intel Skylake processors (mid-2017) – Successor to Haswell, and launched in 2015. – Expect increase in performance and power efficiency.

  19. Coming next: A3

  20. Marconi A1 and A2 exploitation at its best *

  21. Exploiting the parallel universe Three levels of parallelism supported by Intel hardware • Multi thread/task performance Thread Level • Exposed by programming models Parallelism • Execute tens/hundreds/thousands task concurrently • Single thread performance Vector Level • Exposed by tools and programming models Parallelism • Operate on 4/8/16 elements at a time • Single thread performance Instruction Level • Automatically exposed by HW/tools Parallelism • Effectively limited to a few instructions

  22. A1 exploitation • A1: Broadwell nodes – Similar to Haswell cores present on Galileo. – Expect only a small difference in single core performance wrt Galileo, but a big difference compared to Fermi. – More cores/node (36) should mean better OpenMP performance but also MPI performance will improve (faster network). cores/node 36 – Life much easier for SPMD programming models. Memory/node 128 GB + Use SIMD vectorization

  23. Single Instruction Multiple Data (SIMD) vectorization • Technique for exploiting VLP on a single thread • Operate on more than one element at a time • Might decrease instruction counts significantly • Elements are stored on SIMD registers or vectors • Code needs to be vectorized • Vectorization usually on inner loops a[i:4] • Main and remainder loops are generated for (int i = 0; i < N; i++) Scalar loop b[i:4] c[i] = a[i] + b[i]; for (int i = 0; i < N; i += 4) SIMD loop c[i:4] (4 elements) c[i:4] = a[i:4] + b[i:4];

Recommend


More recommend