red storm cray xt4 a superior architecture for scalability
play

Red Storm / Cray XT4: A Superior Architecture for Scalability - PowerPoint PPT Presentation

Red Storm / Cray XT4: A Superior Architecture for Scalability Mahesh Rajan, Doug Doerfler, Courtenay Vaughan Sandia National Laboratories, Albuquerque, NM Cray User Group Atlanta, GA; May 4-9, 2009 Sandia is a multi-program laboratory


  1. Red Storm / Cray XT4: A Superior Architecture for Scalability Mahesh Rajan, Doug Doerfler, Courtenay Vaughan Sandia National Laboratories, Albuquerque, NM Cray User Group Atlanta, GA; May 4-9, 2009 Sandia is a multi-program laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Rajan, Doerfler, Vaughan 1

  2. MOTIVATION  Major trend in HPC system architecture: use of commodity multi-socket multi-core nodes with InfiniBand Interconnect  DOE under the ASC Tri-Lab Linux Capacity Cluster (TLCC) program has purchased 21 “Scalable Units” (SUs). Each SU consists of 144 four -socket, quad-core AMD Opteron (Barcelona) nodes, using DDR InfiniBand as the high speed interconnect  Red Storm/Cray XT4 at Sandia was recently upgraded: the 6240 nodes in the ‘center section’ were upgraded to a similar AMD Opteron quad -core processor (Budapest)  Comparison of performance between Red Storm and TLCC reveals a wealth of information about HPC architectural balance characteristics on application scalability  The best TLCC performance used for comparisons; Often required NUMACTL  The benefits of the superior architectural features of Cray/XT4 are analyzed through several benchmarks and applications

  3. Presentation Outline • Overview of current Red Storm/Cray XT4 system at Sandia • Overview of the Tri-Lab Linux Capacity Cluster -TLCC • Architectural similarities and differences between the two systems • Architectural balance ratios • Micro-benchmarks • Memory latency • Memory bandwidth • MPI Ping-Pong • MPI Random and Bucket-Brigade • Mini-Applications • Mantevo-HPCCG • Mantevo-phdMesh • SNL Applications • CTH – Shock hydrodynamics • SIERRA/Presto – Explicit Lagrangian mechanics with contact • SIERRA/Fuego – Implicit multi-physics Eulerian mechanics • LAMMPS – Molecular dynamics Rajan, Doerfler, Vaughan 3

  4. Red Storm Architecture 284.16 TeraFLOPs theoretical peak performance 135 compute node cabinets, 20 service and I/O node cabinets, and 20 Red/Black switch cabinets 640 dual-core service and I/O nodes (320 for red, 320 for black) 12,960 compute nodes (dual-core and quad-core nodes) = 38,400 compute cores 6720 Dual- Core nodes with AMD Opteron™ processor 280 2.4 GHz 4 GB of DDR-400 RAM 64 KB L1 instruction and data caches on chip 1 MB L2 shared (Data and Instruction) cache on chip Integrated Hyper Transport2 Interfaces 6240 Quad- Core nodes with AMD Opteron™ processor Budapest 2.2 GHz 8 GB of DDR2-800 RAM 64 KB L1 instruction and 64KB L1 data caches on chip per core 512 KB L2 Cache per core 2 MB L3 shared (Data and Instruction) cache on chip Integrated Hyper Transport3 Interfaces Rajan, Doerfler, Vaughan 4

  5. TLCC Overview SNL ‘s TLCC 38 TeraFLOPs theoretical peak performance 2 Scalable Units (SUs) 288 total nodes 272 quad-socket, quad-core compute nodes 4,352 compute cores 2.2 GHz AMD Opteron Barcelona 32 GB DDR2 -667 RAM per node 9.2 TB total RAM 64 KB L1 instruction and 64KB L1 data caches on chip per core 512 KB L2 Cache per core 2 MB L3 shared (Data and Instruction) cache on chip Integrated dual DDR memory controllers Integrated Hyper Transport3 Interfaces Interconnect: InfiniBand with OFED stack InfiniBand card: Mellanox ConnectX HCA Rajan, Doerfler, Vaughan 5

  6. Architectural Comparison Name Cores/Node Networlk/Topo Total Clock Mem/core MPI MPI Stream Memory logy nodes(N) (GHz) & Speed Inter Inter BW Latency Node Node (GB/s/No (clocks) Latency BW de) (usec) (GB/ s) Red 2 Mesh/ Z 6,720 2.4 2GB; 4.8 2.04 4.576 119 Storm Torus DDR- (dual) 400MHz Red 4 Mesh/ Z 6, 240 2.2 2GB; 4.8 1.82 8.774 90 Storm Torus DDR2- (quad) 800MHz TLCC 16 Fat-tree 272 2.4 2GB; 1.0 1.3 15.1 157 DDR2- 667 MHz Rajan, Doerfler, Vaughan 6

  7. Node Balance Ratio Comparison MAX Bytes-to- MAX Bytes-to- MIN Bytes-to- MIN Bytes-to- FLOPS FLOPS FLOPS FLOPS Interconnect Memory Interconnect Memory Red Storm 0.824 0.379 0.477 0.190 (dual) Red Storm 0.756 0.232 0.249 0.058 (quad) TLCC 0.508 0.148 0.107 0.009 Ratio: Quad/ TLCC 1.49 1.57 2.33 6.28 Bytes-to-FLOPS Memory = Stream BW (Mbytes/sec)/ Peak MFLOPS Bytes-to-FLOPS Interconnect = Ping-Pong BW (Mbytes/sec)/ Peak MFLOPS MAX = using single core on node MIN = using all cores on a node Rajan, Doerfler, Vaughan 7

  8. Micro-Benchmark: Memory Latency- single thread Memory Access Latency 350 Red Storm Dual 300 Red Storm Quad Latency, Clock Cycles 250 TLCC 200 150 100 50 0 1.00E+05 1.00E+06 1.00E+07 1.00E+08 Array Size, Bytes 3 cycles L1 (64k) 90+ cycles RAM 15 cycles L2(512k) 45 cycles shared L3(2MB) Rajan, Doerfler, Vaughan 8

  9. Micro-Benchmark: STREAMS Memory Bandwidth -MBytes/sec 16000 Red Storm Dual 14000 Red Storm Quad 12000 TLCC 10000 8000 6000 4000 2000 0 One MPI two MPI Four MPI Two MPI Four MPI Eight MPI task tasks tasks tasks, Two tasks, Two tasks, Two sockets on sockets on sockets on a node a node a node Rajan, Doerfler, Vaughan 9

  10. Micro-Benchmark: MPI Ping-Pong 2,500 Red Storm - Dual 2,000 Bandwidth, Mbytes/sec Red Storm - Quad TLCC 1,500 1,000 500 0 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 Message Size, Bytes Rajan, Doerfler, Vaughan 10

  11. Micro-Benchmark: MPI Ping-Pong MPI Allreduce 8 Bytes Time 1 Time in milliseconds 0.1 Red Storm Dual 0.01 Red Storm Quad TLCC MVAPICH 0.001 1 10 100 1,000 10,000 Number of MPI Tasks Rajan, Doerfler, Vaughan 11

  12. MPI Random and Bucket-Brigade Benchmark Bandwidths in MBytes/sec Random Message (RM) Sizes = 100 to 1K Bytes; Bucket Brigade Small (BBS) size = 8 bytes; Bucket Brigade Large (BBL) size= 1MB RM - RM- RM- BBS- BBS- BBS- BBL- BBL- BBL- 1024 256 64 1024 256 64 1024 256 64 Red 67.7 71.9 75.3 1.19 1.19 1.20 1100.2 1116.1 1132.4 Storm Dual Red 41.6 45.0 46.9 0.86 0.86 0.86 654.7 647.6 632.2 Storm Quad TLCC 0.43 1.59 3.64 1.77 3.37 3.42 275.47 314.3 344.1 Note Big Difference between Red Storm and TLCC for Random Messaging Benchmark Random BW Ratio Quad/TLCC @1024 = 97; Random BW Ratio Quad/TLCC @64 = 13 Rajan, Doerfler, Vaughan 12

  13. Mini-Application: Mantevo HPCCG Illustrates Node Memory Architectural Impact  Mike Heroux’s Conjugate Mini-Application HPCCG; Wall Times, secs Gradient mini-application  Coefficient matrix stored in 70 sparse matrix format  Most of the time dominated by 60 sparse matrix vector multiplication  Parallel overhead small fraction 50 Wall Time, secs  TLCC-16ppn runs show strong 40 benefit of using numactl to set process and memory affinity 30 bindings  Once best performance within a Red Storm Dual 20 node is achieved weak scaling Red Storm Quad curve is near perfect 10  TLCC 2 to 4 MPI tasks 37% loss; 8 TLCC to 16 MPI tasks another 44% loss 0  1.7 X slower TLCC performance; This ratio approaches worst 1 10 100 1000 10000 Quad/TLCC byte-to-FLOPS ratio of Number of MPI Tasks 2.3 discussed earlier Rajan, Doerfler, Vaughan 13

  14. Mini-Application – Mantevo: phdMesh  Benchmark used for research in contact detection algorithms  Figure shows a weak scaling analysis using a grid of counter rotating gears: 4x3x1 on 2 PEs, phdMesh Oct-tree geometric search wall time, secs 4x3x2 on 4 PEs, etc 0.12  Search time / step of an oct- Search time/Step ( seconds) Red Storm Dual 0.1 tree geometric proximity Red Storm Quad search detection algorithm is 0.08 TLCC shown.  TLCC shows quite good 0.06 performance except at scale at 512 cores where it is about 0.04 1.4X slower than Red Storm Quad. 0.02 0 1 10 100 1000 Number of MPI Tasks Rajan, Doerfler, Vaughan 14

  15. Red Storm, TLCC Application Performance Comparison TLCC/Red Storm Wall Time Ratio Ratio = 1, runs take the same time Ratio =2, TLCC takes twice as long 9 8 64 MPI tasks 7 256 MPI Tasks 6 1024 MPI Tasks 5 4 3 2 1 0 Rajan, Doerfler, Vaughan 15

  16. CTH – Weak Scaling  CTH is used for two- and three- dimensional problems involving high- CTH Shape Charge: Wall Time for 100 time Steps: speed hydrodynamic flow and the Weak Scaling with 80x192x80 Cells/core dynamic deformation of solid materials  Model: shaped-charge; cylindrical 2000 container filled with high explosive 1800 capped with a copper liner.  Weak scaling analysis with 1600 80x192x80 computational cells per 1400 processor. Wall Time, secs  Processor exchanges information 1200 with up to six other processors in the 1000 domain. These messages occur several times per time step and are fairly large 800 Red Storm Dual since a face can consist of several thousand cells 600 Red Storm Quad  Modest communication overhead 400 with nearest neighbor exchanges TLCC  At 16 cores Red Storm Quad is 200 1.23X faster than TLCC; at 512 cores to 0 1.32X; This is close to the memory speed ratio of 800/667=1.2 1 10 100 1000 10000  CTH does not greatly stress the Number of MPI Tasks interconnect Rajan, Doerfler, Vaughan 16

Recommend


More recommend