massively parallel architectures mpp specifics
play

Massively Parallel Architectures MPP Specifics Cluster Computing - PowerPoint PPT Presentation

Cluster Computing Massively Parallel Architectures MPP Specifics Cluster Computing No shared memory Scales to hundreds or thousands of processors Homogeneous sub-components Advanced Custom Interconnects MPP Architectures


  1. Cluster Computing Massively Parallel Architectures

  2. MPP Specifics Cluster Computing • No shared memory • Scales to hundreds or thousands of processors • Homogeneous sub-components • Advanced Custom Interconnects

  3. MPP Architectures Cluster Computing • There are numerous approaches to interconnecting CPUs in MPP architectures: – Rings – Grids – Full Interconnect – Trees – Dancehalls – Hybercubes

  4. Rings Cluster Computing Worst case distance (one ring) (bi-directional ring) Cost n

  5. Chordal Ring 3 Cluster Computing

  6. Chordal Ring 4 Cluster Computing

  7. Barrel Shifter Cluster Computing Worst case distance n/2

  8. Grid/Torus/Illiac Torus Cluster Computing #worst case distance Cost n

  9. Fully interconnected Cluster Computing worst case distance 1 Cost n(n-1)/2

  10. Trees Cluster Computing worst case 2log n Cost n log n

  11. Fat Trees Cluster Computing Cost n worst case 2log n

  12. Dancehalls/Butterflies Cluster Computing distance Cost n

  13. Hybercubes Cluster Computing worst case distance Cost d

  14. Intel Paragon Cluster Computing • Intel i860 based machine • “Dual CPU” – – 50 MHz CPUs – Shares a 400 MB/sec cache coherent bus • Grid architecture • Mother of ASCI Red

  15. Intel Paragon Cluster Computing

  16. SP2 Cluster Computing • Based in RS/6000 nodes – POWER2 processors • Special NIC: MSMU on the micro-channel bus • Standard Ethernet on the micro-channel bus • MSMUs interconnected via a HPS backplane

  17. SP2 MSMU Cluster Computing

  18. SP2 HPS Cluster Computing • Links are 8 bit parallel • Contention free latency is 5 ns per stage – 875 ns latency for 512 nodes

  19. SP2 HPS Cluster Computing

  20. ASCI Red Cluster Computing • Build by Intel for the Department of “Energy” • Consist of almost 5000 dual PPro boards with a special adaptation for user-level message-passing • Special support for internal ‘firewalls’

  21. ASCI Red Node Cluster Computing

  22. ASCI Red MRC Cluster Computing

  23. ASCI Red Grid Cluster Computing

  24. Scali Cluster Computing • Based on Intel or Sparc based nodes • Nodes are connected by a Dolphin SCI interface, using a grid of rings • Very high performance MPI and support for commodity operating systems

  25. Performance??? Cluster Computing

  26. Earth Simulator Cluster Computing

  27. ES Cluster Computing

  28. ES Cluster Computing

  29. ES Cluster Computing

  30. ES Cluster Computing

  31. BlueGene/L Cluster Computing October 2003 BG/L half rack prototype 500 Mhz 512 nodes/1024 proc. 2 TFlop/s peak 1.4 Tflop/s sustained

  32. BlueGene/L ASIC node Cluster Computing PowerPC 440 Double 64-bit FPU 2kb L2 L3 cache directory SRAM L3 cache EDRAM DDR JTAG Gigabit Ethernet adapter

  33. BlueGene/L Interconnection Networks Cluster Computing 3 Dimensional Torus – Interconnects all compute nodes (65,536) – Virtual cut-through hardware routing – 1.4Gb/s on all 12 node links (2.1 GB/s per node) – Communications backbone for computations – 350/700 GB/s bisection bandwidth Global Tree – One-to-all broadcast functionality – Reduction operations functionality – 2.8 Gb/s of bandwidth per link – Latency of tree traversal in the order of 5 µ s – Interconnects all compute and I/O nodes (1024) Ethernet – Incorporated into every node ASIC – Active in the I/O nodes (1:64) – All external comm. (file I/O, control, user interaction, etc.) Low Latency Global Barrier and Interrupt Control Network

  34. BG/L – Familiar software environment Cluster Computing • Fortran, C, C++ with MPI – Full language support – Automatic SIMD FPU exploitation • Linux development environment – Cross-compilers and other cross-tools execute on Linux front- end nodes – Users interact with system from front-end nodes • Tools – support for debuggers, hardware performance monitors, trace based visualization • POSIX system calls – compute processes “feel like” they are executing on a Linux environment (restrictions)

  35. Measured MPI Send Bandwidth Cluster Computing Latency @500 MHz = 5.9 + 0.13 * “Manhattan distance” ls

  36. NAS Parallel Benchmarks Cluster Computing • All NAS Parallel Benchmarks run successfully on 256 nodes (and many other configurations) – No tuning / code changes • Compared 500 MHz BG/L and 450 MHz Cray T3E • All BG/L benchmarks were compiled with GNU and XL compilers – Report best result (GNU for IS) • BG/L is a factor of two/three faster on five benchmarks (BT, FT, LU, MG, and SP), a bit slower on one (EP)

Recommend


More recommend