Cluster Computing Massively Parallel Architectures
MPP Specifics Cluster Computing • No shared memory • Scales to hundreds or thousands of processors • Homogeneous sub-components • Advanced Custom Interconnects
MPP Architectures Cluster Computing • There are numerous approaches to interconnecting CPUs in MPP architectures: – Rings – Grids – Full Interconnect – Trees – Dancehalls – Hybercubes
Rings Cluster Computing Worst case distance (one ring) (bi-directional ring) Cost n
Chordal Ring 3 Cluster Computing
Chordal Ring 4 Cluster Computing
Barrel Shifter Cluster Computing Worst case distance n/2
Grid/Torus/Illiac Torus Cluster Computing #worst case distance Cost n
Fully interconnected Cluster Computing worst case distance 1 Cost n(n-1)/2
Trees Cluster Computing worst case 2log n Cost n log n
Fat Trees Cluster Computing Cost n worst case 2log n
Dancehalls/Butterflies Cluster Computing distance Cost n
Hybercubes Cluster Computing worst case distance Cost d
Intel Paragon Cluster Computing • Intel i860 based machine • “Dual CPU” – – 50 MHz CPUs – Shares a 400 MB/sec cache coherent bus • Grid architecture • Mother of ASCI Red
Intel Paragon Cluster Computing
SP2 Cluster Computing • Based in RS/6000 nodes – POWER2 processors • Special NIC: MSMU on the micro-channel bus • Standard Ethernet on the micro-channel bus • MSMUs interconnected via a HPS backplane
SP2 MSMU Cluster Computing
SP2 HPS Cluster Computing • Links are 8 bit parallel • Contention free latency is 5 ns per stage – 875 ns latency for 512 nodes
SP2 HPS Cluster Computing
ASCI Red Cluster Computing • Build by Intel for the Department of “Energy” • Consist of almost 5000 dual PPro boards with a special adaptation for user-level message-passing • Special support for internal ‘firewalls’
ASCI Red Node Cluster Computing
ASCI Red MRC Cluster Computing
ASCI Red Grid Cluster Computing
Scali Cluster Computing • Based on Intel or Sparc based nodes • Nodes are connected by a Dolphin SCI interface, using a grid of rings • Very high performance MPI and support for commodity operating systems
Performance??? Cluster Computing
Earth Simulator Cluster Computing
ES Cluster Computing
ES Cluster Computing
ES Cluster Computing
ES Cluster Computing
BlueGene/L Cluster Computing October 2003 BG/L half rack prototype 500 Mhz 512 nodes/1024 proc. 2 TFlop/s peak 1.4 Tflop/s sustained
BlueGene/L ASIC node Cluster Computing PowerPC 440 Double 64-bit FPU 2kb L2 L3 cache directory SRAM L3 cache EDRAM DDR JTAG Gigabit Ethernet adapter
BlueGene/L Interconnection Networks Cluster Computing 3 Dimensional Torus – Interconnects all compute nodes (65,536) – Virtual cut-through hardware routing – 1.4Gb/s on all 12 node links (2.1 GB/s per node) – Communications backbone for computations – 350/700 GB/s bisection bandwidth Global Tree – One-to-all broadcast functionality – Reduction operations functionality – 2.8 Gb/s of bandwidth per link – Latency of tree traversal in the order of 5 µ s – Interconnects all compute and I/O nodes (1024) Ethernet – Incorporated into every node ASIC – Active in the I/O nodes (1:64) – All external comm. (file I/O, control, user interaction, etc.) Low Latency Global Barrier and Interrupt Control Network
BG/L – Familiar software environment Cluster Computing • Fortran, C, C++ with MPI – Full language support – Automatic SIMD FPU exploitation • Linux development environment – Cross-compilers and other cross-tools execute on Linux front- end nodes – Users interact with system from front-end nodes • Tools – support for debuggers, hardware performance monitors, trace based visualization • POSIX system calls – compute processes “feel like” they are executing on a Linux environment (restrictions)
Measured MPI Send Bandwidth Cluster Computing Latency @500 MHz = 5.9 + 0.13 * “Manhattan distance” ls
NAS Parallel Benchmarks Cluster Computing • All NAS Parallel Benchmarks run successfully on 256 nodes (and many other configurations) – No tuning / code changes • Compared 500 MHz BG/L and 450 MHz Cray T3E • All BG/L benchmarks were compiled with GNU and XL compilers – Report best result (GNU for IS) • BG/L is a factor of two/three faster on five benchmarks (BT, FT, LU, MG, and SP), a bit slower on one (EP)
Recommend
More recommend