implementing optimized collective communication routines
play

Implementing Optimized Collective Communication Routines on the IBM - PowerPoint PPT Presentation

Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer CS 425 term project By Sam Miller samm@scl.ameslab.gov April, 18, 2005 4/15/05 1 of 37 Outline What is BlueGene/L? (5 slides) Hardware (3


  1. Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer CS 425 term project By Sam Miller samm@scl.ameslab.gov April, 18, 2005 4/15/05 1 of 37

  2. Outline • What is BlueGene/L? (5 slides) • Hardware (3 slides) • Communication Networks (2 slides) • Software (2 slides) • MPI and MPICH (1 slide) • Collective Algorithms (5 slides) • Better Collective Algorithms! (12 slides) • Performance • Conclusion 4/15/05 2 of 37

  3. Abbreviations Today • BGL = BlueGene/L • CNK = Compute Node Kernel • MPI = Message Passing Interface • MPICH2 = MPICH 2.0 from Argonne Labs • ASIC = Application Specific Integrated Circuit • ALU = Arithmetic Logic Unit • IBM = International Biscuit Makers (duh) 4/15/05 3 of 37

  4. What is BGL 1/2 • Massively parallel distributed memory cluster of embedded processors • 65,536 nodes! 131,072 processors! • Low power requirements • Relatively small, compared to predecessors • Half system installed at LLNL • Other systems going online too 4/15/05 4 of 37

  5. What is BGL 2/2 • BlueGene/L at LLNL (360 Tflops) – 2,500 square feet, half a tennis court • Earth Simulator (40 Tflops) – 35,000 square feet, requires an entire building 4/15/05 5 of 37

  6. 4/15/05 6 of 37

  7. 4/15/05 7 of 37

  8. 4/15/05 8 of 37

  9. Hardware 1/3 • CPU is PowerPC 440 – Designed for embedded applications – Low power, low clock frequency (700 MHz) – 32 bit :-( • FPU is custom 64-bit – Each PPC 440 core has two of these – The two FPUs operate in parallel – @ 700MHz this is 2.8 Gflops per PPC 440 core 4/15/05 9 of 37

  10. Hardware 2/3 • BGL ASIC – Two PPC 440 cores, four FPUs – L1, L2, L3 caches – DDR memory controller – Logic for 5 separate communications networks – This forms one compute node 4/15/05 10 of 37

  11. Hardware 3/3 • To build the entire 65,536 node system – Two ASICs with 256 or 512 MB DDR RAM form a compute card – Sixteen compute cards form a node board – Sixteen node boards form a midplane – Two midplanes form a rack – Sixty four racks brings us to: – 2x16x16x2x64 = 65,536! 4/15/05 11 of 37

  12. QuickTime™ and a Graphics decompressor are needed to see this picture. 4/15/05 12 of 37

  13. Communication Networks 1/2 • Five different networks – 3D torus • Primary for MPI library – Global tree • Used for collectives on MPI_COMM_WORLD • Used by compute nodes to communicate with I/O nodes – Global interrupt • 1.5 usec latency over entire 65k node system! – JTAG • Used for node bootup and servicing – Gigabit Ethernet • Used by I/O nodes 4/15/05 13 of 37

  14. Communication Networks 2/2 • Torus – 6 neighbors have bi-directional links at 154 MB/sec – Guarantees reliable, deadlock free delivery – Chosen due to high bandwidth nearest neighbor connectivity – Used in prior supercomputers, such as Cray T3E 4/15/05 14 of 37

  15. Software 1/2 • Compute node runs stripped down Linux called CNK – Two threads, 1 per CPU – No context switching, no VM – Standard glibc interface, easy to port – 5000 lines of C++ • I/O nodes run standard PPC Linux – They have disk access – Run a daemon called console I/O daemon (ciod) 4/15/05 15 of 37

  16. Software 2/2 • Network software has 3 layers – Topmost is MPI Library – Middle is Message Layer • Allows transmission of arbitrary buffer sizes – Bottom is Packet layer • Very simple • Stateless interface to torus, tree, and GI hardware • Facilitates sending & receiving packets 4/15/05 16 of 37

  17. MPICH • Developed by Argonne National Labs • Open source, freely available, standards compliant MPI implementation • Used by many vendors • Chosen by IBM due to use of Abstract Device Interface (ADI) and design for scalability 4/15/05 17 of 37

  18. Collective Algorithms 1/5 • Collectives can be implemented with basic send and receives – Better algorithms exist • Default MPICH2 collectives perform poor on BGL – Assume crossbar network, poor node mapping – Point-to-point messages incur high overhead – No knowledge of network specific features 4/15/05 18 of 37

  19. Collective Algorithms 2/5 • Optimization is tricky – Message size and communicator shape are deciding factors – Large messages == optimize bandwidth – Short messages == optimize latency • I will not talk about short message collectives further today • If optimized algorithm isn’t available, BGL falls back on default MPICH2 – It will work because point-to-point messages work – Performance will suck however 4/15/05 19 of 37

  20. Collective Algorithms 3/5 • Conditions for selecting optimized collective algorithm are made locally – What is wrong with this? • Example: char buf[100], buf2[20000]; if (rank == 0) MPI_Bcast(buf, 100, …); else MPI_Bcast(buf2, 20000, …); – Not legal according to MPI standard, but… – What if one node uses the optimized algorithm and the others use the MPICH2 algorithm? • Deadlock - or worse 4/15/05 20 of 37

  21. Collective Algorithms 4/5 • Solution to previous problem: – Make optimization decisions globally – This incurs a slight latency hit – Thus, only used when offsetting increases in bandwidth are important: Ex: long message collectives 4/15/05 21 of 37

  22. Collective Algorithms 5/5 • Remainder of slides – MPI_Bcast – MPI_Reduce, MPI_Allreduce – MPI_Alltoall, MPI_Alltoallv • Using both the tree and torus networks – Tree operates only on MPI_COMM_WORLD • Has a built in ALU, but only fixed point :-( – Torus has deposit bit feature, requires rectangular communicator shape (for most algorithms) 4/15/05 22 of 37

  23. Broadcast 1/3 • MPICH2 – Binomial tree for short messages – Scatter then Allgather for large messages – Perform poor on BGL due to high CPU overhead and lack of topology awareness • Torus – Uses deposit bit feature – For n-dimension mesh, 1/n of message is sent in each direction concurrently • Tree – Does not use ALU 4/15/05 23 of 37

  24. Broadcast 2/3 • Red lines represent one spanning tree of half the message • Blue lines represent another spanning tree of the other message half 4/15/05 24 of 37

  25. Broadcast 3/3 4/15/05 25 of 37

  26. Reduce & Allreduce 1/4 • Reduce essentially a reverse broadcast • Allreduce is a reduce followed by broadcast • Torus – Can’t use deposit bit feature – CPU bound, bandwidth is poor – Solution: Hamiltonian path, huge latency penalty, but great bandwidth • Tree – Natural choice for reduction using integers! – Floating point performance is bad 4/15/05 26 of 37

  27. Reduce & Allreduce 2/4 • Hamiltonian path for 4x4x4 cube 4/15/05 27 of 37

  28. Reduce & Allreduce 3/4 4/15/05 28 of 37

  29. Reduce & Allreduce 4/4 4/15/05 29 of 37

  30. Alltoall and Alltoallv 1/5 • MPICH2 has 4 algorithms – Yes 4 separate ones – BGL performace is bad due to network hot spots and CPU overhead • Torus – No communicator size restriction! – Does not use deposit bit feature • Tree – No alltoall tree algorithm, it would not make sense 4/15/05 30 of 37

  31. Alltoall and Alltoallv 2/5 • BGL Optimized torus algorithm – Uses randomized packet injection – Each node creates a destination list – Each node has same seed value, different offset • Bad memory performance? – Yes! – Torus payload is 240 bytes (8 cache lines) – Multiple packets in adjacent cache lines to each destination are injected before advancing • Measurements showed 2 packets to be optimal 4/15/05 31 of 37

  32. Alltoall and Alltoallv 3/5 4/15/05 32 of 37

  33. Alltoall and Alltoallv 4/5 4/15/05 33 of 37

  34. Alltoall and Alltoallv 5/5 4/15/05 34 of 37

  35. Conclusion • Optimized collectives on BGL off to a good start – Superior performance than MPICH2 – Exploit knowledge about network features – Avoid performance penalties like memory copies and network hot spots • Much work remains – Short message collectives – Non-rectangular communicators for the torus network – Tree collectives using communicators other than MPI_COMM_WORLD – Other collectives: scatter, gather, etc. 4/15/05 35 of 37

  36. Questions? 4/15/05 36 of 37

Recommend


More recommend