Implementing Optimized Collective Communication Routines on the IBM - PowerPoint PPT Presentation

Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer CS 425 term project By Sam Miller samm@scl.ameslab.gov April, 18, 2005 4/15/05 1 of 37

Outline • What is BlueGene/L? (5 slides) • Hardware (3 slides) • Communication Networks (2 slides) • Software (2 slides) • MPI and MPICH (1 slide) • Collective Algorithms (5 slides) • Better Collective Algorithms! (12 slides) • Performance • Conclusion 4/15/05 2 of 37

Abbreviations Today • BGL = BlueGene/L • CNK = Compute Node Kernel • MPI = Message Passing Interface • MPICH2 = MPICH 2.0 from Argonne Labs • ASIC = Application Specific Integrated Circuit • ALU = Arithmetic Logic Unit • IBM = International Biscuit Makers (duh) 4/15/05 3 of 37

What is BGL 1/2 • Massively parallel distributed memory cluster of embedded processors • 65,536 nodes! 131,072 processors! • Low power requirements • Relatively small, compared to predecessors • Half system installed at LLNL • Other systems going online too 4/15/05 4 of 37

What is BGL 2/2 • BlueGene/L at LLNL (360 Tflops) – 2,500 square feet, half a tennis court • Earth Simulator (40 Tflops) – 35,000 square feet, requires an entire building 4/15/05 5 of 37

4/15/05 6 of 37

4/15/05 7 of 37

4/15/05 8 of 37

Hardware 1/3 • CPU is PowerPC 440 – Designed for embedded applications – Low power, low clock frequency (700 MHz) – 32 bit :-( • FPU is custom 64-bit – Each PPC 440 core has two of these – The two FPUs operate in parallel – @ 700MHz this is 2.8 Gflops per PPC 440 core 4/15/05 9 of 37

Hardware 2/3 • BGL ASIC – Two PPC 440 cores, four FPUs – L1, L2, L3 caches – DDR memory controller – Logic for 5 separate communications networks – This forms one compute node 4/15/05 10 of 37

Hardware 3/3 • To build the entire 65,536 node system – Two ASICs with 256 or 512 MB DDR RAM form a compute card – Sixteen compute cards form a node board – Sixteen node boards form a midplane – Two midplanes form a rack – Sixty four racks brings us to: – 2x16x16x2x64 = 65,536! 4/15/05 11 of 37

QuickTime™ and a Graphics decompressor are needed to see this picture. 4/15/05 12 of 37

Communication Networks 1/2 • Five different networks – 3D torus • Primary for MPI library – Global tree • Used for collectives on MPI_COMM_WORLD • Used by compute nodes to communicate with I/O nodes – Global interrupt • 1.5 usec latency over entire 65k node system! – JTAG • Used for node bootup and servicing – Gigabit Ethernet • Used by I/O nodes 4/15/05 13 of 37

Communication Networks 2/2 • Torus – 6 neighbors have bi-directional links at 154 MB/sec – Guarantees reliable, deadlock free delivery – Chosen due to high bandwidth nearest neighbor connectivity – Used in prior supercomputers, such as Cray T3E 4/15/05 14 of 37

Software 1/2 • Compute node runs stripped down Linux called CNK – Two threads, 1 per CPU – No context switching, no VM – Standard glibc interface, easy to port – 5000 lines of C++ • I/O nodes run standard PPC Linux – They have disk access – Run a daemon called console I/O daemon (ciod) 4/15/05 15 of 37

Software 2/2 • Network software has 3 layers – Topmost is MPI Library – Middle is Message Layer • Allows transmission of arbitrary buffer sizes – Bottom is Packet layer • Very simple • Stateless interface to torus, tree, and GI hardware • Facilitates sending & receiving packets 4/15/05 16 of 37

MPICH • Developed by Argonne National Labs • Open source, freely available, standards compliant MPI implementation • Used by many vendors • Chosen by IBM due to use of Abstract Device Interface (ADI) and design for scalability 4/15/05 17 of 37

Collective Algorithms 1/5 • Collectives can be implemented with basic send and receives – Better algorithms exist • Default MPICH2 collectives perform poor on BGL – Assume crossbar network, poor node mapping – Point-to-point messages incur high overhead – No knowledge of network specific features 4/15/05 18 of 37

Collective Algorithms 2/5 • Optimization is tricky – Message size and communicator shape are deciding factors – Large messages == optimize bandwidth – Short messages == optimize latency • I will not talk about short message collectives further today • If optimized algorithm isn’t available, BGL falls back on default MPICH2 – It will work because point-to-point messages work – Performance will suck however 4/15/05 19 of 37

Collective Algorithms 3/5 • Conditions for selecting optimized collective algorithm are made locally – What is wrong with this? • Example: char buf[100], buf2[20000]; if (rank == 0) MPI_Bcast(buf, 100, …); else MPI_Bcast(buf2, 20000, …); – Not legal according to MPI standard, but… – What if one node uses the optimized algorithm and the others use the MPICH2 algorithm? • Deadlock - or worse 4/15/05 20 of 37

Collective Algorithms 4/5 • Solution to previous problem: – Make optimization decisions globally – This incurs a slight latency hit – Thus, only used when offsetting increases in bandwidth are important: Ex: long message collectives 4/15/05 21 of 37

Collective Algorithms 5/5 • Remainder of slides – MPI_Bcast – MPI_Reduce, MPI_Allreduce – MPI_Alltoall, MPI_Alltoallv • Using both the tree and torus networks – Tree operates only on MPI_COMM_WORLD • Has a built in ALU, but only fixed point :-( – Torus has deposit bit feature, requires rectangular communicator shape (for most algorithms) 4/15/05 22 of 37

Broadcast 1/3 • MPICH2 – Binomial tree for short messages – Scatter then Allgather for large messages – Perform poor on BGL due to high CPU overhead and lack of topology awareness • Torus – Uses deposit bit feature – For n-dimension mesh, 1/n of message is sent in each direction concurrently • Tree – Does not use ALU 4/15/05 23 of 37

Broadcast 2/3 • Red lines represent one spanning tree of half the message • Blue lines represent another spanning tree of the other message half 4/15/05 24 of 37

Broadcast 3/3 4/15/05 25 of 37

Reduce & Allreduce 1/4 • Reduce essentially a reverse broadcast • Allreduce is a reduce followed by broadcast • Torus – Can’t use deposit bit feature – CPU bound, bandwidth is poor – Solution: Hamiltonian path, huge latency penalty, but great bandwidth • Tree – Natural choice for reduction using integers! – Floating point performance is bad 4/15/05 26 of 37

Reduce & Allreduce 2/4 • Hamiltonian path for 4x4x4 cube 4/15/05 27 of 37

Reduce & Allreduce 3/4 4/15/05 28 of 37

Reduce & Allreduce 4/4 4/15/05 29 of 37

Alltoall and Alltoallv 1/5 • MPICH2 has 4 algorithms – Yes 4 separate ones – BGL performace is bad due to network hot spots and CPU overhead • Torus – No communicator size restriction! – Does not use deposit bit feature • Tree – No alltoall tree algorithm, it would not make sense 4/15/05 30 of 37

Alltoall and Alltoallv 2/5 • BGL Optimized torus algorithm – Uses randomized packet injection – Each node creates a destination list – Each node has same seed value, different offset • Bad memory performance? – Yes! – Torus payload is 240 bytes (8 cache lines) – Multiple packets in adjacent cache lines to each destination are injected before advancing • Measurements showed 2 packets to be optimal 4/15/05 31 of 37

Alltoall and Alltoallv 3/5 4/15/05 32 of 37

Conclusion • Optimized collectives on BGL off to a good start – Superior performance than MPICH2 – Exploit knowledge about network features – Avoid performance penalties like memory copies and network hot spots • Much work remains – Short message collectives – Non-rectangular communicators for the torus network – Tree collectives using communicators other than MPI_COMM_WORLD – Other collectives: scatter, gather, etc. 4/15/05 35 of 37

Questions? 4/15/05 36 of 37

Implementing Optimized Collective Communication Routines on the IBM - PowerPoint PPT Presentation

Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer CS 425 term project By Sam Miller samm@scl.ameslab.gov April, 18, 2005 4/15/05 1 of 37 Outline What is BlueGene/L? (5 slides) Hardware (3

Mathematics Teaching and Learning Robert Q. Berry, III, PhD NCTM President rberry@nctm.org

Kindergarten Daily Routines (With Formative Assessment Questions Removed) 2015-03-11

1st Grade Daily Routines without formative assessment questions 2015-03-10 www.njctl.org Slide

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Implementing and extending the Optimized Link State Routing Protocol Master presentation by

Overlapping Communication and Computation with High Level Communication Routines - On Optimizing

Optimized design and analysis of Optimized design and analysis of sparse-sampling fMRI

ZIVD, LLC 1 Laboratory Optimized patient care Clinician Optimized patient care 2

Optimized geothermal Optimized geothermal binary power cycles binary power cycles Kontoleontos

Moving Shadow Tracking in VR Interaction A novel optimized approach A novel optimized approach

Collective Communications Collective Communication Communications involving a group of

COLLECTIVE LEADERSHIP AND SAFETY CULTURES COLLECTIVE LEADERSHIP FOR SAFETY SKILLS Co-Lead Coll

Collective Investment Schemes in Cyprus What are the Collective Investment Schemes A

Collective states and transitional behavior in schooling fish KOLBJRN TUNSTRM Collective

Calling External Routines in Stata Giovanni Cerulli and Antonio Zinilli IRCrES-CNR 1

Numeracy Routines in the Intermediate Classroom Rose Palmer/Susan Aleson School District of

Robert Gallagers Minimum Delay Routing Algorithm Using Distributed Computation Timo Bingmann

Akka$Concurrency$Works by#Duncan#K.#DeVore, Viridity'Energy,'Inc.

The Highs and Lows of Stateful Containers Presented by Alex Robinson / Member of the Technical

Parallel Analysis of Parallelism Verifying Concurrent Software System Designs Using GPUs GTC

Interactive Debugging of Dynamic Dataflow Embedded Applications. Kevin Pouget, Patricia Lopez

Barratt London 19 May 2011 1 2 Mark Clare Group Chief Executive Dalston Square, Dalston

Communication assurance with Session Types Rumyana Neykova Communication Safety with Session

Big Data processing with Hadoop Luca Pireddu CRS4Distributed Computing Group April 18, 2012

Sambuz

Useful Links

Newsletter

Mail Us