A Network-Failure-Tolerant Message-Passing System for Terascale Clusters Richard L. Graham Advanced Computing Los Alamos National Laboratory
LA-MPI team • David Daniel Past Contributors • Ron Minnich • Nehal Desai • Sung-Eun Choi • Rich Graham • Craig Rasmussen • Dean Risinger • Ling-Ling Chen • Mitch Sukalski • MaryDell Nochumson • Steve Karmesin Lampi- • Peter Beckman support@lanl.gov
Why yet another MPI ? • We build very large clusters with the latest and best interconnects – integration issue: “End to End Reliability not assured”
Definitions • Path – Homogeneous network transport object • Fragment striping – Sending fragments of a single message along several different physical devices of a given Path. • Message striping – Striping different messages along different paths .
Definitions • Reliability – Correcting non-catastrophic, transient failures • Resilience – Surviving catastrophic network failures
LA-MPI design goals • Message passing support for terascale clusters • Fault tolerant (reliable, resilient) • High performance • Thread Safe • Support widely used message passing API (MPI) • Multi-platform and supportable (Open source)
LA-MPI architecture • Two component design * Run time job control - job startup - standard I/O - job monitoring * Message passing library - resource management - message management
User Application MPI Interface Memory & Message Network Path MML Management Scheduler Shared Network Memory Communication SRL Net Net Net OS A B C User Level Bypass Memory Network Kernel Other Hosts Subsystem Drivers Level
Recv Posted by Message Created Timer Dest Proc By MPI No Path Associated Fragment With Message Timeout ? Retransmit Yes Fragment Recv’ed Fragment Sent To Dest Proc Was Fragment Recv Recv’ed OK Ack/Nack Yes NACK ? Generate Ack/Nack No Record Aggregate Yes No Specific Ack ? Release Fragment Information
Platforms Supported OS’s “Interconnects” * Linux (i686, alpha) * Shared Memory * TRU64 * UDP * 32 and 64 bit * ELAN3 versions * HIPPI-800 * IRIX
CICE – 64 time steps 140 117.0 120 Time - Seconds 99.2 96.5 100 80 60 40 13.5 20 8.43 8.02 0 MPICH MPT LA-MPI Boundary exchange Total time
Zero Byte Half Round Trip Latency (uSec) Platform LA-MPI MPT MPICH 7.0 6.5 19.9 O2K (shared Mem) 155.3 143.5 N/A O2K (HIPPI-800) 526.7 525.6 586.0 O2K (IP) 2.3 N/A 23.5 i686 (shared Mem) 132.8 N/A 123.5 I686 (IP) 2.3 N/A 5.5 ES45 (1G shared Mem) 13.8 N/A 4.5 ES45 (1G ELAN-3)
Peak Bandwidth (MB/Sec) Platform LA-MPI MPT MPICH 145 135 85.7 O2K (shared Mem) 135 73 N/A O2K (HIPPI-800) 36.8 34.8 8.7 O2K (IP) 168 N/A 131 i686 (shared Mem) 11.3 N/A 11.0 I686 (IP) 690 N/A 760 ES45 (1G shared Mem) 281(1NIC) N/A 290 ES45 (1G IP)
Allgather (uSec/call) LA-MPI MPT Host x nProcs 40 bytes 40000 bytes 40 bytes 40000 bytes 1 x 2 24.5 623 35.2 959 1 x 4 39.3 1380 70 3140 1 x 32 595 15500 403 57600 1 x 64 2590 48800 774 153000 Would not run 1 x 120 8480 129000 2 x 4 329 7150 1660 19900 2 x 32 907 165000 9700 245000 4 x 4 633 18400 2400 51200 4 x 32 1669.2 407891.9 13500 639000
HIPPI-800 Ping-Pong Bandwidth (MB/Sec) 150 4 N MB/Sec 100 3N 2N 1N 50 MPT 0 8 256 1K 16K 64K 1M
Future Directions • Finish the resilience work • Additional interconnects (Myrinet 2000 ) • “Progress” engine • Dynamic reconfiguration
LA-MPI design goals • Message passing support for tera-scale clusters • Fault tolerant (reliable, resilient) • High performance • Support widely used message passing API (MPI) • Thread Safe • Multi-platform and supportable (Open source)
Summary • Only known end-to-end fault tolerant MPI implementation • Well performing implementation • Current support for several platforms
Recommend
More recommend