A Power-Aware, Application-Based, Performance Study Of Modern Commodity Cluster Interconnection Networks Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine Open Systems Lab Indiana University Bloomington, USA CAC’09 - IPDPS’09 Rome, Italy May, 25th 2009 Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder
Motivation I (economic) 11 Commercial Energy Price 10 9 Price [cent/kWh] 8 7 6 5 4 3 2 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 Year Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder
Motivation II (personal) Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder
Motivation III (scientific) Interconnection network is the heart of parallel computing How do we compare different network technologies? Microbenchmarks! Often Latency and Bandwidth only Is this enough to predict application performance? Power consumption is becoming a problem for system designers Green500 list as an addition to Top500 Power input (cooling!) major design goal for large systems What about power efficiency of the network? Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder
Experiment Setup We compare three different network technologies Fiber-based Myrinet 10G Copper-based Myrinet 10G Copper-based ConnectX InfiniBand We compare latency and bandwidth results (NetPIPE) and application performance on absolutely identical systems. OpenMPI 1.2.8, OFED 1.3, MX 1.4.3 SLES 10 SP 2 (Linux 2.6.16) 14 nodes, 2 × 4 Xeons L5420 2.5 GHz 4 GiB RAM per core Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder
Microbenchmark Results - Latency 7 IB-C, OMPI MX-C, OMPI 6 MX-F, OMPI 5 Latency [usec] 4 3 2 1 0 1 10 100 Message size [byte] Latency: IB 1.4 µ s , MX-F 2.5 µ s , MX-C 2.8 µ s Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder
Microbenchmark Results - Throughput 16 IB-C, OMPI MX-C, OMPI 14 MX-F, OMPI 12 Throughput [Gb/s] 10 8 6 4 2 0 1.0k 4.1k 16.4k 65.5k 262.1k 1.0M 4.2M 16.8M Message size [byte] Bandwidth: IB 13.9 Gib/s (86.9%), MX 9.1 Gib/s (91%) Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder
Microbenchmark Summary Results: IB performs significantly better in nearly all configurations! MX-F is slightly faster than MX-C OMPI’s MX eager-rendezvous switching point seems suboptimal Projection: IB should deliver higher application performance no data about power consumption yet ⇒ proceeding to real application runs! three runs with each application/network lowest running time counts all results were very stable ( < 3 % variance) Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder
Application Performance - MILC MPI_Allreduce MPI_Comm_rank MPI_Init MPI_Irecv 150 MPI_Isend MPI_Wait Quantum chromodynamics code (nuclear physics) Time [s] Multiple programs 100 We used NERSC ”medium” benchmark for su3rmd Runtime: IB: 444s (123s MPI) 50 MX-C: 435s (115s MPI) MX-F: 426s (107s MPI) 0 IB−C MX−C MX−F Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder
Application Performance - POP 14 MPI_Allreduce MPI_Bcast MPI_Init 12 MPI_Irecv MPI_Isend MPI_Waitall 10 Ocean circulation simulations We used the x1 POP 8 Time [s] benchmark (32 cores on 14 nodes) 6 Runtime: IB: 66s (10s MPI) MX-C: 63s (7s MPI) 4 MX-F: 61s (5s MPI) 2 0 IB−C MX−C MX−F Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder
Application Performance - RAxML MPI_Finalize 40 MPI_Init MPI_Probe Models evolution by building 30 phylogenetic trees from DNA We calculated 112 trees (1 Time [s] per core) from 50 genome sequences with 5000 base 20 pairs each Runtime: IB: 746s (35s MPI) 10 MX-C: 743s (32s MPI) MX-F: 738s (32s MPI)! 0 IB−C MX−C MX−F Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder
Application Performance - WPP MPI_Allreduce MPI_Barrier Simulates time-dependent MPI_Cart_create MPI_Finalize elastic and viscoelastic MPI_Init MPI_Sendrecv 60 propagation of waves which occur during earth quakes and explosions 3D seismic modelling with Time [s] 40 finite difference methods 30k × 30k × 17k grid, single wave source (LOH1 example) on 112 cores 20 Runtime: IB: 702s (51s MPI) MX-C: 706s (57s MPI) MX-F: 701s (53s MPI)! 0 IB−C MX−C MX−F Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder
Power Measurements Methodology: two APC 7800 PDUs, resolution 0.1 A (120 V) data sampled every second via SNMP compute total power consumption as discrete integral Base Data: idle system: IB 17.7 A, MX-C 17.3 A, MX-F 16.9 A IB switch: Cisco TopSpin SFS 7000D 0.48 A MX switch: 0.75 A (0.45 A w/o fan) 4 nodes idle vs. 8 MiB message-stream: IB: 3.9 A / 5.0 A MX-C: 3.77 A / 4.95 A (PML OB1) MX-C: 3.77 A / 4.75 A (MTL MX) Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder
Power Consumption - MILC 29 28 Power consumption [A] 27 26 25 24 IB-C MX-C MX-F 23 50 100 150 200 250 300 350 400 450 Application run time [s] Energy: IB 3.879 kWh, MX-C 0.1% less, MX-F 1.5% less Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder
Power Consumption - POP 23 22 Power consumption [A] 21 20 19 IB-C MX-C MX-F 18 10 20 30 40 50 60 70 Application run time [s] Energy: IB 0.458 KWh, MX-C 4.6% less, MX-F 11.3% less Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder
Power Consumption - RAxML 36 35 Power consumption [A] 34 33 32 31 IB-C 30 MX-C MX-F 29 0 100 200 300 400 500 600 700 Application run time [s] Energy: IB 8.315 kWh, MX-C 1.8% less, MX-F 3.6% less Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder
Power Consumption - WPP 31 Power consumption [A] 30 29 28 IB-C MX-C 27 MX-F 0 100 200 300 400 500 600 700 Application run time [s] Energy: IB 6.807 KWh, MX-C 0.4% less, MX-F 1.4% less Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder
Conclusions Microbenchmarks and simple metrics such as latency and bandwidth are not accurate performance predictors. Other factors influence performance of parallel applications, for example tag matching in hardware, memory registration and cache pollution. The network fabric can have an important impact on power consumption, up to 11% in our experiments. Future Work more power aware network fabric comparisons should performed (not by us) study influence of the driver stack on application performance Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder
Thanks Thanks for your attention! Questions? Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder
Recommend
More recommend