application sensitivity to link and injection bandwidth
play

Application Sensitivity to Link and Injection Bandwidth on a Cray - PowerPoint PPT Presentation

Application Sensitivity to Link and Injection Bandwidth on a Cray XT4 System Cray User Group Conference Helsinki, Finland May 8, 2008 Kevin Pedretti, Brian Barrett, Scott Hemmert, and Courtenay Vaughan Sandia National Laboratories


  1. Application Sensitivity to Link and Injection Bandwidth on a Cray XT4 System Cray User Group Conference Helsinki, Finland May 8, 2008 Kevin Pedretti, Brian Barrett, Scott Hemmert, and Courtenay Vaughan Sandia National Laboratories ktpedre@sandia.gov Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy ʼ s National Nuclear Security Administration under contract DE-AC04-94AL85000.

  2. Outline • Introduction • Link Bandwidth Detuning • Injection Bandwidth Detuning • Application Testing on Red Storm • Future Work and Conclusions 2

  3. Challenges: Exponentially Increasing Parallelism, Decreasing Balance Due to Power Constraints 72% per year See Key for Units 2019 1 EF 1.7M cores (green) r a e y r e p % 588 GF/core 3 3 or 28M cores (blue) 900 TF 35 GF/core 75K cores 89% per year 12 GF/core 3

  4. Motivation • Many challenges ahead on path to Exa-FLOPS – Institute of Advanced Architecture (IAA) starting up – Need tools to evaluate machine balance trade-offs • Hmm… wouldn’t it be great if we could vary link bandwidth, injection bandwidth, latency, and message rate independently from one another, on a real system like Red Storm? – Perform larger and longer experiments than possible with simulation – Validate application simulations and models – Guide future decisions 4

  5. Approaches • Application modeling – Develop mathematical models describing applications – Sandia: http://www.sandia.gov/PMAT/ • Simulators – Model hardware with software, FPGA, etc. – Speed generally decreases with fidelity – Sandia: SST – Structural Simulation Toolkit • Processor, memory, and network models (inc. SeaStar) • Execution driven simulators – Run application on real system, virtual time tracked by centralized network scheduler/simulator (possibly SST) – Sandia: Seshat • Empirical experiments – This talk – Related to MPI detuning work on ASCI Red (Ron Brightwell) 5

  6. Outline • Introduction • Link Bandwidth Detuning • Injection Bandwidth Detuning • Application Testing on Red Storm • Future Work and Conclusions 6

  7. From the XT4 Brochure… What is this “degraded mode” and how is it enabled? 7

  8. Degraded Link Mode • Network links consist of many parallel wires • What if one of the wires/drivers goes bad? Options: – Fix or disable link and reboot (answer today) – Reroute on-the-fly, avoiding bad link – Disable faulty wire(s), distribute traffic over remaining wires => degraded mode • Degraded mode can be enabled on a per link-type basis via the rm.ini file (/opt/cray/etc/rm.ini) 8

  9. Proof of Concept on XT4 Development Cage Point-to-point 0 bandwidth is clearly w h t d i throttled by host 8 0 t h i w d h w i d t injection rate. 0 6 h t d i d w 4 0 No difference between ¾ and Full 2 0 link bandwidth 0 configurations. 0 0 Full link bandwidth is approx. 773 MB/s * 4 0 = 3092 MB/s in each 0 direction. 9

  10. Latency Remains Unchanged 2 w i d t h 1 t d h i w 0 w i d t h u d t h i w d 10

  11. 16-node MPI Alltoall Difference <= 1% compared to full link 0 5 bandwidth for message h t d i w t h w i d sizes: d h t i w 0 ¼ BW: <= 2 KB d w i d t h ½ BW: <= 4 KB ¾ BW: <= 8 KB 5 0 At 4 MB msg size, 0 compared to full link bandwidth: 0 ¼ BW: 3.42x worse ½ BW: 1.63x worse ¾ BW: 1.22x worse 11

  12. Outline • Introduction • Link Bandwidth Detuning • Injection Bandwidth Detuning • Application Testing on Red Storm • Future Work and Conclusions 12

  13. HyperTransport Detuning • HyperTransport link between Opteron and SeaStar setup at boot by Coldstart on Cray XT, BIOS on standard PCs • Anyone may query HT widths and frequencies supported by SeaStar via the PCI config space: – 8 or 16 bits wide – 200, 400, or 800 MHz • HT link config currently hard-coded in Coldstart • Ran into HT watchdog timeouts with 200-MHz, 8-bit config (400 MB/s), easy fix via xtmemio 13

  14. HyperTransport Link Detuning: Effect on Bandwidth 5 0 t i t t i 0 i t 0 5 0 0 14

  15. HyperTransport Link Detuning: Effect on Latency 0 t i t i t i t 15

  16. Outline • Introduction • Link Bandwidth Detuning • Injection Bandwidth Detuning • Application Testing on Red Storm • Future Work and Conclusions 16

  17. Single Cabinet Testing, 80-nodes • Used to build confidence before full Red Storm testing • Applications tested: – CTH • Shock physics • Weak scaling, non-AMR – HPCCG • Sparse conjugate gradient solver mini-application • Strong and weak scaling – LAMMPS • Molecular Dynamics • Strong and weak scaling – SAGE • Hydro-dynamics • Strong scaling 17

  18. Only Application with Significant Difference was CTH VN mode SN mode 18

  19. Red Storm Jumbo Testing in Degraded Mode • Somehow convinced management that Red Storm would not be harmed (Thanks!  ) – Successful cabinet testing – Simple configuration file change (rm.ini) + reboot • Testing performed April 22-24, 2008 – First three days used for Quad-Core Catamount testing and comparison with CNL (Courtenay Vaughan’s talk on Wednesday) – One 8-hour window for degraded link bandwidth testing – Tested CTH and Partisn codes • Caveats – Non-identical node layouts (MOAB vs. Interactive) – Only enough time for one trial at each data point 19

  20. 0 6 d w t h i h i t d w d 4 0 2 0 0 0 0 - 0 - 0 - - At 8192 nodes, ¼ bandwidth config is 10.3% worse than full bandwidth. Standard Partisn test problem setup to stress latency, ~50 MB per process. 20

  21. 0 5 h d i w t 2 5 h i t d w d 0 5 7 5 0 2 5 0 - 5 - 0 - 5 - - At 8192 nodes, CNL (2.0.44) is 49% worse than Catamount on this problem. Doesn’t appear to be a bandwidth issue. 21

  22. 0 5 d th 5 2 d th 0 0 l s (A P ) y a xi s) 7 5 5 0 5 2 0 0 - 5 - 0 - 5 - - Accelerated Portals (AP) has ~30% lower latency than Generic Portals (GP), but only improves Partisn performance 1-8%. 22

  23. 5 w h t d i t d i w d h 0 5 0 - At 8192 nodes, ¼ bandwidth config is 32.6% worse than full bandwidth. Many ~2.5 MB nearest-neighbor messages for this problem. 23

  24. l 5 VN 21% 0 5 SN 13% 0 [ At 8192 nodes, ¼ bandwidth config is 13% worse than full bandwidth for SN mode, 21% for VN mode. Many ~2 MB nearest-neighbor messages for this problem. 24

  25. 5 ce ss ce ss ke t 0 ke t 5 0 [ Unexplained performance boost in degraded mode for <= 4 nodes (one board) in VN-mode. SN-mode behaves as expected. 25

  26. 5 0 5 0 - More jagged Catamount curve thought to be caused by MOAB, which preferentially allocates 2 GB nodes on Red Storm. CNL tested using interactive mode aprun. 26

  27. Future Work • Independent control of message latency – Leverage earlier work on Portals event timestamps – MPI library modifications – Null-entries at head of match list • CPU frequency and memory detuning • Application testing using checkpoints from production runs – Real problems – Run a few timesteps rather than entire problem 27

  28. Conclusions • It is possible to independently control link bandwidth and injection bandwidth on Cray XT4 systems • Application testing on full Red Storm system booted into degraded ¼ link bandwidth mode was successful – Partisn: 10.3% worse at 8192 nodes – CTH: 13 - 36.2% worse 8192 nodes • Useful platform for large-scale machine balance experiments 28

  29. Acknowledgements • Courtenay Vaughan – Red Storm testing • Kurt Ferreira – single cabinet testing • Sue Kelly and Bob Ballance – approving and allocating Red Storm system time • Cray – consultation 29

Recommend


More recommend