hints to improve automatic load balancing with lewi for
play

Hints to improve automatic load balancing with LeWI for hybrid - PowerPoint PPT Presentation

Hints to improve automatic load balancing with LeWI for hybrid applications Marta Garcia, Jesus Labarta, Julita Corbalan Journal of Parallel and Distributed Computing Volume 74, Issue 9 September 2014 1 / 27 Motivation Loss of efficiency


  1. Hints to improve automatic load balancing with LeWI for hybrid applications Marta Garcia, Jesus Labarta, Julita Corbalan Journal of Parallel and Distributed Computing — Volume 74, Issue 9 September 2014 1 / 27

  2. Motivation Loss of efficiency Hybrid programming models ( MPI + X ) Manual tuning of parallel codes (load-balancing, data redistribution) 2 / 27

  3. The X (in this paper) SMPSs (SMPSuperscalar) OpenMP Task as basic element Directives to annotate Annotate taskifiable parallel code functions and their Fork/join model with parameters (in/out/inout) shared memory Task graph to track Number of threads may dependencies change between parallel Number of threads may regions change any time 3 / 27

  4. DLB and LeWI DLB (dynamic load balancing) “ Runtime interposition to [...] intercept MPI calls ” Balance load on the inner level (OpenMP/SMPSs) Several load balancing algorithms LeWI (Lend CPU when Idle) CPUs of rank in blocking MPI call are idle Lend CPUs to other ranks and recover them after MPI call completes 4 / 27

  5. LeWI (a) No load balancing. (b) LeWI algorithm with SMPSs. (c) LeWI algorithm with OpenMP. 5 / 27

  6. Approach “ Extensive performance evaluation ” “ Modeling parallelization characteristics that limit the automatic load balancing potential ” “ Improving automatic load balancing ” 6 / 27

  7. Performance evaluation Marenostrum 2: 2 × IBM PowerPC 970MP (2 cores); 8 GiB RAM Linux 2.6.5-7.244-pseries64; MPICH; IBM XL C/C++ compiler w/o optimizations Metrics Speedup = parallel _ execution _ time serial _ execution _ time useful _ cpu _ time Efficiency = elapsed _ time ∗ cpus where useful _ cpu _ time = cpu _ time − ( mpi _ time + openmp/smpss _ time + dlb _ time ) CPUs _ used to simultaneously run application code 3 benchmarks + 2 real applications 7 / 27

  8. PILS (Parallel ImbaLance Simulation) Synthetic benchmark Core: “ floating point operations without data involved ” Tunable parameters Programming model (MPI, MPI + OpenMP, MPI + SMPSs) Load distribution 1 Parallelism grain ( = # parallel regions ) Iterations 8 / 27

  9. PILS 9 / 27

  10. Parallelism Grain 10 / 27

  11. Other Codes Benchmarks BT-MZ: block tri-diagonal solver LUB: LU matrix factorization Applications Gromacs: molecular dynamics, MPI-only Gadget: cosmological N-body/SPH (smoothed-particle hydrodynamics) simulation 11 / 27

  12. Other Codes Benchmarks BT-MZ: block tri-diagonal solver LUB: LU matrix factorization Applications Gromacs: molecular dynamics, MPI-only Gadget: cosmological N-body/SPH (smoothed-particle hydrodynamics) simulation Application Original version MPI + OpenMP MPI + SMPSs Executed in nodes (cpus) MPI + OpenMP 1 (4) PILS X X MPI + SMPSs BT-MZ MPI + OpenMP X X 1, 2, 4 (4, 8, 16) MPI + OpenMP 1, 2, 4 (4, 8, 16) LUB X X MPI + SMPSs Gromacs MPI X 1.64 (4.256) Gadget MPI X 200 (800) 11 / 27

  13. PILS, 2 and 4 MPI processes 12 / 27

  14. BT-MZ; 1 node 13 / 27

  15. BT-MZ; 2,4 nodes; Class C 14 / 27

  16. BT-MZ; 1 node; 4 MPI processes 15 / 27

  17. LUB; 1 node; Block size 200 . . . . 16 / 27

  18. Gromacs; 1–64 nodes + Details for 16 nodes 17 / 27

  19. Gromacs; Efficiency + CPUs used per Node 18 / 27

  20. Gadget; 200 nodes 19 / 27

  21. Factors Limiting Performance Improvement with LeWI “ Parallelism Grain in OpenMP applications ” “ Task duration in SMPSs applications ” “ Distribution of MPI processes among computation nodes ” 20 / 27

  22. Parallelism Grain 21 / 27

  23. Modified Parallelism Grain in LUB 22 / 27

  24. Performance of Modified LUB 23 / 27

  25. Rank Distribution — BT-MZ 24 / 27

  26. Rank Distribution — Gromacs 25 / 27

  27. Rank Distribution — Gadget Total 26 / 27

  28. Conclusion Summary DLB/LeWI can improve performance transparently Inter-node load imbalances not handled Granularity of parallelism and placement as important factors Optimal configuration with vs. without DLB/LeWI 27 / 27

  29. Conclusion Summary DLB/LeWI can improve performance transparently Inter-node load imbalances not handled Granularity of parallelism and placement as important factors Optimal configuration with vs. without DLB/LeWI Discussion Interaction with MPI Benchmarks (1.5 of 3 NPB-MZ, arbitrary load distribution) How to find “the right” granularity 27 / 27

Recommend


More recommend