Hints to improve automatic load balancing with LeWI for hybrid - PowerPoint PPT Presentation

Hints to improve automatic load balancing with LeWI for hybrid applications Marta Garcia, Jesus Labarta, Julita Corbalan Journal of Parallel and Distributed Computing — Volume 74, Issue 9 September 2014 1 / 27

Motivation Loss of efficiency Hybrid programming models ( MPI + X ) Manual tuning of parallel codes (load-balancing, data redistribution) 2 / 27

The X (in this paper) SMPSs (SMPSuperscalar) OpenMP Task as basic element Directives to annotate Annotate taskifiable parallel code functions and their Fork/join model with parameters (in/out/inout) shared memory Task graph to track Number of threads may dependencies change between parallel Number of threads may regions change any time 3 / 27

DLB and LeWI DLB (dynamic load balancing) “ Runtime interposition to [...] intercept MPI calls ” Balance load on the inner level (OpenMP/SMPSs) Several load balancing algorithms LeWI (Lend CPU when Idle) CPUs of rank in blocking MPI call are idle Lend CPUs to other ranks and recover them after MPI call completes 4 / 27

LeWI (a) No load balancing. (b) LeWI algorithm with SMPSs. (c) LeWI algorithm with OpenMP. 5 / 27

Approach “ Extensive performance evaluation ” “ Modeling parallelization characteristics that limit the automatic load balancing potential ” “ Improving automatic load balancing ” 6 / 27

Performance evaluation Marenostrum 2: 2 × IBM PowerPC 970MP (2 cores); 8 GiB RAM Linux 2.6.5-7.244-pseries64; MPICH; IBM XL C/C++ compiler w/o optimizations Metrics Speedup = parallel _ execution _ time serial _ execution _ time useful _ cpu _ time Efficiency = elapsed _ time ∗ cpus where useful _ cpu _ time = cpu _ time − ( mpi _ time + openmp/smpss _ time + dlb _ time ) CPUs _ used to simultaneously run application code 3 benchmarks + 2 real applications 7 / 27

PILS (Parallel ImbaLance Simulation) Synthetic benchmark Core: “ floating point operations without data involved ” Tunable parameters Programming model (MPI, MPI + OpenMP, MPI + SMPSs) Load distribution 1 Parallelism grain ( = # parallel regions ) Iterations 8 / 27

PILS 9 / 27

Parallelism Grain 10 / 27

Other Codes Benchmarks BT-MZ: block tri-diagonal solver LUB: LU matrix factorization Applications Gromacs: molecular dynamics, MPI-only Gadget: cosmological N-body/SPH (smoothed-particle hydrodynamics) simulation 11 / 27

Other Codes Benchmarks BT-MZ: block tri-diagonal solver LUB: LU matrix factorization Applications Gromacs: molecular dynamics, MPI-only Gadget: cosmological N-body/SPH (smoothed-particle hydrodynamics) simulation Application Original version MPI + OpenMP MPI + SMPSs Executed in nodes (cpus) MPI + OpenMP 1 (4) PILS X X MPI + SMPSs BT-MZ MPI + OpenMP X X 1, 2, 4 (4, 8, 16) MPI + OpenMP 1, 2, 4 (4, 8, 16) LUB X X MPI + SMPSs Gromacs MPI X 1.64 (4.256) Gadget MPI X 200 (800) 11 / 27

PILS, 2 and 4 MPI processes 12 / 27

BT-MZ; 1 node 13 / 27

BT-MZ; 2,4 nodes; Class C 14 / 27

BT-MZ; 1 node; 4 MPI processes 15 / 27

LUB; 1 node; Block size 200 . . . . 16 / 27

Gromacs; 1–64 nodes + Details for 16 nodes 17 / 27

Gromacs; Efficiency + CPUs used per Node 18 / 27

Gadget; 200 nodes 19 / 27

Factors Limiting Performance Improvement with LeWI “ Parallelism Grain in OpenMP applications ” “ Task duration in SMPSs applications ” “ Distribution of MPI processes among computation nodes ” 20 / 27

Parallelism Grain 21 / 27

Modified Parallelism Grain in LUB 22 / 27

Performance of Modified LUB 23 / 27

Rank Distribution — BT-MZ 24 / 27

Rank Distribution — Gromacs 25 / 27

Rank Distribution — Gadget Total 26 / 27

Conclusion Summary DLB/LeWI can improve performance transparently Inter-node load imbalances not handled Granularity of parallelism and placement as important factors Optimal configuration with vs. without DLB/LeWI 27 / 27

Conclusion Summary DLB/LeWI can improve performance transparently Inter-node load imbalances not handled Granularity of parallelism and placement as important factors Optimal configuration with vs. without DLB/LeWI Discussion Interaction with MPI Benchmarks (1.5 of 3 NPB-MZ, arbitrary load distribution) How to find “the right” granularity 27 / 27

Hints to improve automatic load balancing with LeWI for hybrid - PowerPoint PPT Presentation

Hints to improve automatic load balancing with LeWI for hybrid applications Marta Garcia, Jesus Labarta, Julita Corbalan Journal of Parallel and Distributed Computing Volume 74, Issue 9 September 2014 1 / 27 Motivation Loss of efficiency

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

Load Balancing with nftables by Laura Garca (Zen Load Balancer Team) Netdev 1.1 Prototype of

Internal Load Balancing in 5 mins Deliver scalable and resilient internal-only services on GCP

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Epidemic Algorithm for Load Balancing Harshitha Menon, Laxmikant Kal e 15th April 1 / 25

L O A D B A L A N C I N G I S I M P O S S I B L E LOAD BALANCING IS IMPOSSIBLE Tyler McMullen

Load Balancing in Ceph: Load Balancing With Pseudorandom Placement Esteban Molina-Estolano,

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -> 2

Load Balancing Load Balancing: Example Example Problem Consider 6 jobs whose processing times

Load Balancing and Termination Detection Load balancing used to distribute computations fairly

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Vertical Stress Increases Chapter 8 Point Load 1 3/25/2015 Point Load Point Load

Improved Constructions of PRFs Secure Against Related-Key Attacks Kevin Lewi Hart Montgomery

Scaling Backend Authentication at Facebook Kevin Lewi , Callen Rain , Stephen Weis, Yueting Lee,

Cloud-based Global Load Balancing Improve Performance and reliability while reducing IT costs

Welcome & Introductions Dr. Mauros slides are available for download at

Rotations Angharad Ratliff, PharmD, BCPS,BCCCP Clinical Assistant Professor UAA/ISU College of

High-Speed Computing & Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are

The Next Generation of Cryptanalytic Hardware FPGAs (Field Programmable Gate Arrays) allow custom

First Service for sore throats & common cold Full Pharmacy First Service Patient

HYDRO ONE THIRD QUARTER 2015 RESULTS UPDATE November 13, 2015 REPORTED RESULTS OF HYDRO ONE INC.

Academy of Managed Care Pharmacy Back-to-Back National P&T Competition Champions (2017, 2018)

Present status and Physics prospects at INO and mini-ICAL and feasibility of shallow depth ICAL

Hints to improve automatic load balancing with LeWI for hybrid - PowerPoint PPT Presentation

Hints to improve automatic load balancing with LeWI for hybrid applications Marta Garcia, Jesus Labarta, Julita Corbalan Journal of Parallel and Distributed Computing Volume 74, Issue 9 September 2014 1 / 27 Motivation Loss of efficiency

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

Load Balancing with nftables by Laura Garca (Zen Load Balancer Team) Netdev 1.1 Prototype of

Internal Load Balancing in 5 mins Deliver scalable and resilient internal-only services on GCP

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Epidemic Algorithm for Load Balancing Harshitha Menon, Laxmikant Kal e 15th April 1 / 25

L O A D B A L A N C I N G I S I M P O S S I B L E LOAD BALANCING IS IMPOSSIBLE Tyler McMullen

Load Balancing in Ceph: Load Balancing With Pseudorandom Placement Esteban Molina-Estolano,

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -&gt; 2

Load Balancing Load Balancing: Example Example Problem Consider 6 jobs whose processing times

Load Balancing and Termination Detection Load balancing used to distribute computations fairly

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Vertical Stress Increases Chapter 8 Point Load 1 3/25/2015 Point Load Point Load

Improved Constructions of PRFs Secure Against Related-Key Attacks Kevin Lewi Hart Montgomery

Scaling Backend Authentication at Facebook Kevin Lewi , Callen Rain , Stephen Weis, Yueting Lee,

Cloud-based Global Load Balancing Improve Performance and reliability while reducing IT costs

Welcome &amp; Introductions Dr. Mauros slides are available for download at

Rotations Angharad Ratliff, PharmD, BCPS,BCCCP Clinical Assistant Professor UAA/ISU College of

High-Speed Computing &amp; Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are

The Next Generation of Cryptanalytic Hardware FPGAs (Field Programmable Gate Arrays) allow custom

First Service for sore throats &amp; common cold Full Pharmacy First Service Patient

HYDRO ONE THIRD QUARTER 2015 RESULTS UPDATE November 13, 2015 REPORTED RESULTS OF HYDRO ONE INC.

Academy of Managed Care Pharmacy Back-to-Back National P&amp;T Competition Champions (2017, 2018)

Present status and Physics prospects at INO and mini-ICAL and feasibility of shallow depth ICAL

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -> 2

Welcome & Introductions Dr. Mauros slides are available for download at

High-Speed Computing & Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are

First Service for sore throats & common cold Full Pharmacy First Service Patient

Academy of Managed Care Pharmacy Back-to-Back National P&T Competition Champions (2017, 2018)