Using SimGrid to Evaluate the Impact of AMPI Load Balancing In a - PowerPoint PPT Presentation

Using SimGrid to Evaluate the Impact of AMPI Load Balancing In a Geophysics HPC Application Rafael Keller Tesser ⋆ , Philippe O. A. Navaux ⋆ , Lucas Mello Schnorr ⋆ , Arnaud Legrand † ⋆ : UFRGS GPPD/Inf, Porto Alegre, Brazil † : CNRS/Inria POLARIS, Grenoble, France Urbana, April 2016 14th Charm++ Workshop 1 / 21

Outline 1 Context: Improving The Performance of Iterative Unbalanced Applications 2 SimGrid and SMPI in a Nutshell 3 A Simulation Based Methodology 4 Experimental Results Validation Investigating AMPI parameters 5 Conclusion 2 / 21

Context Parallel HPC applications are often written with MPI, which is based on a regular SPMD programming model. • Many of these applications are iterative and such paradigm is suited to balanced applications; • Unbalanced applications: • May resort to static load balancing techniques (at application level ) • Or not. . . (the load imbalance comes from the nature of the input data, evolve over time and space. e.g., Ondes3D) Handling this at the application level is just a nightmare. A possible approach is to use over-decomposition and dynamic process-level load-balancing as proposed with AMPI/CHARM++ 3 / 21

Ondes3D, a Seismic Wave Propagation Simulator • Developed by BRGM [Aochi et al. 2013]; • Used to predict the consequences of future earthquakes. Many sources of load imbalance: • Absorbing boundary conditions (tasks at the borders perform more computation) • Variation in the constitution laws of different geological layers (different equations); • Propagation of the shockwave in space and time; Mesh partitioning techniques and quasi-static load balancing algorithm are thus ineffective. 4 / 21

AMPI can be quite effective 2500 2000 Time (seconds) 1500 -33.97% -32.12% -36.58% -34.72% 1000 500 0 288 2304 4608 Number of chunks MPI (No LB) Ref neLB HwTopoLB HierarchicalLB 2 AMPI (No LB) NucoLB HierarchicalLB 1 500 time-steps Average execution times Based on Mw 6.6, 2007 Niigata Chuetsu-Oki, Japan , earthquake [Aochi et.al ICCS 2013] • Full problem (6000 time steps) � 162 minutes on 32 nodes (Intel Hapertown processors) 5 / 21

Challenges Finding the best load balancing parameters: • Which Load Balancer is the most suited? • How many iterations should be grouped together? (Migration Frequency) • How many VPs? (Decomposition level) Load-balancing benefit vs. application communication overhead and LB overhead • . . . 6 / 21

Challenges Finding the best load balancing parameters: • Which Load Balancer is the most suited? • How many iterations should be grouped together? (Migration Frequency) • How many VPs? (Decomposition level) Load-balancing benefit vs. application communication overhead and LB overhead • . . . And preparing for AMPI is not free : • Need to write data serialization code • Engaging in such approach without knowing how much there is to gain can be deterring; Goal Propose a sound methodology for investigating performance improve- ment of irregular applications through over decomposition 6 / 21

SimGrid Model the machine O ff -line : trace replay of your dreams Time Independent Platform Description Limiter 10G Trace 13G 1G Up Down Up Down Up Down Up Down 10G 0 compute 1e6 Up Down 0 send 1 1e6 1G mpirun ... ... ... ... 0 recv 3 1e6 Limiter 1.5G 1 recv 0 1e6 <?xml version=1.0?> 1 − 39 40 − 74 75 − 104 105 − 144 tau, PAPI <!DOCTYPE platform SYSTEM "simgrid.dtd"> 1 compute 1e6 Trace once on a <platform version="3"> Timed Trace 1 send 2 1e6 <cluster id="gri ff on" pre fi x="gri ff on-" su ffi x=".grid5000.fr" radical="1-144" 2 recv 1 1e6 simple cluster power="286.087kf" bw="125MBps" lat="24us" 2 compute 1e6 bb_bw="1.25GBps" bb_lat="0" sharing_policy="FULLDUPLEX" /> [0.001000] 0 compute 1e6 0.01000 2 send 3 1e6 [0.010028] 0 send 1 1e6 0.009028 3 recv 2 1e6 [0.040113] 0 recv 3 1e6 0.030085 3 compute 1e6 3 send 0 1e6 [0.010028] 1 recv 0 1e6 0.010028 Replay the trace ... as many times as SMPI you want Simulated Execution Time Simulated or Emulated 43.232 seconds Computations MPI Application Simulated Visualization Communications On-line : simulate/emulate unmodi fi ed Paje complex applications time slice - Possible memory folding and shadow execution - Handles non-deterministic applications TRIVA • SimGrid: 15 years old collaboration between France, US, UK, Austria, . . . • Flow-level models that account for topology and contention • SMPI: Supports both trace replay and direct emulation • Embeds 100+ collective communication algorithms 8 / 21

Principle Approach: 1 Implement various load-balancing algorithms in SMPI; 2 Capture a time independent trace (faithful application profile) • Two alternatives: – Standard tracing: parallel/fast , requires more resources – Emulation ( smpicc/smpirun ): requires a single host , slow ; • Add a fake call to MPI_Migrate where needed • Track how much memory is used by each VP and use it as an upper bound of migration cost; • May take some time but does requires minimal modification / knowledge of the application; 3 Replay the trace as often as wished, playing with the different parameters (LB, frequency, topology, . . . ) 10 / 21

Principle Key questions: • How do we know whether our simulations are faithful? • How do we understand where the mismatch comes from? • VP scheduling, LB implementation, trace capture, network, . . . 10 / 21

Evaluation Challenge No LB vs. GreedyLB : Simple Gantt charts are not very informative 11 / 21

Description of the experiments Scenarios Two different earthquake simulations: • Niigata-ken Chuetsu-Oki: • 2007, Mw 6.6, Japan • 500 time-steps; dimensions: 300x300x150 • Ligurian: • 1887, Mw 6.3, north-western Italy • 300 times-teps; dimensions: 500x350x130 Load balancers No load balancing vs. GreedyLB vs. RefineLB Hardware Resources Parapluie cluster from Grid’5000 • 2 x AMD Opteron™ 6164 HE x 24 cores, 1.7GHz, Infiniband • Plus my own laptop (Intel Core™ i7-4610M, 2 cores, 3GHz) 13 / 21

Chuetsu-Oki simulation - 64 VPs and 16 processes Detailed View None GreedyLB RefineLB 16 15 14 13 12 11 10 AMPI 9 8 7 6 5 Load Resource (index) 4 1.0 3 2 0.8 1 0.6 16 15 0.4 14 13 12 11 10 SMPI 9 8 7 6 5 4 3 2 1 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 Iteration (index) 14 / 21

Using SimGrid to Evaluate the Impact of AMPI Load Balancing In a - PowerPoint PPT Presentation

Using SimGrid to Evaluate the Impact of AMPI Load Balancing In a Geophysics HPC Application Rafael Keller Tesser , Philippe O. A. Navaux , Lucas Mello Schnorr , Arnaud Legrand : UFRGS GPPD/Inf, Porto Alegre, Brazil :

Energy Simulation with SimGrid Millian Poquet millian.poquet@inria.fr Slides from SimGrid

Fault Tolerance in Charm++/AMPI Sayantan Chakravorty PPL, UIUC April 19, 2007 Sayantan

2020-07-29_SHPWG_Issue1-Themes Address Calibrate, dynamics of Review the Evaluate Evaluate

Load Balancing with nftables by Laura Garca (Zen Load Balancer Team) Netdev 1.1 Prototype of

Hurricane Storm Surge Analysis and Prediction Using ADCIRC-CG and AMPI Part - 2 AMPIzation

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

Vertical Stress Increases Chapter 8 Point Load 1 3/25/2015 Point Load Point Load

Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI Sam White Parallel

Performance Evaluation of P2P and Cloud Computing Applications - A Module for SimGRID Bogdan

SimGrid MC Verification Support for a Multi-API Simulation Platform Stephan Merz 1 Martin Quinson

Load Test of Load Test of High Capacity Micropile Micropile High Capacity in Site in Site

Hoover Power Post-2017 Allocation: Winners and Losers by Total Load, Ag Load, In-Lieu RCAlaw.com

Internal Load Balancing in 5 mins Deliver scalable and resilient internal-only services on GCP

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Epidemic Algorithm for Load Balancing Harshitha Menon, Laxmikant Kal e 15th April 1 / 25

Generalized roofline analysis? Jee Choi Marat Dukhan Richard (Rich) Vuduc October 2, 2013

Balanced Independent Sets on Colored Interval Graphs Sujoy Bhore, Jan-Henrik Haunert, Fabian

Parallel Clustering for Visualizing Large Scien5fic Line Data

CSE 373: AVL trees Question: is this also an AVL tree? 2 5 6 10 9 8 12 11 13 14 5 AVL

Load balancing David Bindel 12 Nov 2015 Inefficiencies in parallel code Poor single

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

Distributed, Secure Load Balancing with Skew, Heterogeneity, and Churn Jonathan Ledlie and Margo

Balancing Gossip Exchanges in Networks with van Renesse and Firewalls L. Rodrigues