Catching Idlers with Ease: A Lightweight Wait-State Profiler for MPI Programs Guoyong Mao, David Böhme, Markus Geimer, Marc-André Hermanns, Daniel Lorenz and Felix Wolf Petascale Tools Workshop, Madison, WI, USA, August 4, 2014
Late sender processes A Send B Recv Waiting time time 2 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Wait an NxN processes Allgather A Waiting time Allgather B Waiting time Allgather C time 3 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
What we want to know Processing time Wait time Processing time Wait time Processing time Processing time 4 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
What we measure Execution time Execution time Execution time Execution time 5 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
The minimum idea Execution time Execution time Execution time Execution time Minimal execution time 6 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
The minimum idea Processing time Wait time Processing time Wait time Processing time Processing time Estimated processing time Estimated wait time 7 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Considered parameters • We consider • MPI function • Message size • Receiver rank • Other possible parameters • Sender rank • Data type • Tradeoff between • Number of samples for a meaningful minimum and amount data • Parameters considered • Need to find the relevant parameters. 8 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Algorithm For every combination of • MPI function • Message size class • Process record the • Minimum execution time For every combination of MPI call path and message size class record the • Number of visits • Total execution time At the end of the profiling run, subtract the minimum from the execution time for every visit to calculate the wait time. 9 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Per-call overhead increase compared to profiling overhead w/o wait state analysis (%) 250 200 150 100 50 0 10 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Accuracy MPI_Recv JUROPA JUQUEEN 0.18 0.16 0.14 0.12 wait ratio 0.1 Scalasca 0.08 0.06 minimum 0.04 method 0.02 0 11 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
12 wait ratio 0.02 0.04 0.06 0.08 0.12 0.14 0.16 0.1 0 Daniel Lorenz et al., Petascale Workshop, August 4, 2014 Accuracy MPI_Wait make_id_list ks_congrad path_product u_shift_fermion comm_embed rev_comm_rho JUROPA rev_commnct tf_ad_splitting parallel rsl_lite_exch_y rsl_lite_exch_x x_solve y_solve z_solve resid rprj3 psinv interp rhs x_solve JUQUEEN y_solve z_solve resid rprj3 psinv iterp method minimum Scalasca
13 wait ratio 0.02 0.04 0.06 0.08 0.12 0.14 0.16 0.1 0 Daniel Lorenz et al., Petascale Workshop, August 4, 2014 Accuracy MPI_Wait make_id_list ks_congrad path_product u_shift_fermion comm_embed rev_comm_rho JUROPA rev_commnct tf_ad_splitting parallel rsl_lite_exch_y rsl_lite_exch_x x_solve y_solve z_solve resid rprj3 psinv interp rhs x_solve JUQUEEN y_solve z_solve resid rprj3 psinv iterp method minimum Scalasca
Accuracy MPI_Waitall JUROPA JUQUEEN 0.14 0.12 0.1 wait ratio 0.08 Scalasca 0.06 minimum 0.04 method 0.02 0 x_solve y_solve z_solve x_solve y_solve z_solve bndry_3 bndry_2 solvers, commnc copy_fa copy_fa copy_fa copy_fa pcg ces ces ces ces d d t 14 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Non-blocking communication A Isend Isend processes B Wait time 15 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Scalasca detects no wait state A Isend Isend processes B Wait time 16 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Minimum approach does calculate wait states A Isend Isend processes B Wait time 17 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
But is this wrong for performance analysis? A Isend Isend processes B Wait time Latency = Possible overlap time 18 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Detailed example from SP 19 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Wait time according to Scalasca 20 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Wait time according to minimum method 21 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Jitter may cause a little higher wait time Processing time Wait time Wait time Processing time Processing time Processing time Estimated processing time Estimated wait time 22 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Accuracy MPI_Waitall JUROPA JUQUEEN 0.14 0.12 0.1 wait ratio 0.08 Scalasca 0.06 minimum 0.04 method 0.02 0 x_solve y_solve z_solve x_solve y_solve z_solve bndry_3 bndry_2 solvers, commnc copy_fa copy_fa copy_fa copy_fa pcg ces ces ces ces d d t 23 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Static imbalance Processing time Wait time Processing time Wait time Processing time Wait time Processing time Wait time Estimated wait time too small Estimated processing time 24 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Static imbalance • Calculating global minima could resolve process local static imbalances • Reduction operation after measurement • No dilation at measurement time • Loose sender/receiver parameterization of minima • For collective operations, global minima were better 25 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Accuracy for Wait at NxN JUROPA JUQUEEN 0.35 0.3 wait ratio 0.25 0.2 0.15 0.1 Scalasca 0.05 0 minimum get_max_recvs solvers,pcg tf_controle glbl_int_sum EP trnspse_x_yz glbl_int_sum method 26 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Conclusion (1) • Minimum method works for the estimation of blocking and non-blocking communication • For blocking communication results similar to Scalasca • For non-blocking communication, in Waitall wait time do not match the Scalasca analysis. • Low runtime overhead • No trace recording or piggybacking • May not produce 100% accurate numbers, but • Sufficient accuracy to locate performance problems • Point to places where we might want to investigate further with trace analysis 27 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Conclusion (2) • Detection of good minimum crucial • Static imbalance • Tradeoff between number of parameters and number of samples • Jitter may lead to minor increase of measured wait state • For non-blocking communication • Count possible overlap time • Might be larger than pure Late Sender time • Isn’t this even more accurate to estimate the optimization potential? 28 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Reference Guoyong Mao, David Böhme, Marc-André Hermanns, Markus Geimer, Daniel Lorenz, Felix Wolf: Catching Idlers With Ease: A Lightweight Wait- State Profiler for MPI Programs . In: EuroMPI ’14: Proc. Of the 21 st European MPI Users’ Group Meeting, Tokyo, Japan, Sep. 9-12, 2014 29 Daniel Lorenz, Petascale Workshop, Madison, WI, 8/4/14
Recommend
More recommend