catching idlers with ease a lightweight wait state
play

Catching Idlers with Ease: A Lightweight Wait-State Profiler for MPI - PowerPoint PPT Presentation

Catching Idlers with Ease: A Lightweight Wait-State Profiler for MPI Programs Guoyong Mao, David Bhme, Markus Geimer, Marc-Andr Hermanns, Daniel Lorenz and Felix Wolf Petascale Tools Workshop, Madison, WI, USA, August 4, 2014 Late sender


  1. Catching Idlers with Ease: A Lightweight Wait-State Profiler for MPI Programs Guoyong Mao, David Böhme, Markus Geimer, Marc-André Hermanns, Daniel Lorenz and Felix Wolf Petascale Tools Workshop, Madison, WI, USA, August 4, 2014

  2. Late sender processes A Send B Recv Waiting time time 2 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  3. Wait an NxN processes Allgather A Waiting time Allgather B Waiting time Allgather C time 3 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  4. What we want to know Processing time Wait time Processing time Wait time Processing time Processing time 4 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  5. What we measure Execution time Execution time Execution time Execution time 5 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  6. The minimum idea Execution time Execution time Execution time Execution time Minimal execution time 6 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  7. The minimum idea Processing time Wait time Processing time Wait time Processing time Processing time Estimated processing time Estimated wait time 7 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  8. Considered parameters • We consider • MPI function • Message size • Receiver rank • Other possible parameters • Sender rank • Data type • Tradeoff between • Number of samples for a meaningful minimum and amount data • Parameters considered • Need to find the relevant parameters. 8 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  9. Algorithm For every combination of • MPI function • Message size class • Process record the • Minimum execution time For every combination of MPI call path and message size class record the • Number of visits • Total execution time At the end of the profiling run, subtract the minimum from the execution time for every visit to calculate the wait time. 9 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  10. Per-call overhead increase compared to profiling overhead w/o wait state analysis (%) 250 200 150 100 50 0 10 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  11. Accuracy MPI_Recv JUROPA JUQUEEN 0.18 0.16 0.14 0.12 wait ratio 0.1 Scalasca 0.08 0.06 minimum 0.04 method 0.02 0 11 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  12. 12 wait ratio 0.02 0.04 0.06 0.08 0.12 0.14 0.16 0.1 0 Daniel Lorenz et al., Petascale Workshop, August 4, 2014 Accuracy MPI_Wait make_id_list ks_congrad path_product u_shift_fermion comm_embed rev_comm_rho JUROPA rev_commnct tf_ad_splitting parallel rsl_lite_exch_y rsl_lite_exch_x x_solve y_solve z_solve resid rprj3 psinv interp rhs x_solve JUQUEEN y_solve z_solve resid rprj3 psinv iterp method minimum Scalasca

  13. 13 wait ratio 0.02 0.04 0.06 0.08 0.12 0.14 0.16 0.1 0 Daniel Lorenz et al., Petascale Workshop, August 4, 2014 Accuracy MPI_Wait make_id_list ks_congrad path_product u_shift_fermion comm_embed rev_comm_rho JUROPA rev_commnct tf_ad_splitting parallel rsl_lite_exch_y rsl_lite_exch_x x_solve y_solve z_solve resid rprj3 psinv interp rhs x_solve JUQUEEN y_solve z_solve resid rprj3 psinv iterp method minimum Scalasca

  14. Accuracy MPI_Waitall JUROPA JUQUEEN 0.14 0.12 0.1 wait ratio 0.08 Scalasca 0.06 minimum 0.04 method 0.02 0 x_solve y_solve z_solve x_solve y_solve z_solve bndry_3 bndry_2 solvers, commnc copy_fa copy_fa copy_fa copy_fa pcg ces ces ces ces d d t 14 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  15. Non-blocking communication A Isend Isend processes B Wait time 15 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  16. Scalasca detects no wait state A Isend Isend processes B Wait time 16 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  17. Minimum approach does calculate wait states A Isend Isend processes B Wait time 17 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  18. But is this wrong for performance analysis? A Isend Isend processes B Wait time Latency = Possible overlap time 18 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  19. Detailed example from SP 19 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  20. Wait time according to Scalasca 20 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  21. Wait time according to minimum method 21 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  22. Jitter may cause a little higher wait time Processing time Wait time Wait time Processing time Processing time Processing time Estimated processing time Estimated wait time 22 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  23. Accuracy MPI_Waitall JUROPA JUQUEEN 0.14 0.12 0.1 wait ratio 0.08 Scalasca 0.06 minimum 0.04 method 0.02 0 x_solve y_solve z_solve x_solve y_solve z_solve bndry_3 bndry_2 solvers, commnc copy_fa copy_fa copy_fa copy_fa pcg ces ces ces ces d d t 23 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  24. Static imbalance Processing time Wait time Processing time Wait time Processing time Wait time Processing time Wait time Estimated wait time too small Estimated processing time 24 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  25. Static imbalance • Calculating global minima could resolve process local static imbalances • Reduction operation after measurement • No dilation at measurement time • Loose sender/receiver parameterization of minima • For collective operations, global minima were better 25 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  26. Accuracy for Wait at NxN JUROPA JUQUEEN 0.35 0.3 wait ratio 0.25 0.2 0.15 0.1 Scalasca 0.05 0 minimum get_max_recvs solvers,pcg tf_controle glbl_int_sum EP trnspse_x_yz glbl_int_sum method 26 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  27. Conclusion (1) • Minimum method works for the estimation of blocking and non-blocking communication • For blocking communication results similar to Scalasca • For non-blocking communication, in Waitall wait time do not match the Scalasca analysis. • Low runtime overhead • No trace recording or piggybacking • May not produce 100% accurate numbers, but • Sufficient accuracy to locate performance problems • Point to places where we might want to investigate further with trace analysis 27 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  28. Conclusion (2) • Detection of good minimum crucial • Static imbalance • Tradeoff between number of parameters and number of samples • Jitter may lead to minor increase of measured wait state • For non-blocking communication • Count possible overlap time • Might be larger than pure Late Sender time • Isn’t this even more accurate to estimate the optimization potential? 28 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

  29. Reference Guoyong Mao, David Böhme, Marc-André Hermanns, Markus Geimer, Daniel Lorenz, Felix Wolf: Catching Idlers With Ease: A Lightweight Wait- State Profiler for MPI Programs . In: EuroMPI ’14: Proc. Of the 21 st European MPI Users’ Group Meeting, Tokyo, Japan, Sep. 9-12, 2014 29 Daniel Lorenz, Petascale Workshop, Madison, WI, 8/4/14

Recommend


More recommend