Cloc ock Del elta Com ompres ession on for for Scalable e Order er-R -Rep eplay of Non of on-D -Det eter erministic Pa Parallel el Application ons SC15 Kento Sato , Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz November 19 th , 2015 LLNL-PRES-679294 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
Debugging large-scale applications is becoming problematic “On average, software developers spend 50% of their programming time finding and fixing bugs.” [1] With trends towards asynchronous communication patterns in MPI applications, MPI non-determinism will significantly increase debugging cost [1] Source: http://www.prweb.com/releases/2013/1/prweb10298185.htm, CAMBRIDGE, UK 2" (PRWEB) JANUARY 08, 2013 LLNL-PRES-679294
What is MPI non-determinism (ND) ? ! Message receive orders can be different across executions ( " Internal ND) — Unpredictable system noise (e.g. network, system daemon & OS jitter) ! Arithmetic orders can also change across executions ( " External ND) P0 P1 P2 P0 P1 P2 a b b c c a Execution A: (a+b)+c Execution B: a+(b+c) 3" LLNL-PRES-679294
MPI non-determinism significantly increases debugging cost ! Control flows of an application can change across different runs Deterministic apps Non-deterministic apps ! Non-deterministic control flow Input Input Successful run, seg-fault or hang — ! Non-deterministic numerical results debug Floating-point arithmetic is “NOT” — necessarily associative (a+b)+c � ≠ a+(b+c) � " Developers need to do debug runs until the Result Bug Result same bug is reproduced seg-fault Hangs Result A Result B " Running as intended ? Application bugs ? Silent data corruption ? In ND applications, it’s hard to reproduce bugs and incorrect results, It costs excessive amounts of time for “ reproducing ”, finding and fixing bugs 4" LLNL-PRES-679294
Case study: “Monte Carlo Simulation Benchmark” (MCB) ! CORAL proxy application ! MPI non-determinism MCB: Monte Carlo Benchmark Final numerical results are different between 1 st and 2 nd run $ diff result_run1.out result_run2.out result_run1.out:< IMC E_RR_total -3.3140234409e-05 -8.302693774e-08 2.9153322360e-08 -4.8198506756e-06 2.3113821822e-06 result_run2.out:> IMC E_RR_total -3.3140234410e-05 -8.302693776e-08 2.9153322360e-08 -4.8198506757e-06 2.3113821821e-06 22e-06 09e-05 74e-08 56e-06 21e-06 10e-05 76e-08 57e-06 5" LLNL-PRES-679294
Why MPI non-determinism occurs ? Typical MPI non-deterministic code ! In such non-deterministic applications, each MPI_Irecv(…, MPI_ANY_SOURCE, …); process doesn’t know which rank will send while(1) { message MPI_Test(flag); — e.g.) Particle simulation if (flag) { <computation> ! Messages can arrive in any order from MPI_Irecv(…, MPI_ANY_SOURCE, …); neighbors " inconsistent message arrivals } } north Source of MPI non-determinism west east MPI matching functions Wait family Test family south single MPI_Wait MPI_Test any MPI_Waitany MPI_Testany some MPI_Waitsome MPI_Testsome MCB: Monte Carlo Benchmark all MPI_Waitall MPI_Testall 6" LLNL-PRES-679294
State-of-the-art approach: Record-and-replay Record-and-replay ! Traces, records message receive orders in a run, and rank 0 rank 1 rank 2 rank 3 replays the orders in successive runs for debugging — Record-and-replay can reproduce a target control flow rank 2 — Developers can focus on debugging a particular control rank 0 flow rank 2 rank 3 Debugging a particular control flow in replay rank 1 Input rank 1 Developer can focus on debugging particular rank 3 control flow rank 1 rank 2 rank 1 Output Hanging seg-fault Output A Output B 7" LLNL-PRES-679294
Record-and-replay won't work at scale ! Record-and-replay produces large amount of recording data Over ” 10 GB/node ” for 24 hours in MCB — ! For scalable record-replay with low overhead, the record data must fit into local memory, but capacity is limited Storing in shared/parallel file system is not scalable approach — Record-and-replay rank 0 rank 1 rank 2 rank 3 rank 2 rank 0 rank 2 rank 3 10 GB/node rank 1 rank 1 MCB: Monte Carlo Benchmark rank 3 rank 1 rank 2 rank 1 Challenges "Record"size"reduc4on"for"scalable"record:replay" 8" LLNL-PRES-679294
Proposal: Clock Delta Compression (CDC) ! Putting logical-clock (Lamport clock) into each MPI message ! Actual message receive orders (i.e. wall-clock orders) are very similar to logical clock orders in each MPI rank — MPI messages are received in almost monotonically increasing logical-clock order ! CDC records only the order differences between the wall-clock order and the logical- clock order without recording the entire message order Logical-clock rank x 0 100 200 300 400 500 600 700 800 1 rank 0 2 4 Received message in wall-clock order 7 10 rank 0 13 13 rank 2 8 16 rank 1 8 19 rank 0 15 22 rank 1 19 25 28 31 rank 0 17 34 37 rank 0 18 40 9" LLNL-PRES-679294
Result in MCB ! 40 times smaller than the one w/o compression 40 original record MCB: Monte Carlo Benchmark CDC 1 10" LLNL-PRES-679294
Outline ! Background ! General record-and-replay ! CDC: Clock delta compression ! Implementation ! Evaluation ! Conclusion 11" LLNL-PRES-679294
How to record-and-replay MPI applications ? ! Source of MPI non-determinism is these matching functions — “Replaying these matching functions’ behavior” " “Replaying MPI application’s behavior” Matching functions in MPI Wait family Test family single MPI_Wait MPI_Test any MPI_Waitany MPI_Testany some MPI_Waitsome MPI_Testsome all MPI_Waitall MPI_Testall Source of MPI non-determinism Questions "What"informa4on"need"to"be"recorded"for"replaying"these"matching"func4ons"?" 12" LLNL-PRES-679294
Necessary values to be recorded for correct replay ! Example rank x rank 0 rank 0 message rank 0 rank 2 rank x rank 1 rank 1 rank 0 message rank 1 rank 0 rank 2 message rank 0 13" LLNL-PRES-679294
Necessary values for correct replay Matching functions in MPI Wait family Test family rank x single MPI_Wait MPI_Test any MPI_Waitany MPI_Testany count flag rank some MPI_Waitsome MPI_Testsome rank 0 -- 1 0 all MPI_Waitall MPI_Testall 2 0 -- -- 1 0 ! rank rank 0 -- 1 2 rank 2 — Who send the messages? rank 1 -- 1 1 ! count & flag rank 0 -- 1 0 — For MPI_Test family rank 1 -- 1 1 • flag: Matched or unmatched ? • count: How many time unmatched ? 3 0 -- -- 1 0 ! id rank 0 -- 0 -- — For application-level out-of-order 1 1 0 rank 0 ! with_next — For matching some/all functions 14" LLNL-PRES-679294
Necessary values for correct replay Matching functions in MPI Wait family Test family rank x single MPI_Wait MPI_Test any MPI_Waitany MPI_Testany count flag rank some MPI_Waitsome MPI_Testsome rank 0 -- 1 0 all MPI_Waitall MPI_Testall 2 0 -- -- 1 0 ! rank rank 0 -- 1 2 rank 2 — Who send the messages? rank 1 -- 1 1 ! count & flag rank 0 -- 1 0 — For MPI_Test family rank 1 -- 1 1 • flag: Matched or unmatched ? • count: How many time unmatched ? 3 0 -- -- 1 0 ! id rank 0 -- 0 -- — For application-level out-of-order 1 1 0 rank 0 ! with_next — For matching some/all functions 15" LLNL-PRES-679294
Application-level out-of-order Application-level out-of-order ! MPI guarantees that any two communications executed by a process are ordered rank 0 rank 1 Send: A " B — Recv: A " B — msg: A MPI_Irecv (req[0]) ! However, timing of matching function calls depends msg: B MPI_Irecv (req[1]) on an application Message receive order is not necessary equal to — message send order MPI_Test (req[0]) msg: A MPI_Send ! For example, msg: B “msg: B” may matches earlier than “msg: A” — MPI_Send ! Recording only “ rank ” cannot distinguish between A " B and B " A msg: B MPI_Test (req[1]) rank ?? ?? msg: B MPI_Test (req[0]) msg: A msg: A rank 1 msg: A msg: B rank 1 16" LLNL-PRES-679294
Each rank need to assign “id” number to each message rank x rank 0 rank 0 0 message id rank 0 1 rank 2 0 rank x rank 1 rank 1 0 rank 0 2 message id rank 1 1 rank 0 3 rank 2 id message rank 0 4 17" LLNL-PRES-679294
Necessary values for correct replay Matching functions in MPI Wait family Test family rank x single MPI_Wait MPI_Test any MPI_Waitany MPI_Testany count flag rank id some MPI_Waitsome MPI_Testsome rank 0 0 -- 1 0 0 all MPI_Waitall MPI_Testall 2 0 -- -- -- 1 0 1 ! rank rank 0 1 -- 1 2 0 rank 2 0 — Who send the messages? rank 1 0 -- 1 1 0 ! count & flag rank 0 2 -- 1 0 2 — For MPI_Test family rank 1 1 -- 1 1 1 • flag: Matched or unmatched ? • count: How many time unmatched ? 3 0 -- -- -- 1 0 3 ! id rank 0 3 -- 0 -- -- — For application-level out-of-order 1 1 0 4 rank 0 4 ! with_next — For matching some/all functions 18" LLNL-PRES-679294
Recommend
More recommend