Cloc ock Del elta Com ompres ession on for for Scalable e - PowerPoint PPT Presentation

Cloc ock Del elta Com ompres ession on for for Scalable e Order er-R -Rep eplay of Non of on-D -Det eter erministic Pa Parallel el Application ons SC15 Kento Sato , Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz November 19 th , 2015 LLNL-PRES-679294 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Debugging large-scale applications is becoming problematic “On average, software developers spend 50% of their programming time finding and fixing bugs.” [1] With trends towards asynchronous communication patterns in MPI applications, MPI non-determinism will significantly increase debugging cost [1] Source: http://www.prweb.com/releases/2013/1/prweb10298185.htm, CAMBRIDGE, UK 2" (PRWEB) JANUARY 08, 2013 LLNL-PRES-679294

What is MPI non-determinism (ND) ? ! Message receive orders can be different across executions ( " Internal ND) — Unpredictable system noise (e.g. network, system daemon & OS jitter) ! Arithmetic orders can also change across executions ( " External ND) P0 P1 P2 P0 P1 P2 a b b c c a Execution A: (a+b)+c Execution B: a+(b+c) 3" LLNL-PRES-679294

MPI non-determinism significantly increases debugging cost ! Control flows of an application can change across different runs Deterministic apps Non-deterministic apps ! Non-deterministic control flow Input Input Successful run, seg-fault or hang — ! Non-deterministic numerical results debug Floating-point arithmetic is “NOT” — necessarily associative (a+b)+c � ≠ a+(b+c) � " Developers need to do debug runs until the Result Bug Result same bug is reproduced seg-fault Hangs Result A Result B " Running as intended ? Application bugs ? Silent data corruption ? In ND applications, it’s hard to reproduce bugs and incorrect results, It costs excessive amounts of time for “ reproducing ”, finding and fixing bugs 4" LLNL-PRES-679294

Case study: “Monte Carlo Simulation Benchmark” (MCB) ! CORAL proxy application ! MPI non-determinism MCB: Monte Carlo Benchmark Final numerical results are different between 1 st and 2 nd run $ diff result_run1.out result_run2.out result_run1.out:< IMC E_RR_total -3.3140234409e-05 -8.302693774e-08 2.9153322360e-08 -4.8198506756e-06 2.3113821822e-06 result_run2.out:> IMC E_RR_total -3.3140234410e-05 -8.302693776e-08 2.9153322360e-08 -4.8198506757e-06 2.3113821821e-06 22e-06 09e-05 74e-08 56e-06 21e-06 10e-05 76e-08 57e-06 5" LLNL-PRES-679294

Why MPI non-determinism occurs ? Typical MPI non-deterministic code ! In such non-deterministic applications, each MPI_Irecv(…, MPI_ANY_SOURCE, …); process doesn’t know which rank will send while(1) { message MPI_Test(flag); — e.g.) Particle simulation if (flag) { <computation> ! Messages can arrive in any order from MPI_Irecv(…, MPI_ANY_SOURCE, …); neighbors " inconsistent message arrivals } } north Source of MPI non-determinism west east MPI matching functions Wait family Test family south single MPI_Wait MPI_Test any MPI_Waitany MPI_Testany some MPI_Waitsome MPI_Testsome MCB: Monte Carlo Benchmark all MPI_Waitall MPI_Testall 6" LLNL-PRES-679294

State-of-the-art approach: Record-and-replay Record-and-replay ! Traces, records message receive orders in a run, and rank 0 rank 1 rank 2 rank 3 replays the orders in successive runs for debugging — Record-and-replay can reproduce a target control flow rank 2 — Developers can focus on debugging a particular control rank 0 flow rank 2 rank 3 Debugging a particular control flow in replay rank 1 Input rank 1 Developer can focus on debugging particular rank 3 control flow rank 1 rank 2 rank 1 Output Hanging seg-fault Output A Output B 7" LLNL-PRES-679294

Record-and-replay won't work at scale ! Record-and-replay produces large amount of recording data Over ” 10 GB/node ” for 24 hours in MCB — ! For scalable record-replay with low overhead, the record data must fit into local memory, but capacity is limited Storing in shared/parallel file system is not scalable approach — Record-and-replay rank 0 rank 1 rank 2 rank 3 rank 2 rank 0 rank 2 rank 3 10 GB/node rank 1 rank 1 MCB: Monte Carlo Benchmark rank 3 rank 1 rank 2 rank 1 Challenges "Record"size"reduc4on"for"scalable"record:replay" 8" LLNL-PRES-679294

Proposal: Clock Delta Compression (CDC) ! Putting logical-clock (Lamport clock) into each MPI message ! Actual message receive orders (i.e. wall-clock orders) are very similar to logical clock orders in each MPI rank — MPI messages are received in almost monotonically increasing logical-clock order ! CDC records only the order differences between the wall-clock order and the logical- clock order without recording the entire message order Logical-clock rank x 0 100 200 300 400 500 600 700 800 1 rank 0 2 4 Received message in wall-clock order 7 10 rank 0 13 13 rank 2 8 16 rank 1 8 19 rank 0 15 22 rank 1 19 25 28 31 rank 0 17 34 37 rank 0 18 40 9" LLNL-PRES-679294

Result in MCB ! 40 times smaller than the one w/o compression 40 original record MCB: Monte Carlo Benchmark CDC 1 10" LLNL-PRES-679294

Outline ! Background ! General record-and-replay ! CDC: Clock delta compression ! Implementation ! Evaluation ! Conclusion 11" LLNL-PRES-679294

How to record-and-replay MPI applications ? ! Source of MPI non-determinism is these matching functions — “Replaying these matching functions’ behavior” " “Replaying MPI application’s behavior” Matching functions in MPI Wait family Test family single MPI_Wait MPI_Test any MPI_Waitany MPI_Testany some MPI_Waitsome MPI_Testsome all MPI_Waitall MPI_Testall Source of MPI non-determinism Questions "What"informa4on"need"to"be"recorded"for"replaying"these"matching"func4ons"?" 12" LLNL-PRES-679294

Necessary values to be recorded for correct replay ! Example rank x rank 0 rank 0 message rank 0 rank 2 rank x rank 1 rank 1 rank 0 message rank 1 rank 0 rank 2 message rank 0 13" LLNL-PRES-679294

Necessary values for correct replay Matching functions in MPI Wait family Test family rank x single MPI_Wait MPI_Test any MPI_Waitany MPI_Testany count flag rank some MPI_Waitsome MPI_Testsome rank 0 -- 1 0 all MPI_Waitall MPI_Testall 2 0 -- -- 1 0 ! rank rank 0 -- 1 2 rank 2 — Who send the messages? rank 1 -- 1 1 ! count & flag rank 0 -- 1 0 — For MPI_Test family rank 1 -- 1 1 • flag: Matched or unmatched ? • count: How many time unmatched ? 3 0 -- -- 1 0 ! id rank 0 -- 0 -- — For application-level out-of-order 1 1 0 rank 0 ! with_next — For matching some/all functions 14" LLNL-PRES-679294

Necessary values for correct replay Matching functions in MPI Wait family Test family rank x single MPI_Wait MPI_Test any MPI_Waitany MPI_Testany count flag rank some MPI_Waitsome MPI_Testsome rank 0 -- 1 0 all MPI_Waitall MPI_Testall 2 0 -- -- 1 0 ! rank rank 0 -- 1 2 rank 2 — Who send the messages? rank 1 -- 1 1 ! count & flag rank 0 -- 1 0 — For MPI_Test family rank 1 -- 1 1 • flag: Matched or unmatched ? • count: How many time unmatched ? 3 0 -- -- 1 0 ! id rank 0 -- 0 -- — For application-level out-of-order 1 1 0 rank 0 ! with_next — For matching some/all functions 15" LLNL-PRES-679294

Application-level out-of-order Application-level out-of-order ! MPI guarantees that any two communications executed by a process are ordered rank 0 rank 1 Send: A " B — Recv: A " B — msg: A MPI_Irecv (req[0]) ! However, timing of matching function calls depends msg: B MPI_Irecv (req[1]) on an application Message receive order is not necessary equal to — message send order MPI_Test (req[0]) msg: A MPI_Send ! For example, msg: B “msg: B” may matches earlier than “msg: A” — MPI_Send ! Recording only “ rank ” cannot distinguish between A " B and B " A msg: B MPI_Test (req[1]) rank ?? ?? msg: B MPI_Test (req[0]) msg: A msg: A rank 1 msg: A msg: B rank 1 16" LLNL-PRES-679294

Each rank need to assign “id” number to each message rank x rank 0 rank 0 0 message id rank 0 1 rank 2 0 rank x rank 1 rank 1 0 rank 0 2 message id rank 1 1 rank 0 3 rank 2 id message rank 0 4 17" LLNL-PRES-679294

Necessary values for correct replay Matching functions in MPI Wait family Test family rank x single MPI_Wait MPI_Test any MPI_Waitany MPI_Testany count flag rank id some MPI_Waitsome MPI_Testsome rank 0 0 -- 1 0 0 all MPI_Waitall MPI_Testall 2 0 -- -- -- 1 0 1 ! rank rank 0 1 -- 1 2 0 rank 2 0 — Who send the messages? rank 1 0 -- 1 1 0 ! count & flag rank 0 2 -- 1 0 2 — For MPI_Test family rank 1 1 -- 1 1 1 • flag: Matched or unmatched ? • count: How many time unmatched ? 3 0 -- -- -- 1 0 3 ! id rank 0 3 -- 0 -- -- — For application-level out-of-order 1 1 0 4 rank 0 4 ! with_next — For matching some/all functions 18" LLNL-PRES-679294

Cloc ock Del elta Com ompres ession on for for Scalable e - PowerPoint PPT Presentation

Cloc ock Del elta Com ompres ession on for for Scalable e Order er-R -Rep eplay of Non of on-D -Det eter erministic Pa Parallel el Application ons SC15 Kento Sato , Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz

CLOC, SILC and OTR Kazuhiko Minematsu (NEC Corporation) Recent Advances in Authenticated

Fault Based Almost Universal Forgeries on CLOC and SILC Avik Chakraborti (ISI, Kolkata) Joint

RF Solid State Amplifiers Jrn Jacob, ESRF SOLEIL ELTA /AREVA SOLEIL ELTA/AREVA

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

William S. Richardson School Of Law Community Legal Outreach Center (CLOC) UHM 12-306 University

Ministerio del Interior Ministerio del Interior Ministerio del Interior Ministerio del Interior

Foo oothill thill Roa oad (D (Del elta ta Wate ters Rd. . to to Dr Dry Cr Cree eek Rd.

A A Cas Case i in Genomic M c Medicine Benjamin Bell February 2018 Rapid prog ogres ession

W HAT S R EQUIRED I N T HIS L EGISLATIVE S ESSION ? W HAT H APPENS A FTER T HIS L

www.dat.ruc.dk Plan for del 1 og del 2 Del 1 (i dag) CSCW Opponentoplg: Analyse af

The Context of the New Testament Brian Criscuolo Del Rey Church Del Rey Bible Institute New

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

CLOC: Authenticated Encryption for Short Input Tetsu Iwata, Nagoya University Kazuhiko Minematsu,

Updates on CLOC and SILC Version 3 Tetsu Iwata*, Kazuhiko Minematsu, Jian Guo, Sumio Morioka, and

Gehackte Webapplikationen und Malware Hanno B ock, Lizenz: CC0 / Public Domain 2014-04-11

Rancho Del Oro Public Outreach Meeting Presentation Rancho Rancho Del Del Or Oro Publ ublic ic

Systems Programming 7. Sound file processing Guillaume Pierre Fall 2007 http://www.cs.vu.nl/

Personal Privacy in Ubiquitous Computing Computing Marc Langheinrich Institute for Pervasive

Quick Exercise Write a Sound method named toThreeValues() that converts the sound to only three

Asymmetric truncated Toeplitz operators of rank one Bartosz anucha Maria Curie-Skodowska

Opening Exercise Last time, we created a Die class with a roll() method. Write a bit of code to

Noise Audits What they are and the training required to do them The Hearing Loss Prevention/

Routine Visits: The Evidence David U. Himmelstein, M.D. Hunter College/CUNY Cambridge

CTP431- Music and Audio Computing Audio Signal Processing (Part #1) Graduate School of Culture