noi oise e injec ection on tec echniques es to o expos
play

Noi oise e Injec ection on Tec echniques es to o Expos ose e - PowerPoint PPT Presentation

Noi oise e Injec ection on Tec echniques es to o Expos ose e Subtle e and Uninten ended ed Mes essage e Races es Kento Sato , Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, PPoPP2017 Martin Schulz and Christopher M. Chambreau


  1. Noi oise e Injec ection on Tec echniques es to o Expos ose e Subtle e and Uninten ended ed Mes essage e Races es Kento Sato , Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, PPoPP2017 Martin Schulz and Christopher M. Chambreau February 6 th , 2017 LLNL-PRES-720797 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

  2. Debugging large-scale applications is challenging “On average, software developers spend 50% of their programming time finding and fixing bugs.” [1] [1] Source: http://www.prweb.com/releases/2013/1/prweb10298185.htm, CAMBRIDGE, UK (PRWEB) JANUARY 08, 2013 In HPC, applications run in parallel which makes debugging particularly challenging 2 LLNL-PRES-720797

  3. “MPI non-determinism” makes debugging applications even more complicated § MPI supports wildcard receives — MPI processes can wait messages from any MPI processes § Message receive orders can change across executions — Due to non-deterministic system noise (e.g. Network, OS jitter) è MPI non-deterministic application which correctly ran in first execution can crash in the second execution even with the same input input.data Second execution First execution P0 P1 P2 P0 P1 P2 1 Noise 2 2 Crash 1 3 LLNL-PRES-720797

  4. Real-world non-deterministic bugs in Diablo/Hypre 2.10.1* § MPI non-deterministic bugs cost computational scientists substantial amounts of time and efforts Diablo/Hypre 2.10.1 Our debugging team The scientists § We found that the cause is due to a § It hung only once every 50 runs ”Unintended message matching” by after a few hours misused MPI tag (message race § The scientists spent 2 months in the bug) period of 18 months, and then gave § We spent 2 weeks in the period of 3 up on debugging it months to fix the bug * Hypre is an MPI-based library for solving large, sparse linear systems of equations on massively parallel computers 4 LLNL-PRES-720797

  5. Observing a non-deterministic bug is costly 12000 Wasted hours § Due to such non-determinism, we needed to Useful hours submit a bunch of debug jobs to observe the bug 10000 — The bug did not manifest in 98% of jobs — Wasted 9,560 node-hour 8000 9,560 node-hour Node-hour wasted 6000 § Rarely-occurring message race bugs waste both || scientists' productivity and machine resources Wasting 400 nodes 4000 (thereby affect also other users) for 24 hours 2000 0 Debugging cost A tool to frequently and quickly expose message race bugs is invaluable 5 LLNL-PRES-720797

  6. NINJA § NINJA: Noise Injection Agent — Frequent manifestation: Injects network noise in order to frequently and quickly expose message race bugs — High portably: NINJA is developed in MPI profiling layer (PMPI) § Experimental results — NINJA consistently manifests the Hypre 2.10.1 message race bug which does not manifest itself without NINJA 6 LLNL-PRES-720797

  7. Outline § Introduction § Message race bugs § NINJA: Noise Injection Agent § Evaluation § Conclusion 7 LLNL-PRES-720797

  8. Data-parallel model (or SPMD) § In HPC, many applications are written based on a data-parallel model (or SPMD) — Easy to scale out the application by simply dividing a problem across processes § In SPMD, each process calls the same series of routines in the same order § So messages sent in a communication routine are all received within the same communication routine è “self-contained” communication routine (or communication routine) P0 P1 P2 P0 P1 P2 Communication Computation Communication Computation 8 LLNL-PRES-720797

  9. Plots of Send and Receive time stamps § HPC apps call a series of self-contained communication routines step- by-step — Each colored box illustrates a self-contained routine Hypre 7 6 Routine X 17 5 (tag=224) MPI rank Routine X 14 Routine X 16 4 (tag=223) (tag=223) Routine X 13 Routine X 15 3 (tag=222) (tag=222) 2 1 0 0.08 0.0805 0.081 0.0815 0.082 0.0825 0.083 0.0835 0.084 0.0845 0.085 Execution time (seconds) Recv (tag=222) Send (tag=222) Recv (tag=223) Send (tag=223) Recv (tag=224) Send (tag=224) Lulesh 7 6 Routine X 12 Routine X 9 (tag=3072) 5 (tag=3072) MPI rank Routine X 11 (tag=2048) 4 Routine X 8 (tag=2048) Routine X 10 Routine X 7 (tag=1024) 3 (tag=1024) 2 1 0 8.58 8.582 8.584 8.586 8.588 8.59 8.592 8.594 8.596 8.598 8.6 Execution time (seconds) Recv (tag=1024) Send (tag=1024) Recv (tag=2048) Send (tag=2048) Recv (tag=3072) Send (tag=3072) 9 LLNL-PRES-720797

  10. Avoiding message races § To make communication routines “self-contained”, common approaches in MPI are: — Use of different tags/communicators Synchronization Different tags/communicators — Calling synchronization (e.g. MPI_Barrier) P0 P1 P2 P0 P1 P2 Tag=A X X OR P0 P1 P2 Synchronization X Routine A Tag=B X Routine X X X Routine B P0 P1 P2 If these conditions are Tag=A X violated, applications X potentially embrace X message race bugs Tag=A 10 LLNL-PRES-720797

  11. Message race bugs are non-deterministic § Manifestations of message race bugs depend on system noise — Occurrences and amounts of system noise are non-deterministic § Message race bugs rarely manifest, E.g., when System noise level is low 1. Unsafe routines ( Routine A and Routine B ) are separated by 2. interleaving routines ( Routine X ) Correct message matching Wrong message matching P0 P1 P2 P0 P1 P2 Routine A X X Routine X X X X Noise Routine B X Crash 11 LLNL-PRES-720797

  12. Case study: Diablo/Hypre 2.10.1 § The message race bug in Hypre manifest when a message sent in Routine 3 is received in Routine 1 — Routine 1 & 3: same MPI tag without synchronization 7 2.5 msec Routine 5 6 (tag=224) 5 MPI rank Routine 4 Routine 2 4 (tag=223) (tag=223) 3 Routine 3 Routine 1 2 (tag=222) (tag=222) 1 0 0.08 0.0805 0.081 0.0815 0.082 0.0825 0.083 0.0835 0.084 0.0845 0.085 Execution time (seconds) Recv (tag=222) Send (tag=222) Recv (tag=223) Send (tag=223) Recv (tag=224) Send (tag=224) However, Routine 1 and 3 are significantly separated by 2.5 msec, the message race bug rarely manifest We need a tool to frequently expose subtle message race bugs 12 LLNL-PRES-720797

  13. NINJA: Noise Injection Agent Tool § NINJA emulates noisy environments to expose subtle message race bugs Correct message matching Wrong message matching P0 P1 P2 P0 P1 P2 X X NINJA X X X Noise X Crash § Two noise injection modes — System-centric mode : NINJA emulates congested network to induce message races — Application-centric mode : NINJA analyzes application’s communication pattern, and inject a sufficient amount of noise to make two unsafe routines overlapped 13 LLNL-PRES-720797

  14. System-centric mode emulates noisy network § System-centric mode emulates noisy network based on a conventional “flow control” in interconnects § Conventional flow control — When sending a message, the message is divided into packets and queued into a send buffer — The packets are transmitted from a send buffer to a receive buffer — If the receive buffer does not have enough space, flow control engine suspends packet transmission until enough buffer space is freed up Send Recv buffer buffer Physical link 14 LLNL-PRES-720797

  15. NINJA implements flow control at process-level § NINJA’s flow control — Each process manages virtual buffer queue (VBQ) — If VBQ does not have enough space, NINJA delays sending the MPI message until enough buffer space is freed up MPI processes VBQ MPI process Packets Send buffer MPI process Packets Physical link MPI process Packets 15 LLNL-PRES-720797

  16. How NINJA triggers noise injection ? § NINJA system-centric mode MPI processes — Monitor # of incoming packets Ç √ MPI process — Compute # of outgoing packets by using a model based on network MPI process NIC Physical link bandwidth and latency — Estimate VBQ length MPI process — If VBQ length exceeds the VBQ size, then NINJA injects noise to the � N message � ( P s [ i ] /B + C ) i =1 § NINJA logically estimate VBQ > length, so does not physically # of incoming # of outgoing VBQ size packets packets buffer messages by copying VBQ length 16 LLNL-PRES-720797

  17. How much amount of noise is injected ? § NINJA delay a message send until enough VBQ space is freed up § Example — VBQ size: 5 packets — # of packets in VBQ: 3 packets — The incoming message: 4 packets è NINJA delays this message by the time to transmit 2 packets 5 packets B [GB/sec] Send message VBQ C [sec] 3 packets 4 packets Packet size = 2 [KB] 2 [KB] B = 3.14 [GB/sec] 1.27 [msec] 3.14 [GB/sec] + 0.25 [µsec] ×2 56789:; = C = 0.25 [ µ sec] Noise amount 17 LLNL-PRES-720797

  18. System-centric mode induces message races § Earlier messages are not delayed in a routine (since buffer space is left) while later messages are delayed in the same routine § NINJA extends an unsafe routine so that we can overlap one unsafe communication routine with the next communication routine, thereby, induce message races P0 P1 P2 P0 P1 P2 NINJA Race ! 18 LLNL-PRES-720797

Recommend


More recommend