enhanced memory debugging of mpi parallel applications in
play

Enhanced Memory debugging of MPI-parallel Applications in Open MPI - PowerPoint PPT Presentation

Enhanced Memory debugging of MPI-parallel Applications in Open MPI 4th Parallel tools workshop 2010 Shiqing Fan HLRS, High Performance Computing Center University of Stuttgart, Germany Slide 1 High Performance Computing Center Stuttgart


  1. Enhanced Memory debugging of MPI-parallel Applications in Open MPI 4th Parallel tools workshop 2010 Shiqing Fan HLRS, High Performance Computing Center University of Stuttgart, Germany Slide 1 High Performance Computing Center Stuttgart

  2. Introduction: Open MPI 1/3 • A new MPI implementation from scratch PACX-MPI • w/o the cruft of previous implementation LAM/MPI LA-MPI • Design started in early 2004 FT-MPI • Project goals – Full, fast & extensible MPI-2 implementation – Thread-safety – Prevent the “forking problem” – Combine the best ideas and techs. • Open source license based on the BSD license Slide 2 High Performance Computing Center Stuttgart

  3. Introduction: Open MPI 2/3 • Current status – Stable version v1.2.6 (April 2008) – Release v1.3 comes very soon • 14 members, 6 contributors – 4 US DOE labs – 8 universities – 7 vendors – 1 individual Slide 3 High Performance Computing Center Stuttgart

  4. Introduction: Open MPI 3/3 • Open MPI consists of three sub-packages Open MPI Open RTE – Open RunTime Environment Open PAL – Open Portable Access Layer Operating System • Modular Component Architecture ( MCA ) – Dynamically load available modules like plug-in and check for hardware – Select best plug-in and unload others (e.g. if hw not available) – Fast indirect calls into each plug-in User application MPI API Module Component Architecture Framework Framework BTL Comp Comp OpenIB TCP Comp Myrinet SM Comp Slide 4 High Performance Computing Center Stuttgart

  5. Introduction: Valgrind 1/2 • An Open-Source Debugging & Profiling tool • For x86/Linux, AMD64/Linux, PPC32/Linux and PPC64/Linux • Works with any dynamically & statically linked application • Memcheck - A heavyweight memory checker • Runs program on a synthetic CPU – Identical to a real CPU, store information of memory • Valid-value bits (V-bits) for each bit – Has valid value or not • Address bits (A-bits) for each byte – Possible to read/write that location • All reads and writes of memory are checked • Calls to malloc/new/free/delete are intercepted Slide 5 High Performance Computing Center Stuttgart

  6. Introduction: Valgrind 2/2 • Use of uninitialized memory – only reports the error when using the uninitialized value – e.g. : int c[2]; int i = c[0]; /* OK !! */ if (i == 0) /* Memcheck: use of uninitialized value !! */ • Use of free’d memory • Mismatched use of malloc/new with free/delete • Memory leaks • Overlap src and dst blocks – memcpy(), strcpy(), strncpy(), strcat(), strncat() Slide 6 High Performance Computing Center Stuttgart

  7. Valgrind – MPI Example 1/2 • Open MPI readily supports execution of apps with valgrind : mpirun –np 2 valgrind ./mpi_murks: Slide 7 High Performance Computing Center Stuttgart

  8. Valgrind – MPI Example 2/2 • With Valgrind mpirun –np 2 valgrind ./mpi_murks : PID ==11278== Invalid read of size 1 ==11278== at 0x4002321E: memcpy (../../memcheck/mac_replace_strmem.c:256) ==11278== by 0x80690F6: MPID_SHMEM_Eagerb_send_short (mpich/../shmemshort.c:70) .. 2 lines of calls to MPIch-functions deleted ... ==11278== by 0x80492BA: MPI_Send (/usr/src/mpich/src/pt2pt/send.c:91) ==11278== by 0x8048F28: main (mpi_murks.c:44) ==11278== Address 0x4158B0EF is 3 bytes after a block of size 40 alloc'd ==11278== at 0x4002BBCE: malloc (../../coregrind/vg_replace_malloc.c:160) ==11278== by 0x8048EB0: main (mpi_murks.c:39) Buffer-Overrun by 4 Bytes in MPI_Send .... ==11278== Conditional jump or move depends on uninitialised value(s) ==11278== at 0x402985C4: _IO_vfprintf_internal (in /lib/libc-2.3.2.so) ==11278== by 0x402A15BD: _IO_printf (in /lib/libc-2.3.2.so) ==11278== by 0x8048F44: main (mpi_murks.c:46) Printing of uninitialized variable • It can not find: – May be run with 1 process: One pending Recv ( Marmot ). – May be run with >2 processes: Unmatched Sends ( Marmot ). Slide 8 High Performance Computing Center Stuttgart

  9. Design and implementation 1/3 • Memchecker: a new concept to use valgrind’s API internally in Open MPI to reveal bugs – In the Application – In Open MPI itself • Implement generic interface memchecker as MCA – Implemented in Open PAL layer – Configure option --enable-memchecker – Possibly pass installed Valgrind --with-valgrind=/path/to/valgrind • Simply run command, e.g. : – mpirun -np 2 valgrind ./my_mpi Open MPI Open RTE Open PAL Memchecker Memchecker* Memchecker valgrind solaris_rtc some mca… Operating System *currently no API implemented in rtc. Slide 9 High Performance Computing Center Stuttgart

  10. Design and implementation 2/3 • Detect application’s memory violation of MPI-standard – Application’s usage of undefined data – Application’s memory access due to MPI-semantics • Detect Non-blocking/One-sided communication buffer errors – Functions in BTL layer for both communications – Set memory accessibility independent of MPI operations – i.e. only set accessibility for the fragment to be sent/received – Handles derived datatypes • MPI object checking – Check definedness of MPI objects that passing to MPI API – MPI_Status, MPI_Comm, MPI_Request and MPI_Datatype – Could be disabled for better performance Slide 10 High Performance Computing Center Stuttgart

  11. Design and implementation 3/3 • Non-blocking send/receive buffer error checking Proc0 Proc1 MPI-Layer Buffer MPI_Isend Frag 0 MPI_Irecv PML not accessible Inaccessible P2P Management Layer Frag 1 (unaddressable) Inaccessible BML (unaddressable) not accessible BTL Management Layer Frag n Frag n MPI_Wait BTL Byte Transfer Layer MPI_Wait Slide 11 High Performance Computing Center Stuttgart

  12. Detectable bug-classes 1/3 • Non-blocking buffer accessed/modified before finished MPI_Isend (buffer, SIZE, MPI_INT, …, &request); buffer[1] = 4711; MPI_Wait (&req, &status); • The standard does not ( yet ) allow read access: MPI_Isend (buffer, SIZE, MPI_INT, …, &request); result[1] = buffer[1]; MPI_Wait (&request, &status); • Side note: – MPI-1, p30, Rationale for restrictive access rules; “allows better performance on some systems”. Slide 12 High Performance Computing Center Stuttgart

  13. Detectable bug-classes 2/3 • Access to buffer under control of MPI: MPI_Irecv (buffer, SIZE, MPI_CHAR, …, &request); buffer[1] = 4711; MPI_Wait (&request, &status); • Side note: CRC-based methods do not reliably catch these cases. • Memory that is outside receive buffer is overwritten : buffer = malloc( SIZE * sizeof(MPI_CHAR) ); memset (buffer, SIZE * sizeof(MPI_CHAR), 0); MPI_Recv(buffer, SIZE+1, MPI_CHAR, …, &status); • Side note: MPI-1, p21, rationale of overflow situations: “no memory that outside the receive buffer will ever be overwritten.” Slide 13 High Performance Computing Center Stuttgart

  14. Detectable bug-classes 3/3 • Usage of the Undefined Memory passed from Open MPI MPI_Wait(&request, &status); if (status.MPI_ERROR != MPI_SUCCESS) • Side note: This field should remain undefined. – MPI-1, p22 (not needed for calls that return only one status) – MPI-2, p24 (Clarification of status in single-completion calls). • Write to buffer before accumulate is finished : MPI_Accumulate(A, NROWS*NCOLS, MPI_INT, 1, 0, 1, \ xpose, MPI_SUM, win); A[0][1] = 4711; MPI_Win_fence(0, win); Slide 14 High Performance Computing Center Stuttgart

  15. Performance 1/2 • Benchmarks – Intel MPI Bechmark • Environment – Dgrid-cluster at HLRS – Dual-processor Intel Woodcrest – Infiniband-DDR network with the Open Fabrics stack • Test cases – Plain Open MPI – With memchecker component without MPI objects checking Slide 15 High Performance Computing Center Stuttgart

  16. Performance 2/2 • Intel MPI Benchmark, Bi-directional Get test • Use 2 nodes, TCP connections employing IPoverIB-interface • Run with/without Valgrind Slide 16 High Performance Computing Center Stuttgart

  17. Valgrind (Memcheck) Extension 1/2 • New client requests for: – Watching on memory read operations – Watching on memory write operations – Initiating callback functions on memory read/write – Making memory readable and/or writable • use fast ordered set algorithm • byte-wise memory checking • handle the memory with mixed registered and unregistered blocks Slide 17 High Performance Computing Center Stuttgart

  18. Valgrind (Memcheck) Extension 2/2 • VALGRIND_REG_USER_MEM_WATCH (addr, len, op, cb, info) • VALGRIND_UNREG_USER_MEM_WATCH (addr, len) • Watch “op” could be: – WATCH_MEM_READ, WATCH_MEM_WRITE and WATCH_MEM_RW Valgrind User app … Alloc_mem … … … Alloc_mem Read_mem … … Read_cb Read_mem Slide 18 High Performance Computing Center Stuttgart

  19. Thank you very much ! Slide 19 High Performance Computing Center Stuttgart

Recommend


More recommend