practical formal verification of mpi and thread programs
play

Practical Formal Verification of MPI and Thread Programs Ganesh - PowerPoint PPT Presentation

Practical Formal Verification of MPI and Thread Programs Ganesh Gopalakrishnan, School of Computing, University of Utah, Salt Lake City, UT 84112 A Half-Day Tutorial Proposed for ICS 2009 http:// www.cs.utah.edu / formal_verification


  1. First, Fire the Barrier statements! P1 P2 P0 --- --- --- MPI_Barrier MPI_Irecv (from ANY ) MPI_Isend (to P2 ) MPI_Isend(to P2 ) MPI_Barrier MPI_Barrier 34

  2. Then rewrite “ANY” to “from P0” and do one interleaving P1 P2 P0 --- --- --- MPI_Barrier MPI_Irecv (from P0 ) MPI_Isend (to P2 ) MPI_Isend(to P2 ) MPI_Barrier MPI_Barrier Pursue this interleaving 35

  3. Then rewrite “ANY” to “from P0” and do one interleaving P1 P2 P0 --- --- --- MPI_Barrier MPI_Irecv (from P0 ) MPI_Isend (to P2 ) MPI_Isend(to P2 ) MPI_Barrier MPI_Barrier Pursue this interleaving • Dynamic Rewriting Forces MPI Runtime to schedule the way we want • Several such techniques to ensure “progress” across different MPI libraries 36

  4. Then rewrite “ANY” to “from P1” and do the other P1 P2 P0 --- --- --- MPI_Barrier MPI_Irecv (from P1 ) MPI_Isend (to P2 ) MPI_Isend(to P2 ) MPI_Barrier MPI_Barrier Then restart and pursue this interleaving 37

  5. Workflow of ISP Scheduler that Executable generates ALL MPI RELEVANT Program Run Proc 1 schedules Proc 2 (Mazurkeiwicz …… Interposition Traces) Proc n Layer MPI Runtime 38

  6. POE Scheduler P0 P1 P2 Isend(1) sendNext Barrier Isend(1, req) Irecv(*, req) Barrier Barrier Barrier Isend(1, req) Wait(req) Recv(2) Wait(req) Wait(req) MPI Runtime 39

  7. POE Scheduler P0 P1 P2 Isend(1) Barrier sendNext Isend(1, req) Irecv(*, req) Barrier Irecv(*) Barrier Barrier Barrier Isend(1, req) Wait(req) Recv(2) Wait(req) Wait(req) MPI Runtime 40

  8. POE Scheduler P0 P1 P2 Isend(1) Barrier Barrier Barrier Isend(1, req) Irecv(*, req) Barrier Irecv(*) Barrier Barrier Barrier Isend(1, req) Barrier Wait(req) Recv(2) Wait(req) Barrier Wait(req) MPI Runtime 41

  9. POE Scheduler P0 P1 P2 Isend(1) Irecv(2) Barrier Isend Wait (req) Isend(1, req) Irecv(*, req) Barrier No Irecv(*) Match-Set Barrier Barrier Isend(1, req) Barrier Recv(2) SendNext Wait(req) Recv(2) Wait(req) Barrier Wait(req) Isend(1) Wait Deadlock! Wait (req) MPI Runtime 42

  10. POE Contributions • ISP (using POE) is the ONLY dynamic model checker for MPI • Insightful formulation of reduction algorithm – MPI Semantics – Distinguishing Match and Complete events – Completes-before – Prioritized execution giving reduction, and guarantee of maximal senders matching receivers • Works really well – Large examples (e.g. ParMETIS – 14 KLOC) finish in one interleaving – Even if wildcard receives are used, POE often finds that no non- determinism arises – Valuable byproduct : Removal of Functionally Irrelevant Barriers • If Barrier does not help introduce orderings that confine non-determinism • ..then one can remove the barrier 43

  11. Visual Studio and Java GUI ; Eclipse is planned 44

  12. Present Situation • ISP: a push-button dynamic verifier for MPI programs • Find deadlocks, resource leaks, assertion violations – Code level model checking – no manual model building – Guarantee soundness for one test input – Works for MacOS, Linux, Windows – Works for MPICH2, OpenMPI, MS MPI – Verifies 14KLOC in seconds • ISP is available for download: http://cs.utah.edu/formal_verification/ISP-release 45

  13. RESULTS USING ISP • The only push-button model checker for MPI/C programs – (the only other model checking approach is MPI-SPIN) • Testing misses deadlocks even on a page of code – See http://www.cs.utah.edu/formal_verification/ISP_Tests • ISP is meant as a safety-net during manual optimizations – A programmer takes liberties that they would otherwise not – Value amply evident even when tuning the Matrix Mult code • Deadlock found in one test of MADRE (3K LOC) – Later found to be a known deadlock 46

  14. RESULTS USING ISP • Handled these examples – IRS (Sequoia Benchmarks), ParMETIS (14K LOC), MADRE (3K LOC) • Working on these examples – MPI-BLAST , ADLB • There is significant outreach work remaining – The user community of MPI is receptive to FV – But they really have no experience evaluating a model checker – We are offering many tutorials this year • ICS 2009 • EuroPVM / MPI 2009 (likely) • Applying for Cluster 2009 • Applying for Super Computing 2009 47

  15. Example of MPI Code (Mat Mat Mult) X = 48

  16. Example of MPI Code (Mat Mat Mult) X = MPI_Bcast MPI_Bcast MPI_Bcast MPI_Bcast MPI_Bcast 49

  17. Example of MPI Code (Mat Mat Mult) MPI_Send X = MPI_Recv 50

  18. Example of MPI Code (Mat Mat Mult) MPI_Send X = MPI_Recv 51

  19. Example of MPI Code (Mat Mat Mult) MPI_Send X = MPI_Recv 52

  20. Example of MPI Code (Mat Mat Mult) MPI_Send X = MPI_Recv 53

  21. Example of MPI Code (Mat Mat Mult) X = 54

  22. Example of MPI Code (Mat Mat Mult) MPI_Recv X = MPI_Send 55

  23. Example of MPI Code (Mat Mat Mult) MPI_Recv X = MPI_Send 56

  24. Example of MPI Code (Mat Mat Mult) MPI_Recv X = MPI_Send 57

  25. Example of MPI Code (Mat Mat Mult) MPI_Recv X = MPI_Send 58

  26. Salient Code Features Master (rank 0) MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); Slaves MPI_Comm_rank(MPI_COMM_WORLD, &myid); (ranks 1-4) 59

  27. Salient Code Features if (myid == master) { ... MPI_Bcast(b, brows*bcols, MPI_FLOAT, master, …); ... } else { // All Slaves do this ... MPI_Bcast(b, brows*bcols, MPI_FLOAT, master, …); ... } 60

  28. Salient Code Features if (myid == master) { ... for (i = 0; i < numprocs-1; i++) { for (j = 0; j < acols; j++) { buffer[j] = a[i*acols+j]; } MPI_Send(buffer, acols, MPI_FLOAT, i+1, …); numsent++; Block till buffer is copied } into System Buffer } System Buffer else { // slaves ... while (1) { ... MPI_Recv(buffer, acols, MPI_FLOAT, master, …); ... } } 61

  29. Handling Rows >> Processors … MPI_Recv Send Next Row to First Slave which By now must be free MPI_Send 62

  30. Handling Rows >> Processors … MPI_Recv OR Send Next Row to First Slave that returns the answer! MPI_Send 63

  31. Salient Code Features if (myid == master) { ... for (i = 0; i < crows; i++) { MPI_Recv(ans, ccols, MPI_FLOAT, FROM FIRST-PROCESSOR, ...); ... if (numsent < arows) { for (j = 0; j < acols; j++) { buffer[j] = a[numsent*acols+j]; } MPI_Send(buffer, acols, MPI_FLOAT, BACK TO FIRST-PROCESSOR, ...); numsent++; ... } } 64

  32. Optimization if (myid == master) { ... for (i = 0; i < crows; i++) { MPI_Recv(ans, ccols, MPI_FLOAT, FROM ANYBODY, ...); ... if (numsent < arows) { for (j = 0; j < acols; j++) { buffer[j] = a[numsent*acols+j]; } MPI_Send(buffer, acols, MPI_FLOAT, BACK TO THAT BODY, ...); numsent++; ... } } 65

  33. Optimization Shows that wildcard receives if (myid == master) { can arise quite naturally … ... for (i = 0; i < crows; i++) { MPI_Recv(ans, ccols, MPI_FLOAT, FROM ANYBODY, ...); ... if (numsent < arows) { for (j = 0; j < acols; j++) { buffer[j] = a[numsent*acols+j]; } MPI_Send(buffer, acols, MPI_FLOAT, BACK TO THAT BODY, ...); numsent++; ... } } 66

  34. Further Optimization if (myid == master) { ... for (i = 0; i < crows; i++) { MPI_Recv(ans, ccols, MPI_FLOAT, FROM ANYBODY, ...); ... if (numsent < arows) { for (j = 0; j < acols; j++) { buffer[j] = a[numsent*acols+j]; } … here, wait for previous Isend (if any) to finish … MPI_Isend(buffer, acols, MPI_FLOAT, BACK TO THAT BODY, ...); numsent++; ... } } 67

  35. Run Visual Studio ISP Plug-in Demo 68

  36. Slides on Inspect 69

  37. Inspect’s Workflow http://www.cs.utah.edu/~yuyang/inspect Multithreaded C Program instrumentation Executable compile request/permit Instrumented thread 1 Scheduler Program thread n Thread Library Wrapper School of Computing 5/8/2009 University of Utah 70

  38. Overview of the source transformation done by Inspect Multithreaded C Program Inter-procedural Flow-sensitive Context-insensitive Alias Analysis Thread Escape Analysis Intra-procedural Dataflow Analysis Source code transformation Instrumented Program School of Computing University of Utah 5/8/2009 71

  39. Result of instrumentation void *Philosopher(void *arg ) { int i ; pthread_mutex_t *tmp ; void * Philosopher(void * arg){ { int i; inspect_thread_start("Philosopher"); i = (int)arg; i = (int )arg; ... tmp = & mutexes*i % 3+; … pthread_mutex_lock(&mutexes[i%3]); inspect_mutex_lock(tmp); … ... while (1) { while (permits[i%3] == 0) { __cil_tmp43 = read_shared_0(& permits[i % 3] printf("P%d : tryget F%d\n", i, i%3); if (! __cil_tmp32) { pthread_cond_wait(...); break; } } ... __cil_tmp33 = i % 3; … permits[i%3] = 0; tmp___0 = __cil_tmp33; … ... inspect_cond_wait(...); pthread_cond_signal(&conditionVars[i%3]); } ... pthread_mutex_unlock(&mutexes[i%3]); write_shared_1(& permits[i % 3], 0); return NULL; ... } inspect_cond_signal(tmp___25); 72 ...

  40. Inspect animation thread action request Scheduler permission DPOR State stack Program under test Visible operation interceptor Message Buffer Unix domain sockets Unix domain sockets 73

  41. How does Inspect avoid being killed by the exponential number of thread interleavings ?? 74

  42. p threads with n actions each: #interleavings = (n.p)! / (n!) p Thread p Thread 1 …. 1: 1: 2: 2: 3: 3: 4: 4: … … n: n: • p=R, n=1 R! interleavings • p = 3, n = 5 10 6 interleavings • p = 3, n = 6 17 * 10 6 interleavings • p = 4, n = 5 10 10 interleavings 75

  43. the exponential number of thread interleavings ?? Ans: Inspect uses Dynamic Partial Order Reduction Basically, interleaves threads ONLY when dependencies exist between thread actions !! 76

  44. A concrete example of interleaving reductions 77

  45. [ NEW SLIDE ] On the HUGE importance of DPOR AFTER INSTRUMENTATION (transitions are shown as bands) BEFORE INSTRUMENTATION void *thread_A(void *arg ) // thread_B is similar void * thread_A(void* arg) { void *__retres2 ; { int __cil_tmp3 ; pthread_mutex_lock(&mutex); int __cil_tmp4 ; A_count++; pthread_mutex_unlock(&mutex); { } inspect_thread_start("thread_A"); inspect_mutex_lock(& mutex); __cil_tmp4 = read_shared_0(& A_count); void * thread_B(void * arg) __cil_tmp3 = __cil_tmp4 + 1; { write_shared_1(& A_count, __cil_tmp3); pthread_mutex_lock(&lock); inspect_mutex_unlock(& mutex); B_count++; __retres2 = (void *)0; pthread_mutex_unlock(&lock); inspect_thread_end(); } return (__retres2); } } 78

  46. [ NEW SLIDE ] On the HUGE importance of DPOR AFTER INSTRUMENTATION (transitions are shown as bands) BEFORE INSTRUMENTATION void *thread_A(void *arg ) // thread_B is similar void * thread_A(void* arg) { void *__retres2 ; { int __cil_tmp3 ; pthread_mutex_lock(&mutex); int __cil_tmp4 ; A_count++; pthread_mutex_unlock(&mutex); { } inspect_thread_start("thread_A"); inspect_mutex_lock(& mutex); __cil_tmp4 = read_shared_0(& A_count); void * thread_B(void * arg) __cil_tmp3 = __cil_tmp4 + 1; { write_shared_1(& A_count, __cil_tmp3); pthread_mutex_lock(&lock); inspect_mutex_unlock(& mutex); B_count++; __retres2 = (void *)0; pthread_mutex_unlock(&lock); inspect_thread_end(); } return (__retres2); } } • ONE interleaving with DPOR • 252 = (10!) / (5!) 2 without DPOR 79

  47. More eye-popping numbers • bzip2smp has 6000 lines of code split among 6 threads roughly, it has a theoretical max number of interleavings • being of the order of (6000! ) / (1000!) ^ 6 == ?? – This is the execution space that a testing tool foolishly tries to navigate – – bzip2smp with Inspect finished in 51,000 interleavings over a few hours – THIS IS THE RELEVANT SET OF INTERLEAVINGS • MORE FORMALLY: its Mazurkeiwicz trace set 80

  48. Dynamic Partial Order Reduction (DPOR) “animatronics” P0 P1 P2 L 0 L 0 U 0 U 0 lock(y) lock(x) lock(x) L 1 L 2 ………….. ………….. ………….. U 1 U 2 L 1 unlock(y) unlock(x) unlock(x) L 2 U 1 U 2 81

  49. Another DPOR animation (to help show how DDPOR works…) 82

  50. { BT }, { Done } A Simple DPOR Example {}, {} t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) 83

  51. { BT }, { Done } A Simple DPOR Example {}, {} t0: t0: lock lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) 84

  52. { BT }, { Done } A Simple DPOR Example {}, {} t0: t0: lock lock(t) unlock(t) t0: unlock t1: lock(t) unlock(t) t2: lock(t) unlock(t) 85

  53. { BT }, { Done } A Simple DPOR Example {}, {} t0: t0: lock lock(t) unlock(t) t0: unlock t1: lock(t) t1: lock unlock(t) t2: lock(t) unlock(t) 86

  54. { BT }, { Done } A Simple DPOR Example {t1}, {t0} t0: t0: lock lock(t) unlock(t) t0: unlock t1: lock(t) t1: lock unlock(t) t2: lock(t) unlock(t) 87

  55. { BT }, { Done } A Simple DPOR Example {t1}, {t0} t0: t0: lock lock(t) unlock(t) t0: unlock t1: {}, {} lock(t) t1: lock unlock(t) t1: unlock t2: lock(t) t2: lock unlock(t) 88

  56. { BT }, { Done } A Simple DPOR Example {t1}, {t0} t0: t0: lock lock(t) unlock(t) t0: unlock t1: {t2}, {t1} lock(t) t1: lock unlock(t) t1: unlock t2: lock(t) t2: lock unlock(t) 89

  57. { BT }, { Done } A Simple DPOR Example {t1}, {t0} t0: t0: lock lock(t) unlock(t) t0: unlock t1: {t2}, {t1} lock(t) t1: lock unlock(t) t1: unlock t2: lock(t) t2: lock unlock(t) t2: unlock 90

  58. { BT }, { Done } A Simple DPOR Example {t1}, {t0} t0: t0: lock lock(t) unlock(t) t0: unlock t1: {t2}, {t1} lock(t) t1: lock unlock(t) t1: unlock t2: lock(t) t2: lock unlock(t) 91

  59. { BT }, { Done } A Simple DPOR Example {t1}, {t0} t0: t0: lock lock(t) unlock(t) t0: unlock t1: {t2}, {t1} lock(t) unlock(t) t2: lock(t) unlock(t) 92

  60. { BT }, { Done } A Simple DPOR Example {t1,t2}, {t0} t0: t0: lock lock(t) unlock(t) t0: unlock t1: {}, {t1, t2} lock(t) t2: lock unlock(t) t2: lock(t) unlock(t) 93

  61. { BT }, { Done } A Simple DPOR Example {t1,t2}, {t0} t0: t0: lock lock(t) unlock(t) t0: unlock t1: {}, {t1, t2} lock(t) t2: lock unlock(t) t2: unlock t2: lock(t) … unlock(t) 94

  62. { BT }, { Done } A Simple DPOR Example {t1,t2}, {t0} t0: t0: lock lock(t) unlock(t) t0: unlock t1: {}, {t1, t2} lock(t) unlock(t) t2: lock(t) unlock(t) 95

  63. { BT }, { Done } A Simple DPOR Example {t2}, {t0,t1} t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) 96

  64. { BT }, { Done } A Simple DPOR Example {t2}, {t0, t1} t0: t1: lock lock(t) unlock(t) t1: unlock t1: … lock(t) unlock(t) t2: lock(t) unlock(t) 97

  65. This is how DDPOR works Once the backtrack set gets populated, ships work • description to other nodes • We obtain distributed model checking using MPI Once we figured out a crucial heuristic (SPIN 2007) we • have managed to get linear speed- up….. so far…. 98

  66. We have devised a work-distribution scheme (SPIN 2007) load balancer Request unloading report result idle node id work description worker b worker a 99

  67. Speedup on aget 100

Recommend


More recommend