First, Fire the Barrier statements! P1 P2 P0 --- --- --- MPI_Barrier MPI_Irecv (from ANY ) MPI_Isend (to P2 ) MPI_Isend(to P2 ) MPI_Barrier MPI_Barrier 34
Then rewrite “ANY” to “from P0” and do one interleaving P1 P2 P0 --- --- --- MPI_Barrier MPI_Irecv (from P0 ) MPI_Isend (to P2 ) MPI_Isend(to P2 ) MPI_Barrier MPI_Barrier Pursue this interleaving 35
Then rewrite “ANY” to “from P0” and do one interleaving P1 P2 P0 --- --- --- MPI_Barrier MPI_Irecv (from P0 ) MPI_Isend (to P2 ) MPI_Isend(to P2 ) MPI_Barrier MPI_Barrier Pursue this interleaving • Dynamic Rewriting Forces MPI Runtime to schedule the way we want • Several such techniques to ensure “progress” across different MPI libraries 36
Then rewrite “ANY” to “from P1” and do the other P1 P2 P0 --- --- --- MPI_Barrier MPI_Irecv (from P1 ) MPI_Isend (to P2 ) MPI_Isend(to P2 ) MPI_Barrier MPI_Barrier Then restart and pursue this interleaving 37
Workflow of ISP Scheduler that Executable generates ALL MPI RELEVANT Program Run Proc 1 schedules Proc 2 (Mazurkeiwicz …… Interposition Traces) Proc n Layer MPI Runtime 38
POE Scheduler P0 P1 P2 Isend(1) sendNext Barrier Isend(1, req) Irecv(*, req) Barrier Barrier Barrier Isend(1, req) Wait(req) Recv(2) Wait(req) Wait(req) MPI Runtime 39
POE Scheduler P0 P1 P2 Isend(1) Barrier sendNext Isend(1, req) Irecv(*, req) Barrier Irecv(*) Barrier Barrier Barrier Isend(1, req) Wait(req) Recv(2) Wait(req) Wait(req) MPI Runtime 40
POE Scheduler P0 P1 P2 Isend(1) Barrier Barrier Barrier Isend(1, req) Irecv(*, req) Barrier Irecv(*) Barrier Barrier Barrier Isend(1, req) Barrier Wait(req) Recv(2) Wait(req) Barrier Wait(req) MPI Runtime 41
POE Scheduler P0 P1 P2 Isend(1) Irecv(2) Barrier Isend Wait (req) Isend(1, req) Irecv(*, req) Barrier No Irecv(*) Match-Set Barrier Barrier Isend(1, req) Barrier Recv(2) SendNext Wait(req) Recv(2) Wait(req) Barrier Wait(req) Isend(1) Wait Deadlock! Wait (req) MPI Runtime 42
POE Contributions • ISP (using POE) is the ONLY dynamic model checker for MPI • Insightful formulation of reduction algorithm – MPI Semantics – Distinguishing Match and Complete events – Completes-before – Prioritized execution giving reduction, and guarantee of maximal senders matching receivers • Works really well – Large examples (e.g. ParMETIS – 14 KLOC) finish in one interleaving – Even if wildcard receives are used, POE often finds that no non- determinism arises – Valuable byproduct : Removal of Functionally Irrelevant Barriers • If Barrier does not help introduce orderings that confine non-determinism • ..then one can remove the barrier 43
Visual Studio and Java GUI ; Eclipse is planned 44
Present Situation • ISP: a push-button dynamic verifier for MPI programs • Find deadlocks, resource leaks, assertion violations – Code level model checking – no manual model building – Guarantee soundness for one test input – Works for MacOS, Linux, Windows – Works for MPICH2, OpenMPI, MS MPI – Verifies 14KLOC in seconds • ISP is available for download: http://cs.utah.edu/formal_verification/ISP-release 45
RESULTS USING ISP • The only push-button model checker for MPI/C programs – (the only other model checking approach is MPI-SPIN) • Testing misses deadlocks even on a page of code – See http://www.cs.utah.edu/formal_verification/ISP_Tests • ISP is meant as a safety-net during manual optimizations – A programmer takes liberties that they would otherwise not – Value amply evident even when tuning the Matrix Mult code • Deadlock found in one test of MADRE (3K LOC) – Later found to be a known deadlock 46
RESULTS USING ISP • Handled these examples – IRS (Sequoia Benchmarks), ParMETIS (14K LOC), MADRE (3K LOC) • Working on these examples – MPI-BLAST , ADLB • There is significant outreach work remaining – The user community of MPI is receptive to FV – But they really have no experience evaluating a model checker – We are offering many tutorials this year • ICS 2009 • EuroPVM / MPI 2009 (likely) • Applying for Cluster 2009 • Applying for Super Computing 2009 47
Example of MPI Code (Mat Mat Mult) X = 48
Example of MPI Code (Mat Mat Mult) X = MPI_Bcast MPI_Bcast MPI_Bcast MPI_Bcast MPI_Bcast 49
Example of MPI Code (Mat Mat Mult) MPI_Send X = MPI_Recv 50
Example of MPI Code (Mat Mat Mult) MPI_Send X = MPI_Recv 51
Example of MPI Code (Mat Mat Mult) MPI_Send X = MPI_Recv 52
Example of MPI Code (Mat Mat Mult) MPI_Send X = MPI_Recv 53
Example of MPI Code (Mat Mat Mult) X = 54
Example of MPI Code (Mat Mat Mult) MPI_Recv X = MPI_Send 55
Example of MPI Code (Mat Mat Mult) MPI_Recv X = MPI_Send 56
Example of MPI Code (Mat Mat Mult) MPI_Recv X = MPI_Send 57
Example of MPI Code (Mat Mat Mult) MPI_Recv X = MPI_Send 58
Salient Code Features Master (rank 0) MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); Slaves MPI_Comm_rank(MPI_COMM_WORLD, &myid); (ranks 1-4) 59
Salient Code Features if (myid == master) { ... MPI_Bcast(b, brows*bcols, MPI_FLOAT, master, …); ... } else { // All Slaves do this ... MPI_Bcast(b, brows*bcols, MPI_FLOAT, master, …); ... } 60
Salient Code Features if (myid == master) { ... for (i = 0; i < numprocs-1; i++) { for (j = 0; j < acols; j++) { buffer[j] = a[i*acols+j]; } MPI_Send(buffer, acols, MPI_FLOAT, i+1, …); numsent++; Block till buffer is copied } into System Buffer } System Buffer else { // slaves ... while (1) { ... MPI_Recv(buffer, acols, MPI_FLOAT, master, …); ... } } 61
Handling Rows >> Processors … MPI_Recv Send Next Row to First Slave which By now must be free MPI_Send 62
Handling Rows >> Processors … MPI_Recv OR Send Next Row to First Slave that returns the answer! MPI_Send 63
Salient Code Features if (myid == master) { ... for (i = 0; i < crows; i++) { MPI_Recv(ans, ccols, MPI_FLOAT, FROM FIRST-PROCESSOR, ...); ... if (numsent < arows) { for (j = 0; j < acols; j++) { buffer[j] = a[numsent*acols+j]; } MPI_Send(buffer, acols, MPI_FLOAT, BACK TO FIRST-PROCESSOR, ...); numsent++; ... } } 64
Optimization if (myid == master) { ... for (i = 0; i < crows; i++) { MPI_Recv(ans, ccols, MPI_FLOAT, FROM ANYBODY, ...); ... if (numsent < arows) { for (j = 0; j < acols; j++) { buffer[j] = a[numsent*acols+j]; } MPI_Send(buffer, acols, MPI_FLOAT, BACK TO THAT BODY, ...); numsent++; ... } } 65
Optimization Shows that wildcard receives if (myid == master) { can arise quite naturally … ... for (i = 0; i < crows; i++) { MPI_Recv(ans, ccols, MPI_FLOAT, FROM ANYBODY, ...); ... if (numsent < arows) { for (j = 0; j < acols; j++) { buffer[j] = a[numsent*acols+j]; } MPI_Send(buffer, acols, MPI_FLOAT, BACK TO THAT BODY, ...); numsent++; ... } } 66
Further Optimization if (myid == master) { ... for (i = 0; i < crows; i++) { MPI_Recv(ans, ccols, MPI_FLOAT, FROM ANYBODY, ...); ... if (numsent < arows) { for (j = 0; j < acols; j++) { buffer[j] = a[numsent*acols+j]; } … here, wait for previous Isend (if any) to finish … MPI_Isend(buffer, acols, MPI_FLOAT, BACK TO THAT BODY, ...); numsent++; ... } } 67
Run Visual Studio ISP Plug-in Demo 68
Slides on Inspect 69
Inspect’s Workflow http://www.cs.utah.edu/~yuyang/inspect Multithreaded C Program instrumentation Executable compile request/permit Instrumented thread 1 Scheduler Program thread n Thread Library Wrapper School of Computing 5/8/2009 University of Utah 70
Overview of the source transformation done by Inspect Multithreaded C Program Inter-procedural Flow-sensitive Context-insensitive Alias Analysis Thread Escape Analysis Intra-procedural Dataflow Analysis Source code transformation Instrumented Program School of Computing University of Utah 5/8/2009 71
Result of instrumentation void *Philosopher(void *arg ) { int i ; pthread_mutex_t *tmp ; void * Philosopher(void * arg){ { int i; inspect_thread_start("Philosopher"); i = (int)arg; i = (int )arg; ... tmp = & mutexes*i % 3+; … pthread_mutex_lock(&mutexes[i%3]); inspect_mutex_lock(tmp); … ... while (1) { while (permits[i%3] == 0) { __cil_tmp43 = read_shared_0(& permits[i % 3] printf("P%d : tryget F%d\n", i, i%3); if (! __cil_tmp32) { pthread_cond_wait(...); break; } } ... __cil_tmp33 = i % 3; … permits[i%3] = 0; tmp___0 = __cil_tmp33; … ... inspect_cond_wait(...); pthread_cond_signal(&conditionVars[i%3]); } ... pthread_mutex_unlock(&mutexes[i%3]); write_shared_1(& permits[i % 3], 0); return NULL; ... } inspect_cond_signal(tmp___25); 72 ...
Inspect animation thread action request Scheduler permission DPOR State stack Program under test Visible operation interceptor Message Buffer Unix domain sockets Unix domain sockets 73
How does Inspect avoid being killed by the exponential number of thread interleavings ?? 74
p threads with n actions each: #interleavings = (n.p)! / (n!) p Thread p Thread 1 …. 1: 1: 2: 2: 3: 3: 4: 4: … … n: n: • p=R, n=1 R! interleavings • p = 3, n = 5 10 6 interleavings • p = 3, n = 6 17 * 10 6 interleavings • p = 4, n = 5 10 10 interleavings 75
the exponential number of thread interleavings ?? Ans: Inspect uses Dynamic Partial Order Reduction Basically, interleaves threads ONLY when dependencies exist between thread actions !! 76
A concrete example of interleaving reductions 77
[ NEW SLIDE ] On the HUGE importance of DPOR AFTER INSTRUMENTATION (transitions are shown as bands) BEFORE INSTRUMENTATION void *thread_A(void *arg ) // thread_B is similar void * thread_A(void* arg) { void *__retres2 ; { int __cil_tmp3 ; pthread_mutex_lock(&mutex); int __cil_tmp4 ; A_count++; pthread_mutex_unlock(&mutex); { } inspect_thread_start("thread_A"); inspect_mutex_lock(& mutex); __cil_tmp4 = read_shared_0(& A_count); void * thread_B(void * arg) __cil_tmp3 = __cil_tmp4 + 1; { write_shared_1(& A_count, __cil_tmp3); pthread_mutex_lock(&lock); inspect_mutex_unlock(& mutex); B_count++; __retres2 = (void *)0; pthread_mutex_unlock(&lock); inspect_thread_end(); } return (__retres2); } } 78
[ NEW SLIDE ] On the HUGE importance of DPOR AFTER INSTRUMENTATION (transitions are shown as bands) BEFORE INSTRUMENTATION void *thread_A(void *arg ) // thread_B is similar void * thread_A(void* arg) { void *__retres2 ; { int __cil_tmp3 ; pthread_mutex_lock(&mutex); int __cil_tmp4 ; A_count++; pthread_mutex_unlock(&mutex); { } inspect_thread_start("thread_A"); inspect_mutex_lock(& mutex); __cil_tmp4 = read_shared_0(& A_count); void * thread_B(void * arg) __cil_tmp3 = __cil_tmp4 + 1; { write_shared_1(& A_count, __cil_tmp3); pthread_mutex_lock(&lock); inspect_mutex_unlock(& mutex); B_count++; __retres2 = (void *)0; pthread_mutex_unlock(&lock); inspect_thread_end(); } return (__retres2); } } • ONE interleaving with DPOR • 252 = (10!) / (5!) 2 without DPOR 79
More eye-popping numbers • bzip2smp has 6000 lines of code split among 6 threads roughly, it has a theoretical max number of interleavings • being of the order of (6000! ) / (1000!) ^ 6 == ?? – This is the execution space that a testing tool foolishly tries to navigate – – bzip2smp with Inspect finished in 51,000 interleavings over a few hours – THIS IS THE RELEVANT SET OF INTERLEAVINGS • MORE FORMALLY: its Mazurkeiwicz trace set 80
Dynamic Partial Order Reduction (DPOR) “animatronics” P0 P1 P2 L 0 L 0 U 0 U 0 lock(y) lock(x) lock(x) L 1 L 2 ………….. ………….. ………….. U 1 U 2 L 1 unlock(y) unlock(x) unlock(x) L 2 U 1 U 2 81
Another DPOR animation (to help show how DDPOR works…) 82
{ BT }, { Done } A Simple DPOR Example {}, {} t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) 83
{ BT }, { Done } A Simple DPOR Example {}, {} t0: t0: lock lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) 84
{ BT }, { Done } A Simple DPOR Example {}, {} t0: t0: lock lock(t) unlock(t) t0: unlock t1: lock(t) unlock(t) t2: lock(t) unlock(t) 85
{ BT }, { Done } A Simple DPOR Example {}, {} t0: t0: lock lock(t) unlock(t) t0: unlock t1: lock(t) t1: lock unlock(t) t2: lock(t) unlock(t) 86
{ BT }, { Done } A Simple DPOR Example {t1}, {t0} t0: t0: lock lock(t) unlock(t) t0: unlock t1: lock(t) t1: lock unlock(t) t2: lock(t) unlock(t) 87
{ BT }, { Done } A Simple DPOR Example {t1}, {t0} t0: t0: lock lock(t) unlock(t) t0: unlock t1: {}, {} lock(t) t1: lock unlock(t) t1: unlock t2: lock(t) t2: lock unlock(t) 88
{ BT }, { Done } A Simple DPOR Example {t1}, {t0} t0: t0: lock lock(t) unlock(t) t0: unlock t1: {t2}, {t1} lock(t) t1: lock unlock(t) t1: unlock t2: lock(t) t2: lock unlock(t) 89
{ BT }, { Done } A Simple DPOR Example {t1}, {t0} t0: t0: lock lock(t) unlock(t) t0: unlock t1: {t2}, {t1} lock(t) t1: lock unlock(t) t1: unlock t2: lock(t) t2: lock unlock(t) t2: unlock 90
{ BT }, { Done } A Simple DPOR Example {t1}, {t0} t0: t0: lock lock(t) unlock(t) t0: unlock t1: {t2}, {t1} lock(t) t1: lock unlock(t) t1: unlock t2: lock(t) t2: lock unlock(t) 91
{ BT }, { Done } A Simple DPOR Example {t1}, {t0} t0: t0: lock lock(t) unlock(t) t0: unlock t1: {t2}, {t1} lock(t) unlock(t) t2: lock(t) unlock(t) 92
{ BT }, { Done } A Simple DPOR Example {t1,t2}, {t0} t0: t0: lock lock(t) unlock(t) t0: unlock t1: {}, {t1, t2} lock(t) t2: lock unlock(t) t2: lock(t) unlock(t) 93
{ BT }, { Done } A Simple DPOR Example {t1,t2}, {t0} t0: t0: lock lock(t) unlock(t) t0: unlock t1: {}, {t1, t2} lock(t) t2: lock unlock(t) t2: unlock t2: lock(t) … unlock(t) 94
{ BT }, { Done } A Simple DPOR Example {t1,t2}, {t0} t0: t0: lock lock(t) unlock(t) t0: unlock t1: {}, {t1, t2} lock(t) unlock(t) t2: lock(t) unlock(t) 95
{ BT }, { Done } A Simple DPOR Example {t2}, {t0,t1} t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) 96
{ BT }, { Done } A Simple DPOR Example {t2}, {t0, t1} t0: t1: lock lock(t) unlock(t) t1: unlock t1: … lock(t) unlock(t) t2: lock(t) unlock(t) 97
This is how DDPOR works Once the backtrack set gets populated, ships work • description to other nodes • We obtain distributed model checking using MPI Once we figured out a crucial heuristic (SPIN 2007) we • have managed to get linear speed- up….. so far…. 98
We have devised a work-distribution scheme (SPIN 2007) load balancer Request unloading report result idle node id work description worker b worker a 99
Speedup on aget 100
Recommend
More recommend