P ARALLEL P ROGRAMMING ON M ULTICORES Threading Subroutines Bo Zhang zhangb@cs.duke.edu Nikos P. Pitsianis Xiaobai Sun D EPARTMENT OF C OMPUTER S CIENCE D UKE U NIVERSITY 09/29/2010 1 / 38
M Y E XPERIENCE B ACKGROUND Ph.D. in Mathematics; basic knowledge of C W HAT I HAVE GAINED ◮ Learned Fortran, Pthread, and OpenMP ◮ Parallel Fast Multipole Method (FMM) 1. 1st parallel version on multicore desktop & laptop 2. Initial code has 12k lines; current code has 18k lines 3. In both Pthread and OpenMP 4. Problem size: 20M (laptop); 100M (workstation) 5. Speedup: 8X on an oct-core workstation 2 / 38
U SING E XISTING S UBROUTINES You have a working sequential code, but the parallel version sees 1. Segmentation fault 2. Bus error 3. Heap collapse 4. Inconsistent results What happened? 3 / 38
U SE OF E XISTING S UBROUTINES Potential Problems, not T HREAD -S AFE Concurrent updates to global variables Concurrent use of local scratch space Stack overflow 4 / 38
U SE OF E XISTING S UBROUTINES Concurrent updates to global variables double balance; void deposit (double s) { balance += s; } void withdraw (double s) { balance -= s; } int main () { balance = 0.0 deposit (s1); deposit (s2); withdraw (s3); . . . printf("%f \ n",balance); return 0; } 5 / 38
U SE OF E XISTING S UBROUTINES Concurrent updates to global variables double balance; void deposit (double s) { balance += s; } void withdraw (double s) { balance -= s; } int main () { balance = 0.0 deposit (s1); deposit (s2); withdraw (s3); . . . printf("%f \ n",balance); return 0; } Use mutex variable 6 / 38
U SE OF E XISTING S UBROUTINES Concurrent updates to global variables double balance; double balance; pthread_mutex_t balance_mutex; void deposit (double s) { void deposit (double s) { balance += s; pthread_mutex_lock (&balance_mutex); } balance += s; void withdraw (double s) { pthread_mutex_unlock (&balance_mutex); balance -= s; } } void withdraw (double s) { int main () { pthread_mutex_lock (&balance_mutex); balance = 0.0 balance -= s; deposit (s1); pthread_mutex_unlock (&balance_mutex); deposit (s2); } withdraw (s3); int main () { . . . balance = 0.0; printf("%f \ n",balance); return 0; . . . return 0; } } Use mutex variable 7 / 38
U SE OF E XISTING S UBROUTINES Concurrent use of local scratch space int main () { int s1, s2, s, work[100]; s1 = task1(work); s2 = task2(work); s = s1+s2; printf("%d \ n", s); return 0; } 8 / 38
U SE OF E XISTING S UBROUTINES Concurrent use of local scratch space int main () { int s1, s2, s, work[100]; s1 = task1(work); s2 = task2(work); s = s1+s2; printf("%d \ n", s); return 0; } Allocate for each thread; trim it down when necessary 9 / 38
U SE OF E XISTING S UBROUTINES Concurrent use of local scratch space int main () { int main () { int s1, s2, s, work[100]; int s1, s2, s, work1[100], work2[100]; s1 = task1(work); s1 = task1(work1); s2 = task2(work); s2 = task2(work2); s = s1+s2; s = s1+s2; printf("%d \ n", s); printf("%d \ n", s); return 0; return 0; } } Allocate for each thread; trim it down when necessary 10 / 38
U SE OF E XISTING S UBROUTINES Stack overflow Caused by recursive call of subroutine void foo (int i, int *s) { int ia[100]; if ( i > 0) foo(i-1, s); *s += compute(ia, i); } int main () { int s, i = 2; foo(i, &s); return 0; } 11 / 38
U SE OF E XISTING S UBROUTINES Stack overflow Caused by recursive call of subroutine void foo (int i, int *s) { int ia[100]; if ( i > 0) foo(i-1, s); *s += compute(ia, i); } int main () { int s, i = 2; foo(i, &s); int i, s return 0; Main } Stack 12 / 38
U SE OF E XISTING S UBROUTINES Stack overflow Caused by recursive call of subroutine void foo (int i, int *s) { int ia[100]; if ( i > 0) foo(i-1, s); *s += compute(ia, i); int ia[100] } foo(2) int main () { int s, i = 2; foo(i, &s); int i, s return 0; Main } Stack 13 / 38
U SE OF E XISTING S UBROUTINES Stack overflow Caused by recursive call of subroutine int ia[100] foo(1) void foo (int i, int *s) { int ia[100]; if ( i > 0) foo(i-1, s); *s += compute(ia, i); int ia[100] } foo(2) int main () { int s, i = 2; foo(i, &s); int i, s return 0; Main } Stack 14 / 38
U SE OF E XISTING S UBROUTINES Stack overflow Caused by recursive call of subroutine int ia[100] foo(0) int ia[100] foo(1) void foo (int i, int *s) { int ia[100]; if ( i > 0) foo(i-1, s); *s += compute(ia, i); int ia[100] } foo(2) int main () { int s, i = 2; foo(i, &s); int i, s return 0; Main } Stack 15 / 38
U SE OF E XISTING S UBROUTINES Stack overflow Caused by recursive call of subroutine Unfold recursion int main () { void foo (int i, int *s) { int i, s, ia[100]; int ia[100]; for ( i = 0; i < 2; i++ ) if ( i > 0) s += compute(ia,i); foo(i-1, s); return 0; *s += compute(ia, i); } } int i, s int main () { ia[100] int s, i = 2; foo(i, &s); return 0; } Stack 16 / 38
U SE OF E XISTING S UBROUTINES Stack overflow Caused by memory allocation within subroutine int ps[3]; void *foo (void *threadid) { long tid = (long)threadid; int ia[100]; ps[tid] = compute(ia,i); pthread_exit(NULL); } int main () { long i; int s; pthread_t threads[3]; for ( i = 0; i < 3; i++ ) pthread_create(&threads[i], NULL, foo, (void *i)); for ( i = 0; i < 3; i++ ) pthread_join (threads[t], NULL); s = ps[0]+ps[1]+ps[2]; return 0; } 17 / 38
U SE OF E XISTING S UBROUTINES Stack overflow Caused by memory allocation within subroutine Dynamic allocation with malloc() Use pthread library pthread_attr_setstacksize() Allocate within main() , pass pointer to thread 18 / 38
U SE OF E XISTING S UBROUTINES Stack overflow Caused by memory allocation within subroutine Allocate within main() , pass pointer to thread int ps[3]; int *gia; void *foo (void *threadid) { long tid = (long)threadid; int *ia = &gia[tid*100]; ps[tid] = compute(ia,i); pthread_exit(NULL); } int main () { long i; int s; pthread_t threads[3]; gia = (int *)malloc(sizeof(int)*3*100); for ( i = 0; i < 3; i++ ) pthread_create(&threads[i], NULL, foo, (void *i)); for ( i = 0; i < 3; i++ ) pthread_join (threads[t], NULL); s = ps[0]+ps[1]+ps[2]; free(gia); return 0; } 19 / 38
A LGORITHMIC R EDUCTION OF C RITICAL S ECTIONS R EADING & W RITING ARE NOT SYMMETRIC 1. Concurrent reading is fast & safe 2. Concurrent writing must be AVOIDED 20 / 38
A LGORITHMIC R EDUCTION OF C RITICAL S ECTIONS Example: T ML of FMM • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 21 / 38
A LGORITHMIC R EDUCTION OF C RITICAL S ECTIONS Example: T ML of FMM 22 / 38
A LGORITHMIC R EDUCTION OF C RITICAL S ECTIONS Example: T ML of FMM 23 / 38
A LGORITHMIC R EDUCTION OF C RITICAL S ECTIONS Example: T ML of FMM B 2 B 1 24 / 38
A LGORITHMIC R EDUCTION OF C RITICAL S ECTIONS Example: T ML of FMM ? B 2 B 1 25 / 38
A LGORITHMIC R EDUCTION OF C RITICAL S ECTIONS Example: T ML of FMM B 2 B 1 B 3 Need O ( N ) Locks!!! 26 / 38
A LGORITHMIC R EDUCTION OF C RITICAL S ECTIONS Example: T ML of FMM B 2 B 4 B 1 B 3 Need O ( 3 d − 2 d ) Locks 27 / 38
A LGORITHMIC R EDUCTION OF C RITICAL S ECTIONS Example: T ML of FMM B 2 B 1 B 3 Need ZERO Locks 28 / 38
D EVELOPING T IPS Save a copy of your working code Do sequential version first Work on one subroutine at a time Pick the most time-critical subroutine first 29 / 38
D EVELOPING T IPS Study software structures List 5 & Evaluate Local: T LL & T LT Upward Pass: T SM , T MM List 2: T ME , T EE , T EL Sum Local & Direct List 3: T MT or T ST List 4: T SL or T ST List 1: T ST l max All l ’s All l ’s O PERATIONS l max − 1 l max l max − 2 l max − 1 l max − 3 l max − 2 l max − 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3 3 1 2 2 0 1 2 . . . . . . . . . l max − 3 l max − 2 l max − 1 l max All Particles T IME S TEP 30 / 38
C ODING T IPS Do NOT use unless necessary No goto statement in Fortran Loops in Fortran No break statement U TILIZE C OMPILER INSTEAD OF BLOCKING IT 31 / 38
C ODING T IPS Do NOT use unless necessary No goto statement in Fortran Loops in Fortran DO ...... DO 10 ...... ....... ....... ENDDO 10 CONTINUE No break statement U TILIZE C OMPILER INSTEAD OF BLOCKING IT 32 / 38
C ODING T IPS Do NOT use unless necessary No goto statement in Fortran Loops in Fortran No break statement while (cond) { while (1) { ....... ...... cond = foo(); break; } } U TILIZE C OMPILER INSTEAD OF BLOCKING IT 33 / 38
T ESTING & T UNING T IPS Makefile Split Fortran subroutines into individual file Debugging & profiling tools Shell script for batch test 34 / 38
Recommend
More recommend