Application of Fortran Application of Fortran Pthreads on Linear Algebra Pthreads on Linear Algebra and Scientific Computing and Scientific Computing Clay P. Breshears Clay P. Breshears Henry A. Gabb Gabb Henry A. Mark Fahey Fahey Mark USAE Waterways Experiment Station USAE Waterways Experiment Station Major Shared Resource Center Major Shared Resource Center
Acknowledgements Acknowledgements This work was funded by the DoD DoD High Performance Computing High Performance Computing This work was funded by the Modernization Program CEWES Major Shared Resource Center Modernization Program CEWES Major Shared Resource Center through Programming Environment and Training (PET), through Programming Environment and Training (PET), Contract Number: DAHC 94-96-C0002, Nichols Research Contract Number: DAHC 94-96-C0002, Nichols Research Corporation. Corporation.
Introduction Introduction ¥ Last year: Introduced F90 Pthreads API Last year: Introduced F90 Pthreads API ¥ ¥ This year: Are they really useful? This year: Are they really useful? ¥ Ð How easy is programming? Ð How easy is programming? Ð Can we get decent parallel performance? Ð Can we get decent parallel performance? Ð Are there algorithmic considerations? Ð Are there algorithmic considerations? Ð Are there external considerations? Are there external considerations? Ð ¥ Must be within a userÕs attention span Must be within a userÕs attention span ¥
Pthreads Pthreads ¥ POSIX standard for thread functions POSIX standard for thread functions ¥ Ð Thread management Ð Thread management Ð Mutual exclusion Mutual exclusion Ð Ð Conditional variables Ð Conditional variables Ð Attributes Ð Attributes ¥ Only defined in C Only defined in C ¥
F90 Pthreads API F90 Pthreads API ¥ F90 subroutines interface C functions F90 subroutines interface C functions ¥ ¥ Ô ¥ Ôf fÕ prefix to name Õ prefix to name Ð fpthread fpthread_create _create Ð Ð Similar to PVM, MPI Similar to PVM, MPI Ð Ð Error code as final argument Ð Error code as final argument ¥ F90 module and C wrapper routines F90 module and C wrapper routines ¥
Problems Considered Problems Considered ¥ Matrix Multiplication Matrix Multiplication ¥ ¥ Direct Solution of Linear Systems Direct Solution of Linear Systems ¥ Ð Gaussian Ð Gaussian Elimination and Back Substitution Elimination and Back Substitution ¥ Command, Control, Communication, and Command, Control, Communication, and ¥ Intelligence (C3I) Benchmarks Intelligence (C3I) Benchmarks Ð Map-Image Correlation Map-Image Correlation Ð Ð Terrain Masking Terrain Masking Ð
Project Goals Project Goals ¥ Apply threaded programming techniques to Apply threaded programming techniques to ¥ scientific computations scientific computations ¥ Demonstrate and exploit concurrency in numeric ¥ Demonstrate and exploit concurrency in numeric codes codes ¥ Achieve execution speed up with multiple Achieve execution speed up with multiple ¥ threads/processors threads/processors
NOT a Project Goal... NOT a Project Goal... To produce the fastest executing To produce the fastest executing versions of codes and algorithms versions of codes and algorithms examined examined Our Emphasis: Our Emphasis: ¥How easy is the method to use? How easy is the method to use? ¥ ¥How much speed up might be How much speed up might be ¥ expected? expected?
Matrix Multiplication Matrix Multiplication ¥ Sparse matrix-matrix multiplication Sparse matrix-matrix multiplication ¥ ¥ Row-major linked list data structure Row-major linked list data structure ¥ Ð Ð Direct Methods for Sparse Matrices by Duff, Direct Methods for Sparse Matrices by Duff, Erisman Erisman and Reid and Reid ¥ IKJ loop structure ¥ IKJ loop structure Ð C(I, :) = C(I, :) + A(I, K) * B(K, :) C(I, :) = C(I, :) + A(I, K) * B(K, :) Ð Ð I I th row of C; row of C; K K th row of B Ð th th row of B Ð Scalar from A (accessed across Scalar from A (accessed across I I th row) Ð th row)
Thread Algorithm Thread Algorithm While more rows to process While more rows to process get next row number (lock shared counter) get next row number (lock shared counter) for each element in row of A do for each element in row of A do copy appropriate row of B to local vector copy appropriate row of B to local vector multiply vector by scalar from A multiply vector by scalar from A add results to local ÒsummationÓ vector add results to local ÒsummationÓ vector add ÒsummationÓ vector to row of C add ÒsummationÓ vector to row of C lock data structure to prevent overwriting lock data structure to prevent overwriting
1000x1000 Sparse Matrix Multiplication (< 10K NZ) 1000x1000 Sparse Matrix Multiplication (< 10K NZ) on SGI/Cray Origin 2000 (IRIX 6.4) on SGI/Cray Origin 2000 (IRIX 6.4) # of Threads Time (seconds) 1 0.259 2 0.261 4 0.261 8 0.264 16 0.270 32 0.301 64 0.324 128 0.364
Dense Matrices Dense Matrices ¥ Not enough work in sparse case Not enough work in sparse case ¥ ¥ IKJ loop structure IKJ loop structure ¥ Ð row-major access Ð row-major access ¥ JKI loop structure JKI loop structure ¥ Ð C(:, J) = C(:, J) + A(:, K) * B(K, J) Ð C(:, J) = C(:, J) + A(:, K) * B(K, J) Ð J Ð J th th column of C; column of C; K K th th column of A column of A Ð Scalar from B (accessed down Ð Scalar from B (accessed down J J th th column) column)
1000x1000 Dense Matrix Multiplication 1000x1000 Dense Matrix Multiplication on SGI/Cray Origin 2000 (IRIX 6.4) on SGI/Cray Origin 2000 (IRIX 6.4) # of Threads IKJ Time JKI Time (se conds) (seconds) 1 75.5 29.8 2 42.4 20.2 4 22.4 11.2 8 11.9 6.1 16 6.3 3.4 32 3.6 2.0 64 3.2 1.9 128 3.0 2.3
Solution of Linear Systems Solution of Linear Systems ¥ Gaussian Gaussian Elimination with Back Substitution Elimination with Back Substitution ¥ Ð simple method with row updates Ð simple method with row updates Ð diminishing amounts of work diminishing amounts of work Ð Q: How do we distribute work evenly among threads Q: How do we distribute work evenly among threads throughout computation? throughout computation? A: Cyclic distribution of rows A: Cyclic distribution of rows
Cyclic Row Partitioning Cyclic Row Partitioning A x b pivot 0 0 0 0 0 0 0 index
Thread Algorithm Thread Algorithm do i = 1, NUMROWS indx = NUMROWS if (i mod myid = = 0) then while (indx > 0) save pivot, row(i) while (indx mod myid = = 0) else fpthread_cond_wait wait for pivot to be saved compute x(indx) endif indx = indx - 1 copy row(i), pivot to local row, fpthread_cond_broadcast pivot end while do j = i+1, NUMROWS if (j mod myid = = 0) then compute row multiplier update row(j) with local row endif end do end do
Cyclic Row Partitioning Cyclic Row Partitioning A x b pivot 0 0 0 0 0 0 0 index
But What AboutÉ But What AboutÉ ¥ Column-Major ordering is Fortran standard Column-Major ordering is Fortran standard ¥ Ð Helped with Matrix Multiply code Ð Helped with Matrix Multiply code ¥ Modify threaded algorithm Modify threaded algorithm ¥ Ð Transpose A matrix after input Ð Transpose A matrix after input ¥ library routine is threaded ¥ library routine is threaded Ð Swap A(i,j) indices for A(j,i) Swap A(i,j) indices for A(j,i) Ð
Timing Results Timing Results F90 threaded Gaussian Elimination with back substitution 2000x2000 system of equations on SGI/Cray Origin2000 # of Threads Row-Major Transpos e Column-Major 1 6 72 16 2 2 7 35 23 2 4 5 56 11 5 8 10 30 10 0 1 6 16 42 14 7 3 2 16 67 13 1
But What AboutÉ But What AboutÉ ¥ ...Memory contention of threads if matrices ...Memory contention of threads if matrices ¥ stored on a single processor? stored on a single processor? Ð Used Used _DSM_ROUND_ROBIN _DSM_ROUND_ROBIN to no significant effect to no significant effect Ð Ð Used Used A(CYCLIC,*) A(CYCLIC,*) distribution to no significant effect distribution to no significant effect Ð ¥ ...Distribution of threads to processors? ...Distribution of threads to processors? ¥
pthread_ _setconcurrency setconcurrency pthread ¥ SGI extension to Pthreads SGI extension to Pthreads ¥ ¥ Wrote F90 wrapper Wrote F90 wrapper ¥ ¥ Set to number of threads executing Set to number of threads executing ¥
Added Timing Results Added Timing Results F90 threaded Gaussian Elimination with back substitution 2000x2000 system of equations on SGI/Cray Origin2000 # o f Threads Row-Major Transpose Transpose Column-Major fp_setconcurrency 1 67 2 1 62 2 73 5 2 32 1 25 4 55 6 1 15 6 8 8 10 30 1 00 2 6 16 16 42 1 47 5 3 32 16 67 1 31 6 4
10K System of Equations 10K System of Equations ¥ Transpose algorithm Transpose algorithm ¥ Ð 32 threads Ð 32 threads Ð 10144 seconds on Origin2000 10144 seconds on Origin2000 Ð ¥ Transpose with Transpose with setconcurrency setconcurrency ¥ Ð 32 threads 32 threads Ð Ð 8608 seconds on Origin2000 Ð 8608 seconds on Origin2000
Recommend
More recommend