Application of Fortran Application of Fortran Pthreads on Linear - PowerPoint PPT Presentation

Application of Fortran Application of Fortran Pthreads on Linear Algebra Pthreads on Linear Algebra and Scientific Computing and Scientific Computing Clay P. Breshears Clay P. Breshears Henry A. Gabb Gabb Henry A. Mark Fahey Fahey Mark USAE Waterways Experiment Station USAE Waterways Experiment Station Major Shared Resource Center Major Shared Resource Center

Acknowledgements Acknowledgements This work was funded by the DoD DoD High Performance Computing High Performance Computing This work was funded by the Modernization Program CEWES Major Shared Resource Center Modernization Program CEWES Major Shared Resource Center through Programming Environment and Training (PET), through Programming Environment and Training (PET), Contract Number: DAHC 94-96-C0002, Nichols Research Contract Number: DAHC 94-96-C0002, Nichols Research Corporation. Corporation.

Introduction Introduction ¥ Last year: Introduced F90 Pthreads API Last year: Introduced F90 Pthreads API ¥ ¥ This year: Are they really useful? This year: Are they really useful? ¥ Ð How easy is programming? Ð How easy is programming? Ð Can we get decent parallel performance? Ð Can we get decent parallel performance? Ð Are there algorithmic considerations? Ð Are there algorithmic considerations? Ð Are there external considerations? Are there external considerations? Ð ¥ Must be within a userÕs attention span Must be within a userÕs attention span ¥

Pthreads Pthreads ¥ POSIX standard for thread functions POSIX standard for thread functions ¥ Ð Thread management Ð Thread management Ð Mutual exclusion Mutual exclusion Ð Ð Conditional variables Ð Conditional variables Ð Attributes Ð Attributes ¥ Only defined in C Only defined in C ¥

F90 Pthreads API F90 Pthreads API ¥ F90 subroutines interface C functions F90 subroutines interface C functions ¥ ¥ Ô ¥ Ôf fÕ prefix to name Õ prefix to name Ð fpthread fpthread_create _create Ð Ð Similar to PVM, MPI Similar to PVM, MPI Ð Ð Error code as final argument Ð Error code as final argument ¥ F90 module and C wrapper routines F90 module and C wrapper routines ¥

Problems Considered Problems Considered ¥ Matrix Multiplication Matrix Multiplication ¥ ¥ Direct Solution of Linear Systems Direct Solution of Linear Systems ¥ Ð Gaussian Ð Gaussian Elimination and Back Substitution Elimination and Back Substitution ¥ Command, Control, Communication, and Command, Control, Communication, and ¥ Intelligence (C3I) Benchmarks Intelligence (C3I) Benchmarks Ð Map-Image Correlation Map-Image Correlation Ð Ð Terrain Masking Terrain Masking Ð

Project Goals Project Goals ¥ Apply threaded programming techniques to Apply threaded programming techniques to ¥ scientific computations scientific computations ¥ Demonstrate and exploit concurrency in numeric ¥ Demonstrate and exploit concurrency in numeric codes codes ¥ Achieve execution speed up with multiple Achieve execution speed up with multiple ¥ threads/processors threads/processors

NOT a Project Goal... NOT a Project Goal... To produce the fastest executing To produce the fastest executing versions of codes and algorithms versions of codes and algorithms examined examined Our Emphasis: Our Emphasis: ¥How easy is the method to use? How easy is the method to use? ¥ ¥How much speed up might be How much speed up might be ¥ expected? expected?

Matrix Multiplication Matrix Multiplication ¥ Sparse matrix-matrix multiplication Sparse matrix-matrix multiplication ¥ ¥ Row-major linked list data structure Row-major linked list data structure ¥ Ð Ð Direct Methods for Sparse Matrices by Duff, Direct Methods for Sparse Matrices by Duff, Erisman Erisman and Reid and Reid ¥ IKJ loop structure ¥ IKJ loop structure Ð C(I, :) = C(I, :) + A(I, K) * B(K, :) C(I, :) = C(I, :) + A(I, K) * B(K, :) Ð Ð I I th row of C; row of C; K K th row of B Ð th th row of B Ð Scalar from A (accessed across Scalar from A (accessed across I I th row) Ð th row)

Thread Algorithm Thread Algorithm While more rows to process While more rows to process get next row number (lock shared counter) get next row number (lock shared counter) for each element in row of A do for each element in row of A do copy appropriate row of B to local vector copy appropriate row of B to local vector multiply vector by scalar from A multiply vector by scalar from A add results to local ÒsummationÓ vector add results to local ÒsummationÓ vector add ÒsummationÓ vector to row of C add ÒsummationÓ vector to row of C lock data structure to prevent overwriting lock data structure to prevent overwriting

1000x1000 Sparse Matrix Multiplication (< 10K NZ) 1000x1000 Sparse Matrix Multiplication (< 10K NZ) on SGI/Cray Origin 2000 (IRIX 6.4) on SGI/Cray Origin 2000 (IRIX 6.4) # of Threads Time (seconds) 1 0.259 2 0.261 4 0.261 8 0.264 16 0.270 32 0.301 64 0.324 128 0.364

Dense Matrices Dense Matrices ¥ Not enough work in sparse case Not enough work in sparse case ¥ ¥ IKJ loop structure IKJ loop structure ¥ Ð row-major access Ð row-major access ¥ JKI loop structure JKI loop structure ¥ Ð C(:, J) = C(:, J) + A(:, K) * B(K, J) Ð C(:, J) = C(:, J) + A(:, K) * B(K, J) Ð J Ð J th th column of C; column of C; K K th th column of A column of A Ð Scalar from B (accessed down Ð Scalar from B (accessed down J J th th column) column)

1000x1000 Dense Matrix Multiplication 1000x1000 Dense Matrix Multiplication on SGI/Cray Origin 2000 (IRIX 6.4) on SGI/Cray Origin 2000 (IRIX 6.4) # of Threads IKJ Time JKI Time (se conds) (seconds) 1 75.5 29.8 2 42.4 20.2 4 22.4 11.2 8 11.9 6.1 16 6.3 3.4 32 3.6 2.0 64 3.2 1.9 128 3.0 2.3

Solution of Linear Systems Solution of Linear Systems ¥ Gaussian Gaussian Elimination with Back Substitution Elimination with Back Substitution ¥ Ð simple method with row updates Ð simple method with row updates Ð diminishing amounts of work diminishing amounts of work Ð Q: How do we distribute work evenly among threads Q: How do we distribute work evenly among threads throughout computation? throughout computation? A: Cyclic distribution of rows A: Cyclic distribution of rows

Cyclic Row Partitioning Cyclic Row Partitioning A x b pivot 0 0 0 0 0 0 0 index

Thread Algorithm Thread Algorithm do i = 1, NUMROWS indx = NUMROWS if (i mod myid = = 0) then while (indx > 0) save pivot, row(i) while (indx mod myid = = 0) else fpthread_cond_wait wait for pivot to be saved compute x(indx) endif indx = indx - 1 copy row(i), pivot to local row, fpthread_cond_broadcast pivot end while do j = i+1, NUMROWS if (j mod myid = = 0) then compute row multiplier update row(j) with local row endif end do end do

Cyclic Row Partitioning Cyclic Row Partitioning A x b pivot 0 0 0 0 0 0 0 index

But What AboutÉ But What AboutÉ ¥ Column-Major ordering is Fortran standard Column-Major ordering is Fortran standard ¥ Ð Helped with Matrix Multiply code Ð Helped with Matrix Multiply code ¥ Modify threaded algorithm Modify threaded algorithm ¥ Ð Transpose A matrix after input Ð Transpose A matrix after input ¥ library routine is threaded ¥ library routine is threaded Ð Swap A(i,j) indices for A(j,i) Swap A(i,j) indices for A(j,i) Ð

Timing Results Timing Results F90 threaded Gaussian Elimination with back substitution 2000x2000 system of equations on SGI/Cray Origin2000 # of Threads Row-Major Transpos e Column-Major 1 6 72 16 2 2 7 35 23 2 4 5 56 11 5 8 10 30 10 0 1 6 16 42 14 7 3 2 16 67 13 1

But What AboutÉ But What AboutÉ ¥ ...Memory contention of threads if matrices ...Memory contention of threads if matrices ¥ stored on a single processor? stored on a single processor? Ð Used Used _DSM_ROUND_ROBIN _DSM_ROUND_ROBIN to no significant effect to no significant effect Ð Ð Used Used A(CYCLIC,*) A(CYCLIC,*) distribution to no significant effect distribution to no significant effect Ð ¥ ...Distribution of threads to processors? ...Distribution of threads to processors? ¥

pthread_ _setconcurrency setconcurrency pthread ¥ SGI extension to Pthreads SGI extension to Pthreads ¥ ¥ Wrote F90 wrapper Wrote F90 wrapper ¥ ¥ Set to number of threads executing Set to number of threads executing ¥

Added Timing Results Added Timing Results F90 threaded Gaussian Elimination with back substitution 2000x2000 system of equations on SGI/Cray Origin2000 # o f Threads Row-Major Transpose Transpose Column-Major fp_setconcurrency 1 67 2 1 62 2 73 5 2 32 1 25 4 55 6 1 15 6 8 8 10 30 1 00 2 6 16 16 42 1 47 5 3 32 16 67 1 31 6 4

10K System of Equations 10K System of Equations ¥ Transpose algorithm Transpose algorithm ¥ Ð 32 threads Ð 32 threads Ð 10144 seconds on Origin2000 10144 seconds on Origin2000 Ð ¥ Transpose with Transpose with setconcurrency setconcurrency ¥ Ð 32 threads 32 threads Ð Ð 8608 seconds on Origin2000 Ð 8608 seconds on Origin2000

Application of Fortran Application of Fortran Pthreads on Linear - PowerPoint PPT Presentation

Application of Fortran Application of Fortran Pthreads on Linear Algebra Pthreads on Linear Algebra and Scientific Computing and Scientific Computing Clay P. Breshears Clay P. Breshears Henry A. Gabb Gabb Henry A. Mark Fahey Fahey Mark

Table of Contents 1. Pthreads Overview 1. What is a Thread? 2. What are Pthreads? 3. Why Pthreads?

pthreads pthreads (POSIX threads) is a library for doing threading pthreads Can

1954 1957 FORTRAN I FORTRAN II FORTRAN III FORTRAN 1957 end-1958 october 1956 november

Programming Shared-memory Platforms with Pthreads Xu Liu Derived from John Mellor-Crummeys

PThreads Thanks to LLNL for their tutorial from which these slides are derived

Pthreads: POSIX Threads Pthreads is a standard set of C library functions for multithreaded

Concurrency with pthreads David Hovemeyer 22 November 2019 David Hovemeyer Computer Systems

Introduction to FORTRAN A Brief Summary of GNU FORTRAN Ashik Iqubal Department of Physics

The Fortran 90 programming language Fortran has evolved since the early days of computing

Getting started with Fortran branches loops 1 2 Why learn Fortran? Well suited for

An introduction to Fortran Daniel Price School of Physics and Astronomy Monash University

FORTRAN 04 February 1999; CS655 FORTRAN Concepts/Contributions Binding time Separate

Evolution of Fortran standards over the few A brief overview of this course decades The 1 st

AMath 483/583 Lecture 8 Notes: This lecture: Fortran subroutines and functions Arrays

Getting along and working together Fortran-Python Interoperability Jacob Wilkins Fortran AND

Fortran 90 Arrays Fortran 90 Arrays Program testing can be used to show the presence of bugs

Incremental Migration of C and Fortran Applications to GPGPU using HMPP Peppher 2011

Katahdin A Programming Language Where the Syntax and Semantics Are Mutable at Runtime Chris

Atmospheric Modeling (5 ECTS) Lectures: Mon 10.15-12.00, Physicum D105 Tue 10.15-12.00,

Model Interfacing Jedi Academy IV, Monterey CA 26 th February 2020 Outline Introduction

A New Vision for Coarray Fortran John Mellor-Crummey, Laksono Adhianto William Scherer III

Review of OpenMP Russian-German School on High-Performance Computer Systems, 27 th June - 6 th

!"#$%&'()*(+,-.'%"-% /0-,+".%1,-#2,#'34% 56789%#2'3:%1'.:2('%

f2py: Pythons interface to the world of number crunching CodeJam08 Moritz Helias

Application of Fortran Application of Fortran Pthreads on Linear - PowerPoint PPT Presentation

Application of Fortran Application of Fortran Pthreads on Linear Algebra Pthreads on Linear Algebra and Scientific Computing and Scientific Computing Clay P. Breshears Clay P. Breshears Henry A. Gabb Gabb Henry A. Mark Fahey Fahey Mark

Table of Contents 1. Pthreads Overview 1. What is a Thread? 2. What are Pthreads? 3. Why Pthreads?

pthreads pthreads (POSIX threads) is a library for doing threading pthreads Can

1954 1957 FORTRAN I FORTRAN II FORTRAN III FORTRAN 1957 end-1958 october 1956 november

Programming Shared-memory Platforms with Pthreads Xu Liu Derived from John Mellor-Crummeys

PThreads Thanks to LLNL for their tutorial from which these slides are derived

Pthreads: POSIX Threads Pthreads is a standard set of C library functions for multithreaded

Concurrency with pthreads David Hovemeyer 22 November 2019 David Hovemeyer Computer Systems

Introduction to FORTRAN A Brief Summary of GNU FORTRAN Ashik Iqubal Department of Physics

The Fortran 90 programming language Fortran has evolved since the early days of computing

Getting started with Fortran branches loops 1 2 Why learn Fortran? Well suited for

An introduction to Fortran Daniel Price School of Physics and Astronomy Monash University

FORTRAN 04 February 1999; CS655 FORTRAN Concepts/Contributions Binding time Separate

Evolution of Fortran standards over the few A brief overview of this course decades The 1 st

AMath 483/583 Lecture 8 Notes: This lecture: Fortran subroutines and functions Arrays

Getting along and working together Fortran-Python Interoperability Jacob Wilkins Fortran AND

Fortran 90 Arrays Fortran 90 Arrays Program testing can be used to show the presence of bugs

Incremental Migration of C and Fortran Applications to GPGPU using HMPP Peppher 2011

Katahdin A Programming Language Where the Syntax and Semantics Are Mutable at Runtime Chris

Atmospheric Modeling (5 ECTS) Lectures: Mon 10.15-12.00, Physicum D105 Tue 10.15-12.00,

Model Interfacing Jedi Academy IV, Monterey CA 26 th February 2020 Outline Introduction

A New Vision for Coarray Fortran John Mellor-Crummey, Laksono Adhianto William Scherer III

Review of OpenMP Russian-German School on High-Performance Computer Systems, 27 th June - 6 th

!&quot;#$%&amp;'()*(+,-.'%&quot;-% /0-,+&quot;.%1,-#2,#'34% 56789%#2'3:%1'.:2('%

f2py: Pythons interface to the world of number crunching CodeJam08 Moritz Helias

!"#$%&'()*(+,-.'%"-% /0-,+".%1,-#2,#'34% 56789%#2'3:%1'.:2('%