silicon graphics scientific library update
play

Silicon Graphics Scientific Library Update Mimi Celis Tom Elken - PowerPoint PPT Presentation

Silicon Graphics Scientific Library Update Mimi Celis Tom Elken celis@sgi.com telken@sgi.com Supercomputing Applications Silicon Graphics, Inc. 41st Cray User Group Conference Minneapolis, Minnesota Contents Scientific Libraries


  1. Silicon Graphics Scientific Library Update Mimi Celis Tom Elken celis@sgi.com telken@sgi.com Supercomputing Applications Silicon Graphics, Inc. 41st Cray User Group Conference Minneapolis, Minnesota

  2. Contents ¥ Scientific Libraries available on SGI hardware ¥ SCSL Scientific Library (like ÒSGIÓ, ÒSCSLÓ doesnÕt mean anything ;-) ) ¥ SCSL Release 1.2 ¥ Signal Processing in SCSL 1.2 ¥ Performance ¥ Special Solvers in SCSL 1.2 ¥ Future 2

  3. Scientific Libraries on SGI There are ÒmanyÓ scientific libraries available on SGI platforms today. ¥ LibSci on Cray platforms. ¥ CHALLENGEcomplib on IRIX platforms. (libcomplib.sgimath,libblas) Ð Part of the IDO in IRIX 6.4 and older Ð Part of the IRIX development libraries in IRIX 6.5 Ð Version 3.1 ¥ SCSL on IRIX platforms. Ð Unbundled product Ð Available for IRIX 6.4 and newer Ð Version 1.1 3

  4. SCSL Scientific Library ¥ SCSL is a scientific and math library ¥ SCSL is (initially) available on IRIX 6.4 and 6.5 systems ¥ SCSL will become the standard scientific library on all SGI platforms ¥ SCSL will merge the important functionality of CHALLENGEcomplib and LibSci into one library ¥ SCSL will provide a new library with more functionality and better performance than either library by itself. 4

  5. SCSL Contents ¥ BLAS (Basic Linear Algebra Subprograms). Ð BLAS1-Vector-vector operations Ð BLAS2-Matrix-vector operations Ð BLAS3-Matrix-matrix operations ¥ LAPACK Ð Symmetric and Nonsymmetric linear systems of equations Ð Symmetric and Nonsymmetric eigenvector/value Ð Singular Value Decomposition Ð Linear Least Squares BLAS and LAPACK developed at the University of Tennessee. 5

  6. SCSL Contents (continued) ¥ Sparse Linear Equation Solvers Ð Symmetric linear systems of equations Ð Nonsymmetric linear systems of equations (NO pivoting) ¥ FFTs Ð multiple one-dimension mixed radix Ð one-,two-and three-dimension mixed radix Ð single-and double-precision, for both real and complex data types Sparse solvers and FFTs were developed at SGI. (There is no defacto standard API). 6

  7. How to use SCSL ¥ Documentation in form of man pages: Ð intro_libscsl Ð intro_blas1, _blas2, _blas3 Ð intro_fft Ð intro_lapack Ð intro_sparse (soon) Ð these will point you to more detailed man pages ¥ Linking: Ð Serial: -lscs Ð OpenMP or libmp parallel: -lscs_mp -mp 7

  8. SCSL Release 1.2 SCSL 1.1 is the current release. Release 1.2 will be the next SCSL release. Goals for 1.2: ¥ Add the missing complib Signal Processing functionality. ¥ Provide C language interfaces for the Signal Processing routines. ¥ Enhance the ordering techniques in the sparse linear solvers. ¥ Performance tuning for the MIPS R12000 Processor. ¥ Rollup bug fixes from SCSL 1.1 and complib 3.1. SCSL 1.2 will be released with IRIX 6.5.5 (late July 1999). 8

  9. SCSL Release 1.2 (continued) SCSL 1.2 is the follow-on to CHALLENGEcomplib with some exceptions: ¥ SCSL 1.2 will NOT include o32 versions of the libraries. ¥ SCSL 1.2 will NOT support LINPACK and EISPACK. ¥ SCSL 1.2 will run on all platforms that have n32 or 64 support. CHALLENGEcomplib is available to run on older and current platforms,however: ¥ There will be no further releases of complib. ¥ No complib bugs fixes (with rare exceptions). 9

  10. Signal Processing for SCSL 1.2 Additions to the FFTs : ¥ multiple 1D routine which calculates an FFT in one dimension for each row of a two-dimensional matrix. ¥ 1D, 2D and 3D routines that compute the product of the Fourier Transform of a sequence with the Fourier Transform of a filter (*prod routines in complib). ¥ Functions will be introduced to release memory allocated within the FFT routines. ¥ C language bindings. 10

  11. Signal Processing for SCSL 1.2 (continued) SCSL 1.2 will include convolution and correlation routines. ¥ Convolution for Finite Impulse Response (FIR) and Infinite Impulse Response (IIR) filters, together with Correlations. ¥ 1D and 2D convolution and correlation Single and double precision for real and complex arithmetic. ¥ 2D routines will run on multiple processors. ¥ API similar to complib API (but not fully compatible). ¥ Fortran and C language bindings. The two main goals of the Convolution and Correlation library are performance and generality . It provides well tuned modules usable in most convolution and correlation instances. 11

  12. Performance ¥ BLAS ¥ Fast Fourier Transforms ¥ Sparse Solver 12

  13. BLAS Performance DGEMM Performance 700 600 500 400 Mflops 300 200 100 0 32 64 128 256 512 1024 2048 Matrix Size 13

  14. BLAS Performance DGEMV Performance 450 400 350 300 Mflops 250 200 150 100 50 0 32 64 128 256 512 1024 2048 Matrix Size 14

  15. BLAS Performance DGEMM Parallel Performance 18000 16000 14000 12000 10000 Mflops 8000 6000 4000 2000 0 1 2 4 8 16 32 Number of processors 15

  16. Fast Fourier Transforms (FFT) ¥ 1-Dimensional FFT applications: Ð Seismic: many short FFTs (1024-4096 data points) Ð Sonar, radar cross-section, speech recognition and astronomical systems: large 1D FFTs ¥ Multi-dimensional FFTs: Ð image processing Ð PDEs from CFP applications Following charts show Òeffective megaflop rateÓ based on 5n*log(n) for each complex-to-complex FFT. 16

  17. FFT performance 1D Complex-complex FFT 600 500 Single Precision Double Precision 400 Mflops 300 200 100 0 1 100 10000 1000000 1E+08 FFT size 17

  18. FFT performance Complex-complex Multiple 1D FFT 600 500 400 Mflops 300 200 Single Precision Double Precision 100 0 10 100 1000 10000 FFT size and # of repetitions 18

  19. FFT performance 2D Complex-complex FFT 450 400 350 300 Mflops 250 Single Precision 200 Double Precision 150 100 50 0 10 100 1000 FFT size of one dimension 19

  20. FFT parallel performance Complex-complex Multiple 1D FFT 6000 1024-single 2048-single 5000 4096-single 1024-double 4000 2048-double Mflops 4096-double 3000 2000 1000 0 1 10 100 # of CPUs Ò1024-singleÓ means 1024 copies of a size 1024 single precision (32 bits) FFT 20

  21. Changes to SGI Sparse Solvers ¥ New Matrix Ordering Options Ð Methods 3 and 4 are termed ÒExtreme2Ó ordering ¥ New default for ordering option Ð Extreme ordering (Method 2) is now the default ¥ Out-of-core solver option Ð Was in recent SCSL version, but now is documented Ð Single-processor only Ð Striped file system useful Ð Simple interface and performs well 21

  22. New ordering options 3. Multiple Nested Dissection orders ¥ default is OMP_NUM_THREADS orders ¥ repeatable quality 4. Multiple ND orders using feedback file information ¥ default is 2 x OMP_NUM_THREADS orders ¥ feedback file is at most 5KB, up to 200 records ¥ binary feedback file ¥ a solver that learns 22

  23. Choosing a default method ¥ Should default be best for Total Time for Nine models which size model? 3500 ¥ Decided to optimize for medium or larger problems 3000 (at least 5000 equations) 2500 ¥ Extreme2 (3) about 3% 2000 faster than Extreme, but is 1500 new tech., so we use Method 2 as the new default. 1000 500 0 1 2 3 4 Ordering Method 23

  24. Out-of-core (OOC) Option ¥ Performance 10-40% Total Time for Nine models slower than extreme (1-CPU runs) (Method 2) ordering in- 1800 core; 15% in this case. 1600 1400 ¥ but faster than AMF (1) 1200 ¥ This used 4-way striping on 1000 file system -- 140 MB/s on 800 some reads 600 400 ¥ Allowed 128MB in-core for 200 factor storage 0 1 2 3 4 OOC Ordering Method / Factor Storage 24

  25. Scalability: Factorization Mflops ¥ AmdahlÕs law resp. for 3500 much of lack of scaling gismondi 3000 in previous chart fleet10 th2 ¥ Over 11 Gflops Factorization Mflops 2500 achieved on gismondi 280Kdof on 48 CPUs 2000 ¥ More can be done to 1500 improve memory placement 1000 ¥ These results used 500 DSM_ROUND_ROBIN data placement 0 0 5 10 # of CPUs 25

  26. PSLDLT: Scalability to 8 CPUs 7 ¥ Measured: Elapsed time 6 for 1 preprocess, 2 factorizations, 2 solves. fleet10 5 ¥ # floating point ops to gismondi Speedup factor & preprocess time 4 : 3 th2 Ð Gflop secs. Ð fleet10 383 27 2 Ð gismondi 133 3 280Kdof Ð th2 34 18 1 Ð 280Kdof 18 15 0 0 2 4 6 8 10 # of CPUs 26

  27. Summary ¥ SCSL 1.2 improvements: Ð FFTs have new interface Ð Add the missing complib Signal Processing functionality. Ð Provide C language interfaces for the Signal Processing routines. Ð Enhance the ordering techniques in the sparse linear solvers. Ð Performance tuning for the MIPS R12000 Processor. Ð Rollup bug fixes from SCSL 1.1 and complib 3.1. ¥ Comments, questions: Ð Mimi Celis; celis@sgi.com Ð Tom Elken; telken@sgi.com 27

Recommend


More recommend