Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Dense Triangular Solvers on Multicore Clusters using UPC Jorge González-Domínguez*, María J. Martín, Guillermo L. Taboada, Juan Touriño Computer Architecture Group University of A Coruña (Spain) {jgonzalezd,mariam,taboada,juan}@udc.es International Conference on Computational Science ICCS 2011 1/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Introduction 1 BLAS2 Triangular Solver 2 BLAS3 Triangular Solver 3 Experimental Evaluation 4 Conclusions 5 2/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Introduction 1 BLAS2 Triangular Solver 2 BLAS3 Triangular Solver 3 Experimental Evaluation 4 5 Conclusions 3/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions UPC: a Suitable Alternative for HPC in Multi-core Era Programming Models: PGAS Languages: Traditionally: Shared/Distributed memory programming models UPC -> C Challenge: hybrid memory architectures Titanium -> Java PGAS (Partitioned Global Address Co-Array Fortran -> Space) Fortran UPC Compilers: Berkeley UPC GCC (Intrepid) Michigan TU HP , Cray and IBM UPC Compilers 4/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Studied Numerical Operations BLAS Libraries Basic Linear Algebra Subprograms Specification of a set of numerical functions Widely used by scientists and engineers SparseBLAS and PBLAS (Parallel BLAS) Development of UPCBLAS gemv : Matrix-vector product ( α ∗ A ∗ x + β ∗ y = y ) gemm : Matrix-matrix product ( α ∗ A ∗ B + β ∗ C = C ) Studied Routines trsv : BLAS2 Triangular Solver ( M ∗ x = b ) trsm : BLAS3 Triangular Solver ( M ∗ X = B ) 5/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Introduction 1 BLAS2 Triangular Solver 2 BLAS3 Triangular Solver 3 Experimental Evaluation 4 5 Conclusions 6/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Parallel BLAS2 Triangular Solver( M ∗ x = b ) (I) Example Types of Blocks Matrix 8X8 i < j Zero matrix 2 Threads i = j Triangular matrix 2 Rows per block i > j Square matrix 7/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Parallel BLAS2 Triangular Solver( M ∗ x = b ) (II) THREAD 0 → trsv( M 11 , x 1 , b 1 ) 8/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Parallel BLAS2 Triangular Solver( M ∗ x = b ) (III) THREAD 0 → gemv( M 31 , x 1 , b 3 ) THREAD 1 → gemv( M 21 , x 1 , b 2 ) → trsv( M 22 , x 2 , b 2 ) → gemv( M 41 , x 1 , b 4 ) 9/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Parallel BLAS2 Triangular Solver( M ∗ x = b ) (IV) THREAD 0 → gemv( M 32 , x 2 , b 3 ) → trsv( M 33 , x 3 , b 3 ) THREAD 1 → gemv( M 42 , x 2 , b 2 ) 10/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Parallel BLAS2 Triangular Solver( M ∗ x = b ) (V) THREAD 1 → gemv( M 43 , x 3 , b 4 ) → trsv( M 44 , x 4 , b 4 ) 11/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Parallel BLAS2 Triangular Solver( M ∗ x = b ) (and VI) Impact of the Block Size The more blocks the matrix is divided in, the more ... computations can be simultaneously performed ( ↑ perf) synchronizations are needed ( ↓ perf) Best block size automatically determined -> Paper 12/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Introduction 1 BLAS2 Triangular Solver 2 BLAS3 Triangular Solver 3 Experimental Evaluation 4 5 Conclusions 13/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Parallel BLAS3 Triangular Solver( M ∗ X = B ) (I) Studied Distributions Triangular and dense matrices distributed by rows -> Similar approach than BLAS2 but changing sequential gemv → gemm trsv → trsm Dense matrices distributed by columns Triangular and dense matrices with 2D distribution (multicore-aware) 14/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Parallel BLAS3 Triangular Solver( M ∗ X = B ) (II) Dense Matrices Distributed by Columns 15/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Parallel BLAS3 Triangular Solver( M ∗ X = B ) (and III) Triangular and dense matrices with 2D distribution (multicore-aware) Node 1 -> Cores 0 & 1 Node 2 -> Cores 2 & 3 16/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Introduction 1 BLAS2 Triangular Solver 2 BLAS3 Triangular Solver 3 Experimental Evaluation 4 5 Conclusions 17/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Evaluation of the BLAS2 Triangular Solver Departamental Cluster; InfiniBand; 8 nodes; 2th/node m = 30000 5 UPC ScaLAPACK 4.5 4 Speedups 3.5 3 2.5 2 1.5 2 4 8 16 Number of Threads 18/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Evaluation of the BLAS3 Triangular Solver (I) Finis Terrae Supercomputer; InfiniBand; 32 nodes; 4th/node Itanium2 supercomputer: m = 12000 n = 12000 120 UPC M_dist UPC M_rep 100 UPC multi ScaLAPACK 80 Speedups 60 40 20 0 2 4 8 16 32 64 128 Number of Threads 19/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Evaluation of the BLAS3 Triangular Solver (II) Finis Terrae Supercomputer; InfiniBand; 32 nodes; 4th/node Itanium2 supercomputer: m = 15000 n = 4000 100 UPC M_dist 90 UPC M_rep 80 UPC multi ScaLAPACK 70 Speedups 60 50 40 30 20 10 0 2 4 8 16 32 64 128 Number of Threads 20/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Evaluation of the BLAS3 Triangular Solver (and III) Finis Terrae Supercomputer; InfiniBand; 32 nodes; 4th/node Itanium2 supercomputer: m = 8000 n = 25000 140 UPC M_dist UPC M_rep 120 UPC multi 100 ScaLAPACK Speedups 80 60 40 20 0 2 4 8 16 32 64 128 Number of Threads 21/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Introduction 1 BLAS2 Triangular Solver 2 BLAS3 Triangular Solver 3 Experimental Evaluation 4 5 Conclusions 22/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Main Conclusions Summary Implementation of BLAS triangular solvers for UPC Several techniques to improve their performance Special effort to find the most appropriate data distributions BLAS2 → Block-cyclic distribution by rows Block size automatically determined according to the characteristics of the scenario BLAS3 → Depending on the memory constraints Comparison with ScaLAPACK (MPI) UPC easier to use Similar of better performance 23/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Dense Triangular Solvers on Multicore Clusters using UPC Jorge González-Domínguez*, María J. Martín, Guillermo L. Taboada, Juan Touriño Computer Architecture Group University of A Coruña (Spain) {jgonzalezd,mariam,taboada,juan}@udc.es International Conference on Computational Science ICCS 2011 24/24
Recommend
More recommend