linear algebra la algorithms for hybrid architectures
play

Linear Algebra (LA) Algorithms for Hybrid Architectures with XKaapi - PowerPoint PPT Presentation

Linear Algebra (LA) Algorithms for Hybrid Architectures with XKaapi WSPPD 2011, August 19, 2011 Federal University of Rio Grande do Sul (UFRGS) Joo V. F. Lima Phd Student joao.lima@inf.ufrgs.br Nicolas Maillard (UFRGS), Vincent Danjean and


  1. Linear Algebra (LA) Algorithms for Hybrid Architectures with XKaapi WSPPD 2011, August 19, 2011 Federal University of Rio Grande do Sul (UFRGS) João V. F. Lima Phd Student joao.lima@inf.ufrgs.br Nicolas Maillard (UFRGS), Vincent Danjean and Thierry Gautier (MOAIS-LIG) Advisors

  2. Contents ● Introduction ● Parallel LA Algorithms ● XKaapi programming model ● Linear Algebra with XKaapi ● Conclusion 2

  3. Contents ● Introduction ● Parallel LA Algorithms ● XKaapi programming model ● Linear Algebra with XKaapi ● Conclusion 3

  4. Introduction ● Solving linear algebra (LA) systems is a fundamental problem in scientific computing ● Enabling LA for hybrid architectures is strategic ● Many hybrid processing units (PU) ● The problem is reduce the gap theoretical/achived performance 4

  5. Introduction Pthreads, TBB, Cilk++, OpenMP, UPC, KAAPI, etc multicore CUDA, OpenCL, etc GPU ???? GPU multicore 5

  6. Introduction ● Efforts include optimized BLAS and LAPACK libraries for these architectures ● Some examples: ● PLASMAS (library/multicore) ● MAGMA (library/hybrid GPU-based) ● StarPU (runtime system/hybrid architectures) ● XKaapi (work in progress) 6

  7. Contents ● Introduction ● Parallel LA Algorithms ● LA for Multicore Processors ● LA for Hybrid Systems ● XKaapi programming model ● Linear Algebra with XKaapi ● Conclusion 7

  8. LA for Multicore Processors ● LAPACK/ScaLAPACK are a “de facto” standard ● Both exploit parallelism at BLAS level ● GotoBLAS, ATLAS, etc ● Their algorithms can be described as the repetition of ● Panel factorization - accumulate (Level-2 BLAS) ● Trailing submatrix update - apply to the rest of the matrix (Level-3 BLAS) 8

  9. LA for Multicore Processors ● Rich parallelism at Level-3 BLAS as panel size is small ● Level-2 BLAS cannot be efficiently parallelized on shared- memory ● It introduces a fork-join execution pattern with limitations ● Scalability – high const at panel factorization (sequential) ● Asynchronicity – multiple threads have to wait the previous step ● BLAS level parallelism up to 2-3x slower ● Solution – exploit parallelism at an higher level 9

  10. PLASMA ● Parallel Linear Algebra Software for Multicore Architectures ● Designed to be efficient on ● Homogeneous multicore processors ● Multi-socket systems of multicore processors ● Three crucial elements ● Tile algorithms (square tiles) ● Tile data layout (cache size) ● Dynamic scheduling (WS with pthreads/QUARK) ● http://icl.cs.utk.edu/plasma 10

  11. PLASMA 11

  12. PLASMA Tiled Cholesky ● Assuming a symetric, positive matrix A of size p * b x p * b A = ( A pp ) A 11  ⋯  ⋯  A 21 A 22 ⋮ ⋮ ⋱ ⋮ ⋯ A p1 A p2 ● Where b is the block size ● Each Aij is of size b x b 12

  13. PLASMA Tiled Cholesky 1 for k = 1,...,p do 2 DPOTRF2(A kk , L kk ) 3 for i = k+1,...,p do 4 DTRSM(L kk ,A ik ,L ik ) 5 endfor 6 for i= k+1,...,p do 7 for j= k+1,...,i do 8 DGSMM(L ik ,L jk ,A ij ) 9 endfor 10 endfor 11 endfor 13

  14. PLASMA Tiled Cholesky ● Tiled Cholesky factorization 4x4 [Buttari et al, 2009] SYRK Tiled Cholesky TRSM GEMM 4x4 SYRK TRSM POTRF POTRF SYRK TRSM POTRF TRSM GEMM GEMM TRSM SYRK POTRF TRSM SYRK GEMM [Dongarra, SC2010] SYRK 14

  15. Contents ● Introduction ● Parallel LA Algorithms ● LA for Multicore Processors ● LA for Hybrid Systems ● XKaapi programming model ● Linear Algebra with XKaapi ● Conclusion 15

  16. LA for Hybrid Systems ● How to code LA for hybrid architectures ? ● Use Hybrid Algorithms ● Considerations about the hybridization ● Split LA algorithms as BLAS-based tasks – Task parallelism ● Choose granularity with auto-tunning ● Define dependencies among them – Algorithms as DAGs ● Schedule the tasks over the multicore and the GPU 16

  17. LA for Hybrid Systems ● Scheduling is of crucial importance ● Schedule small and non-parallelizable task on CPU – Level-1 and Level-2 BLAS tasks ● Schedule large and highly data parallel tasks on GPU – Level-3 BLAS tasks ● Scheduling approaches on hybrid systems ● Static – highly efficient to a specific architecture ● Dynamic – load balancing, tunning depends on BLAS-level granularity 17

  18. Hybrid Tiled Cholesky 1 for k = 1,...,p do 2 A kk = POTRF(A kk ) 3 for i = k+1,...,p do 4 A ik = TRSM(A kk, A ik ) 5 endfor 6 for i= k+1,...,p do 7 for j= k+1,...,i do 8 A ij = GEMM(A ik, A jk, A ij ) 9 endfor 10 A ii = SYRK(A ik , A ii ) 11 endfor 12 endfor 18

  19. Hybrid Tiled Cholesky Algorithm for 1 CPU + 1 GPU for ( j=0; j<*n; j+=nb ){ jb= min( nb,*n-j ); cublasSsyrk( da(j,0), da(j,j) ); cudaMemcpy2DAsync( work,da(j,j),DtoH,s[1]); if ( j+jb < *n ) cublasSgemm( da(j+jb,0), da(j+jb,j) ); cudaStreamSynchronize( stream[1] ); spotrf( "Lower", &jb, work ,&jb, info ); if ( *info != 0 ) { *info= *info+j; break; } cudaMemcpy2DAsync( da(j,j),work,HtoD,s[0]); if ( j+jb < *n ) cublasStrsm( da(j,j), da(j+jb,j) ); } 19

  20. MAGMA ● Matrix Algebra on GPU and Multicore Architectures ● A subset of LAPACK and BLAS routines ● Each routine has two versions ● An highly efficient version o CPU ● An highly efficient version to GPU (MAGMA BLAS) ● Interface very similar to LAPACK 20

  21. StarPU ● Tasking API for numerical kernel designers ● Supports heterogeneous PUs (Cell BE, GPUs) ● Composed of ● data-management facility ● task execution engine ● http://runtime.bordeaux.inria.fr/StarPU 21

  22. StarPU Data Management ● Each device has a buffer ● MSI caching protocol ● (M) modified, (S) shared, (I) invalid ● Data transfers are transparent Data registration starpu_data_handle vector_handle; starpu_vector_data_register(&vector_handle, …); task->buffers[0].handle = vector_handle; task->buffers[0].mode = STARPU_RW; 22

  23. StarPU Task Concept ● Concept of codelets: abstraction of a task Codelet static starpu_codelet cl = { .where = STARPU_CPU | STARPU_CUDA, .cpu_func = scal_cpu_func, /* CPU */ .cuda_func = scal_cuda_func, /* GPU */ .nbuffers = 1, /* n of parameters */ .model = &vector_scal_model, .power_model = &vector_scal_power_model } 23

  24. Contents ● Introduction ● Parallel LA Algorithms ● LA for Multicore Processors ● LA for Hybrid Systems ● XKaapi programming model ● Linear Algebra with XKaapi ● Conclusion 24

  25. XKaapi programming model ● XKaapi is a C/C++ library ● Targets multicore+GPU+cluster architectures ● Goals ● Simplify the development of parallel applications – Platform abstraction (programming model) ● Automatic dynamic load balancing – theoretically & pratically performances – Work Stealing based algorithms 25

  26. XKaapi programming model ● We will focus on the C++ API Kaapi++ ● Three main concepts ● Task signature – Nº of parameters, types, and access mode ● Task implementation – The implementations available to each PU (CPU, GPU, etc) ● Data pointer – When data is shared between tasks 26

  27. XKaapi programming model struct TaskHello{ public ka::Task<1>::Signature< int >{}; template<> struct TaskBodyCPU<TaskHello> { void operator()( int n ) { /* CPU implementation … */ } }; template<> struct TaskBodyGPU<TaskHello>{ void operator()( int n ) { /* GPU implementation … */ } }; 27

  28. Contents ● Introduction ● Parallel LA Algorithms ● LA for Multicore Processors ● LA for Hybrid Systems ● XKaapi programming model ● Linear Algebra with XKaapi ● Conclusion 28

  29. XKaapi Tiled Cholesky ● Definition of four BLAS-based tasks ● TaskDPOTRF, TaskDTRSM, TaskDSYRK, TaskDGEMM ● Abstraction of the target PU and scheduling ● The runtime decides which PU will execute ● The code does not contain any reference to schedule details ● Asynchronous execution of tasks ● The order only depends on the data production of previous tasks 29

  30. XKaapi Tiled Cholesky for (k=0;k < N;k += blocsize){ ka::Spawn<TaskDPOTRF>()( A(rk,rk) ); for (m=k+blocsize;m < N;m += blocsize ka::Spawn<TaskDTRSM>()(A(rk,rk), A(rm,rk) /* B */ ); for (m=k+blocsize;m < N;m += blocsize){ ka::Spawn<TaskDSYRK>()( A(rm,rk), A(rm,rm) /* C */ ); for (n=k+blocsize;n < m;n += blocsize) ka::Spawn<TaskDGEMM>()( A(rm,rk), /* A */ A(rn,rk), /* B */ A(rm,rn) /* C */ ); } } 30

  31. Contents ● Introduction ● Parallel LA Algorithms ● LA for Multicore Processors ● LA for Hybrid Systems ● XKaapi programming model ● Linear Algebra with XKaapi ● Conclusion 31

  32. Conclusion ● XKaapi interface meets hybridization aspects ● Kaapi++ ● Performance on hybrid architecture includes ● PU affinity ● Task performance (CPU-bound, GPU-bound, etc) ● Data management 32

  33. Future Works ● XKaapi on N CPU(s) + 1 GPU ● Distributed shared memory concepts (DSM) ● Optimized many-task execution on GPU ● XKaapi on N CPU(s) + N GPU(s) + ??? ● DSM-like protocol of data ● Problem of CPU-bound or GPU-bound tasks ● PU affinity ● Scheduler based on work stealing 33

Recommend


More recommend