Linear Algebra (LA) Algorithms for Hybrid Architectures with XKaapi WSPPD 2011, August 19, 2011 Federal University of Rio Grande do Sul (UFRGS) João V. F. Lima Phd Student joao.lima@inf.ufrgs.br Nicolas Maillard (UFRGS), Vincent Danjean and Thierry Gautier (MOAIS-LIG) Advisors
Contents ● Introduction ● Parallel LA Algorithms ● XKaapi programming model ● Linear Algebra with XKaapi ● Conclusion 2
Contents ● Introduction ● Parallel LA Algorithms ● XKaapi programming model ● Linear Algebra with XKaapi ● Conclusion 3
Introduction ● Solving linear algebra (LA) systems is a fundamental problem in scientific computing ● Enabling LA for hybrid architectures is strategic ● Many hybrid processing units (PU) ● The problem is reduce the gap theoretical/achived performance 4
Introduction Pthreads, TBB, Cilk++, OpenMP, UPC, KAAPI, etc multicore CUDA, OpenCL, etc GPU ???? GPU multicore 5
Introduction ● Efforts include optimized BLAS and LAPACK libraries for these architectures ● Some examples: ● PLASMAS (library/multicore) ● MAGMA (library/hybrid GPU-based) ● StarPU (runtime system/hybrid architectures) ● XKaapi (work in progress) 6
Contents ● Introduction ● Parallel LA Algorithms ● LA for Multicore Processors ● LA for Hybrid Systems ● XKaapi programming model ● Linear Algebra with XKaapi ● Conclusion 7
LA for Multicore Processors ● LAPACK/ScaLAPACK are a “de facto” standard ● Both exploit parallelism at BLAS level ● GotoBLAS, ATLAS, etc ● Their algorithms can be described as the repetition of ● Panel factorization - accumulate (Level-2 BLAS) ● Trailing submatrix update - apply to the rest of the matrix (Level-3 BLAS) 8
LA for Multicore Processors ● Rich parallelism at Level-3 BLAS as panel size is small ● Level-2 BLAS cannot be efficiently parallelized on shared- memory ● It introduces a fork-join execution pattern with limitations ● Scalability – high const at panel factorization (sequential) ● Asynchronicity – multiple threads have to wait the previous step ● BLAS level parallelism up to 2-3x slower ● Solution – exploit parallelism at an higher level 9
PLASMA ● Parallel Linear Algebra Software for Multicore Architectures ● Designed to be efficient on ● Homogeneous multicore processors ● Multi-socket systems of multicore processors ● Three crucial elements ● Tile algorithms (square tiles) ● Tile data layout (cache size) ● Dynamic scheduling (WS with pthreads/QUARK) ● http://icl.cs.utk.edu/plasma 10
PLASMA 11
PLASMA Tiled Cholesky ● Assuming a symetric, positive matrix A of size p * b x p * b A = ( A pp ) A 11 ⋯ ⋯ A 21 A 22 ⋮ ⋮ ⋱ ⋮ ⋯ A p1 A p2 ● Where b is the block size ● Each Aij is of size b x b 12
PLASMA Tiled Cholesky 1 for k = 1,...,p do 2 DPOTRF2(A kk , L kk ) 3 for i = k+1,...,p do 4 DTRSM(L kk ,A ik ,L ik ) 5 endfor 6 for i= k+1,...,p do 7 for j= k+1,...,i do 8 DGSMM(L ik ,L jk ,A ij ) 9 endfor 10 endfor 11 endfor 13
PLASMA Tiled Cholesky ● Tiled Cholesky factorization 4x4 [Buttari et al, 2009] SYRK Tiled Cholesky TRSM GEMM 4x4 SYRK TRSM POTRF POTRF SYRK TRSM POTRF TRSM GEMM GEMM TRSM SYRK POTRF TRSM SYRK GEMM [Dongarra, SC2010] SYRK 14
Contents ● Introduction ● Parallel LA Algorithms ● LA for Multicore Processors ● LA for Hybrid Systems ● XKaapi programming model ● Linear Algebra with XKaapi ● Conclusion 15
LA for Hybrid Systems ● How to code LA for hybrid architectures ? ● Use Hybrid Algorithms ● Considerations about the hybridization ● Split LA algorithms as BLAS-based tasks – Task parallelism ● Choose granularity with auto-tunning ● Define dependencies among them – Algorithms as DAGs ● Schedule the tasks over the multicore and the GPU 16
LA for Hybrid Systems ● Scheduling is of crucial importance ● Schedule small and non-parallelizable task on CPU – Level-1 and Level-2 BLAS tasks ● Schedule large and highly data parallel tasks on GPU – Level-3 BLAS tasks ● Scheduling approaches on hybrid systems ● Static – highly efficient to a specific architecture ● Dynamic – load balancing, tunning depends on BLAS-level granularity 17
Hybrid Tiled Cholesky 1 for k = 1,...,p do 2 A kk = POTRF(A kk ) 3 for i = k+1,...,p do 4 A ik = TRSM(A kk, A ik ) 5 endfor 6 for i= k+1,...,p do 7 for j= k+1,...,i do 8 A ij = GEMM(A ik, A jk, A ij ) 9 endfor 10 A ii = SYRK(A ik , A ii ) 11 endfor 12 endfor 18
Hybrid Tiled Cholesky Algorithm for 1 CPU + 1 GPU for ( j=0; j<*n; j+=nb ){ jb= min( nb,*n-j ); cublasSsyrk( da(j,0), da(j,j) ); cudaMemcpy2DAsync( work,da(j,j),DtoH,s[1]); if ( j+jb < *n ) cublasSgemm( da(j+jb,0), da(j+jb,j) ); cudaStreamSynchronize( stream[1] ); spotrf( "Lower", &jb, work ,&jb, info ); if ( *info != 0 ) { *info= *info+j; break; } cudaMemcpy2DAsync( da(j,j),work,HtoD,s[0]); if ( j+jb < *n ) cublasStrsm( da(j,j), da(j+jb,j) ); } 19
MAGMA ● Matrix Algebra on GPU and Multicore Architectures ● A subset of LAPACK and BLAS routines ● Each routine has two versions ● An highly efficient version o CPU ● An highly efficient version to GPU (MAGMA BLAS) ● Interface very similar to LAPACK 20
StarPU ● Tasking API for numerical kernel designers ● Supports heterogeneous PUs (Cell BE, GPUs) ● Composed of ● data-management facility ● task execution engine ● http://runtime.bordeaux.inria.fr/StarPU 21
StarPU Data Management ● Each device has a buffer ● MSI caching protocol ● (M) modified, (S) shared, (I) invalid ● Data transfers are transparent Data registration starpu_data_handle vector_handle; starpu_vector_data_register(&vector_handle, …); task->buffers[0].handle = vector_handle; task->buffers[0].mode = STARPU_RW; 22
StarPU Task Concept ● Concept of codelets: abstraction of a task Codelet static starpu_codelet cl = { .where = STARPU_CPU | STARPU_CUDA, .cpu_func = scal_cpu_func, /* CPU */ .cuda_func = scal_cuda_func, /* GPU */ .nbuffers = 1, /* n of parameters */ .model = &vector_scal_model, .power_model = &vector_scal_power_model } 23
Contents ● Introduction ● Parallel LA Algorithms ● LA for Multicore Processors ● LA for Hybrid Systems ● XKaapi programming model ● Linear Algebra with XKaapi ● Conclusion 24
XKaapi programming model ● XKaapi is a C/C++ library ● Targets multicore+GPU+cluster architectures ● Goals ● Simplify the development of parallel applications – Platform abstraction (programming model) ● Automatic dynamic load balancing – theoretically & pratically performances – Work Stealing based algorithms 25
XKaapi programming model ● We will focus on the C++ API Kaapi++ ● Three main concepts ● Task signature – Nº of parameters, types, and access mode ● Task implementation – The implementations available to each PU (CPU, GPU, etc) ● Data pointer – When data is shared between tasks 26
XKaapi programming model struct TaskHello{ public ka::Task<1>::Signature< int >{}; template<> struct TaskBodyCPU<TaskHello> { void operator()( int n ) { /* CPU implementation … */ } }; template<> struct TaskBodyGPU<TaskHello>{ void operator()( int n ) { /* GPU implementation … */ } }; 27
Contents ● Introduction ● Parallel LA Algorithms ● LA for Multicore Processors ● LA for Hybrid Systems ● XKaapi programming model ● Linear Algebra with XKaapi ● Conclusion 28
XKaapi Tiled Cholesky ● Definition of four BLAS-based tasks ● TaskDPOTRF, TaskDTRSM, TaskDSYRK, TaskDGEMM ● Abstraction of the target PU and scheduling ● The runtime decides which PU will execute ● The code does not contain any reference to schedule details ● Asynchronous execution of tasks ● The order only depends on the data production of previous tasks 29
XKaapi Tiled Cholesky for (k=0;k < N;k += blocsize){ ka::Spawn<TaskDPOTRF>()( A(rk,rk) ); for (m=k+blocsize;m < N;m += blocsize ka::Spawn<TaskDTRSM>()(A(rk,rk), A(rm,rk) /* B */ ); for (m=k+blocsize;m < N;m += blocsize){ ka::Spawn<TaskDSYRK>()( A(rm,rk), A(rm,rm) /* C */ ); for (n=k+blocsize;n < m;n += blocsize) ka::Spawn<TaskDGEMM>()( A(rm,rk), /* A */ A(rn,rk), /* B */ A(rm,rn) /* C */ ); } } 30
Contents ● Introduction ● Parallel LA Algorithms ● LA for Multicore Processors ● LA for Hybrid Systems ● XKaapi programming model ● Linear Algebra with XKaapi ● Conclusion 31
Conclusion ● XKaapi interface meets hybridization aspects ● Kaapi++ ● Performance on hybrid architecture includes ● PU affinity ● Task performance (CPU-bound, GPU-bound, etc) ● Data management 32
Future Works ● XKaapi on N CPU(s) + 1 GPU ● Distributed shared memory concepts (DSM) ● Optimized many-task execution on GPU ● XKaapi on N CPU(s) + N GPU(s) + ??? ● DSM-like protocol of data ● Problem of CPU-bound or GPU-bound tasks ● PU affinity ● Scheduler based on work stealing 33
Recommend
More recommend