Linear Algebra (LA) Algorithms for Hybrid Architectures with XKaapi - PowerPoint PPT Presentation

Linear Algebra (LA) Algorithms for Hybrid Architectures with XKaapi WSPPD 2011, August 19, 2011 Federal University of Rio Grande do Sul (UFRGS) João V. F. Lima Phd Student joao.lima@inf.ufrgs.br Nicolas Maillard (UFRGS), Vincent Danjean and Thierry Gautier (MOAIS-LIG) Advisors

Contents ● Introduction ● Parallel LA Algorithms ● XKaapi programming model ● Linear Algebra with XKaapi ● Conclusion 2

Contents ● Introduction ● Parallel LA Algorithms ● XKaapi programming model ● Linear Algebra with XKaapi ● Conclusion 3

Introduction ● Solving linear algebra (LA) systems is a fundamental problem in scientific computing ● Enabling LA for hybrid architectures is strategic ● Many hybrid processing units (PU) ● The problem is reduce the gap theoretical/achived performance 4

Introduction Pthreads, TBB, Cilk++, OpenMP, UPC, KAAPI, etc multicore CUDA, OpenCL, etc GPU ???? GPU multicore 5

Introduction ● Efforts include optimized BLAS and LAPACK libraries for these architectures ● Some examples: ● PLASMAS (library/multicore) ● MAGMA (library/hybrid GPU-based) ● StarPU (runtime system/hybrid architectures) ● XKaapi (work in progress) 6

Contents ● Introduction ● Parallel LA Algorithms ● LA for Multicore Processors ● LA for Hybrid Systems ● XKaapi programming model ● Linear Algebra with XKaapi ● Conclusion 7

LA for Multicore Processors ● LAPACK/ScaLAPACK are a “de facto” standard ● Both exploit parallelism at BLAS level ● GotoBLAS, ATLAS, etc ● Their algorithms can be described as the repetition of ● Panel factorization - accumulate (Level-2 BLAS) ● Trailing submatrix update - apply to the rest of the matrix (Level-3 BLAS) 8

LA for Multicore Processors ● Rich parallelism at Level-3 BLAS as panel size is small ● Level-2 BLAS cannot be efficiently parallelized on shared- memory ● It introduces a fork-join execution pattern with limitations ● Scalability – high const at panel factorization (sequential) ● Asynchronicity – multiple threads have to wait the previous step ● BLAS level parallelism up to 2-3x slower ● Solution – exploit parallelism at an higher level 9

PLASMA ● Parallel Linear Algebra Software for Multicore Architectures ● Designed to be efficient on ● Homogeneous multicore processors ● Multi-socket systems of multicore processors ● Three crucial elements ● Tile algorithms (square tiles) ● Tile data layout (cache size) ● Dynamic scheduling (WS with pthreads/QUARK) ● http://icl.cs.utk.edu/plasma 10

PLASMA 11

PLASMA Tiled Cholesky ● Assuming a symetric, positive matrix A of size p * b x p * b A = ( A pp ) A 11  ⋯  ⋯  A 21 A 22 ⋮ ⋮ ⋱ ⋮ ⋯ A p1 A p2 ● Where b is the block size ● Each Aij is of size b x b 12

PLASMA Tiled Cholesky 1 for k = 1,...,p do 2 DPOTRF2(A kk , L kk ) 3 for i = k+1,...,p do 4 DTRSM(L kk ,A ik ,L ik ) 5 endfor 6 for i= k+1,...,p do 7 for j= k+1,...,i do 8 DGSMM(L ik ,L jk ,A ij ) 9 endfor 10 endfor 11 endfor 13

PLASMA Tiled Cholesky ● Tiled Cholesky factorization 4x4 [Buttari et al, 2009] SYRK Tiled Cholesky TRSM GEMM 4x4 SYRK TRSM POTRF POTRF SYRK TRSM POTRF TRSM GEMM GEMM TRSM SYRK POTRF TRSM SYRK GEMM [Dongarra, SC2010] SYRK 14

LA for Hybrid Systems ● How to code LA for hybrid architectures ? ● Use Hybrid Algorithms ● Considerations about the hybridization ● Split LA algorithms as BLAS-based tasks – Task parallelism ● Choose granularity with auto-tunning ● Define dependencies among them – Algorithms as DAGs ● Schedule the tasks over the multicore and the GPU 16

LA for Hybrid Systems ● Scheduling is of crucial importance ● Schedule small and non-parallelizable task on CPU – Level-1 and Level-2 BLAS tasks ● Schedule large and highly data parallel tasks on GPU – Level-3 BLAS tasks ● Scheduling approaches on hybrid systems ● Static – highly efficient to a specific architecture ● Dynamic – load balancing, tunning depends on BLAS-level granularity 17

Hybrid Tiled Cholesky 1 for k = 1,...,p do 2 A kk = POTRF(A kk ) 3 for i = k+1,...,p do 4 A ik = TRSM(A kk, A ik ) 5 endfor 6 for i= k+1,...,p do 7 for j= k+1,...,i do 8 A ij = GEMM(A ik, A jk, A ij ) 9 endfor 10 A ii = SYRK(A ik , A ii ) 11 endfor 12 endfor 18

Hybrid Tiled Cholesky Algorithm for 1 CPU + 1 GPU for ( j=0; j<*n; j+=nb ){ jb= min( nb,*n-j ); cublasSsyrk( da(j,0), da(j,j) ); cudaMemcpy2DAsync( work,da(j,j),DtoH,s[1]); if ( j+jb < *n ) cublasSgemm( da(j+jb,0), da(j+jb,j) ); cudaStreamSynchronize( stream[1] ); spotrf( "Lower", &jb, work ,&jb, info ); if ( *info != 0 ) { *info= *info+j; break; } cudaMemcpy2DAsync( da(j,j),work,HtoD,s[0]); if ( j+jb < *n ) cublasStrsm( da(j,j), da(j+jb,j) ); } 19

MAGMA ● Matrix Algebra on GPU and Multicore Architectures ● A subset of LAPACK and BLAS routines ● Each routine has two versions ● An highly efficient version o CPU ● An highly efficient version to GPU (MAGMA BLAS) ● Interface very similar to LAPACK 20

StarPU ● Tasking API for numerical kernel designers ● Supports heterogeneous PUs (Cell BE, GPUs) ● Composed of ● data-management facility ● task execution engine ● http://runtime.bordeaux.inria.fr/StarPU 21

StarPU Data Management ● Each device has a buffer ● MSI caching protocol ● (M) modified, (S) shared, (I) invalid ● Data transfers are transparent Data registration starpu_data_handle vector_handle; starpu_vector_data_register(&vector_handle, …); task->buffers[0].handle = vector_handle; task->buffers[0].mode = STARPU_RW; 22

StarPU Task Concept ● Concept of codelets: abstraction of a task Codelet static starpu_codelet cl = { .where = STARPU_CPU | STARPU_CUDA, .cpu_func = scal_cpu_func, /* CPU */ .cuda_func = scal_cuda_func, /* GPU */ .nbuffers = 1, /* n of parameters */ .model = &vector_scal_model, .power_model = &vector_scal_power_model } 23

XKaapi programming model ● XKaapi is a C/C++ library ● Targets multicore+GPU+cluster architectures ● Goals ● Simplify the development of parallel applications – Platform abstraction (programming model) ● Automatic dynamic load balancing – theoretically & pratically performances – Work Stealing based algorithms 25

XKaapi programming model ● We will focus on the C++ API Kaapi++ ● Three main concepts ● Task signature – Nº of parameters, types, and access mode ● Task implementation – The implementations available to each PU (CPU, GPU, etc) ● Data pointer – When data is shared between tasks 26

XKaapi programming model struct TaskHello{ public ka::Task<1>::Signature< int >{}; template<> struct TaskBodyCPU<TaskHello> { void operator()( int n ) { /* CPU implementation … */ } }; template<> struct TaskBodyGPU<TaskHello>{ void operator()( int n ) { /* GPU implementation … */ } }; 27

XKaapi Tiled Cholesky ● Definition of four BLAS-based tasks ● TaskDPOTRF, TaskDTRSM, TaskDSYRK, TaskDGEMM ● Abstraction of the target PU and scheduling ● The runtime decides which PU will execute ● The code does not contain any reference to schedule details ● Asynchronous execution of tasks ● The order only depends on the data production of previous tasks 29

XKaapi Tiled Cholesky for (k=0;k < N;k += blocsize){ ka::Spawn<TaskDPOTRF>()( A(rk,rk) ); for (m=k+blocsize;m < N;m += blocsize ka::Spawn<TaskDTRSM>()(A(rk,rk), A(rm,rk) /* B */ ); for (m=k+blocsize;m < N;m += blocsize){ ka::Spawn<TaskDSYRK>()( A(rm,rk), A(rm,rm) /* C */ ); for (n=k+blocsize;n < m;n += blocsize) ka::Spawn<TaskDGEMM>()( A(rm,rk), /* A */ A(rn,rk), /* B */ A(rm,rn) /* C */ ); } } 30

Conclusion ● XKaapi interface meets hybridization aspects ● Kaapi++ ● Performance on hybrid architecture includes ● PU affinity ● Task performance (CPU-bound, GPU-bound, etc) ● Data management 32

Future Works ● XKaapi on N CPU(s) + 1 GPU ● Distributed shared memory concepts (DSM) ● Optimized many-task execution on GPU ● XKaapi on N CPU(s) + N GPU(s) + ??? ● DSM-like protocol of data ● Problem of CPU-bound or GPU-bound tasks ● PU affinity ● Scheduler based on work stealing 33

Linear Algebra (LA) Algorithms for Hybrid Architectures with XKaapi - PowerPoint PPT Presentation

Linear Algebra (LA) Algorithms for Hybrid Architectures with XKaapi WSPPD 2011, August 19, 2011 Federal University of Rio Grande do Sul (UFRGS) Joo V. F. Lima Phd Student joao.lima@inf.ufrgs.br Nicolas Maillard (UFRGS), Vincent Danjean and

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms

PV Math Department MCL Vision Credit Options Credit General General/Post- College Honors

Architectures Architectural styles Software architectures Architectures versus middleware

Linear algebra explained in four pages Excerpt from the N O BULLSHIT GUIDE TO LINEAR ALGEBRA by

Matrices Basic Linear Algebra Overview Lecture will cover why matrices and linear algebra

MATRICES AND LINEAR ALGEBRA Linear Algebra Matrix manipulation is the original essence of

Expressive Linear Algebra in Haskell Henning Thielemann 2019-08-21 Expressive Linear Algebra in

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

Hybrid Automobiles Hybrid Automobiles It switches easily between fuel, batteries, or both It

CSE 421: Algorithms and Computational Complexity Outline: General Idea Summer 2007 Review of

John McCarthy http://www-formal.stanford.edu/jmc/ 2005 November 2 THE LOGICAL ROAD TO HUMAN

The Art and Science of Cake Cutting Ulle Endriss Institute for Logic, Language and Computation

Language Technologies for All (LT4All): Enabling Linguistic Diversity and Multilingualism

1 A comprehensive example Sow replacement in practice At every weaning or return to oestrus it

32) Domain Models and Software Product Line Engineering (SPLC) 1. Domain Models and Product Lines

HV CoIIN 2.0 Virtual Series Part 2: How You Can Achieve Breakthrough Benefits Heather Johnson

Description Logics for Conceptual Data Modeling in UML Diego Calvanese, Giuseppe De Giacomo

Linear Algebra (LA) Algorithms for Hybrid Architectures with XKaapi - PowerPoint PPT Presentation

Linear Algebra (LA) Algorithms for Hybrid Architectures with XKaapi WSPPD 2011, August 19, 2011 Federal University of Rio Grande do Sul (UFRGS) Joo V. F. Lima Phd Student joao.lima@inf.ufrgs.br Nicolas Maillard (UFRGS), Vincent Danjean and

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Graphics 2014 Linear Algebra II Linear Maps &amp; Matrices Linear Maps &amp; Matrices CORE

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms

PV Math Department MCL Vision Credit Options Credit General General/Post- College Honors

Architectures Architectural styles Software architectures Architectures versus middleware

Linear algebra explained in four pages Excerpt from the N O BULLSHIT GUIDE TO LINEAR ALGEBRA by

Matrices Basic Linear Algebra Overview Lecture will cover why matrices and linear algebra

MATRICES AND LINEAR ALGEBRA Linear Algebra Matrix manipulation is the original essence of

Expressive Linear Algebra in Haskell Henning Thielemann 2019-08-21 Expressive Linear Algebra in

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

Hybrid Automobiles Hybrid Automobiles It switches easily between fuel, batteries, or both It

CSE 421: Algorithms and Computational Complexity Outline: General Idea Summer 2007 Review of

John McCarthy http://www-formal.stanford.edu/jmc/ 2005 November 2 THE LOGICAL ROAD TO HUMAN

The Art and Science of Cake Cutting Ulle Endriss Institute for Logic, Language and Computation

Language Technologies for All (LT4All): Enabling Linguistic Diversity and Multilingualism

1 A comprehensive example Sow replacement in practice At every weaning or return to oestrus it

32) Domain Models and Software Product Line Engineering (SPLC) 1. Domain Models and Product Lines

HV CoIIN 2.0 Virtual Series Part 2: How You Can Achieve Breakthrough Benefits Heather Johnson

Description Logics for Conceptual Data Modeling in UML Diego Calvanese, Giuseppe De Giacomo

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE