X-KAAPI: a Multi Paradigm Runtime for Multicore Architectures - PowerPoint PPT Presentation

X-KAAPI: a Multi Paradigm Runtime for Multicore Architectures Thierry Gautier ∗ , Fabien Lementec ∗ , Vincent Faucher†, Bruno Raffin ∗ ∗ INRIA, Grenoble, France †CEA, DEN, DANS, DM2S, SEMT, DYN, Gif-sur Yvette, France Thierry Gautier thierry.gautier@ inrialpes.fr MOAIS, INRIA, Grenoble P2S2/ICPP 2013

Parallel architecture • Complex architecture ‣ Computing resources ๏ CPU, GPU, ... ‣ Memory ๏ hierarchical memory (register, L1, L2, L3, main memory) ๏ private / shared cache ‣ Interconnection network ๏ between several cores & memory ➡ High complexity Network ‣ million of components ‣ heterogeneity Xeon X5650 Xeon X5650 ๏ memory GPU CORE CORE Core Core Core Core Core Core Core Core CORE Xeon core GPU CORE CORE ๏ processor QPI Tesla C2050 CORE Core Core Core Core Core Core Core Core GPU cache cache QPI QPI cache GPU cache cache cache cache cache cache cache cache Tylersburg Tylersburg ESI (I/O) 36D 36D QPI Memory Memory PCIe 16x Memory Memory Memory Memory Memory Memory Memory Memory PEX 8647 PEX 8647 PEX 8647 PEX 8647 PCIe 16x GPU GPU GPU GPU GPU GPU GPU GPU - 2

Goal: Write Once, Run Anywhere • Provide performance guarantee of application ‣ On multiple parallel architectures ‣ With dynamic variation (OS jitter, application load) Application • Two steps solution ‣ Definition of a programming model APIs: #pragma, C, C++, F ๏ Task based move move move move move move move move task=0x1008053a8 task=0x100804ec8 task=0x100805018 task=0x100804460 task=0x100804a30 task=0x100804598 task=0x100804328 task=0x100804b68 date=0 date=0 date=0 date=0 date=0 date=0 date=0 date=0 - recursive task, adaptive task wc=0, counter=0, wc=0, counter=0, wc=0, counter=0, wc=0, counter=0, wc=0, counter=0, wc=0, counter=0, wc=0, counter=0, wc=0, counter=0, 0x100805300 v0 0x100804e20 v0 0x100804f70 v0 0x1008043b8 v0 0x100804988 v0 0x1008044f0 v0 0x100804280 v0 0x100804ac0 v0 ๏ Data flow dependencies __Z4TypeI9TaskDGEMM move __Z4TypeI9TaskDGEMM move move __Z4TypeI9TaskDGEMM move __Z4TypeI9TaskDGEMM task=0x100805270 task=0x1008051b0 task=0x100804dc0 task=0x100804868 task=0x100804d18 task=0x100804220 task=0x100804730 task=0x100804910 date=0 date=0 date=0 date=0 date=0 date=0 date=0 date=0 wc=3, counter=3, wc=0, counter=0, wc=3, counter=3, wc=0, counter=0, wc=0, counter=0, wc=3, counter=3, wc=0, counter=0, wc=3, counter=3, 0x100805300 v1 0x100805108 v0 0x100804f70 v1 0x1008047c0 v0 0x100804c70 v0 0x1008044f0 v1 0x100804688 v0 0x100804ac0 v1 - computed at runtime __Z4TypeI9TaskDGEMM __Z4TypeI9TaskDGEMM __Z4TypeI9TaskDGEMM __Z4TypeI9TaskDGEMM task=0x100805438 task=0x1008050a8 task=0x100804628 task=0x100804bf8 date=0 date=0 date=0 date=0 wc=3, counter=3, wc=3, counter=3, wc=3, counter=3, wc=3, counter=3, 0x100805300 v2 0x100804f70 v2 0x1008044f0 v2 0x100804ac0 v2 Distributed Scheduler ‣ Efficient scheduling algorithms ๏ Work stealing based with heuristic ๏ HEFT, DualApproximation, ... ๏ Theoretical analysis of performance C P U C P C P C P U U U GPU GPU 6 13 - 3

How to program such architecture ? MPI-1.0 MPI-1.1 MPI-1.3 [1994] MPI-2.0 MPI-2.1 MPI-2.2 MPI-3.0 GridSs Athapascan Kaapi Kaapi data flow distributed work stealing + fault tolerance + adaptive task OmpSs data flow dependencies // loop + static scheduling data flow + multiGPUs + cluster Kaapi + multiGPUs XKaapi CellSs SMPSs GPUSs + multiGPUs data flow data flow StarPU + //loop data flow + adaptive task + multiGPUs + OMP RTS Quark data flow Cilk Cilk++ Cilk/Intel Cilk+ + // loop work stealing independent tasks TBB 1.0 TBB 2.0 TBB 3.0 TBB 4.1 + // loop OpenMP 1.0 OpenMP 2.0 OpenMP 3.0 OMP 3.1 OMP 4.0 // loop + Task 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2012 2013 2011 Time - 4

Outline • Introduction • Overview of Kaapi parallel programming model ‣ Scheduling tasks with data flow dependencing ‣ XKaapi’s on-demand task creation • Evaluations ‣ Micro benchmarks ‣ EPX parallelization • Conclusions - 5

Data flow dependencies • Using code annotation void main() void main() { { /* data result is produced */ #pragma kaapi task read (input) write (result) compute( input, &result ); compute( input, &result ); /* data result is consumed */ #pragma kaapi task read (result) display( &result ); display( &result ); } } compute ‣ Other APIs: C, C++, Fortran result • Task ~ OpenMP structured block display ‣ assumption: no side effect, description of access mode • Related work ‣ StarPU [Bordeaux,France], OmpSS [BSC, Spain], Quark [UTK] ‣ and new standard OpenMP-4.0 ! - 6

XKaapi programming example #include <cblas.h> #include <clapack.h> void Cholesky( double* A, int N, size_t NB ) { for (size_t k=0; k < N; k += NB) { #pragma kaapi task readwrite(&A[k*N+k]{ld=N; [NB][NB]}) clapack_dpotrf ( CblasRowMajor, CblasLower, NB, &A[k*N+k], N ); for (size_t m=k+ NB; m < N; m += NB) { #pragma kaapi task read(&A[k*N+k]{ld=N; [NB][NB]}) \ readwrite(&A[m*N+k]{ld=N; [NB][NB]}) cblas_dtrsm ( CblasRowMajor, CblasLeft, CblasLower, CblasNoTrans, CblasUnit, NB, NB, 1., &A[k*N+k], N, &A[m*N+k], N ); } for (size_t m=k+ NB; m < N; m += NB) { #pragma kaapi task read(&A[m*N+k]{ld=N; [NB][NB]}) \ readwrite(&A[m*N+m]{ld=N; [NB][NB]}) cblas_dsyrk ( CblasRowMajor, CblasLower, CblasNoTrans, NB, NB, -1.0, &A[m*N+k], N, 1.0, &A[m*N+m], N ); for (size_t n=k+NB; n < m; n += NB) { #pragma kaapi task read(&A[m*N+k]{ld=N; [NB][NB]}, &A[n*N+k]{ld=N; [NB][NB]})\ readwrite(&A[m*N+n]{ld=N; [NB][NB]}) cblas_dgemm ( CblasRowMajor, CblasNoTrans, CblasTrans, NB, NB, NB, -1.0, &A[m*N+k], N, &A[n*N+k], N, 1.0, &A[m*N+n], N ); } } } } - 7

XKaapi programming example #include <cblas.h> #include <clapack.h> void Cholesky( double* A, int N, size_t NB ) { for (size_t k=0; k < N; k += NB) { #pragma kaapi task readwrite(&A[k*N+k]{ld=N; [NB][NB]}) clapack_dpotrf ( CblasRowMajor, CblasLower, NB, &A[k*N+k], N ); for (size_t m=k+ NB; m < N; m += NB) { #pragma kaapi task read(&A[k*N+k]{ld=N; [NB][NB]}) \ readwrite(&A[m*N+k]{ld=N; [NB][NB]}) cblas_dtrsm ( CblasRowMajor, CblasLeft, CblasLower, CblasNoTrans, CblasUnit, NB, NB, 1., &A[k*N+k], N, &A[m*N+k], N ); } for (size_t m=k+ NB; m < N; m += NB) { #pragma kaapi task read(&A[m*N+k]{ld=N; [NB][NB]}) \ readwrite(&A[m*N+m]{ld=N; [NB][NB]}) cblas_dsyrk ( CblasRowMajor, CblasLower, CblasNoTrans, NB, NB, -1.0, &A[m*N+k], N, 1.0, &A[m*N+m], N ); for (size_t n=k+NB; n < m; n += NB) { #pragma kaapi task read(&A[m*N+k]{ld=N; [NB][NB]}, &A[n*N+k]{ld=N; [NB][NB]})\ readwrite(&A[m*N+n]{ld=N; [NB][NB]}) cblas_dgemm ( CblasRowMajor, CblasNoTrans, CblasTrans, NB, NB, NB, -1.0, &A[m*N+k], N, &A[n*N+k], N, 1.0, &A[m*N+n], N ); } } } } - 8

Main Characteristics of XKaapi • Parallelism is explicit, task based, with data flow dependencies ‣ Task’s creation is a non blocking operation ‣ Dependencies between tasks = Data flow dependencies ๏ Computed at runtime during workstealing requests ≠ StarPU, OmpSS, Quark = computed during task’s creation • Scheduling ‣ by work stealing ๏ Cilk’s like performance guarantee - T p = O(T 1 /p + T ∞ ) - Number of steal requests O(p T ∞ ) ๏ + heuristics for data locality ‣ + others: ETF, HETF, DualApproximation • Target architectures ‣ heterogeneous architecture: multi-CPUs / multi-GPUs ‣ many-core: Intel Xeon Phi - 9

The way XKaapi executes tasks • One “worker thread” per core Idle ‣ Able to execute XKaapi fine-grain tasks Core ‣ Holds a queue of tasks ๏ Related to sequential C stack of activation frames steal ๏ T.H.E. low overhead protocol, lock in rare case • Task creation is cheap ! ‣ Reduces to pushing C function pointer + its arguments into the worker thread queue ๏ ~ 10 cycles / tasks on AMD Many Cours processors ‣ Recursive tasks are welcome push pop • Work-stealing based scheduling ‣ Cilks’s work first principle ‣ Work-stealing algorithm = plug-in Core ๏ Default: steal a task from a randomly chosen queue - 10

On-demand task creation with XKaapi • Adaptive tasks in XKaapi ‣ Adaptative tasks can be split at run time to create new tasks ‣ Provide a “splitter” function called when an idle core decides to steal some of the remaining computation to be performed by a task under execution • Example of use: the XKaapi for_each construct ‣ A general-purpose parallel loop ๏ A task == a range of iterations to compute ‣ Execution model ๏ Initially, one task in charge of the whole range T 1 : [0 - 15] - 11

On-demand task creation with XKaapi • Adaptive tasks in XKaapi ‣ Adaptative tasks can be split at run time to create new tasks ‣ Provide a “splitter” function called when an idle core decides to steal some of the remaining computation to be performed by a task under execution • Exemple of use: the XKaapi for_each construct ‣ A general-purpose parallel loop ๏ A task == a range of iterations to compute ‣ Execution model ๏ Initially, one task in charge of the whole range T 1 : [0 - 15] - 12

X-KAAPI: a Multi Paradigm Runtime for Multicore Architectures - PowerPoint PPT Presentation

X-KAAPI: a Multi Paradigm Runtime for Multicore Architectures Thierry Gautier , Fabien Lementec , Vincent Faucher, Bruno Raffin INRIA, Grenoble, France CEA, DEN, DANS, DM2S, SEMT, DYN, Gif-sur Yvette, France Thierry Gautier

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Task scheduling over Heterogeneous Multicore Machines: a Runtime Perspective Raymond Namyst

Paradigm Shift: Moving from Vertical Paradigm Shift: Moving from Vertical Paradigm Shift:

Prolog Declarative/logic paradigm Functional paradigm No assignment statement

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

When Multicore Isnt Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly

PARADIGM Erkin Otles CS 838 PARADIGM Approach We developed an approach called PARADIGM

Testing Concurrency Runtime via a Testing Concurrency Runtime via a Stochastic Stress Framework

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim

The Challenge of Multicore The Challenge of Multicore and and Specialized Accelerators for

Practical Algebraic Effect Handlers in Multicore OCaml KC Sivaramakrishnan University of

Reactive design patterns for microservices on multicore Reactive summit - 22/10/18

Status of update of G4LEND Koi, Tatsumi (SLAC) Beck, Bret (LLNL) Hiller, Larry (LLNL) Caleb,

A z k -invariant subspace without the wandering property Daniel Seco Universidad Carlos III de

Motivation Data-intensive applications need large machines with plenty of NumaGiC: cores and

Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory Shivaram

Graph Mining on Big Data System Presented by Hefu Chai,

Real-Time Multi-Tasking Environments Shinpei Kato * , Karthik Lakshmanan * , Raj Rajkumar * , and

Environmental challenges as drivers for innovation and prosperity Eric Jakob, Ambassador

Orthogonal polynomials and zeros of optimal approximants Daniel Seco (with Bnteau, Khavinson,