1 StarPU : Exploiting heterogeneous architectures through task-based programming Cédric Augonnet Nathalie Furmento Raymond Namyst Samuel Thibault INRIA Bordeaux, LaBRI, University of Bordeaux ComplexHPC spring school – May 13 rd 2011
2 The RUNTIME Team NL
3 The RUNTIME Team NL
4 The RUNTIME Team NL Doing Parallelism for centuries !
5 The RUNTIME Team Research directions • High Performance Runtime Systems for Parallel Architectures “ Runtime Systems perform dynamically what cannot be not statically ” • • Main research directions • Exploiting shared memory machines – Thread scheduling over hierarchical multicore architectures – Task scheduling over accelerator-based machines • Communication over high speed networks – Multicore-aware communication engines – Multithreaded MPI implementations • Integration of multithreading and communication – Runtime support for hybrid programming • See http://runtime.bordeaux.inria.fr/ for more information
6 Introduction Toward heterogeneous multi-core architectures • Multicore is here • Hierarchical architectures Mixed Large • Manycore is coming and • Power is a major concern Small Cores • Architecture specialization • Now – Accelerators (GPGPUs, FPGAs) – Coprocessors (Cell's SPUs) • In the (near?) Future – Many simple cores – A few full-featured cores
7 Introduction How to program these architectures? Multicore • Multicore programming • pthreads, OpenMP, TBB, ... OpenMP TBB Cilk MPI CPU CPU CPU CPU M. M.
8 Introduction How to program these architectures? Accelerators • Multicore programming OpenCL • pthreads, OpenMP, TBB, ... CUDA libspe ATI Stream •Accelerator programming • Consensus on OpenCL? • (Often) Pure offloading model CPU CPU *PU M. CPU CPU *PU M. M. M.
9 Introduction How to program these architectures? Multicore Accelerators • Multicore programming OpenCL • pthreads, OpenMP, TBB, ... OpenMP CUDA libspe ? TBB Cilk ? ATI Stream •Accelerator programming MPI • Consensus on OpenCL? • (Often) Pure offloading model CPU CPU *PU M. •Hybrid models? • Take advantage of all resources ☺ CPU CPU • Complex interactions ☹ *PU M. M. M.
10 Introduction Challenging issues at all stages • Applications • Programming paradigm • BLAS kernels, FFT, … HPC Applications • Compilers Compiling Specific environment librairies • Languages • Code generation/optimization Runtime system • Runtime systems • Resources management Operating System • Task scheduling Hardware • Architecture • Memory interconnect
11 Introduction Challenging issues at all stages Expressive interface • Applications • Programming paradigm • BLAS kernels, FFT, … HPC Applications • Compilers Compiling Specific environment librairies • Languages • Code generation/optimization Runtime system • Runtime systems • Resources management Operating System • Task scheduling Hardware • Architecture • Memory interconnect Execution Feedback
12 Outline • Overview of StarPU • Programming interface • Task & data management • Task scheduling • MAGMA+PLASMA example • Experimental features • Conclusion
13 Overview of StarPU
14 Overview of StarPU A = A+B Rationale Dynamically schedule tasks CPU CPU on all processing units GPU M. • See a pool of CPU CPU heterogeneous processing M. B GPU units A M. B M. GPU M. Avoid unnecessary data CPU CPU transfers between CPU CPU GPU M. A accelerators • Software VSM for M. M. heterogeneous machines
The StarPU runtime system 15 HPC Applications Execution model High-level data management library Scheduling engine Specific drivers ... CPUs GPUs SPUs Mastering CPUs, GPUs, SPUs … *PUs
16 The StarPU runtime system The need for runtime systems • “do dynamically what can’t HPC Applications be done statically anymore” Parallel Parallel Compilers Libraries • StarPU provides • Task scheduling • Memory management • Compilers and libraries StarPU generate (graphs of) parallel Drivers (CUDA, OpenCL) tasks CPU GPU … • Additional information is welcome!
17 Data management • StarPU provides a Virtual HPC Applications Shared Memory subsystem Parallel Parallel Compilers Libraries • Weak consistency • Replication • Single writer • High level API – Partitioning filters StarPU • Input & ouput of tasks = reference to VSM data Drivers (CUDA, OpenCL) CPU GPU …
18 The StarPU runtime system Task scheduling • Tasks = HPC Applications • Data input & output Parallel Parallel – Reference to VSM data Compilers Libraries • Multiple implementations – E.g. CUDA + CPU implementation • Dependencies with other tasks • Scheduling hints StarPU • StarPU provides an Open Drivers (CUDA, OpenCL) Scheduling platform cpu f gpu (A RW , B R , C R ) CPU GPU … • Scheduling algorithm = spu plug-ins
19 The StarPU runtime system Task scheduling • Who generates the code ? HPC Applications • StarPU Task = ~function pointers Parallel Parallel • StarPU doesn't generate code Compilers Libraries • Libraries era • PLASMA + MAGMA • FFTW + CUFFT... StarPU • Rely on compilers • PGI accelerators Drivers (CUDA, OpenCL) cpu • CAPS HMPP... f gpu (A RW , B R , C R ) CPU GPU … spu
20 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM A B
21 The StarPU runtime system Execution model Application A+= B Scheduling engine Memory StarPU Management (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM A B Submit task « A += B »
22 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management A+= B (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM A B Schedule task
23 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management A+= B (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM B A B Fetch data
24 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management A+= B (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM B A A B Fetch data
25 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management A+= B (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM B A A B Fetch data
26 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM B A A B A+= B Offload computation
27 The StarPU runtime system Execution model Application Scheduling engine Memory StarPU Management (DSM) CPU driver A B GPU driver #k ... GPU CPU#k RAM B A A B Notify termination
28 The StarPU runtime system Development context • History • Started about 3 years ago • StarPU main core ~ 20k lines of code • Written in C • 3 core developers – Cédric Augonnet, Samuel Thibault, Nathalie Furmento • Open Source • Released under LGPL • Sources freely available – svn repository and nightly tarballs – See http://runtime.bordeaux.inria.fr/StarPU/ • Open to external contributors
29 The StarPU runtime system Supported platforms • Supported architectures • Multicore CPUs (x86, PPC, ...) • NVIDIA GPUs • OpenCL devices (eg. AMD cards) • Cell processors (experimental) • Supported Operating Systems • Linux • Mac OS • Windows
30 Performance teaser • QR decomposition • Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060)
31 Programming interface
32 Scaling a vector Launching StarPU • Makefile flags • CFLAGS +=$(shell pkg-config --cflags libstarpu) • LDFLAGS+=$(shell pkg-config --libs libstarpu) • Headers • #include <starpu.h> • (De)Initialize StarPU • starpu_init(NULL); • starpu_shutdown();
33 Scaling a vector Data registration • Register a piece of data to StarPU • float array[NX]; for (unsigned i = 0; i < NX; i++) array[i] = 1.0f; starpu_data_handle vector_handle ; starpu_vector_data_register( &vector_handle , 0, array, NX, sizeof(vector[0])); • Unregister data • starpu_data_unregister( vector_handle );
34 Scaling a vector Defining a codelet • CPU kernel void scal_cpu_func(void *buffers[], void *cl_arg) { struct starpu_vector_interface_s *vector = buffers[0]; unsigned n = STARPU_VECTOR_GET_NX(vector); float *val = (float *)STARPU_VECTOR_GET_PTR(vector); float *factor = cl_arg; for (int i = 0; i < n; i++) val[i] *= *factor; }
35 Scaling a vector Defining a codelet (2) • CUDA kernel (compiled with nvcc, in a separate .cu file) __global__ void vector_mult_cuda(float *val, unsigned n, float factor) { for(unsigned i = 0 ; i < n ; i++) val[i] *= factor; } extern "C" void scal_cuda_func(void *buffers[], void *cl_arg) { struct starpu_vector_interface_s *vector = buffers[0]; unsigned n = STARPU_VECTOR_GET_NX(vector); float *val = (float *)STARPU_VECTOR_GET_PTR(vector); float *factor = (float *)cl_arg; vector_mult_cuda<<<1,1>>>(val, n, *factor); cudaThreadSynchronize(); }
Recommend
More recommend