Running PEPPHER benchmarks on top of the StarPU runtime system - PowerPoint PPT Presentation

1 Running PEPPHER benchmarks on top of the StarPU runtime system Cédric Augonnet Nicolas Collin Nathalie Furmento Raymond Namyst Samuel Thibault INRIA Bordeaux, LaBRI, Université de Bordeaux 22 th January 2011

2 The StarPU runtime system Motivations • “do dynamically what can’t be done statically” HPC Applications Parallel Parallel Compilers Libraries •Typical duties • Task scheduling • Memory management •Compilers and libraries Runtime system generate (graphs of) parallel tasks Operating System • Additional information is CPU GPU … welcome!

3 The StarPU runtime system Motivations •Main Challenges A = A+B • Dynamically schedule CPU CPU GPU M. tasks on all processing units CPU CPU M. B – See a pool of GPU A heterogeneous cores M. B M. – Scheduling ≠ offloading GPU M. CPU CPU • Avoid unnecessary CPU CPU GPU M. A data transfers between accelerators M. M. – Need to keep track of data copies

4 The StarPU runtime system Memory Management •StarPU provides a Virtual HPC Applications Shared Memory subsystem Parallel Parallel Compilers Libraries • Weak Consistency • Replication • Single writer • High level API • Application registers data StarPU Drivers (CUDA, OpenCL) • Input & ouput of tasks = CPU GPU … reference to registered data

5 The StarPU runtime system Task scheduling •Tasks = HPC Applications • Data input & output Parallel Parallel • Dependencies with other Compilers Libraries tasks • Multiple implementations – e.g. CUDA and/or CPU • Scheduling hints •StarPU provides an Open StarPU Scheduling platform • Scheduling algorithm = Drivers (CUDA, OpenCL) cpu plug-ins f gpu (A RW , B R , C R ) CPU GPU … spu

6 Peppher Benchmarks • Fast Fourier Transform (FFT) • Mixing FFTW and CUFFTW • Dense Linear Algebra • Mixing PLASMA and MAGMA • Computational Fluid Dynamic (CFD) • Porting Rodinia's CFD

7 Dense Linear Algebra Mixing PLASMA and MAGMA (Collaboration with UTK)

8 Mixing PLASMA and MAGMA with StarPU Background • Background • Cholesky/LU/QR: Solve dense linear systems • UTK : ~ leaders for Dense Linear Algebra for 20 years • Need performance portability • State of the art libraries • PLASMA (Multicore CPUs) • MAGMA (Multiple GPUs) • Our approach • Use PLASMA algorithms • PLASMA kernels on CPUs, MAGMA kernels on GPUs • Schedule tasks with StarPU

9 Mixing PLASMA and MAGMA with StarPU Productivity • Programmability • Cholesky: ~half a week, QR: ~2 days of works, LU : ~time to write new kernels • Quick algorithmic prototyping // Sequential Tile Cholesky // Hybrid Tile Cholesky FOR k = 0..TILES-1 FOR k = 0..TILES-1 DPOTRF(A[k][k]) starpu_Insert_Task (DPOTRF, …) FOR m = k+1..TILES-1 FOR m = k+1..TILES-1 DTRSM(A[k][k], A[m][k]) starpu_Insert_Task( DTRSM, …) FOR n = k+1..TILES-1 FOR n = k+1..TILES-1 DSYRK(A[n][k], A[n][n]) starpu_Insert_Task( DSYRK, …) FOR m = n+1..TILES-1 FOR m = n+1..TILES-1 DGEMM(A[m][k], A[n][k], A[m][n]) starpu_Insert_Task( DGEMM, …)

10 Mixing PLASMA and MAGMA with StarPU • Cholesky decomposition • Hannibal: 8 CPU cores (Nehalem) + 3 GPUs (NV FX5800)

11 Mixing PLASMA and MAGMA with StarPU • QR decomposition • Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060)

12 Mixing PLASMA and MAGMA with StarPU • QR decomposition • Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060) MAGMA

13 Mixing PLASMA and MAGMA with StarPU • QR decomposition • Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060) +12 CPUs ~200GFlops Peak : 12 cores ~150 GFlops

14 Mixing PLASMA and MAGMA with StarPU • Memory transfers during Cholesky decomposition ~2.5x less transfers

15 Mixing PLASMA and MAGMA with StarPU Perspective • Add more algorithms • 2-sided Factorizations (eg. Hessenberg) • Solvers • Going to be released as a standalone library • Toward a complete LAPACK implementation for hybrid computing • Need autotuning facilities! • Next step: integrate MPI • On-going work • Accelerated SCALAPACK ?

16 Rodinia's CFD Solver

17 Rodinia's CFD Solver Background • The Rodinia benchmark suite • Cover the different « Berkeley Dwarves » • Available either in OpenMP or in CUDA • Neither multi-GPU nor hybrid systems • Rodinia's CFD Solver benchmark • 3D Euler equations for incompressible flow • Unstructured Grid Finite Volumes • Memory intensive kernel • Pre-processing and Post-processing are not available – Need to create our own input meshes

18 Rodinia's CFD Solver Methodology • Pre-processing • Generated a mesh of the air around a sphere • Very simple yet ! • Parallelizing the problem • Partition the mesh using SCOTCH • 1 task = update 1 part • Redundant computation • Exchange part boundaries

19 Rodinia's CFD Solver Post-processing

20 Rodinia's CFD Solver Preliminary results • Problem size • 64x64x64 grid, 1.3 Millions tetrahedrons • Reference CPU performance • 1 core (Intel Westmere X5650) – 1.4s per iteration • 12 cores – 0.15s per iteration • Preliminary performance with StarPU • 1 NVIDIA C2050 – 53ms per iteration • 2 NVIDIA C2050 – 28ms per iteration • We need large problems !

21 Rodinia's CFD Solver Perspective • Port in OpenCL • Use hybrid platforms • GPUs are much faster than CPUs – Memory bound – Rather few tasks • Parallel CPU tasks – large granularity • Heterogeneity-aware data layout • CPUs : Arrays of Structures (cache friendly) • GPUs : Structures of Arrays (SIMD friendly)

22 Conclusion • StarPU • Data management & Task scheduling • Freely available under LGPL on Linux, Mac and Windows • Adapted 3 PEPPHER benchmarks • FFTW + CUFFTW • MAGMA + PLASMA • Rodinia's CFD Solver

23 Conclusion • Productive approach • Rely on existing kernels for CPU/GPU • Architecture independent task model • Higher-level front-ends would help – StarSs, HMPP, Codeplay's Offload • Autotuning will be required • Need to find optimal granularity – Parallel tasks – Divisible tasks • Select code variants – eg. with SkePU

Running PEPPHER benchmarks on top of the StarPU runtime system - PowerPoint PPT Presentation

1 Running PEPPHER benchmarks on top of the StarPU runtime system Cdric Augonnet Nicolas Collin Nathalie Furmento Raymond Namyst Samuel Thibault INRIA Bordeaux, LaBRI, Universit de Bordeaux 22 th January 2011 2 The StarPU runtime system

Composing multiple StarPU applications Composing multiple StarPU applications over heterogeneous

Red- -Light Running Light Running Red Red-Light Running 2 Traffic Signals Traffic Signals

Red- -Light Running Light Running Red Red-Light Running 2 Traffic Signals Traffic Signals

Incremental Migration of C and Fortran Applications to GPGPU using HMPP Peppher 2011

Benchmarks Online Testing Data District Benchmarks English/Language Arts and Math

The HPC Challenge Benchmarks and the PMaC project Certificates of relevance for benchmarks

StarPU : Exploiting heterogeneous architectures through task-based programming Cdric Augonnet

To TOP or NOT to TOP www.SAS.com To TOP or NOT to TOP Using the TOP command in Linux By Len van

Testing Concurrency Runtime via a Testing Concurrency Runtime via a Stochastic Stress Framework

Boosted Top Tagging Seung J. Lee Outline Introduction: top jets @ LHC Modern boosted top

BENCHMARKS TOPIC SUMMARY Scott Adams, Dilbert BENCHMARKS The Investment Process and how BM fits

Inside The RT Patch Talk: Steven Rostedt (Red Hat) Benchmarks : Darren V Hart (IBM) Inside

Running Probabilistic Running Probabilistic Running Probabilistic Programs Backwards Programs

Running Time Why do we need to analyze the running Algorithm/Running Time Analysis time of a

D7: Front-running Race conditions #7: Front ont-running running A form of a race condition

Puncher/Squeezer Riveting Tools BEST PRACTICES 2018 Tool Uses Top Rail Punch Top Rail

The Mathemagix compiler Joris van der Hoeven, Palaiseau 2011 http://www.T e X macs .org 1

Advisory Council Meeting June 14, 2013 Agenda Welcome / Introductions Governing Board Updates

Explicit Complex Multiplication Benjamin Smith INRIA Saclay Ile-de-France &

Theory Presentation Combinators Jacques Carette, and Russell OConnor McMaster University CICM

Fuzzing and how to evaluate it Michael Hicks The University of Maryland UM Joint work with

Magma 2010 Conference on p -adic L -functions p -adic L -functions, (Stark-) Heegner points, and

I.1.1 Introduc)on and magma)c rocks Geoscience: the Earth and

Computing Conjugacy Classes of Elements in Finite Matrix Alexander Hulpke Department of

Running PEPPHER benchmarks on top of the StarPU runtime system - PowerPoint PPT Presentation

1 Running PEPPHER benchmarks on top of the StarPU runtime system Cdric Augonnet Nicolas Collin Nathalie Furmento Raymond Namyst Samuel Thibault INRIA Bordeaux, LaBRI, Universit de Bordeaux 22 th January 2011 2 The StarPU runtime system

Composing multiple StarPU applications Composing multiple StarPU applications over heterogeneous

Red- -Light Running Light Running Red Red-Light Running 2 Traffic Signals Traffic Signals

Red- -Light Running Light Running Red Red-Light Running 2 Traffic Signals Traffic Signals

Incremental Migration of C and Fortran Applications to GPGPU using HMPP Peppher 2011

Benchmarks Online Testing Data District Benchmarks English/Language Arts and Math

The HPC Challenge Benchmarks and the PMaC project Certificates of relevance for benchmarks

StarPU : Exploiting heterogeneous architectures through task-based programming Cdric Augonnet

To TOP or NOT to TOP www.SAS.com To TOP or NOT to TOP Using the TOP command in Linux By Len van

Testing Concurrency Runtime via a Testing Concurrency Runtime via a Stochastic Stress Framework

Boosted Top Tagging Seung J. Lee Outline Introduction: top jets @ LHC Modern boosted top

BENCHMARKS TOPIC SUMMARY Scott Adams, Dilbert BENCHMARKS The Investment Process and how BM fits

Inside The RT Patch Talk: Steven Rostedt (Red Hat) Benchmarks : Darren V Hart (IBM) Inside

Running Probabilistic Running Probabilistic Running Probabilistic Programs Backwards Programs

Running Time Why do we need to analyze the running Algorithm/Running Time Analysis time of a

D7: Front-running Race conditions #7: Front ont-running running A form of a race condition

Puncher/Squeezer Riveting Tools BEST PRACTICES 2018 Tool Uses Top Rail Punch Top Rail

The Mathemagix compiler Joris van der Hoeven, Palaiseau 2011 http://www.T e X macs .org 1

Advisory Council Meeting June 14, 2013 Agenda Welcome / Introductions Governing Board Updates

Explicit Complex Multiplication Benjamin Smith INRIA Saclay Ile-de-France &amp;

Theory Presentation Combinators Jacques Carette, and Russell OConnor McMaster University CICM

Fuzzing and how to evaluate it Michael Hicks The University of Maryland UM Joint work with

Magma 2010 Conference on p -adic L -functions p -adic L -functions, (Stark-) Heegner points, and

I.1.1 Introduc)on and magma)c rocks Geoscience: the Earth and

Computing Conjugacy Classes of Elements in Finite Matrix Alexander Hulpke Department of

Explicit Complex Multiplication Benjamin Smith INRIA Saclay Ile-de-France &