Composing multiple StarPU applications Composing multiple StarPU - PowerPoint PPT Presentation

Composing multiple StarPU applications Composing multiple StarPU applications over heterogeneous machines: a supervised approach Andra Hugo With Abdou Guermouche, Pierre-André Wacrenier, Raymond Namyst Inria, LaBRI, University of Bordeaux RUNTIME INRIA Group INRIA Bordeaux Sud-Ouest

The increasing role of runtime systems Code reusability • Many HPC applications rely on specific parallel libraries - Linear algebra, FFT, Stencils IntelTBB • Efficient implementations sitting on Harmony StarSs top of dynamic runtime systems Anthill OpenMP Cilk - To deal with hybrid, multicore KAAPI StarPU complex hardware complex hardware Charm++ DAGuE • E.g. MKL/OpenMP, Qilin MAGMA/StarPU - To avoid reinventing the wheel! • Some application may benefit from relying on multiple libraries - Potentially using different underlying runtime systems… Runtime - 2

The increasing role of runtime systems Code reusability • Many HPC applications rely on specific parallel libraries - Linear algebra, FFT, Stencils IntelTBB • Efficient implementations sitting on Harmony StarSs top of dynamic runtime systems Anthill OpenMP Cilk - To deal with hybrid, multicore KAAPI StarPU complex hardware complex hardware Charm++ DAGuE • E.g. MKL/OpenMP, Qilin MAGMA/StarPU - To avoid reinventing the wheel! • Some application may benefit from And the performance relying on multiple libraries => of the application - Potentially using different underlying runtime systems… Runtime - 3

Struggle for resources Interferences between parallel libraries • Parallel libraries typically allocate and bind one thread per core Problems: • Resource over-subscription • Resource under-subscription Solutions: • Stand-alone allocation • Hand-made allocation • Hand-made allocation • Examples: - Sparse direct solvers - Code coupling (multi-physics, multi-scale) - Etc… CPU 1 CPU 1 CPU 2 CPU 2 CPU 4 CPU 4 GPU GPU CPU 3 CPU 3 Example: qr_mumps Runtime - 4

Struggle for resources Interferences between parallel libraries • Parallel libraries typically allocate and bind one thread per core Problems: • Resource over-subscription • Resource under-subscription Solutions: • Stand-alone allocation • Hand-made allocation • Hand-made allocation • Examples: - Sparse direct solvers - Code coupling (multi-physics, multi-scale) - Etc… CPU 1 CPU 1 CPU 2 CPU 2 CPU 4 CPU 4 GPU GPU CPU 3 CPU 3 => Composability problem Example: qr_mumps Runtime - 5

Composability problem How to deal with it? Intel TBB Lithe • Advanced environments allow partitioning of hardware resources - Intel TBB • The pool of workers are split in arenas - Lithe • Resource sharing management interface • Harts are transferred between parallel libraries • Main challenge: Automatically adjusting the amount of resources allocated to each library Runtime - 6

Our approach: Scheduling Contexts Toward code composability Push Push • Isolate concurrent parallel codes Context B • Similar to lightweight virtual machines Context A CPU GPU workers workers Runtime - 7

Our approach: Scheduling Contexts Toward code composability Push Push • Isolate concurrent parallel codes Context B • Similar to lightweight virtual machines Context A Contexts may expand and shrink • Hypervised approach - • Resize contexts • Share resources - Maximize overall throughput CPU GPU workers workers - Use dynamic feedback both from application and runtime Hypervisor Runtime - 8

Tackle the Composability problem Runtime System to validate our proposal • Scheduling contexts to isolate parallel codes • The Hypervisor to (re)size scheduling contexts • Runtime - 9

Using StarPU as an experimental platform A runtime system for *PU architectures for studying resource negociation • The StarPU runtime system - Dynamically schedule tasks on all A = A+B processing units • See a pool of heterogeneous CPU CPU GPU M. processing units CPU CPU CPU CPU - Avoid unnecessary data transfers GPU M. B M. B between accelerators • Software VSM for GPU M. CPU CPU heterogeneous machines CPU CPU GPU M. M. A Runtime - 11

Overview of StarPU Maximizing PU occupancy, minimizing data transfers • Accept tasks that may have HPC Applications multiple implementations Parallel Parallel Compilers Libraries - Potential inter-dependencies • Leads to a directed acyclic graph of tasks • Data-flow approach cpu f gpu spu StarPU (A RW , B R , C R ) Drivers (CUDA, OpenCL) • Open, general purpose CPU GPU MIC scheduling platform - Scheduling policies = plugins Runtime - 12

Tasks scheduling How does it work? • When a task is submitted, it first goes into a pool of “frozen tasks” until all dependencies are met Push • Then, the task is “pushed” to the scheduler Scheduler • • Idle processing units actively poll for Idle processing units actively poll for work (“pop”) • What happens inside the scheduler is… Pop Pop up to you! • Examples: - mct, work stealing, eager, priority CPU GPU workers workers Runtime - 13

Scheduling Contexts in StarPU Extension of StarPU • “Virtual” StarPU machines - Feature their own scheduler - Minimize interferences - Enforce data locality • Allocation of resources - Explicit: • Programmer’s input - Supervised: • Tips on the number of resources • Tips on the number of flops - Shared processing units Runtime - 15

Scheduling contexts in StarPU Easily use contexts in your application int resources1[3] = {CPU_1, CPU_2, GPU_1}; int resources2[4] = {CPU_3, CPU_4, CPU_5, CPU_6}; /* define the scheduling policy and the table MCT of resource ids */ sched_ctx1 = starpu_create_sched_ctx(“mct",resources1,3); sched_ctx2 = starpu_create_sched_ctx("greedy",resources2,4); Runtime - 16

Scheduling contexts in StarPU Easily use contexts in your application int resources1[3] = {CPU_1, CPU_2, GPU_1}; int resources2[4] = {CPU_3, CPU_4, CPU_5, CPU_6}; /* define the scheduling policy and the table of resource ids */ sched_ctx1 = starpu_create_sched_ctx("heft",resources1,3); sched_ctx2 = starpu_create_sched_ctx("greedy",resources2,4); // thread 1: // thread 2: /* define the context associated to kernel 1 */ /* define the context associated to kernel 2 */ starpu_set_sched_ctx(sched_ctx1); starpu_set_sched_ctx(sched_ctx2); /* submit the set of tasks of the parallel kernel /* submit the set of tasks of parallel kernel 2*/ 1*/ for( i = 0; i < ntasks2; i++) for( i = 0; i < ntasks1; i++) starpu_task_submit(tasks2[i]); starpu_task_submit(tasks1[i]); Runtime - 17

Experimental evaluation Platform and Application • 9 CPUs (two Intel hexacore processors, 3 cores devoted to execute GPU drivers) + 3 GPUs • MAGMA Linear Algebra Library - StarPU Implementation - Cholesky Factorization kernel • Euler3D solver - Computational Fluid Dynamic benchmark - Rodinia benchmark suite - Iterative solver for 3D Euler equations for compressible fluids - StarPU Implementation MAGMA – Cholesky Factorization Runtime - 18

Composing Magma and the Euler3D solver Different parallel kernels • Computational Fluid Dynamic: CFD And Cholesky Factorization - Domain decomposition parallelization - Independent tasks per iteration No contexts 19.8 - Dependencies between iterations 20 - Strong affinity with GPUs 2 contexts 18 14.2 - 2 sub-domains: 2 GPUs 16 14 • Cholesky Factorization: 12 Time(s) - Scalable on both CPUs & GPUs 10 - 1GPU & 9 CPUs 8 - Large number of tasks 6 4 2 Contexts’ benefits : • 0 - Enforcing locality constraints Runtime - 19

Micro-benchmark: 9 Cholesky factorizations in parallel Gain performance from data locality Serial Execution • Mixing parallel kernels: 60 52 1 Context: 9 CPUs / 3GPUs 3 contexts : 3 x (3 CPUs / 1 GPU) 44.3 - Unnecessary data transfers 9 Contexts: 9 x ( 1 CPUs / 0.3 GPUs) 50 34.8 34.4 between Host Memory & GPU 40 memory -> blocking waits 30 Time (s) - GPU Memory flushes 20 10 0 Runtime - 20

Micro-benchmark: 9 Cholesky factorizations in parallel Gain performance from data locality Serial Execution : 87 GB • Mixing parallel kernels: 60 1 Context: 9 CPUs / 3GPUs : 113 GB 52 3 contexts : 3 x (3 CPUs / 1 GPU) : 37 GB - Unnecessary data transfers 44.3 9 Contexts: 9 x ( 1 CPUs / 0.3 GPUs) : 41GB 50 34.8 34.4 between Host Memory & GPU 40 memory -> blocking waits 30 Time (s) - GPU Memory flushes 20 10 0 Runtime - 21

Composing multiple StarPU applications Composing multiple StarPU - PowerPoint PPT Presentation

Composing multiple StarPU applications Composing multiple StarPU applications over heterogeneous machines: a supervised approach Andra Hugo With Abdou Guermouche, Pierre-Andr Wacrenier, Raymond Namyst Inria, LaBRI, University of Bordeaux

Running PEPPHER benchmarks on top of the StarPU runtime system Cdric Augonnet Nicolas Collin

Composing Transformation Composing Transformation Composing Transformation the process of

FUTURE COMPOSING THE FUTURE COMPOSING THE FUTURE CONTENTS >>> INTRODUCTION Vision

Composing heterogeneous software with style Stephen Kell stephen.kell@cs.ox.ac.uk Composing. . .

StarPU : Exploiting heterogeneous architectures through task-based programming Cdric Augonnet

Objectives Combinator Parsing Show how to build complex parsers by composing simpler parsers.

CMSC 433 Programming Language Technologies and Paradigms Composing Objects Composing Objects

Composing Criteria of Individuation Matthew Gotham Department of Linguistics University College

vegawidget Composing and Rendering Interactive Vega(-Lite) Charts vegawidget Using Vega-Lite in

Composing the uncomposable Some work, work-in-progress and ideas. Stephen Kell

The abstract art of composing SDN applications Pedro A. Aranda Telefonica

Composing Nonlinear Solvers Matthew Knepley Computational and Applied Mathematics Rice

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Multiple Decrement Models Lecture: Weeks 8-9 Lecture: Weeks 8-9 (STT 456) Multiple Decrement

Multiple Decrement Models Lecture: Weeks 8-9 Lecture: Weeks 8-9 (STT 456) Multiple Decrement

publishing and harvesting metadata at Europeana Valentine Charles, Richard Wallis, Antoine Isaac,

ATOM: Automatic Transaction-Oriented Memoization Hugo Rito and Jo ao Cachopo INESC-ID

Towards direct models of classical logic Locali meeting (Beijing, 4-6/11/2013) Pierre-Louis

Development of small scale cooling systems at AT-ECR (Compact Cooling System, developments at

Holographic Complexity in the Jackiw-Teitelboim Gravity Kanato Goto RIKEN, iTHEMS Based on

Network theory and analysis of football strategies Javier Lpez Pea Department of Mathematics

Controlling a population of identical NFA Nathalie Bertrand Inria Rennes joint work with Miheer

Marginal triviality of the scaling limits of critical 4D Ising and 4 4 models Michael Aizenman

Composing multiple StarPU applications Composing multiple StarPU - PowerPoint PPT Presentation

Composing multiple StarPU applications Composing multiple StarPU applications over heterogeneous machines: a supervised approach Andra Hugo With Abdou Guermouche, Pierre-Andr Wacrenier, Raymond Namyst Inria, LaBRI, University of Bordeaux

Running PEPPHER benchmarks on top of the StarPU runtime system Cdric Augonnet Nicolas Collin

Composing Transformation Composing Transformation Composing Transformation the process of

FUTURE COMPOSING THE FUTURE COMPOSING THE FUTURE CONTENTS &gt;&gt;&gt; INTRODUCTION Vision

Composing heterogeneous software with style Stephen Kell stephen.kell@cs.ox.ac.uk Composing. . .

StarPU : Exploiting heterogeneous architectures through task-based programming Cdric Augonnet

Objectives Combinator Parsing Show how to build complex parsers by composing simpler parsers.

CMSC 433 Programming Language Technologies and Paradigms Composing Objects Composing Objects

Composing Criteria of Individuation Matthew Gotham Department of Linguistics University College

vegawidget Composing and Rendering Interactive Vega(-Lite) Charts vegawidget Using Vega-Lite in

Composing the uncomposable Some work, work-in-progress and ideas. Stephen Kell

The abstract art of composing SDN applications Pedro A. Aranda Telefonica

Composing Nonlinear Solvers Matthew Knepley Computational and Applied Mathematics Rice

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Multiple Decrement Models Lecture: Weeks 8-9 Lecture: Weeks 8-9 (STT 456) Multiple Decrement

Multiple Decrement Models Lecture: Weeks 8-9 Lecture: Weeks 8-9 (STT 456) Multiple Decrement

publishing and harvesting metadata at Europeana Valentine Charles, Richard Wallis, Antoine Isaac,

ATOM: Automatic Transaction-Oriented Memoization Hugo Rito and Jo ao Cachopo INESC-ID

Towards direct models of classical logic Locali meeting (Beijing, 4-6/11/2013) Pierre-Louis

Development of small scale cooling systems at AT-ECR (Compact Cooling System, developments at

Holographic Complexity in the Jackiw-Teitelboim Gravity Kanato Goto RIKEN, iTHEMS Based on

Network theory and analysis of football strategies Javier Lpez Pea Department of Mathematics

Controlling a population of identical NFA Nathalie Bertrand Inria Rennes joint work with Miheer

Marginal triviality of the scaling limits of critical 4D Ising and 4 4 models Michael Aizenman

FUTURE COMPOSING THE FUTURE COMPOSING THE FUTURE CONTENTS >>> INTRODUCTION Vision