composing multiple starpu applications composing multiple
play

Composing multiple StarPU applications Composing multiple StarPU - PowerPoint PPT Presentation

Composing multiple StarPU applications Composing multiple StarPU applications over heterogeneous machines: a supervised approach Andra Hugo With Abdou Guermouche, Pierre-Andr Wacrenier, Raymond Namyst Inria, LaBRI, University of Bordeaux


  1. Composing multiple StarPU applications Composing multiple StarPU applications over heterogeneous machines: a supervised approach Andra Hugo With Abdou Guermouche, Pierre-André Wacrenier, Raymond Namyst Inria, LaBRI, University of Bordeaux RUNTIME INRIA Group INRIA Bordeaux Sud-Ouest

  2. The increasing role of runtime systems Code reusability • Many HPC applications rely on specific parallel libraries - Linear algebra, FFT, Stencils IntelTBB • Efficient implementations sitting on Harmony StarSs top of dynamic runtime systems Anthill OpenMP Cilk - To deal with hybrid, multicore KAAPI StarPU complex hardware complex hardware Charm++ DAGuE • E.g. MKL/OpenMP, Qilin MAGMA/StarPU - To avoid reinventing the wheel! • Some application may benefit from relying on multiple libraries - Potentially using different underlying runtime systems… Runtime - 2

  3. The increasing role of runtime systems Code reusability • Many HPC applications rely on specific parallel libraries - Linear algebra, FFT, Stencils IntelTBB • Efficient implementations sitting on Harmony StarSs top of dynamic runtime systems Anthill OpenMP Cilk - To deal with hybrid, multicore KAAPI StarPU complex hardware complex hardware Charm++ DAGuE • E.g. MKL/OpenMP, Qilin MAGMA/StarPU - To avoid reinventing the wheel! • Some application may benefit from And the performance relying on multiple libraries => of the application - Potentially using different underlying runtime systems… Runtime - 3

  4. Struggle for resources Interferences between parallel libraries • Parallel libraries typically allocate and bind one thread per core Problems: • Resource over-subscription • Resource under-subscription Solutions: • Stand-alone allocation • Hand-made allocation • Hand-made allocation • Examples: - Sparse direct solvers - Code coupling (multi-physics, multi-scale) - Etc… CPU 1 CPU 1 CPU 2 CPU 2 CPU 4 CPU 4 GPU GPU CPU 3 CPU 3 Example: qr_mumps Runtime - 4

  5. Struggle for resources Interferences between parallel libraries • Parallel libraries typically allocate and bind one thread per core Problems: • Resource over-subscription • Resource under-subscription Solutions: • Stand-alone allocation • Hand-made allocation • Hand-made allocation • Examples: - Sparse direct solvers - Code coupling (multi-physics, multi-scale) - Etc… CPU 1 CPU 1 CPU 2 CPU 2 CPU 4 CPU 4 GPU GPU CPU 3 CPU 3 => Composability problem Example: qr_mumps Runtime - 5

  6. Composability problem How to deal with it? Intel TBB Lithe • Advanced environments allow partitioning of hardware resources - Intel TBB • The pool of workers are split in arenas - Lithe • Resource sharing management interface • Harts are transferred between parallel libraries • Main challenge: Automatically adjusting the amount of resources allocated to each library Runtime - 6

  7. Our approach: Scheduling Contexts Toward code composability Push Push • Isolate concurrent parallel codes Context B • Similar to lightweight virtual machines Context A CPU GPU workers workers Runtime - 7

  8. Our approach: Scheduling Contexts Toward code composability Push Push • Isolate concurrent parallel codes Context B • Similar to lightweight virtual machines Context A Contexts may expand and shrink • Hypervised approach - • Resize contexts • Share resources - Maximize overall throughput CPU GPU workers workers - Use dynamic feedback both from application and runtime Hypervisor Runtime - 8

  9. Tackle the Composability problem Runtime System to validate our proposal • Scheduling contexts to isolate parallel codes • The Hypervisor to (re)size scheduling contexts • Runtime - 9

  10. Tackle the Composability problem Runtime System to validate our proposal • Scheduling contexts to isolate parallel codes • The Hypervisor to (re)size scheduling contexts • Runtime - 10

  11. Using StarPU as an experimental platform A runtime system for *PU architectures for studying resource negociation • The StarPU runtime system - Dynamically schedule tasks on all A = A+B processing units • See a pool of heterogeneous CPU CPU GPU M. processing units CPU CPU CPU CPU - Avoid unnecessary data transfers GPU M. B M. B between accelerators • Software VSM for GPU M. CPU CPU heterogeneous machines CPU CPU GPU M. M. A Runtime - 11

  12. Overview of StarPU Maximizing PU occupancy, minimizing data transfers • Accept tasks that may have HPC Applications multiple implementations Parallel Parallel Compilers Libraries - Potential inter-dependencies • Leads to a directed acyclic graph of tasks • Data-flow approach cpu f gpu spu StarPU (A RW , B R , C R ) Drivers (CUDA, OpenCL) • Open, general purpose CPU GPU MIC scheduling platform - Scheduling policies = plugins Runtime - 12

  13. Tasks scheduling How does it work? • When a task is submitted, it first goes into a pool of “frozen tasks” until all dependencies are met Push • Then, the task is “pushed” to the scheduler Scheduler • • Idle processing units actively poll for Idle processing units actively poll for work (“pop”) • What happens inside the scheduler is… Pop Pop up to you! • Examples: - mct, work stealing, eager, priority CPU GPU workers workers Runtime - 13

  14. Tackle the Composability problem Runtime System to validate our proposal • Scheduling contexts to isolate parallel codes • The Hypervisor to (re)size scheduling contexts • Runtime - 14

  15. Scheduling Contexts in StarPU Extension of StarPU • “Virtual” StarPU machines - Feature their own scheduler - Minimize interferences - Enforce data locality • Allocation of resources - Explicit: • Programmer’s input - Supervised: • Tips on the number of resources • Tips on the number of flops - Shared processing units Runtime - 15

  16. Scheduling contexts in StarPU Easily use contexts in your application int resources1[3] = {CPU_1, CPU_2, GPU_1}; int resources2[4] = {CPU_3, CPU_4, CPU_5, CPU_6}; /* define the scheduling policy and the table MCT of resource ids */ sched_ctx1 = starpu_create_sched_ctx(“mct",resources1,3); sched_ctx2 = starpu_create_sched_ctx("greedy",resources2,4); Runtime - 16

  17. Scheduling contexts in StarPU Easily use contexts in your application int resources1[3] = {CPU_1, CPU_2, GPU_1}; int resources2[4] = {CPU_3, CPU_4, CPU_5, CPU_6}; /* define the scheduling policy and the table of resource ids */ sched_ctx1 = starpu_create_sched_ctx("heft",resources1,3); sched_ctx2 = starpu_create_sched_ctx("greedy",resources2,4); // thread 1: // thread 2: /* define the context associated to kernel 1 */ /* define the context associated to kernel 2 */ starpu_set_sched_ctx(sched_ctx1); starpu_set_sched_ctx(sched_ctx2); /* submit the set of tasks of the parallel kernel /* submit the set of tasks of parallel kernel 2*/ 1*/ for( i = 0; i < ntasks2; i++) for( i = 0; i < ntasks1; i++) starpu_task_submit(tasks2[i]); starpu_task_submit(tasks1[i]); Runtime - 17

  18. Experimental evaluation Platform and Application • 9 CPUs (two Intel hexacore processors, 3 cores devoted to execute GPU drivers) + 3 GPUs • MAGMA Linear Algebra Library - StarPU Implementation - Cholesky Factorization kernel • Euler3D solver - Computational Fluid Dynamic benchmark - Rodinia benchmark suite - Iterative solver for 3D Euler equations for compressible fluids - StarPU Implementation MAGMA – Cholesky Factorization Runtime - 18

  19. Composing Magma and the Euler3D solver Different parallel kernels • Computational Fluid Dynamic: CFD And Cholesky Factorization - Domain decomposition parallelization - Independent tasks per iteration No contexts 19.8 - Dependencies between iterations 20 - Strong affinity with GPUs 2 contexts 18 14.2 - 2 sub-domains: 2 GPUs 16 14 • Cholesky Factorization: 12 Time(s) - Scalable on both CPUs & GPUs 10 - 1GPU & 9 CPUs 8 - Large number of tasks 6 4 2 Contexts’ benefits : • 0 - Enforcing locality constraints Runtime - 19

  20. Micro-benchmark: 9 Cholesky factorizations in parallel Gain performance from data locality Serial Execution • Mixing parallel kernels: 60 52 1 Context: 9 CPUs / 3GPUs 3 contexts : 3 x (3 CPUs / 1 GPU) 44.3 - Unnecessary data transfers 9 Contexts: 9 x ( 1 CPUs / 0.3 GPUs) 50 34.8 34.4 between Host Memory & GPU 40 memory -> blocking waits 30 Time (s) - GPU Memory flushes 20 10 0 Runtime - 20

  21. Micro-benchmark: 9 Cholesky factorizations in parallel Gain performance from data locality Serial Execution : 87 GB • Mixing parallel kernels: 60 1 Context: 9 CPUs / 3GPUs : 113 GB 52 3 contexts : 3 x (3 CPUs / 1 GPU) : 37 GB - Unnecessary data transfers 44.3 9 Contexts: 9 x ( 1 CPUs / 0.3 GPUs) : 41GB 50 34.8 34.4 between Host Memory & GPU 40 memory -> blocking waits 30 Time (s) - GPU Memory flushes 20 10 0 Runtime - 21

  22. Tackle the Composability problem Runtime System to validate our proposal • Scheduling contexts to isolate parallel codes • The Hypervisor to (re)size scheduling contexts • Runtime - 22

Recommend


More recommend