Task scheduling over Heterogeneous Multicore Machines: a Runtime - PDF document

Task scheduling over Heterogeneous Multicore Machines: a Runtime Perspective Raymond Namyst “Runtime” group INRIA Bordeaux Research Center University of Bordeaux 1 France Runtime Systems for Petascale Computing Systems: a Pessimistic View Raymond Namyst “Runtime” group INRIA Bordeaux Research Center University of Bordeaux 1 France

Outline The frightening evolution of parallel architectures • � – � Multicore + coprocessors + accelerators = heterogeneous architectures New programming challenges • � – � Hybrid programming models Designing runtime systems for heterogeneous machines • � – � Scheduling and Memory consistency Challenges for the upcoming years • � – � Current situation is terrible, but there is hope! Multicore is a solid architecture trend • � Multicore chips – � Architects’ answer to the question: “What circuits should we add on a die?” � � No point in adding new predicators or other intelligent units… – � Different from SMPs � � Hierarchical chips � � Getting really complex – � Back to the CC-NUMA era?

Machines are going heterogeneous GPGPU are the new kids on • � the block – � Very powerful SIMD accelerators – � Successfully used for offloading data-parallel kernels Other chips already feature • � specialized harware – � IBM Cell/BE � � 1 PPU + 8 SPUs – � Intel Larrabee � � 48-core with SIMD units I mean “really more heterogeneous” • � Programming model – � Specialized instruction set – � SIMD execution model • � Memory – � Size limitations – � No hardware consistency � � Explicit data transfers • � Are we happy with that? – � No, but it’s probably unavoidable!

Heterogeneity is also a solid trend • � One interpretation of “Amdalh’s law” Mixed Large – � We will always need and powerful, general purpose Small Core cores to speed up sequential parts of our applications! • � “Future processors will be a mix of general purpose and specialized cores” [anonymous source] We have to get prepared! IBM Cell (1+8 cores)� Intel TeraScale (80 cores)� Get ready for � � Understand today's • � tomorrow's accelerators architectures AMD graphic processors

New Programming Challenges Programming homogeneous multicore machines • � Why not just try to extend Multicore existing solutions? OpenMP • � Shared-memory approach TBB Cilk – � Scalability MPI – � NUMA-awareness – � Affinity-guided scheduling • � Message passing CPU CPU approach CPU CPU – � Cache-friendly buffers M. – � Topology-awareness – � Collective

Programming homogeneous multicore machines OpenMP • � Multicore – � Scheduling in a NUMA context (memory affinity, work stealing) OpenMP – � Memory management (page TBB Cilk migration) MPI MPI • � – � NUMA-aware buffer management CPU CPU – � Efficient collective operations CPU CPU Also several interesting • � approaches M. – � Intel TBB, SMP-superscalar, etc. – � Idea = we need fine-grain parallelism! Our background: Thread Scheduling over Multicore Machines The Bubble Scheduling concept • � – � Capturing application’s structure with nested bubbles – � Scheduling = dynamic mapping trees of threads onto a tree of cores BubbleSched The BubbleSched platform • � Operating System – � Designing portable NUMA-aware scheduling policies CPU CPU CPU CPU � � Focus on algorithmic issues – � Debugging/tuning scheduling Mem Mem algorithms � � FxT tracing toolkit + replay animation � � [with Univ. New Hampshire, USA]

Our background: Thread Scheduling over Multicore Machines Designing multicore-friendly programs void Node::compute(){� • � with OpenMP // approximate surface� computeApprox();� – � Parallel sections generate bubbles if(_error > _max_error) {� – � Nested parallelism is welcome! // precision not sufficient � // so divide and conquer� � � Lazy creation of threads splitCell();� The ForestGOMP platform • � #pragma omp parallel for� for(int i=0; i<8; i++)� – � Extension of GNU OpenMP _children[i]->compute();� }� � � Binary compliant with existing applications } � – � Excellent speedups with irregular applications GNU OpenMP binary � � Implicit 3D surface reconstruction [with iParla] GOMP Interface � � Tree depth > 15, more than 300,000 threads libgomp BubbleSched also targeted by OMPi Threads GOMP • � Bubble- – � [with Univ. of Ioannina, Greece] pthreads Sched Dealing with heterogenenous accelerators • � Specific APIs Accelerators – � CUDA, IBM SDK, … ALF MCF – � No consensus CUDA Cg � � Specialized languages/ FireStream compilers – � OpenCL? CPU CPU *PU M. • � Communication libraries CPU CPU – � MCAPI, MPI *PU M. M.

Dealing with heterogenenous accelerators • � Language extensions Accelerators – � RapidMind, Sieve C++ ALF MCF – � HMPP CUDA Cg #pragma hmpp target=cuda FireStream – � Cell Superscalar #pragma css input(..) output(…) CPU CPU *PU M. • � Most approaches focus on CPU CPU offloading *PU M. – � As opposed to scheduling M. Programming Hybrid Architectures • � Challenge = exploiting all Multicore Accelerators computing units ALF simultaneously OpenMP MCF ? CUDA TBB Cg Cilk ? FireStream MPI • � Either use a hybrid programming model – � E.g. OpenMP + HMPP + CPU CPU *PU M. Intel TBB + CUBLAS + MKL + … CPU CPU *PU M. M. • � Or use a uniform programming model – � That doesn’t exist yet…

In either case, a common runtime system is needed! Runtime Systems for Heterogeneous Multicore Architectures • � Runtime systems – � Perform dynamically what can’t be done statically – � Hide hardware complexity, HPC Applications provide portability (of Compiling Specific performance?) environment libraries Runtime system • � Just a matter of providing yet another scheduling & Operating System memory management Hardware API?

Runtime Systems for Heterogeneous Multicore Architectures • � Programmers (usually) know their application – � Don't guess what we know! Expressive interface – � Scheduling hints HPC Applications • � Feedback is important Compiling Specific environment libraries – � E.g. Performance counters Runtime system – � Adaptive applications? Operating System • � Other Issues Hardware – � Can we still find a unified Execution Feedback execution model? – � How to determine the appropriate task granularity? Towards a unified execution model • � We wanted our runtime to A = A+B fulfill the following requirements: CPU CPU – � Dynamically schedule tasks GPU M. on all processing units CPU CPU � � See a pool of GPU M. B heterogeneous cores M. B – � Avoid unnecessary data transfers between SPU SPU SPU accelerators CPU CPU � � Need to keep track of data CPU CPU copies SPU SPU SPU A M. A

The StarPU Runtime System Cédric Augonnet, Samuel Thibault Compilers, libraries High-level data management Scheduling engine Common driver interface (CUDA/Nvidia, Gordon/Cell) OS / Vendor specific interfaces Mastering CPUs, GPUs, SPUs ... (hence the name: *PU ) High-Level Data Management All we need is a Software DSM • � system! – � Consistency, replication, migration – � Concurrency, accelerator to accelerator transfers – � Memory reclaiming mechanism � � Problem size > accelerator size Data partitioned with filters 4,2,2,2,3 • � – � Various interfaces � � BLAS, vector, CSR, CSC – � Recursively applied � � Structured data = tree

Scheduling Engine Tasks are manipulated through • � Input Data “codelet wrappers” – � May provide multiple implementations Codelet wrp Implementations Callback � � Scheduling hints CPU GPU SPU – � Optional cost model per implementation, priority, … code code code – � List data dependencies � � Using the filter interface Output Data – � Maybe automatically generated Schedulers are plug-ins • � – � Assign tasks to run queues – � Dependencies and data prefetching are hidden Evaluation Blocked matrix multiplication Dedicate one CPU � � Exploit heterogeneous platform – � 4 CPUs + 1 GPU GFlops � � CPUs must not be neglicted! � � Issues with 4 CPUs + 1 GPU Busy CPU delays GPU management – � Cache-sensitive CPU code – � Trade-off : dedicate one core • � quadcore Intel Xeon + nVidia Quadro FX4600

Evaluation Dense LU decomposition Some tasks are critical for the algorithm Lack of parallelism Cannot feed all *PUs with enough work Evaluation Dense LU decomposition Some tasks are critical for the algorithm ...Even worse with Cholesky !

Evaluation Cholesky decomposition Priorities -> gain ~ 10 % Evaluation About the importance of performance models Modeling workers ' performance • � - “1 GPU = 10x faster than 1 CPU” - Reduce load imbalance - Fuzzy approximation Modeling tasks execution time - Precise performance models - “mathematical” models - user-provided models - automatic “learning” for unknown codelets

Task scheduling over Heterogeneous Multicore Machines: a Runtime - PDF document

Task scheduling over Heterogeneous Multicore Machines: a Runtime Perspective Raymond Namyst Runtime group INRIA Bordeaux Research Center University of Bordeaux 1 France Runtime Systems for Petascale Computing Systems: a Pessimistic

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Metrics and Task Scheduling Policies for Energy Saving in Multicore Computers J. Mair, K. Leung,

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Real-Time Scheduling slides: P. Puschner Scheduling Task Model Assumptions about task timing,

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Immigration in the Party Political Agenda: A Comparative Analysis of Party Manifestos in Six

Agent-based macroeconomics: Methods, myths and models First Bordeaux Workshop on Agent-Based

More Ways More Life Lessons from Autonomous Shuttles & Multi-Modal Shifts March 2018

Towards practical key exchange from ordinary isogeny graphs Luca De Feo 1,3 Jean Kieffer 2,3,4

Simplifying mixtures of Parzen windows GRETSI 2011, Bordeaux, France Olivier Schwander Frank

Early repolarization Early repolarization : Recognition and Management Defined on Baseline ECGs

Long-time Behaviour of a Model of Rigid Structure Floating in a Viscous Fluid G.

Camera Motion Identification in the Rough Indexing Paradigm Petra KRMER and Jenny BENOIS-PINEAU

Task scheduling over Heterogeneous Multicore Machines: a Runtime - PDF document

Task scheduling over Heterogeneous Multicore Machines: a Runtime Perspective Raymond Namyst Runtime group INRIA Bordeaux Research Center University of Bordeaux 1 France Runtime Systems for Petascale Computing Systems: a Pessimistic

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Metrics and Task Scheduling Policies for Energy Saving in Multicore Computers J. Mair, K. Leung,

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Real-Time Scheduling slides: P. Puschner Scheduling Task Model Assumptions about task timing,

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Immigration in the Party Political Agenda: A Comparative Analysis of Party Manifestos in Six

Agent-based macroeconomics: Methods, myths and models First Bordeaux Workshop on Agent-Based

More Ways More Life Lessons from Autonomous Shuttles &amp; Multi-Modal Shifts March 2018

Towards practical key exchange from ordinary isogeny graphs Luca De Feo 1,3 Jean Kieffer 2,3,4

Simplifying mixtures of Parzen windows GRETSI 2011, Bordeaux, France Olivier Schwander Frank

Early repolarization Early repolarization : Recognition and Management Defined on Baseline ECGs

Long-time Behaviour of a Model of Rigid Structure Floating in a Viscous Fluid G.

Camera Motion Identification in the Rough Indexing Paradigm Petra KRMER and Jenny BENOIS-PINEAU

More Ways More Life Lessons from Autonomous Shuttles & Multi-Modal Shifts March 2018