A Pluggable Framework for Composable HPC Scheduling Libraries Max Grossman 1 , Vivek Kumar 2 , Nick Vrvilo 1 , Zoran Budimlic 1 , Vivek Sarkar 1 1 Habanero Extreme Scale So=ware Research Group, Rice University 2 IIIT-Delhi AsHES 2017 - May 29 2017 1
Top10 Past decade has seen more heterogeneous supercomputers 2 hQps://www.top500.org
Top10 Majority of Top10 Peak and Achieved GFlop/s has come from heterogeneous machines since 2013 3 hQps://www.top500.org
Top10 We as a community are very bad at programming heterogeneous supercomputers (even for LINPACK). 4 hQps://www.top500.org
How Do We Define Heterogeneity? For the past decade, “heterogeneous compuang” == “GPUs” • Dealing with GPUs has taught us a lot about so=ware heterogeneity But heterogeneity is on the rise everywhere in HPC: • Hardware: memory, networks, storage, cores • So=ware: networking libraries, Processors compute libraries, managed runames, domain libraries, storage Depicaon of the abstract pladorm moavaang this work. APIs 5
Heterogeneous Programming in PracHce pthreads QThreads 6
Heterogeneous Programming in Research Legion: Hide all heterogeneity from user, rely on runame to map problem to hardware efficiently, implicit dependencies discovered by runame. Parsec, OCR: Explicit dataflow model. HCMPI, HCUPC++, HC-CUDA, HPX: Task-based runames that create dedicated proxy threads for managing some external resource (e.g. NIC, GPU). HiPER : Generalize a task-based, locality-aware, work-stealing runame/model to support non-CPU resources. • Retain the appearance of legacy APIs • Composability, extensibility, compaability are first-class ciazens from the start. 7
Outline HiPER Execuaon & Pladorm Model HiPER Use Cases • MPI Module • Composing MPI and CUDA Performance Evaluaaon Conclusions & Future Work 8
Outline HiPER Execuaon & Pladorm Model HiPER Use Cases • MPI Module • Composing MPI and CUDA Performance Evaluaaon Conclusions & Future Work 9
HiPER’s Predecessors Hierarchical Place Trees sysmem L2 L2 L1 L1 L1 L1 10
HiPER’s Predecessors Hierarchical Place Trees GPU sysmem w/ GPU Proxy Thread L2 L2 L1 L1 L1 11
HiPER’s Predecessors GPU sysmem OSHMEM Proxy Thread Proxy Thread L2 L2 Hierarchical Place Trees w/ GPU and OSHMEM L1 L1 12
HiPER’s Predecessors GPU GPU sysmem OSHMEM Proxy Thread Proxy Thread L2 L2 Hierarchical Place Trees w/ GPUs and OSHMEM L1 13
HiPER’s Predecessors GPU MPI GPU sysmem OSHMEM Proxy Thread Proxy Thread L2 L2 Hierarchical Place Trees w/ GPUs, OSHMEM, MPI 14
HiPER’s Predecessors GPU MPI GPU sysmem OSHMEM Proxy Thread Proxy Thread L2 L2 Hierarchical Place Trees w/ GPUs, OSHMEM, MPI • Simple model makes it aQracave for many past research efforts, but … • Not scalable so=ware engineering • Wasteful use of host resources • Not easily extendable to new so=ware/hardware capabiliaes 15
HiPER PlaMorm & ExecuHon Model HiPER Work-Stealing Thread Pool 16
HiPER PlaMorm & ExecuHon Model Pluggable System OSHMEM MPI CUDA etc. Modules Modules expose user-visible APIs for work creaaon. HiPER Work-Stealing Thread Pool 17
HiPER PlaMorm & ExecuHon Model Pluggable System OSHMEM MPI CUDA etc. Modules HiPER PlaMorm Model Pladorm model gives modules somewhere to place work, thread pool somewhere to HiPER Work-Stealing Thread Pool find work. 18
HiPER PlaMorm & ExecuHon Model Pluggable System OSHMEM MPI CUDA etc. Modules HiPER PlaMorm Model Modules fill in pladorm model, tell threads the CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 x86 x86 x86 x86 subset of the pladorm they are responsible HiPER Work-Stealing Thread Pool for scheduling work on. 19
HiPER PlaMorm & ExecuHon Model Pluggable System OSHMEM MPI CUDA etc. Modules HiPER PlaMorm Model NIC Modules fill in pladorm model, tell threads the CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 x86 x86 x86 x86 subset of the pladorm they are responsible HiPER Work-Stealing Thread Pool for scheduling work on. 20
HiPER PlaMorm & ExecuHon Model Pluggable System OSHMEM MPI CUDA etc. Modules HiPER PlaMorm Model GPU NIC CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 x86 x86 x86 x86 HiPER Work-Stealing Thread Pool 21
Outline HiPER Execuaon & Pladorm Model HiPER Use Cases • MPI Module • Composing MPI and CUDA Performance Evaluaaon Conclusions & Future Work 22
Fundamental Task-Parallel API The HiPER core exposes a fundamental C/C++ tasking API. API ExplanaHon Create an asynchronous task async([] { S1; }); Suspend calling task unal nested tasks have finish([] { S2; }); completed Create an async. task at a place in the pladorm async_at([] { S3; }, place); model fut = async_future([] { S4; }); Get a future that is signaled when a task completes Create an asynchronous task whose execuaon is async_await([] { S5; }, fut); predicated on saasfacaon of fut. Summary of core tasking APIs. The above list is not comprehensive. 23
MPI Module Extends HiPER namespace with familiar MPI APIs • Programmers can use the APIs they already know and love • Built on 1) an MPI implementaaon, and 2) HiPER’s core tasking APIs. Asynchronous APIs return futures rather than MPI_Requests, enabling composability in programming layer with all other future-based APIs: hiper::future_t<void> *MPI_Irecv/Isend(...); Enables non-standard extensions, e.g.: Start an asynchronous send hiper::future_t<void> *MPI_Isend_await(..., hiper::future_t<void> *await); once await is saasfied. hiper::future_t<void> *MPI_Allreduce_future(...); Asynchronous collecaves. 24
Example API ImplementaHon hiper::future_t<void> *hiper::MPI_Isend_await(..., hiper::future_t<void> *await) { // Create a promise to be satisfied on the completion of this operation hiper::promise_t<void> *prom = new hiper::promise_t<void>(); // Taskify the actual MPI_Isend at the NIC, pending the satisfaction of await hclib::async_nb_await_at([=] { // At MPI place, do the actual Isend MPI_Request req; ::MPI_Isend(..., &req)); // Create a data structure to track the status of the pending Isend pending_mpi_op *op = malloc(sizeof(*op)); ... hiper::append_to_pending(op, &pending, test_mpi_completion, nic); }, fut, nic); return prom->get_future(); Periodic polling funcaon } 25
Composing System, MPI, CUDA Modules // Asynchronously process ghost regions on this rank in parallel on CPU ghost_fut = forasync_future([] (z) { ... }); // Asynchronously exchange ghost regions with neighbors reqs[0] = MPI_Isend_await(..., ghost_fut); reqs[1] = MPI_Isend_await(..., ghost_fut); reqs[2] = MPI_Irecv(...); reqs[3] = MPI_Irecv(...); // Asynchronously process remainder of z values on this rank kernel_fut = forasync_cuda(..., [] (z) { ... }); // Copy received ghost region to CUDA device copy_fut = async_copy_await(..., reqs[2], reqs[3], kernel_fut); 26
Outline HiPER Execuaon & Pladorm Model HiPER Use Cases • MPI Module • Composing MPI and CUDA Performance Evaluaaon Conclusions & Future Work 27
Task Micro-Benchmarking Micro-benchmark performance normalized to HiPER on Edison, higher is beQer. hQps://github.com/habanero-rice/tasking-micro-benchmark-suite 28
Experimental Setup Experiments shown here were run on Titan @ ORNL and Edison @ NERSC. ApplicaHon PlaMorm Dataset Modules Used Scaling ISx Titan 2 29 keys per node OpenSHMEM Weak HPGMG-FV Edison log2_box_dim=7 UPC++ Weak boxes_per_rank=8 UTS Titan T1XXL OpenSHMEM Strong Graph500 Titan 2 29 nodes OpenSHMEM Strong LBM Titan MPI, CUDA Weak 29
HiPER EvaluaHon – Regular ApplicaHons HIPER is low-overhead, no impact on performance for regular applicaaons 1 20 0.8 Total execution time (s) 15 Total execution time (s) 0.6 10 0.4 5 0.2 0 32 64 128 256 512 1024 0 64 128 256 512 Total nodes on Titan (16 cores per node) Flat OpenSHMEM HIPER Total nodes on Edison (2 processes/sockets per node, 12 cores per process) OpenSHMEM+OpenMP UPC++ + OpenMP HiPER ISx HPGMG Solve Step 30
HiPER EvaluaHon – Regular ApplicaHons ~2% performance improvement through reduced synchronizaaon from futures-based programming. LBM 31
HiPER EvaluaHon – UTS HIPER integraaon 100 improves computaaon- 80 communicaaon Total execution time (s) overlap, 60 scalability, load balance 40 20 0 32 64 128 256 512 1024 Total nodes on Titan (16 cores per node) OpenSHMEM+OpenMP HiPER 32
HiPER EvaluaHon – Graph500 HiPER used for concurrent (not parallel) programming in Graph500. Rather than periodic polling, use novel shmem_async_when APIs to trigger local computaaon on incoming RDMA. Reduces code complexity, hands scheduling problem to the runame. 33
Outline HiPER Execuaon & Pladorm Model HiPER Use Cases • OpenSHMEM w/o Thread Safety • OpenSHMEM w/ Contexts Performance Evaluaaon Conclusions & Future Work 34
Recommend
More recommend