dCUDA: Hardware Supported Overlap of Computation and Communication - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth dCUDA: Hardware Supported Overlap of Computation and Communication Tobias Gysi, Jeremia Bär, and Torsten Hoefler

spcl.inf.ethz.ch @spcl_eth GPU computing gained a lot of popularity in various application domains weather & climate machine learning molecular dynamics

spcl.inf.ethz.ch @spcl_eth GPU cluster programming using MPI and CUDA node 2 code node 1 // run compute kernel device device __global__ void mykernel ( … ) { } memory memory // launch compute kernel PCI-Express PCI-Express mykernel<<<64,128 >>>( … ); // on-node data movement cudaMemcpy( host host psize, &size, memory memory sizeof(int), cudaMemcpyDeviceToHost); PCI-Express PCI-Express // inter-node data movement mpi_send( pdata, size, MPI_FLOAT, … ); mpi_recv( pdata, size, interconnect MPI_FLOAT, … );

spcl.inf.ethz.ch @spcl_eth Disadvantages of the MPI-CUDA approach host device complexity • two programming models • duplicated functionality mykernel <<< >>>( … ); cudaMemcpy ( … ); mykernel ( … ) { … copy sync } device sync time mpi_send ( … ); mpi_recv ( … ); cluster sync mykernel<<< >>>( … ); performance mykernel ( … ) { • encourages sequential execution … } • low utilization of the costly hardware …

spcl.inf.ethz.ch @spcl_eth Achieve high resource utilization using oversubscription & hardware threads code thread 1 thread 2 thread 3 instruction pipeline ld %r0,%r1 ld %r0,%r1 ready ready ld mul %r0,%r0,3 stall ld %r0,%r1 ready ld ld st %r0,%r1 ready stall ld %r0,%r1 ld ld mul %r0,%r0,3 ready stall mul ld GPU cores use time stall mul %r0,%r0,3 ready mul mul “parallel slack” to ready stall mul %r0,%r0,3 mul mul hide instruction pipeline latencies st %r0,%r1 ready stall st mul stall st %r0,%r1 ready st st ready stall st %r0,%r1 st st …

spcl.inf.ethz.ch @spcl_eth Use oversubscription & hardware threads to hide remote memory latencies code thread 1 thread 2 thread 3 instruction pipeline get … get … ready ready get mul %r0,%r0,3 stall get … ready get get put … stall stall get … get get stall stall stall ! get introduce put & get time ready stall stall ! ! operations to access mul ! ready stall mul %r0,%r0,3 distributed memory mul mul stall mul %r0,%r0,3 ready ready stall mul %r0,%r0,3 mul mul ready stall put mul put … …

spcl.inf.ethz.ch @spcl_eth How much “parallel slack” is necessary to fully utilize the interconnect? Little’s law 𝑑𝑝𝑜𝑑𝑣𝑠𝑠𝑓𝑜𝑑𝑧 = 𝑚𝑏𝑢𝑓𝑜𝑑𝑧 ∗ 𝑢ℎ𝑠𝑝𝑣𝑕ℎ𝑞𝑣𝑢 device memory interconnect latency 1µs 19µs bandwidth 200GB/s 6GB/s concurrency 200kB 114kB #threads ~12000 ~7000 >>

spcl.inf.ethz.ch @spcl_eth dCUDA (distributed CUDA) extends CUDA with MPI-3 RMA and notifications for ( int i = 0; i < steps; ++i) { • iterative stencil kernel for ( int idx = from; idx < to; idx += jstride) • thread specific idx out[idx] = -4.0 * in[idx] + computation in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride]; if (lsend) dcuda_put_notify (ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag); if (rsend) dcuda_put_notify (ctx, wout, rank + 1, 0, jstride, &out[len], tag); communication • map ranks to blocks dcuda_wait_notifications (ctx, wout, • device-side put/get operations DCUDA_ANY_SOURCE, tag, lsend + rsend); • notifications for synchronization • shared and distributed memory swap(in, out); swap(win, wout); }

spcl.inf.ethz.ch @spcl_eth Advantages of the dCUDA approach performance device 1 device 2 • avoid device synchronization rank 1 rank 2 rank 3 rank 4 • latency hiding at cluster scale stencil( … ) ; stencil( … ) ; put( … ); put( … ); put( … ); put( … ); wait( … ); wait( … ); stencil( … ) ; stencil( … ) ; complexity put( … ); put( … ); put( … ); put( … ); • unified programming model wait( … ); wait( … ); time • one communication mechanism sync sync stencil( … ) ; stencil( … ) ; put( … ); put( … ); device 1 device 2 put( … ); put( … ); sync sync put wait( … ); wait( … ); stencil( … ) ; stencil( … ) ; rank 1 rank 2 rank 3 rank 4 put( … ); put( … ); put( … ); put( … ); wait( … ); wait( … ); … … … … put put

spcl.inf.ethz.ch @spcl_eth Implementation of the dCUDA runtime system event handler host-side block manager block manager block manager MPI GPU direct device-library device-library device-library device-side put ( … ); put( … ); put( … ); get( … ); get( … ); get( … ); wait( … ); wait( … ); wait( … );

spcl.inf.ethz.ch @spcl_eth Overlap of a copy kernel with halo exchange communication benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per node) no overlap 1000 execution time [ms] compute & exchange halo exchange 500 compute only 0 30 60 90 # of copy iterations per exchange

spcl.inf.ethz.ch @spcl_eth Weak scaling of MPI-CUDA and dCUDA for a stencil program benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per node) MPI-CUDA 100 dCUDA execution time [ms] 50 halo exchange 0 2 4 6 8 # of nodes

spcl.inf.ethz.ch @spcl_eth Weak scaling of MPI-CUDA and dCUDA for a particle simulation benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per node) MPI-CUDA 200 execution time [ms] dCUDA 150 100 50 halo exchange 0 2 4 6 8 # of nodes

spcl.inf.ethz.ch @spcl_eth Weak scaling of MPI-CUDA and dCUDA for sparse-matrix vector multiplication benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per node) 200 150 execution time [ms] dCUDA 100 MPI-CUDA 50 communication 0 1 4 9 # of nodes

spcl.inf.ethz.ch @spcl_eth Conclusions  unified programming model for GPU clusters  device-side remote memory access operations with notifications  transparent support of shared and distributed memory  extend the latency hiding technique of CUDA to the full cluster  inter-node communication without device synchronization  use oversubscription & hardware threads to hide remote memory latencies  automatic overlap of computation and communication  synthetic benchmarks demonstrate perfect overlap  example applications demonstrate the applicability to real codes  https://spcl.inf.ethz.ch/Research/Parallel_Programming/dCUDA/

dCUDA: Hardware Supported Overlap of Computation and Communication - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth dCUDA: Hardware Supported Overlap of Computation and Communication Tobias Gysi, Jeremia Br, and Torsten Hoefler spcl.inf.ethz.ch @spcl_eth GPU computing gained a lot of popularity in various application domains

Hardware Observability Framework Hardware Observability Framework Hardware Observability

DCUDA: Dynamic GPU Scheduling with Live Migration Support Fan Guo 1 , Yongkun Li 1 , John C.S. Lui

Overlap between VaD VaD and AD: and AD: Overlap between an epidemiological perspective an

Cloud Layer Overlap and the Influence of Vertical and Temporal Resolution of Radar Data Oliver

Corrected network measures Introduction Overlap weight Corrected Vladimir Batagelj overlap

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Formal Definition of Computation Formal Definition of Computation p.1/28 Computation

Overlap Graph and Clumps Mireille R egnier LIX and INRIA Mireille.Regnier@inria.fr web page :

software and hardware for the Internet of Things. Choose hardware Design hardware Design

Fast Convolutions Via the Overlap- and-Save Method Using Shared Memory FFT Karel Admek , Sofia

Cloud cover and overlap parameterizations Adrian Tompkins, ICTP tompkins@ictp.it 1 1 Cloud

Hierarchical Overlap Graph B. Cazaux and E. Rivals LIRMM & IBC, Montpellier 8. Feb. 2018

Overlap of black petrel distributions with New Zealand fisheries Edward Abraham, Yvan Richard

MONONUCLEOSIS MIMICKING MALIGNANCY Laura Saldivar, MD Introduction - Symptom overlap -

EMPLOYER EMPLOYER AN INTRODUCTION TO SUPPORTED SUPPORTED POLICING POLICING An introduction

Building Blocks Operating Systems, Processes, Threads Dr Mark Bull, EPCC markb@epcc.ed.ac.uk

Oversubscription on Multicore Processors Costin Iancu, Steven Hofmeyr, Filip Blagojevi, Yili

Jill Larsen EVP & CHRO Medidata Solutions Agenda AI Defined Embracing AI AI Myths AI

C ONTACT U S 610-650-1848 schedule@imwealthpartners.com Securities offered through SA Stone

QoS-Aware Admission Control in Heterogeneous Datacenters Christina Delimitrou and Christos

ICT and Development ICT and Development Week 10 March 28 - 30 1 Computers and Society

GPCF* Update Present status as a series of questions / answers related to decisions made / yet

The Impact of Process Placement and Oversubscription on Application Performance: A Case Study for

dCUDA: Hardware Supported Overlap of Computation and Communication - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth dCUDA: Hardware Supported Overlap of Computation and Communication Tobias Gysi, Jeremia Br, and Torsten Hoefler spcl.inf.ethz.ch @spcl_eth GPU computing gained a lot of popularity in various application domains

Hardware Observability Framework Hardware Observability Framework Hardware Observability

DCUDA: Dynamic GPU Scheduling with Live Migration Support Fan Guo 1 , Yongkun Li 1 , John C.S. Lui

Overlap between VaD VaD and AD: and AD: Overlap between an epidemiological perspective an

Cloud Layer Overlap and the Influence of Vertical and Temporal Resolution of Radar Data Oliver

Corrected network measures Introduction Overlap weight Corrected Vladimir Batagelj overlap

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Formal Definition of Computation Formal Definition of Computation p.1/28 Computation

Overlap Graph and Clumps Mireille R egnier LIX and INRIA Mireille.Regnier@inria.fr web page :

software and hardware for the Internet of Things. Choose hardware Design hardware Design

Fast Convolutions Via the Overlap- and-Save Method Using Shared Memory FFT Karel Admek , Sofia

Cloud cover and overlap parameterizations Adrian Tompkins, ICTP tompkins@ictp.it 1 1 Cloud

Hierarchical Overlap Graph B. Cazaux and E. Rivals LIRMM &amp; IBC, Montpellier 8. Feb. 2018

Overlap of black petrel distributions with New Zealand fisheries Edward Abraham, Yvan Richard

MONONUCLEOSIS MIMICKING MALIGNANCY Laura Saldivar, MD Introduction - Symptom overlap -

EMPLOYER EMPLOYER AN INTRODUCTION TO SUPPORTED SUPPORTED POLICING POLICING An introduction

Building Blocks Operating Systems, Processes, Threads Dr Mark Bull, EPCC markb@epcc.ed.ac.uk

Oversubscription on Multicore Processors Costin Iancu, Steven Hofmeyr, Filip Blagojevi, Yili

Jill Larsen EVP &amp; CHRO Medidata Solutions Agenda AI Defined Embracing AI AI Myths AI

C ONTACT U S 610-650-1848 schedule@imwealthpartners.com Securities offered through SA Stone

QoS-Aware Admission Control in Heterogeneous Datacenters Christina Delimitrou and Christos

ICT and Development ICT and Development Week 10 March 28 - 30 1 Computers and Society

GPCF* Update Present status as a series of questions / answers related to decisions made / yet

The Impact of Process Placement and Oversubscription on Application Performance: A Case Study for

Hierarchical Overlap Graph B. Cazaux and E. Rivals LIRMM & IBC, Montpellier 8. Feb. 2018

Jill Larsen EVP & CHRO Medidata Solutions Agenda AI Defined Embracing AI AI Myths AI