High-Performance Adaptive MPI Derived Datatype Communication for - PowerPoint PPT Presentation

High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi- GPU Systems Ching-Hsiang Chu, Jahanzeb Maqbool Hashmi, Kawthar Shafie Khorassani, Hari Subramoni, Dhabaleswar K. (DK) Panda The Ohio State University {chu.368, hashmi.29, shafiekhorassani.1}@osu.edu, {subramon, panda}@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

Outline • Introduction • Problem Statement • Proposed Designs • Performance Evaluation • Concluding Remarks Network Based Computing Laboratory HiPC 2019 2

Drivers of Modern HPC Cluster Architectures Accelerators / Coprocessors High Performance Interconnects - high compute density, high InfiniBand SSD, NVMe-SSD, NVRAM performance/watt Multi-core Processors <1usec latency, 200Gbps Bandwidth> >1 TFlop DP on a chip • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Multiple Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) connected by PCIe/NVLink interconnects • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc. K - Computer Summit Sierra Sunway TaihuLight Network Based Computing Laboratory HiPC 2019 3

Non-contiguous Data Transfer for HPC Applications • Wide usages of MPI derived datatype for Non-contiguous Data Transfer – Requires Low-latency and high overlap processing Quantum Chromodynamics Weather Simulation: COSMO model M. Martinasso, G. Kwasniewski, S. R. Alam, Thomas C. Schulthess, and T. Hoefler. “A PCIe congestion-aware performance model for densely populated accelerator servers. “ SC 2016 Mike Clark. “GPU Computing with QUDA, “Developer Technology Group, https://www.olcf.ornl.gov/wp-content/uploads/2013/02/Clark_M_LQCD.pdf Network Based Computing Laboratory HiPC 2019 4

State-of-the-art MPI Derived Datatype Processing • GPU kernel-based packing/unpacking [1-3] – High-throughput memory access – Leverage GPUDirect RDMA capability [1] R. Shi, X. Lu, S. Potluri, K. Hamidouche, J. Zhang and D. K. Panda, "HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters," ICPP 2014. [2] C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS 2016. [3] Wei Wu, George Bosilca, Rolf vandeVaart, Sylvain Jeaugey, and Jack Dongarra. “GPU-Aware Non-contiguous Data Movement In Open MPI, “ HPDC 2016. Network Based Computing Laboratory HiPC 2019 5

Expensive Packing/Unpacking Operations in GPU-Aware MPI • Significant overhead Overhead of MPI Datatype Processing 1000000 MVAPICH2-GDR: Contiguous 337X worse! when moving non- MVAPICH2-GDR: DDT 100000 OpenMPI: Contiguous contiguous GPU- OpenMPI: DDT 10000 Latency (us) resident data 1000 100 – Wasting cycles 10 – Extra data copies 1 – High Latency!!! [32,16,16] [512,512,256] [1957x245] [11797x3009] (3.28 KB) (1012 KB) (25.8 KB) (173.51KB) NAS specfem3D_cm Application Kernels and their sizes Data transfer between two NVIDIA K80 GPUs with PCIe link Network Based Computing Laboratory HiPC 2019 6

Analysis of Packing/Unpacking Operations in GPU-Aware MPI CUDA Driver Overhead Pack/unpack kernels • Primary overhead Memory Allocation CUDA Synchronization Others 100% – Packing/Unpacking 80% Time Breakdown – CPU-GPU 60% synchronization 40% – GPU driver overhead 20% 0% • Can we reduce or 1012 KB 173.51KB 1012 KB 173.51KB 3.28 KB 25.8 KB 3.28 KB 25.8 KB eliminate the expensive packing/unpacking NAS specfem3D_cm NAS specfem3D_cm MVAPICH-GDR 2.3.1 OpenMPI 4.0.1 + UCX 1.5.1 operations? Data transfer between two NVIDIA K80 GPUs with PCIe link Network Based Computing Laboratory HiPC 2019 7

Outline • Introduction • Problem Statement • Proposed Designs • Performance Evaluation • Concluding Remarks Network Based Computing Laboratory HiPC 2019 8

Problem Statement • How can we exploit load-store based remote memory access model over high- performance interconnects like PCIe and NVLink to achieve “packing-free” non- contiguous data transfers for GPU-resident data? • Can we propose new designs that mitigate the overheads of existing approaches and offer optimal performance for GPU based derived datatype transfers when packing/unpacking approaches are inevitable? • How to design an adaptive MPI communication runtime that can dynamically employ optimal DDT processing mechanisms for diverse application scenarios? Network Based Computing Laboratory HiPC 2019 9

Outline • Introduction • Problem Statement • Proposed Designs – Zero-copy non-contiguous data movement over NVLink/PCIe – One-shot packing/unpacking – Adaptive MPI derived datatype processing • Performance Evaluation • Concluding Remarks Network Based Computing Laboratory HiPC 2019 10

Overview of Zero-copy Datatype Transfer • Direct link such as PCIe/NVLink is available between two GPUs �� • Efficient datatype layout � � �� exchange and cache �� • Load-store data � �� movement �� Network Based Computing Laboratory HiPC 2019 11

Zero-copy Datatype Transfer: Enhanced Layout Cache • Convert IOV list to �� displacement list – Improved reusability �� – One-time effort �� • Cache datatype layout on �� the shared system memory �� – Accessible within the node �� without extra copies �� Network Based Computing Laboratory HiPC 2019 12

Zero-copy Datatype Transfer: Copy vs. Load-Store • Exploiting load-store capability of modern interconnects – Eliminate extra data copies and expensive packing/unpacking processing Existing Packing Schem Proposed Packing-free Schem Destination GPU Memory Destination GPU Memory Source GPU Memory Load-Store Source GPU Memory PCIe/NVLink PCIe/NVLink Copy Network Based Computing Laboratory HiPC 2019 13

One-shot Packing/Unpacking Mechanism • Packing/unpacking is inevitable if there is no direct link • Direct packing/unpacking between CPU and GPU memory to avoid extra copies 1. GDRCopy-based Destination GPU Memory Source GPU Memory – CPU-driven low-latency copy-based System Memory PCIe/NVLink PCIe/NVLink scheme 2. Kernel-based – GPU-driven high-throughput load- store-based scheme Network Based Computing Laboratory HiPC 2019 14

High-Performance Adaptive MPI Derived Datatype Communication for - PowerPoint PPT Presentation

High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi- GPU Systems Ching-Hsiang Chu, Jahanzeb Maqbool Hashmi, Kawthar Shafie Khorassani, Hari Subramoni, Dhabaleswar K. (DK) Panda The Ohio State University {chu.368,

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Messages Messages A message contains a number of elements of some particular datatype.

Concurrent Datatype Verification Verifying lock free data types using CSP and FDR Jonathan

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

FouryearsofNOAA-10 retrieved C02concentrations. ApplicationtoAIRSsimulations

Lecture 9 PSiOS: Bring Your Own Privacy & Security to iOS Devices Tim Werthmann, Ralf Hund,

Time Series Bhiksha Raj Class 22. 14 Nov 2013 14 Nov 2013 11-755/18797 1 Administrivia

Geometry of Supermagnets Geometry of Supermagnets Vo Volk lker er Sc Schomerus homerus

BIMODULES OVER SIMPLE FINITE-DIMENSIONAL JORDAN SUPERALGEBRAS Consuelo Mart nez L opez

Support for Demonstration Ombudsman Programs Serving Beneficiaries of Financial Alignment Models

Office of Proposal Support Services (OPSS) Christina Leigh Docteur, Director Chetna Chianese,

Clifford representation of an algebra related to spanning forests Andrea Sportiello work in

High-Performance Adaptive MPI Derived Datatype Communication for - PowerPoint PPT Presentation

High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi- GPU Systems Ching-Hsiang Chu, Jahanzeb Maqbool Hashmi, Kawthar Shafie Khorassani, Hari Subramoni, Dhabaleswar K. (DK) Panda The Ohio State University {chu.368,

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Messages Messages A message contains a number of elements of some particular datatype.

Concurrent Datatype Verification Verifying lock free data types using CSP and FDR Jonathan

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

FouryearsofNOAA-10 retrieved C02concentrations. ApplicationtoAIRSsimulations

Lecture 9 PSiOS: Bring Your Own Privacy &amp; Security to iOS Devices Tim Werthmann, Ralf Hund,

Time Series Bhiksha Raj Class 22. 14 Nov 2013 14 Nov 2013 11-755/18797 1 Administrivia

Geometry of Supermagnets Geometry of Supermagnets Vo Volk lker er Sc Schomerus homerus

BIMODULES OVER SIMPLE FINITE-DIMENSIONAL JORDAN SUPERALGEBRAS Consuelo Mart nez L opez

Support for Demonstration Ombudsman Programs Serving Beneficiaries of Financial Alignment Models

Office of Proposal Support Services (OPSS) Christina Leigh Docteur, Director Chetna Chianese,

Clifford representation of an algebra related to spanning forests Andrea Sportiello work in

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Lecture 9 PSiOS: Bring Your Own Privacy & Security to iOS Devices Tim Werthmann, Ralf Hund,