MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning GPU - PowerPoint PPT Presentation

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning GPU Technology Conference GTC 2017 by Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail: subramon@cse.ohio-state.edu E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~subramon http://www.cse.ohio-state.edu/~panda

Outline • Overview of the MVAPICH2 Project • MVAPICH2-GPU with GPUDirect-RDMA (GDR) What’s new with MVAPICH2-GDR • • Efficient MPI-3 Non-Blocking Collective support Maximal overlap in MPI Datatype Processing • Efficient Support for Managed Memory • RoCE and Optimized Collective • Initial support for GPUDirect Async feature • • Efficient Deep Learning with MVAPICH2-GDR • OpenACC-Aware support • Conclusions Network Based Computing Laboratory GTC 2017 2

Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,750 organizations in 84 countries – More than 416,000 (> 0.4 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘16 ranking) • 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China • 13th, 241,108-core (Pleiades) at NASA • 17th, 462,462-core (Stampede) at TACC • 40th, 74,520-core (Tsubame 2.5) at Tokyo Institute of Technology – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade System-X from Virginia Tech (3 rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Sunway TaihuLight (1 st in Jun’16, 10M cores, 100 PFlops) – Network Based Computing Laboratory GTC 2017 3

Network Based Computing Laboratory Number of Downloads 100000 150000 200000 250000 300000 350000 400000 450000 50000 0 MVAPICH2 Release Timeline and Downloads Sep-04 Feb-05 Jul-05 Dec-05 MV 0.9.4 May-06 MV2 0.9.0 Oct-06 Mar-07 Aug-07 MV2 0.9.8 Jan-08 Jun-08 MV2 1.0 Nov-08 Apr-09 MV 1.0 Sep-09 MV2 1.0.3 GTC 2017 Feb-10 MV 1.1 Jul-10 Timeline Dec-10 May-11 MV2 1.4 Oct-11 Mar-12 MV2 1.5 Aug-12 MV2 1.6 Jan-13 Jun-13 MV2 1.7 Nov-13 MV2 1.8 Apr-14 Sep-14 Feb-15 MV2 1.9 MV2-GDR 2.0b Jul-15 MV2-MIC 2.0 Dec-15 MV2 2.1 May-16 MV2-Virt 2.1rc2 MV2-GDR 2.2rc1 Oct-16 MV2-X 2.2 MV2 2.3a Mar-17 4

MVAPICH2 Software Family Requirements MVAPICH2 Library to use MPI with IB, iWARP and RoCE MVAPICH2 Advanced MPI, OSU INAM, PGAS and MPI+PGAS with IB and RoCE MVAPICH2-X MPI with IB & GPU MVAPICH2-GDR MPI with IB & MIC MVAPICH2-MIC HPC Cloud with MPI & IB MVAPICH2-Virt Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA Network Based Computing Laboratory GTC 2017 5

Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models PGAS Hybrid --- MPI + X Message Passing Interface (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk) (MPI) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-to- Remote Energy- I/O and Fault Collectives Active Introspection point Job Startup Virtualization Memory & Analysis Algorithms Awareness Messages File Systems Tolerance Primitives Access Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path) (Intel-Xeon, OpenPower, Xeon-Phi (MIC, KNL), NVIDIA GPGPU) Modern Features Transport Protocols Transport Mechanisms Modern Features Multi SR- Shared UMR NVLink * CAPI * RC XRC UD DC ODP MCDRAM * CMA IVSHMEM IOV Rail Memory * Upcoming Network Based Computing Laboratory GTC 2017 6

Optimizing MPI Data Movement on GPU Clusters • Connected as PCIe devices – Flexibility bu t Complexity Memory buffers Node 1 Node 0 1 . Intra- GPU QPI 2 . Intra-Socket GPU -GPU CPU CPU CPU 3 . Inter-Socket GPU -GPU PCIe 4 . Inter-Node GPU -GPU 5 . Intra-Socket GPU -Host 6 . Inter-Socket GPU -Host GPU 7 . Inter-Node GPU -Host GPU GPU GPU IB 8 . Inter-Node GPU -GPU with IB adapter on remote socket and more . . . • For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline … • Critical for runtimes to optimize data movement while hiding the complexity Network Based Computing Laboratory GTC 2017 7

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers At Sender: inside MPI_Send(s_devbuf, size, …); MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …); High Performance and High Productivity Network Based Computing Laboratory GTC 2017 8

CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases • Support for MPI communication from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communication for multi- GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node • Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective communication from GPU device buffers Network Based Computing Laboratory GTC 2017 9

Installing MVAPICH2-GDR • MVAPICH2-2.2 with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/ • Please select the best matching package for your system We have most of common combinations. If you do not find your match here please email OSU with the following details • – OS versions – OFED version – CUDA Version – Compiler (GCC, Intel and PGI) • Install instructions • Having root permissions • On default Path: rpm -Uvh --nodeps mvapich2-gdr-cuda7.0-gnu-2.2-0.3.el6.x86_64.rpm • On specific Path: rpm --prefix /custom/install/prefix -Uvh --nodeps mvapich2-gdr-cuda7.0-gnu-2.2-0.3.el6.x86_64.rpm • Do not have root permissions: • rpm2cpio mvapich2-gdr-cuda7.0-gnu-2.2-0.3.el6.x86_64.rpm | cpio –id • More details on the installation process refer to: http://mvapich.cse.ohio-state.edu/userguide/gdr/2.2#_installing_mvapich2_gdr_library Network Based Computing Laboratory GTC 2017 10

Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR) GPU-GPU Internode Bandwidth GPU-GPU internode latency 30 MV2-GDR2.2 MV2-GDR2.0b Latency (us) 3000 MV2-GDR2.2 Bandwidth 25 MV2 w/o GDR 2500 MV2-GDR2.0b 11X (MB/s) 20 2000 MV2 w/o GDR 15 3X 10x 1500 10 2X 1000 5 0 500 2.18us 0 2 8 32 128 512 2K 0 Message Size (bytes) 1 4 16 64 256 1K 4K Message Size (bytes) GPU-GPU Internode Bi-Bandwidth Bi-Bandwidth (MB/s) 4000 MV2-GDR2.2 3000 MV2-GDR2.0b 11x 2000 MVAPICH2-GDR-2.2 MV2 w/o GDR 2X Intel Ivy Bridge (E5-2680 v2) node - 20 cores 1000 NVIDIA Tesla K40c GPU 0 Mellanox Connect-X4 EDR HCA 1 4 16 64 256 1K 4K CUDA 8.0 Mellanox OFED 3.0 with GPU-Direct-RDMA Message Size (bytes) Network Based Computing Laboratory GTC 2017 11

Application-Level Evaluation (HOOMD-blue) 64K Particles 256K Particles 3500 2500 Average Time Steps per 3000 Average Time Steps per MV2 MV2+GDR 2500 2000 2X second (TPS) second (TPS) 2X 2000 1500 1500 1000 1000 500 500 0 0 4 8 16 32 4 8 16 32 Number of Processes Number of Processes • Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • HoomdBlue Version 1.0.5 • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384 Network Based Computing Laboratory GTC 2017 12

Full and Efficient MPI-3 RMA Support Small Message Latency 35 30 Latency (us) 25 6X 20 15 10 2.88 us 5 0 0 2 8 32 128 512 2K 8K Message Size (bytes) MVAPICH2-GDR-2.2 Intel Ivy Bridge (E5-2680 v2) node - 20 cores, NVIDIA Tesla K40c GPU Mellanox Connect-IB Dual-FDR HCA, CUDA 7 Mellanox OFED 2.4 with GPU-Direct-RDMA Network Based Computing Laboratory GTC 2017 13

Performance of MVAPICH2-GDR with GPU-Direct RDMA and Multi-Rail Support GPU-GPU Internode MPI Uni-Directional GPU-GPU Internode Bi-directional Bandwidth 10000 Bandwidth 16000 8,759 9000 MV2-GDR 2.1 15,111 MV2-GDR 2.1 14000 20% 8000 Bandwidth (MB/s) MV2-GDR 2.1 RC2 Bi-Bandwidth (MB/s) 12000 7000 40% MV2-GDR 2.1 RC2 6000 10000 5000 8000 4000 6000 3000 4000 2000 1000 2000 0 0 1 4 16 64 256 1K 4K 16K64K 256K 1M4M 1 4 16 64 256 1K 4K 16K64K 256K1M 4M Message Size (bytes) Message Size (bytes) MVAPICH2-GDR-2.2.b Intel Ivy Bridge (E5-2680 v2) node - 20 cores, NVIDIA Tesla K40c GPU Mellanox Connect-IB Dual-FDR HCA CUDA 7 Mellanox OFED 2.4 with GPU-Direct-RDMA Network Based Computing Laboratory GTC 2017 14

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning GPU - PowerPoint PPT Presentation

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning GPU Technology Conference GTC 2017 by Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail: subramon@cse.ohio-state.edu E-mail:

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies GPU

MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries Enabling GPUDirect Technologies GPU

MVAPICH2-GDR: High-Performance and Scalable CUDA-Aware MPI Library for HPC and AI GPU Technology

GdR = a tool from CNRS GdR = Groupe de Recherche A tool to gather an academic

AFC Asia Frontier Fund AFC Asia Frontier Fund CONFIDENTIAL January 2017 September 2013

AFC Asia Frontier Fund AFC Asia Frontier Fund CONFIDENTIAL May 2017 September 2013 INTRODUCING

July 2017 September 2013 INTRODUCING ASIA FRONTIER CAPITAL AFC Asia Frontier Fund 2

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

The Heritage of the Inventor School Inventor School Movement in the GDR Movement in the GDR

The Frontier Thesis: How & Why the Riverina Was Won The Frontier Thesis The Frontier Thesis:

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth by D.K. Panda The Ohio

Building Efficient HPC Clouds with MVAPICH2 and OpenStack over SR-IOV-enabled Heterogeneous

MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds Jie Zhang, Xiaoyi

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

DIRBS Telecommunication/ICT devices that do not comply with a country's applicable national

Digital footprints Respect and manners Who can view their content? Keep their

in Southeast Asia and Beyond RightsCon, 30 th March 2017 Ng Swee Meng (Sinar Project) & Maria

Searchable Security Scheme for Cloud NoSQL Mohammad Ahmadian ahmadian@knights.ucf.edu Advisor:

MakeCode: Types, Games, and Machine Code Micha Moskal Microsoft Research Redmond Joint work

Understanding the Efficacy of Deployed Internet Source Address Validation Filtering Robert

Everything You Ever Wanted to Know about Social Media but Were Afraid to Ask Presentation by

INTRODUCTION Gu Guest Blogging? https:/ ://www.y .youtube.c .com/watch?v=Rq Rq5Vo84Mops

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning GPU - PowerPoint PPT Presentation

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning GPU Technology Conference GTC 2017 by Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail: subramon@cse.ohio-state.edu E-mail:

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies GPU

MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries Enabling GPUDirect Technologies GPU

MVAPICH2-GDR: High-Performance and Scalable CUDA-Aware MPI Library for HPC and AI GPU Technology

GdR = a tool from CNRS GdR = Groupe de Recherche A tool to gather an academic

AFC Asia Frontier Fund AFC Asia Frontier Fund CONFIDENTIAL January 2017 September 2013

AFC Asia Frontier Fund AFC Asia Frontier Fund CONFIDENTIAL May 2017 September 2013 INTRODUCING

July 2017 September 2013 INTRODUCING ASIA FRONTIER CAPITAL AFC Asia Frontier Fund 2

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

The Heritage of the Inventor School Inventor School Movement in the GDR Movement in the GDR

The Frontier Thesis: How &amp; Why the Riverina Was Won The Frontier Thesis The Frontier Thesis:

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth by D.K. Panda The Ohio

Building Efficient HPC Clouds with MVAPICH2 and OpenStack over SR-IOV-enabled Heterogeneous

MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds Jie Zhang, Xiaoyi

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

DIRBS Telecommunication/ICT devices that do not comply with a country's applicable national

Digital footprints Respect and manners Who can view their content? Keep their

in Southeast Asia and Beyond RightsCon, 30 th March 2017 Ng Swee Meng (Sinar Project) &amp; Maria

Searchable Security Scheme for Cloud NoSQL Mohammad Ahmadian ahmadian@knights.ucf.edu Advisor:

MakeCode: Types, Games, and Machine Code Micha Moskal Microsoft Research Redmond Joint work

Understanding the Efficacy of Deployed Internet Source Address Validation Filtering Robert

Everything You Ever Wanted to Know about Social Media but Were Afraid to Ask Presentation by

INTRODUCTION Gu Guest Blogging? https:/ ://www.y .youtube.c .com/watch?v=Rq Rq5Vo84Mops

The Frontier Thesis: How & Why the Riverina Was Won The Frontier Thesis The Frontier Thesis:

in Southeast Asia and Beyond RightsCon, 30 th March 2017 Ng Swee Meng (Sinar Project) & Maria