Designing High Performance Heterogeneous Broadcast for Streaming - PowerPoint PPT Presentation

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar K. (DK) Panda 1 Department of Computer Science and Engineering, The Ohio State University 2 Engility Corporation DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. 88ABW-2016-5574

Outline • Introduction • Proposed Designs • Performance Evaluation • Conclusion and Future Work Network Based Computing Laboratory SBAC-PAD 2016 2

Drivers of Modern HPC Cluster Architectures Accelerators / Coprocessors High Performance Interconnects – InfiniBand Multi-core Processors high compute density, high performance/watt <1 µs latency, >100 Gbps Bandwidth >1 Tflop/s DP on a chip • Multi-core processors are ubiquitous • InfiniBand is very popular in HPC clusters • Accelerators/Coprocessors are becoming common in high-end systems ➠ Pushing the envelope towards Exascale computing Tianhe – 2 Stampede Titan Tianhe – 1A Network Based Computing Laboratory SBAC-PAD 2016 3

IB and GPU in HPC Systems • Growth of IB and GPU clusters in the last 3 years – IB is the major commodity network adapter used – NVIDIA GPUs boost 18% of the top 50 of the ”Top 500” systems as of June 2016 60 System share in Top 500 (%) 50 51.8 47.4 40 44.8 44.4 41.4 41 40.8 30 20 10 13.8 13.2 8.4 7.8 9 9.8 10.4 0 June-13 Nov-13 Nov-14 June-15 Nov-15 June-16 June-14 GPU Cluster InfiniBand Cluster Data from Top500 list (http://www.top500.org) Network Based Computing Laboratory SBAC-PAD 2016 4

Motivation Data Source • Streaming applications on HPC systems Real-time streaming 1. Communication (MPI) HPC resources for Sender real-time analytics • Broadcast-type operations Data streaming-like broadcast operations 2. Computation (CUDA) • Multiple GPU nodes as workers Worker Worker Worker Worker Worker CPU CPU CPU CPU CPU – Examples GPU GPU GPU GPU GPU • Deep learning frameworks GPU GPU GPU GPU GPU • Proton computed tomography (pCT) Network Based Computing Laboratory SBAC-PAD 2016 5

Motivation • Streaming applications on HPC systems 1. Communication — Heterogeneous Broadcast-type operations • Data are usually from a live source and stored in Host memory • Data need to be sent to remote GPU memories for computing Real-time streaming Requires data movement from Host memory to remote GPU memories, Sender i.e., host-device (H-D) heterogeneous Data streaming-like broadcast operations broadcast ⇒ Performance bottleneck Network Based Computing Laboratory SBAC-PAD 2016 6

Motivation • Requirements for streaming applications on HPC systems – Low latency, high throughput and scalability – Free up Peripheral Component Interconnect Express (PCIe) bandwidth for application needs Data streaming-like broadcast operations Worker Worker Worker Worker Worker CPU CPU CPU CPU CPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU Network Based Computing Laboratory SBAC-PAD 2016 7

Motivation – Technologies we have • NVIDIA GPUDirect [1] • InfiniBand (IB) hardware multicast (IB MCAST) [2] – Use remote direct memory access (RDMA) transfers – Enables efficient designs of between GPUs and other homogeneous broadcast PCIe devices ⇒ GDR operations – Peer-to-peer transfers • Host-to-Host [3] between GPUs • GPU-to-GPU [4] – and more… [1] https://developer.nvidia.com/gpudirect [2] Pfister GF., “An Introduction to the InfiniBand Architecture. ” High Performance Mass Storage and Parallel I/O, Chapter 42, pp 617-632, Jun 2001. [3] J. Liu, A. R. Mamidala, and D. K. Panda, “Fast and Scalable MPI-level Broadcast using InfiniBand’s Hardware Multicast Support,” in IPDPS 2004 , p. 10, April 2004. [4] A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on InfiniBand Clusters,” in HiPC 2014 , Dec 2014. Network Based Computing Laboratory SBAC-PAD 2016 8

Problem Statement • Can we design a high-performance heterogeneous broadcast for streaming applications? • Supports Host-to-Device broadcast operations • Can we also design an efficient broadcast for multi-GPU systems? • Can we combine GPUDirect and IB technologies to • Avoid extra data movements to achieve better performance • Increase available Host-Device (H-D) PCIe bandwidth for application use Network Based Computing Laboratory SBAC-PAD 2016 9

Outline • Introduction • Proposed Designs – Heterogeneous Broadcast with GPUDirect RDMA (GDR) and InfiniBand (IB) Hardware Multicast – Intra-node Topology-Aware Broadcast for Multi-GPU Systems • Performance Evaluation • Conclusion and Future Work Network Based Computing Laboratory SBAC-PAD 2016 10

Proposed Heterogeneous Broadcast • Key requirement of IB MCAST – Control header needs to be stored in host memory • SL-based approach: Combine CUDA GDR and IB MCAST features – Also, take advantage of IB Scatter-Gather List (SGL) feature: • Multicast two separate addresses (control on the host + data on GPU)— in but one IB message – Directly IB read/write from/to GPU using GDR feature ⇒ low-latency zero- copy based schemes – Avoiding extra copy between Host and GPU ⇒ frees up PCIe bandwidth resource for application needs – Employing IB MCAST feature increases scalability Network Based Computing Laboratory SBAC-PAD 2016 11

Proposed Heterogeneous Broadcast • Overview of SL-based approach Node 1 C CPU Source IB HCA C Data GPU CPU Data IB HCA IB GPU Switch Node N C CPU Multicast steps IB HCA IB SL step GPU Data Network Based Computing Laboratory SBAC-PAD 2016 12

Broadcast on Multi-GPU systems • Existing two-level approach – Inter-node: Can apply proposed SL-based Node 1 – Intra-node: Use host-based shared memory CPU Source GPU CPU Issues of H-D cudaMemcpy : IB 1. Expensive Switch GPU 2. Consumes PCIe bandwidth Node N between CPU and GPUs! CPU Multicast steps cudaMemcpy GPU 0 GPU N GPU 1 (Host ↔ Device) Network Based Computing Laboratory SBAC-PAD 2016 13

Broadcast on Multi-GPU systems • Proposed Intra-node Topology-Aware Broadcast Node 1 – CUDA InterProcess Communication (IPC) CPU Source GPU CPU IB Switch GPU Node N CPU Multicast steps cudaMemcpy GPU 0 GPU 1 GPU N (Device ↔ Device) Network Based Computing Laboratory SBAC-PAD 2016 14

Broadcast on Multi-GPU systems • Proposed Intra-node Topology-Aware Broadcast – Leader keeps a copy of the data Host Memory Shared Memory – Synchronization between GPUs Region • Use a one-byte flag in shared memory on host – Non-leaders copy the data using CUDA IPC Ø Frees up PCIe bandwidth resource GPU N GPU 0 GPU 1 • Other Topology-Aware designs RecvBuf – Ring, K-nomial…etc. RecvBuf RecvBuf CopyBuf – Dynamic tuning selection IPC Network Based Computing Laboratory SBAC-PAD 2016 15

Outline • Introduction • Proposed Designs • Performance Evaluation – OSU Micro-Benchmark (OMB) level evaluation – Streaming benchmark level evaluation • Conclusion and Future Work Network Based Computing Laboratory SBAC-PAD 2016 16

Experimental Environments 1. Wilkes cluster @ University of • Modified Ohio State University Cambridge (OSU) Micro-Benchmark (OMB) http://www.hpc.cam.ac.uk/services/wilkes – http://mvapich.cse.ohio-state.edu/benchmarks/ – 2 NVIDIA K20c GPUs per node – osu_bcast - MPI_Bcast Latency Test – Used Up to 32 GPU nodes – Modified to support heterogeneous broadcast 2. CSCS cluster @ Swiss National • Streaming benchmark Supercomputing Centre – Mimic real streaming applications http://www.cscs.ch/computers/kesch_escha/index.html – Cray CS-Storm system – Continuously broadcasts data from a – 8 NVIDIA K80 GPU cards per node ( = 16 source to GPU-based compute nodes NVIDIA Kepler GK210 GPU chips per node) – Includes a computation phase that involves – Used Up to 88 NVIDIA K80 GPU cards host-to-device and device-to-host copies (176 GPU chips) over 11 nodes Network Based Computing Laboratory SBAC-PAD 2016 17

Designing High Performance Heterogeneous Broadcast for Streaming - PowerPoint PPT Presentation

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar K. (DK) Panda 1 Department of Computer

Broadcast Algorithms BJRN A. JOHNSSON Overview Best-Effort Broadcast (Regular) Reliable

Broadcast Receiver Why do we need Broadcast Receiver? Broadcast Receivers Broadcast receiver

Broadcast Receiver Why do we need Broadcast Receiver? Broadcast Receivers Broadcast receiver

Broadcast Encryption and Some Other Primitives Lecture 24 Broadcast Encryption Broadcast

BROADCAST RECEIVER SERVICE Broadcast receiver A broadcast receiver is a dormant component of

BROADCAST RECEIVER SERVICES Broadcast receiver A broadcast receiver is a dormant component of

Cooperative Broadcast for Cooperative Broadcast for Maximum Network Lifetime Maximum Network

Designing for Designing for Greenspace Greenspace Greenspace Designing for Designing for

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Class 14 Slides SLIDE what is the designing principle how does designing principle

www.sbe.org Society of Broadcast Engineers Ralph Beaver Society of Broadcast Engineers

Broadcast Journalism: Guide for the Presentation of Radio and Television Broadcast Journalism:

Broadcast Journalism: Guide for the Presentation of Broadcast Journalism: Guide for the

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Designing Your Fashion Portfolio From Concept To Presentation Designing Your Fashion Portfolio

Standardizing hurricane size descriptors for broadcast to the public Lori Drake, Hurricane

I/O Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy,

[I NTRODUCTION ] Shrideep Pallickara Computer Science Colorado State University CS370: Operating

IN INSIDE A COMPUTER COMPUTER BU BUS S ARCHITECTURE ECE 422 DATA COMMUNICATIONS &

UMBC A B M A L T F O U M B C I M Y O R T 1 (Jan. 30th, 2002) I E S R C E

Hardware and Device Drivers Device virtualization Device drivers and security Bjrn

PC BUILDING PRESENTED BY WHAT IS A PC General purpose Personal Computer for individual usage

Create a custom IP Block with AXI Interface

Chapter 1 Understand units of measure common to computer systems. Introduction Appreciate

Designing High Performance Heterogeneous Broadcast for Streaming - PowerPoint PPT Presentation

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar K. (DK) Panda 1 Department of Computer

Broadcast Algorithms BJRN A. JOHNSSON Overview Best-Effort Broadcast (Regular) Reliable

Broadcast Receiver Why do we need Broadcast Receiver? Broadcast Receivers Broadcast receiver

Broadcast Receiver Why do we need Broadcast Receiver? Broadcast Receivers Broadcast receiver

Broadcast Encryption and Some Other Primitives Lecture 24 Broadcast Encryption Broadcast

BROADCAST RECEIVER SERVICE Broadcast receiver A broadcast receiver is a dormant component of

BROADCAST RECEIVER SERVICES Broadcast receiver A broadcast receiver is a dormant component of

Cooperative Broadcast for Cooperative Broadcast for Maximum Network Lifetime Maximum Network

Designing for Designing for Greenspace Greenspace Greenspace Designing for Designing for

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Class 14 Slides SLIDE what is the designing principle how does designing principle

www.sbe.org Society of Broadcast Engineers Ralph Beaver Society of Broadcast Engineers

Broadcast Journalism: Guide for the Presentation of Radio and Television Broadcast Journalism:

Broadcast Journalism: Guide for the Presentation of Broadcast Journalism: Guide for the

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Designing Your Fashion Portfolio From Concept To Presentation Designing Your Fashion Portfolio

Standardizing hurricane size descriptors for broadcast to the public Lori Drake, Hurricane

I/O Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy,

[I NTRODUCTION ] Shrideep Pallickara Computer Science Colorado State University CS370: Operating

IN INSIDE A COMPUTER COMPUTER BU BUS S ARCHITECTURE ECE 422 DATA COMMUNICATIONS &amp;

UMBC A B M A L T F O U M B C I M Y O R T 1 (Jan. 30th, 2002) I E S R C E

Hardware and Device Drivers Device virtualization Device drivers and security Bjrn

PC BUILDING PRESENTED BY WHAT IS A PC General purpose Personal Computer for individual usage

Create a custom IP Block with AXI Interface

Chapter 1 Understand units of measure common to computer systems. Introduction Appreciate

IN INSIDE A COMPUTER COMPUTER BU BUS S ARCHITECTURE ECE 422 DATA COMMUNICATIONS &