Multi-GPU Nodes Iman Faraji, Seyed H. Mirsadeghi, and Ahmad Afsahi - PowerPoint PPT Presentation

Topology-Aware GPU Selection on Multi-GPU Nodes Iman Faraji, Seyed H. Mirsadeghi, and Ahmad Afsahi Department of Electrical and Computer Engineering Parallel Processing Research Laboratory Queen’s University Canada The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) May 23, 2016

Outline • Introduction • Background and Motivation • Design • Results • Conclusion • Future Work Parallel Processing Research Laboratory (PPRL) 2 AsHES 2016

Introduction • GPU accelerators have successfully established themselves in modern HPC clusters – High performance – Energy efficiency • Demand for higher GPU computational power and memory – Multi-GPU nodes in state-of-the-art HPC clusters Parallel Processing Research Laboratory (PPRL) 4 AsHES 2016

Introduction • Clusters with multi-GPU nodes provide:  Higher computational power  More memory to hold larger datasets However, this brings up a challenge… More GPUs  Potentially higher GPU-to-GPU communications “Achilles heel” in GPU -accelerated application performance! Parallel Processing Research Laboratory (PPRL) 5 AsHES 2016

Introduction • To address the GPU communication bottleneck: – Increase GPU utilization at the application level  Reducing the share of GPU communications in application runtime  Not all applications can highly utilize the GPUs in a node – Asynchronously progress inter-process GPU communications and GPU computation  Overlapping GPU communication with computation  Highly overlapping GPU communication and computation is not always feasible – Leverage GPU hardware features (such as IPC)  Improving GPU-to-GPU communication performance  Only possible for specific GPU pairs within a node  Communication performance still limited by the latency and bandwidth capacity  HOWEVER… Parallel Processing Research Laboratory (PPRL) 6 AsHES 2016

Introduction • Smartly designed applications will continue to use these features • GPU communications can still become a soft-point in different applications and GPU nodes Conduct GPU communications as efficient as possible HOW? Parallel Processing Research Laboratory (PPRL) 7 AsHES 2016

Background and Motivation • Multi-GPU node architecture Level 3 Level 2 Level 2 Level 1 Level 1 Level 1 Level 1 Level 0 Level 0 Level 0 Level 0 Level 0 Level 0 Level 0 Level 0 GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 · Level 1: Path between GPU pairs traverses multiple internal switches · Level 0: Path between GPU pairs traverses a PCIe internal switch · Level 2: Path between GPU pairs traverses a PCIe host bridge · Level 3: Path traverses a socket-level link (e.g., QPI) Helios-K80 cluster at Université Laval's computing centre Parallel Processing Research Laboratory (PPRL) 9 AsHES 2016

Background and Motivation • Multi-GPU node bandwidth Parallel Processing Research Laboratory (PPRL) 10 AsHES 2016

Background and Motivation • Multi-GPU node latency Parallel Processing Research Laboratory (PPRL) 11 AsHES 2016

Design • What we know: – Intranode GPU-to-GPU communications may traverse different paths – Different paths can have different latency and bandwidth • Ultimate goal: – Efficient utilization of GPU communication channels • Intensive communications carried over stronger channels • Our proposal: – Topology-aware GPU selection • Intelligent assignment of Intranode GPUs to MPI processes so as to maximize communication performance Parallel Processing Research Laboratory (PPRL) 13 AsHES 2016

Design Our Approach: GPU Node GPU Physical Communication 1. Extracting the GPU Characteristics Pattern communication pattern 2.Extracting the physical characteristics of the node GPU Virtual GPU Physical Topology Topology 3. Modeling topology-aware GPU selection as a graph mapping problem Mapping Algorithm 4. Solving the problem using a mapping algorithm GPU Mapping Table Parallel Processing Research Laboratory (PPRL) 14 AsHES 2016

Design Our Approach: Instrumenting Metrics: GPU Node GPU Open MPI library 1. Latency Physical Communication 1. Extracting the GPU to collect GPU 2. Bandwidth Characteristics Pattern inter-process 3. Distance communication pattern communication 2.Extracting the physical SCOTCH characteristics of the node GRAPH API GPU Virtual GPU Physical Topology Topology 3. Modeling topology-aware GPU selection as a graph SCOTCH Mapping Algorithm mapping problem Mapping Algorithm 4. Solving the problem using a mapping algorithm GPU Mapping Table Parallel Processing Research Laboratory (PPRL) 15 AsHES 2016

Result: Setup • One node, Helios cluster from Calcul Quebec – 16 GPUs (K80) – Two 12-core Intel Xeon 2.7 GHz • 4 micro-benchmarks – 5-point 2D stencil – 5-point 2D torus – 7-point 3D torus – 5-point 4D hypercube • One application (New) Parallel Processing Research Laboratory (PPRL) 17 AsHES 2016

Results Micro-benchmark Runtime improvement of topology-aware mappings over default mapping on non-weighted microbenchmarks Parallel Processing Research Laboratory (PPRL) 18 AsHES 2016

Results Micro-benchmark Runtime improvement of topology-aware mappings over default mapping on weighted microbenchmarks Parallel Processing Research Laboratory (PPRL) 19 AsHES 2016

NEW! Results Application 12.8% improvement 15.7% improvement Runtime of the HOOMD-Blue application with LJ-512K particle size using default and topology-aware mappings Parallel Processing Research Laboratory (PPRL) 20 AsHES 2016

Conclusion • Discussed GPU inter-process communication bottleneck – Overviewed some potential solutions to subside its effect • Showed an example of a multi-GPU node and its communication channels • Showed different levels of bandwidth and latency in a Multi- GPU node • Proposed a topology-aware GPU selection approach – More efficient utilization of GPU-to-GPU communication channels – Performance improvement by mapping intensive communications onto stronger channels Parallel Processing Research Laboratory (PPRL) 21 AsHES 2016

Conclusion Topology awareness matters for GPU communications and can provide considerable performance improvements. Parallel Processing Research Laboratory (PPRL) 22 AsHES 2016

Future Work • Evaluation on different multi-GPU nodes with different node architectures and GPUs. • Impact on different applications. • Extension towards multiple nodes across the cluster. Parallel Processing Research Laboratory (PPRL) 23 AsHES 2016

Acknowledgments Parallel Processing Research Laboratory (PPRL) 24 AsHES 2016

Thank you for your attention! Contacts: • Iman Faraji : i.faraji@queensu.ca • Seyed H. Mirsadeghi : s.mirsadeghi@queensu.ca • Ahmad Afsahi : ahmad.afsahi@queensu.ca Question? Parallel Processing Research Laboratory (PPRL) 25 AsHES 2016

Backup Motivation Helios-K20 Helios-K80 Parallel Processing Research Laboratory (PPRL) 26 AsHES 2016

Multi-GPU Nodes Iman Faraji, Seyed H. Mirsadeghi, and Ahmad Afsahi - PowerPoint PPT Presentation

Topology-Aware GPU Selection on Multi-GPU Nodes Iman Faraji, Seyed H. Mirsadeghi, and Ahmad Afsahi Department of Electrical and Computer Engineering Parallel Processing Research Laboratory Queens University Canada The Sixth International

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Habanero Operating Committee January 25 2017 Habanero Overview 1. Execute Nodes 2. Head Nodes

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Minimum Number Of Nodes Minimum number of nodes in a binary tree whose height is h. At

Minimum Number Of Nodes Minimum number of nodes in a binary tree whose height is h. At

Minimum Number Of Nodes Minimum number of nodes in a binary tree whose height is h. At

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

GPU support in MetaCentrum Miroslav Ruda CESNET April, 2013 GPU support in MetaCentrum I Two

Use Tesla to provide first GPU VM Service in China Feng Zhu

Advanced Counting Techniques CS1200, CSE IIT Madras Meghana Nasre April 7, 2020 CS1200, CSE IIT

Ranked Schrder Trees Olivier Bodini , Antoine Genitrini and Mehdi Naima March 18th

Greedy Algorithms Course: CS 5130 - Advanced Data Structures and Algorithms Instructor: Dr. Badri

Spatial Range Query in Sensor Spatial Range Query in Sensor Networks Networks Jie Gao Computer

(Supplement of Page 5) Four techniques for finding lower bounds. (1) Comparison tree. (2) A

Ray Tracing Assignment Goal is to reproduce the following Whitted, 1980 1 Ray Tracing

Principles of Microscopy II: Super-resolution Humberto Cabrera Venezuelan Institute for

Computation and reflection in Coq and HOL John Harrison Intel Corporation, visiting Katholieke

Multi-GPU Nodes Iman Faraji, Seyed H. Mirsadeghi, and Ahmad Afsahi - PowerPoint PPT Presentation

Topology-Aware GPU Selection on Multi-GPU Nodes Iman Faraji, Seyed H. Mirsadeghi, and Ahmad Afsahi Department of Electrical and Computer Engineering Parallel Processing Research Laboratory Queens University Canada The Sixth International

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Habanero Operating Committee January 25 2017 Habanero Overview 1. Execute Nodes 2. Head Nodes

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Minimum Number Of Nodes Minimum number of nodes in a binary tree whose height is h. At

Minimum Number Of Nodes Minimum number of nodes in a binary tree whose height is h. At

Minimum Number Of Nodes Minimum number of nodes in a binary tree whose height is h. At

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

GPU support in MetaCentrum Miroslav Ruda CESNET April, 2013 GPU support in MetaCentrum I Two

Use Tesla to provide first GPU VM Service in China Feng Zhu

Advanced Counting Techniques CS1200, CSE IIT Madras Meghana Nasre April 7, 2020 CS1200, CSE IIT

Ranked Schrder Trees Olivier Bodini , Antoine Genitrini and Mehdi Naima March 18th

Greedy Algorithms Course: CS 5130 - Advanced Data Structures and Algorithms Instructor: Dr. Badri

Spatial Range Query in Sensor Spatial Range Query in Sensor Networks Networks Jie Gao Computer

(Supplement of Page 5) Four techniques for finding lower bounds. (1) Comparison tree. (2) A

Ray Tracing Assignment Goal is to reproduce the following Whitted, 1980 1 Ray Tracing

Principles of Microscopy II: Super-resolution Humberto Cabrera Venezuelan Institute for

Computation and reflection in Coq and HOL John Harrison Intel Corporation, visiting Katholieke

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team