Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) - PowerPoint PPT Presentation

Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by Professor Wen-Mei Hwu 1

Outline Experiences from working with domain experts to develop GPU codes on Blue Waters ▪ Kepler and Volta GPUs ▪ HPC Kepler to Volta Speedup ▪ Blue Waters, Summit, Sierra ▪ Intra-Node Communication Performance 2

GPU Architecture Bird’s Eye View (not to scale) Register L1$ / Shared L2$ Accelerator CPU Interconnect System DRAM Cores hard drives network, etc DRAM / HBM 10-100 SMs Memory Subsystem 3

Kepler Volta Single Global Number Maximum Shared Registers Precision Memory of SMs Blocks / SM Memory / SM / SM Rate Bandwidth K20X (Kepler) 15 16 48 KB 64 K 3.94 TFLOPS 250 GB/s V100 (Volta) 80 32 96 KB 64 K 15 TFLOPS 900 GB/s 4

K20x to V100: Architectural Parameters 5

HPC Case Studies AWP-ODC ChaNGa Tom Jordan, Yifeng Cui Tom Quinn Southern California Earthquake Center University of Washington University of Southern California Anelastic Wave propagation Charm N-body Gravity Solver Solves a velocity-stress formulation of Collisionless N-body simulations the 3D wave equation 6

AWP and ChaNGa V100 Speedup Vs. P100 Vs. K20x (Blue Waters) ChaNGa 3.28 4.73 AWP 1.71 5.19 7

AWP Detail SP over p100 SP over K20X 1.711 5.188 K20x V100 Kernel 1 Kernel 2 Kernel 1 Kernel 2 GPU Time 72.4 % 27.5 % 70.1 % 29.3 % Mem BW 145.7 GB/s 136.1 GB/s 726.7 GB/s 600.2 GB/s Latency-Limited Bandwidth-Limited 8

AWP Optimizations Large Blocks to Uneven Architectural Unclear Tradeoff Capture Reuse Change Reuse in fast memory Many more SMs Fine-grained parallelism: more More memory per SM Blocks / SM limited work for GPU, by registers and SMs less reuse Same registers per SM 9

Takeaways Laissez-faire Approach: 3-5x kernel speedup over optimized Kepler 3-5x interconnect speedup over optimized Kepler Larger problem to fill GPU Redesign/Rewrite Approach: Finer-grained parallelism to fill GPU Harder to capture reuse (key to performance) 10

Nodes are Getting Bigger Summit 1 (ORNL) BW 1x AMD64 POWER9 POWER9 CPU 32 threads 88 threads 88 threads 16 FP 22 FP 22 FP K20X V100 V100 V100 V100 V100 V100 GPU 6 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 4 TF 15 TF 15 TF 15 TF 15 TF 15 TF 15 TF Accelerator PCIe 2x16 NVLink 2.0 x2 Interconnect 8 GB/s 50 GB/s (unidirectional) 32GB 512 GB Memory 1: https://www.olcf.ornl.gov/for-users/system-user-guides/summit/system-overview/ 11

Blue Waters XK and Summit Intra-Node Interconnects Blue Waters Summit V100 V100 PCIe 2.0 x16 NVLink 2.0 x2 AM64 V100 P9 P9 V100 K20x V100 V100 12

System Performance Research CUDA Microbench: https://github.com/rai-project/microbench Neural Networks MLModelScope: http://ml-arc-minsky.netlify.com/ Future Directions: Quick application-driven architecture design Performance modeling of neural networks 13

Faster Interconnects NVLink 2.0 x3 (1.5x Summit) PCIe 3.0 x16 (2x BW) 75 GB/s 15.8 GB/s github.com/rai-project/microbench 14

Allocations accessible from CPU and GPU Unified Memory Implicit data transfer (no cudaMemcpy) GPU 0 GPU 1 CPU cudaSetDevice(0); cudaMallocManaged(&a,...); a[ page0 ] = 0; // gpu0 a[ page1 ] = 1; // gpu1 Page fault and migration a[ page2 ] = 2; // cpu Page fault and migration cudaMemAdvise(a, gpu1 , Write served over NVLink cudaMemAdviseSetPreferredLocation); a[ page1 ] = 1; // cpu cudaMemPrefetcAsync(a, gpu1 ); Bulk page migration 15

P9 Unified Memory Performance Limited by 1 CPU thread Coherence: 30% of explicit management Prefetch: 50-80% of explicit github.com/rai-project/microbench 16

AMD64 Unified Memory Performance Coherence: 30-70% of explicit management Prefetch: 50-95% of explicit github.com/rai-project/microbench 17

Device Affinity Data placement on big nodes can have a dramatic communication impact github.com/rai-project/microbench 18

MLModelScope: Neural Network Performance Data http://ml-arc-minsky.netlify.com (model -- machine -- framework) triples ▪ ( AlexNet -- Jetson TX-1 -- Tensorflow ) ▪ ( VGG19 -- AWS P2 X-large -- MxNet ) Neural-network performance primitive benchmarks 19

Thank You https://cwpearson.github.io pearson@illinois.edu Special thanks to ▪ Professor Wen-Mei Hwu ▪ John Larson, Simon Garcia de Gonzalo, Zaid Qureshi, Mert Hidayetoglu, Abdul Dakkak and Cheng Li (University of Illinois) ▪ Isaac Gelado (NVIDIA) ▪ Jinjun Xiong and I-Hsin Chung (IBM) ▪ The IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR) - a research collaboration as part of the IBM Cognitive Horizon Network. 20

Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) - PowerPoint PPT Presentation

Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by Professor Wen-Mei Hwu 1 Outline Experiences from working with domain experts to develop GPU codes on Blue Waters Kepler and Volta GPUs HPC

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Habanero Operating Committee January 25 2017 Habanero Overview 1. Execute Nodes 2. Head Nodes

Minimum Number Of Nodes Minimum number of nodes in a binary tree whose height is h. At

Minimum Number Of Nodes Minimum number of nodes in a binary tree whose height is h. At

Minimum Number Of Nodes Minimum number of nodes in a binary tree whose height is h. At

Chemspace KNIME nodes Chemspace Search Chemspace KNIME nodes Chemspace Search and Chemspace

Chemspace KNIME nodes Expanded Search Chemspace KNIME nodes Chemspace Search and Chemspace

Tree A tree consists of a set of nodes and a set of edges that connect pairs of nodes.

What are Graphs? Nodes and Edges A graph consists of dots called nodes or vertices.

The effects of dangling nodes on citation networks Erjia Yan & Ying Ding ISSI 2011 - June

Alexander Lee: C: elegans metabolic network Graph of C. elegans metabolic network. Note that

Degree centrality Network Analysis in Python I Important nodes Which nodes are important?

More, bigger, better and joined More, bigger, better and joined HNV: The pros: Recognising

Where Bigger Is Where Bigger Is Jan 2016 Jan 2016 Cautionary Statement Cautionary Statement

Human Error - The Weakest link in CyberSecurity Exceptional IT. Real People. Bigger Purpose.

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Reliability Assurance I nitiative (RAI ) Progress Report Jerry Hedrick, Director of Compliance

Santosh Kumar Rai Oklahoma State University PHENO 2012 May 08 D. Karabacak, S. Nandi, SKR

Machines CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Slides credit: Piyush Rai Back to linear

Stochastic Blockmodels meet Graph Neural Networks Nikhil Mehta Lawrence Carin Piyush Rai

CSC2541: Differentiable Inference and Generative Models Lecture 2: Variational autoencoders

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

Content-Centric Networking at Internet Scale through The Integration of Name Resolution and

Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) - PowerPoint PPT Presentation

Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by Professor Wen-Mei Hwu 1 Outline Experiences from working with domain experts to develop GPU codes on Blue Waters Kepler and Volta GPUs HPC

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Habanero Operating Committee January 25 2017 Habanero Overview 1. Execute Nodes 2. Head Nodes

Minimum Number Of Nodes Minimum number of nodes in a binary tree whose height is h. At

Minimum Number Of Nodes Minimum number of nodes in a binary tree whose height is h. At

Minimum Number Of Nodes Minimum number of nodes in a binary tree whose height is h. At

Chemspace KNIME nodes Chemspace Search Chemspace KNIME nodes Chemspace Search and Chemspace

Chemspace KNIME nodes Expanded Search Chemspace KNIME nodes Chemspace Search and Chemspace

Tree A tree consists of a set of nodes and a set of edges that connect pairs of nodes.

What are Graphs? Nodes and Edges A graph consists of dots called nodes or vertices.

The effects of dangling nodes on citation networks Erjia Yan &amp; Ying Ding ISSI 2011 - June

Alexander Lee: C: elegans metabolic network Graph of C. elegans metabolic network. Note that

Degree centrality Network Analysis in Python I Important nodes Which nodes are important?

More, bigger, better and joined More, bigger, better and joined HNV: The pros: Recognising

Where Bigger Is Where Bigger Is Jan 2016 Jan 2016 Cautionary Statement Cautionary Statement

Human Error - The Weakest link in CyberSecurity Exceptional IT. Real People. Bigger Purpose.

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Reliability Assurance I nitiative (RAI ) Progress Report Jerry Hedrick, Director of Compliance

Santosh Kumar Rai Oklahoma State University PHENO 2012 May 08 D. Karabacak, S. Nandi, SKR

Machines CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Slides credit: Piyush Rai Back to linear

Stochastic Blockmodels meet Graph Neural Networks Nikhil Mehta Lawrence Carin Piyush Rai

CSC2541: Differentiable Inference and Generative Models Lecture 2: Variational autoencoders

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

Content-Centric Networking at Internet Scale through The Integration of Name Resolution and

The effects of dangling nodes on citation networks Erjia Yan & Ying Ding ISSI 2011 - June