hpc and in the data center
play

HPC AND IN THE DATA CENTER Peter Messmer, DATE 2019, March 27 2019 - PowerPoint PPT Presentation

GPU ACCELERATED COMPUTING IN HPC AND IN THE DATA CENTER Peter Messmer, DATE 2019, March 27 2019 RISE OF GPU COMPUTING 1000X GPU-Computing perf 10 7 by 1.5X per year APPLICATIONS 2025 10 6 ALGORITHMS 1.1X per year 10 5 10 4 SYSTEMS 10 3


  1. GPU ACCELERATED COMPUTING IN HPC AND IN THE DATA CENTER Peter Messmer, DATE 2019, March 27 2019

  2. RISE OF GPU COMPUTING 1000X GPU-Computing perf 10 7 by 1.5X per year APPLICATIONS 2025 10 6 ALGORITHMS 1.1X per year 10 5 10 4 SYSTEMS 10 3 CUDA 1.5X per year 10 2 Single-threaded perf ARCHITECTURE 1980 1990 2000 2010 2020 Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp 2

  3. NVIDIA POWERS WORLD’S FASTEST SUPERCOMPUTERS 48% More Systems | 22 of Top 25 Greenest ORNL Summit LLNL Sierra Piz Daint ABCI ENI HPC4 World’s 2 nd Fastest World’s Fastest Europe’s Fastest Japan’s Fastest Fastest Industrial 27,648 GPUs| 144 PF 17,280 GPUs| 95 PF 5,704 GPUs| 21 PF 4,352 GPUs| 20 PF 3,200 GPUs| 12 PF 3

  4. THE NEW HPC MARKET SIMULATION MACHINE LEARNING DEEP LEARNING 4

  5. NVIDIA POWERS 5 OF 6 GORDON BELL NOMINATIONS GPU Acceleration Critical To HPC At Scale Today Prize Winner Prize Winner Genomics Weather Seismic Material Science Quantum 1 st Soil & Structure 2.36 ExaOps 1.13 ExaOps 300X Higher Chromodynamics Simulation Performance <1% of Uncertainty Margin 5

  6. TESLA UNIVERSAL ACCELERATION PLATFORM Single Platform To Drive Utilization and Productivity CUSTOMER USECASES Molecular Weather Seismic Speech Translate Recommender Healthcare Manufacturing Finance Simulations Forecasting Mapping CONSUMER INTERNET SUPERCOMPUTING INDUSTRIAL APPLICATIONS APPS & Amber +550 FRAMEWORKS Applications NAMD MACHINE LEARNING | RAPIDS DEEP LEARNING SUPERCOMPUTING NVIDIA SDK & LIBRARIES CuBLAS CuFFT OpenACC cuDF cuML cuGRAPH cuDNN cuBLAS CUTLASS NCCL TensorRT CUDA TESLA GPUs & SYSTEMS VIRTUAL GPU SYSTEM OEM CLOUD TESLA GPU NVIDIA DGX FAMILY NVIDIA HGX 6

  7. EXPANDING VALUE FOR HPC CUSTOMERS Partnering With HPC Development Community 40X 25X GROMACS 22X CRYSPARC FUN3D Chemistry Cryo CFD AMBER CHROMA CRYOSPARC 24x GTC FUN3D 24x LAMMPS GROMACS 7x MILC MICROVOLUTION 48x NAMD PARABRICKS 22x MICROVOLUTION PARABRICKS WRF QUANTUM ESP WRF 8x SPECFEM3D Microscopy Genomics Weather 2018 2019 2019 MORE PERFORMANCE WITH SAME GPU ADDING NEW AND IMPROVED TOP APPLICATIONS 7 CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ 4 NVIDIA V100 PCIe or SXM2 GPUs

  8. CUDA DEVELOPMENT ECOSYSTEM New Algorithm Developers and Problem Domain GPU Users Optimization Experts Specialists Specialists CUDA-C++ CUDA Fortran Directives and Extended Standard Applications Frameworks Libraries Standard Languages Languages Ease of use Specialized Performance CUDA: Programming Model, GPU Architecture, System Architecture 8

  9. NEW PROGRAMMING MODEL FEATURES Execution Interop Turing Asynchronous Lightweight Graphics Precision Efficiency Multi-Precision Task Graphs Interop Tensor Cores atomicAdd(&h, (half)1.15f); half2 hvec(0.94f, -2.13f); atomicAdd(&h2, hvec); IEEE-754.2008 FP16 Specification = 0.707031 0 0 1 1 1 0 0 1 1 0 1 0 1 0 0 0 sign exponent mantissa bit (5 bits) (10 bits) NVCC Enhancements FP16 Operations 9

  10. INDEPENDENT THREAD SCHEDULING Communicating Algorithms Pascal: Lock-Free Algorithms Volta/Turing: Starvation Free Algorithms Threads may wait for messages Threads cannot wait for messages 10

  11. ASYNCHRONOUS TASK GRAPHS Execution Optimization When Workflow is Known Up-Front Deep Neural Network Training DL Inference Loop & Function offload Linear Algebra HPC Simulation 11

  12. DEFINITION OF A CUDA GRAPH Graph Nodes Are Not Just Kernel Launches Sequence of operations, connected by dependencies. A Operations are one of: B X Kernel Launch CUDA kernel running on GPU D C CPU Function Call Callback function on CPU Memcopy/Memset GPU data management E Y Sub-Graph Graphs are hierarchical End 12

  13. WHAT IS OPENACC Open Specification Developed by OpenACC.org Consortium Add Simple Compiler Directive Designed for Directives-based performance and programming model for main() parallel { portability on <serial code> computing CPUs and GPUs #pragma acc kernels { <parallel code> } } SIMPLE POWERFUL & PORTABLE Read more at www.openacc.org/about 13

  14. WHO OPENACC IS FOR The Main Focus Domain Scientists Application Developers 1. Want to do more science & less Looking for: programming 1. easy code maintenance, 2. Believe that GPUs are hard 2. better efficiency, 3. Need help in learning how to easy 3. portability start with GPUs 4. Mostly don’t have a computer Mostly computer scientists science degree 14

  15. OPENACC GROWING MOMENTUM Wide Adoption Across Key HPC Codes Over 100 Apps* Using OpenACC VASP Top Quantum Chemistry and Material Science Code For VASP, OpenACC is the way forward for GPU acceleration. Performance is similar to CUDA, and ANSYS Fluent GTC OpenACC dramatically decreases GPU development Gaussian XGC and maintenance efforts. We’re excited to VASP ACME silica IFPEN, RMM-DIIS on P100 collaborate with NVIDIA and PGI as an early LSDalton FLASH adopter of Unified Memory. MPAS COSMO GAMERA Numeca Prof. Georg Kresse Computational Materials Physics University of Vienna * Applications in production and development 15

  16. SINGLE CODE FOR MULTIPLE PLATFORMS OpenACC - Performance Portable Programming Model for HPC AWE Hydr drodyn dynam amics ics Clover verLe Leaf af mini-Ap App, p, bm32 data set http://uk-mac.github.io/CloverLeaf OpenPOWER 160 142 Speedup vs Single Haswell Core Sunway 140 x 120 109 x86 CPU PGI 18.1 OpenACC x 100 Intel 2018 OpenMP x86 Xeon Phi 80 67x NVIDIA GPU 60 40x 40 AMD GPU 14.8x 15x 20 11x 10x 10x 7.6x 7.9x PEZY-SC 0 1x 2x 4x Kepler Pascal Multicore Haswell Multicore Multicore Skylake Volta V100 Broadwell Systems: Haswell: 2x16 core Haswell server, four K80s, CentOS 7.2 (perf-hsw10), Broadwell: 2x20 core Broadwell server, eight P100s (dgx1-prd-01), Broadwell server, eight V100s (dgx07), Skylake 2x20 core Xeon Gold server (sky-4). Compilers: Intel 2018.0.128, PGI 18.1 Benchmark: CloverLeaf v1.3 downloaded from http://uk-mac.github.io/CloverLeaf the week of November 7 2016; CloverlLeaf_Serial; CloverLeaf_ref (MPI+OpenMP); CloverLeaf_OpenACC (MPI+OpenACC) 16 Data compiled by PGI February 2018.

  17. NSIGHT SYSTEMS System-wide Performance Analysis Observe Application Behavior : CPU threads, GPU traces, Memory Bandwidth and more Locate Optimization Opportunities: CUDA & OpenGL APIs, Unified Memory transfers, User Annotations using NVTX Ready for Big Data : Fast GUI capable of visualizing in excess of 10 million events on laptops, Container support, Minimum user privileges https://developer.nvidia.com/nsight-systems 17

  18. Thread/core migration Processes and threads Thread state CUDA and OpenGL API trace cuDNN and cuBLAS trace Kernel and memory transfer activities Multi-GPU 18

  19. CONTAINERS: SIMPLIFYING WORKFLOWS WHY CONTAINERS Simplifies Deployments - Eliminates complex, time-consuming builds and installs Get started in minutes - Simply Pull & Run the app Portable - Deploy across various environments, from test to production with minimal changes 19

  20. NGC CONTAINERS: ACCELERATING WORKFLOWS WHY NGC CONTAINERS WHY CONTAINERS Simplifies Deployments Optimized for Performance - Monthly DL container releases offer latest features and - Eliminates complex, time-consuming builds and superior performance on NVIDIA GPUs installs Scalable Performance Get started in minutes - Supports multi-GPU & multi-node systems for scale-up & - Simply Pull & Run the app scale-out environments Portable Designed for Enterprise & HPC environments - Deploy across various environments, from test to - Supports Docker & Singularity runtimes production with minimal changes Run Anywhere - Pascal/Volta/Turing-powered NVIDIA DGX, PCs, workstations, servers and top cloud platforms 20

  21. THE NEW NGC GPU-optimized Software Hub. Simplifying DL, ML and HPC Workflows 10+ Model Training Scripts NLP, Image Classification, Object Detection & more Simplify Deployments 50+ Containers NGC DL, ML, HPC Innovate Faster Deploy Anywhere Industry Workflows Medical Imaging, Intelligent Video Analytics 50+ Pre-trained Models ngc.nvidia.com NLP, Classification, Object Detection & more 21

  22. NGC-READY ECOSYSTEM Now Over 50 GPU-Optimized Containers DEEP LEARNING MACHINE LEARNING HPC VISUALIZATION 22

  23. RE-IMAGINING DATA SCIENCE WORKFLOW Open Source, End-to-end GPU-accelerated Workflow Built On CUDA cuDF cuML Visualization insights data Data Optimized ML Data preparation / model visualization wrangling training libraries 23

  24. RAPIDS — OPEN GPU DATA SCIENCE Software Stack Python Data Preparation Model Training Visualization cuDF cuML cuGRAPH PYTHON DEEP LEARNING FRAMEWORKS RAPIDS DASK CUDF CUML CUGRAPH CUDNN CUDA APACHE ARROW on GPU Memory 24

  25. ACCELERATING MACHINE LEARNING The RAPIDS Ecosystem Open Source Enterprise Data Science Deep Learning Startups Community Platforms Integration GPU Servers Storage Partners 25

Recommend


More recommend