ACCELERATING APPLICATIONS WITH CUDA C/C++ Pedro Mario Cruz e Silva, Solutions Architect Manager
ELEVEN YEARS OF GPU COMPUTING GPU-Trained AI Machine Beats World Champion in Go World’s First Atomic Model of HIV Capsid Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs Top 13 Greenest Google Outperforms Stanford Builds AI Supercomputers Powered Humans in ImageNet Machine using GPUs by NVIDIA GPUs Fermi: World’s AlexNet beats expert code by huge margin using GPUs First HPC GPU World’s First 3 -D Mapping World’s First GPU Discovered How H1N1 Top500 System of Human Genome Mutates to Resist Drugs CUDA Launched 2006 2017 2010 2008 2012 2014 2
“SCALABILITY OF CPU AND GPU SOLUTIONS OF THE PRIME ELLIPTIC CURVE DISCRETE LOGARITHM PROBLEM” Jairo Panetta (ITA), Paulo Souza (ITA), Luiz Laranjeira (UnB), Carlos Teixeira Jr (UnB) Visit Speed (10 6 ) 250 197.33 200 150 100 77.84 50 29.77 25.99 0 1 STI PS3 K40 + CUDA8.0 P100 + CUDA8.0 V100 + CUDA9.0 3
GPU PROGRAMMING 4
HOW GPU ACCELERATION WORKS Application Code Compute-Intensive Functions Rest of Sequential 5% of Code CPU Code GPU CPU + 5
3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC Libraries Languages Directives Easily Accelerate Maximum “Drop - in” Applications Flexibility Acceleration 6
THE BASICS Heterogenous Computing • Host: The CPU and its memory (host memory) • Device: The GPU and its memory (device memory) Host Device 7
ACCELERATING C/C++ CODE WITH CUDA ON GPUS 8
9
V1OO ARCHITECTURE 10
TESLA V100 The Fastest and Most Productive GPU for AI and HPC Tensor Core Improved NVLink & Volta MPS Improved SIMT Model Volta Architecture HBM2 125 Programmable Inference Utilization New Algorithms Most Productive GPU TFLOPS Deep Learning Efficient Bandwidth 11
THREAD HIERARCHY Grid, Block & Threads 12
TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink 5 13 *full GV100 chip contains 84 SMs
VOLTA GV100 SM GV100 FP32 units 64 FP64 units 32 INT32 units 64 Tensor Cores 8 Register File 256 KB Unified L1/Shared 128 KB memory Active Threads 2048 14
NEW TENSOR CORE New CUDA TensorOp instructions & data formats 4x4 matrix processing array D[FP32] = A[FP16] * B[FP16] + C[FP32] Optimized for deep learning Activation Inputs Weights Inputs Output Results 15
TENSOR CORE 4x4x4 matrix multiply and accumulate 16
CONCEPTS __global__ - this keyword is used to tell the CUDA compiler that the function is to be compiled for the GPU, and is callable from both the host and the GPU itself. For CUDA C/C++, the nvcc compiler will handle compiling this code. blockIdx.x - this is a read-only variable that is defined for you. It is used within a GPU kernel to determine the ID of the block which is currently executing code. Since there will be many blocks running in parallel, we need this ID to help determine which chunk of data that particular block will work on. threadIdx.x - this is a read-only variable that is defined for you. It is used within a GPU kernel to determine the ID of the thread which is currently executing code in the active block. blockDim.x - this is a read-only variable that is defined for you. It simply returns a value indicating the number of threads there are per block. Remember that all the blocks scheduled to execute on the GPU are identical, except for the blockIdx.x value. myKernel <<< number_of_blocks, threads_per_block>>> (...) - this is the syntax used to launch a kernel on the GPU. Inside the triple-angle brackets we set two values. The first is the total number of blocks we want to run on the GPU, and the second is the number of threads there are per block. It's possible, and in fact recommended, for one to schedule more blocks than the GPU can actively run in parallel. In this case, the system will just continue executing blocks until they have all run. 17
NVIDIA DEEP LEARNING INSTITUTE Hands-on self-paced and instructor-led training in deep learning and accelerated computing for developers Autonomous Deep Learning Medical Image Vehicles Fundamentals Analysis Request onsite instructor-led workshops at your organization: www.nvidia.com/requestdli Take self-paced labs online: www.nvidia.com/dlilabs Download the course catalog, view upcoming Genomics Finance Intelligent Video workshops, and learn about the University Analytics Ambassador Program: www.nvidia.com/dli More industry- specific training coming soon… Game Development Accelerated Computing & Digital Content Fundamentals 43
developer.nvidia.com 44
developer.nvidia.com 45
NVIDIA HW GRANT PROGRAM Jetson TX2 Titan V Volta Quadro P6000 (Dev Kit) • Scientific Computing Scientific Visualization • • Robotics HPC • • Virtual Reality Autonomous Machines • Deep Learning • https://developer.nvidia.com/academic_gpu_seeding 46
INCEPTION PROGRAM http://www.nvidia.com/object/inception-program.html 47
Recommend
More recommend