April 4-7, 2016 | Silicon Valley CUDA 8 AND BEYOND Mark Harris, April 5, 2016
INTRODUCING CUDA 8 Pascal Support Unified Memory New Architecture, Stacked Memory , NVLINK Simple Parallel Programming with large virtual memory Libraries Developer T ools nvGRAPH – library for accelerating graph analytics apps Critical Path Analysis to speed overall app tuning FP16 computation to boost Deep Learning workloads OpenACC profiling to optimize directive performance Single GPU debugging on Pascal 2
INTRODUCING TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node Pascal Architecture NVLink HBM2 Stacked Memory Page Migration Engine T esla P100 CPU Unified Memory Highest Compute Performance GPU Interconnect for Maximum Unifying Compute & Memory in Simple Parallel Programming with Scalability Single Package 512 TB of Virtual Memory 3
UNIFIED MEMORY 4
UNIFIED MEMORY Dramatically Lower Developer Effort CUDA 6+ Single allocation, single pointer, Simpler Kepler accessible anywhere CPU Programming & GPU Eliminate need for explicit copy Memory Model Greatly simplifies code porting Unified Memory Performance Migrate data to accessing processor Through Guarantee global coherence Data Locality Still allows explicit hand tuning Allocate Up To GPU Memory Size 5
SIMPLIFIED MEMORY MANAGEMENT CODE CPU Code CUDA 6 Code with Unified Memory void sortfile(FILE *fp, int N) { void sortfile(FILE *fp, int N) { char *data; char *data; data = (char *)malloc(N); cudaMallocManaged(&data, N); fread(data, 1, N, fp); fread(data, 1, N, fp); qsort(data, N, 1, compare); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize(); use_data(data); use_data(data); free(data); cudaFree(data); } } 6
GREAT PERFORMANCE WITH UNIFIED MEMORY RAJA: Portable C++ Framework for parallel-for style programming LULESH Throughput 20 CPU: 10-core Haswell 2.0x RAJA uses Unified Memory for 1.9x 18 GPU: Tesla K40 heterogeneous array allocations 16 Million elements per second 1.5x 14 Parallel forall loops run on device 12 10 “Excellent performance 8 considering this is a "generic” 6 version of LULESH with no 4 architecture-specific tuning.” 2 -Jeff Keasler, LLNL 0 45^3 100^3 150^3 Mesh size GPU: NVIDIA Tesla K40, CPU: Intel Haswell E5-2650 v3 @ 2.30GHz, single socket 10-core 7
CUDA 8: UNIFIED MEMORY Large datasets, simple programming, High Performance CUDA 8 Oversubscribe GPU memory Enable Large Data Models Pascal Allocate up to system memory size CPU GPU CPU/GPU Data coherence Simpler Data Accesss Unified memory atomic operations Unified Memory Tune Usage hints via cudaMemAdvise API Unified Memory Explicit prefetching API Performance Allocate Beyond GPU Memory Size 8
UNIFIED MEMORY EXAMPLE On-Demand Paging __global__ void setValue(int *ptr, int index, int val) { ptr[index] = val; } void foo(int size) { char *data; Unified Memory allocation cudaMallocManaged(&data, size); Access all values on CPU memset(data, 0, size); Access one value on GPU setValue<<<...>>>(data, size/2, 5); cudaDeviceSynchronize(); useData(data); cudaFree(data); } 9
HOW UNIFIED MEMORY WORKS IN CUDA 6 Servicing CPU page faults GPU Code CPU Code __global__ cudaMallocManaged(&array, size); void setValue(char *ptr, int index, char val) memset(array, size); { ptr[index] = val; setValue<<<...>>>(array, size/2, 5); } GPU Memory Mapping CPU Memory Mapping array array Page Fault Interconnect 10 10
HOW UNIFIED MEMORY WORKS ON PASCAL Servicing CPU and GPU Page Faults GPU Code CPU Code __global__ cudaMallocManaged(&array, size); Void setValue(char *ptr, int index, char val) memset(array, size); { ptr[index] = val; setValue<<<...>>>(array, size/2, 5); } GPU Memory Mapping CPU Memory Mapping array array Page Fault Page Fault Interconnect 11 11
USE CASE: ON-DEMAND PAGING Graph Algorithms Performance over GPU directly accessing host memory (zero-copy) Large Data Set Baseline: migrate on first touch Optimized: best placement in memory 4/14/16 12 12
UNIFIED MEMORY ON PASCAL GPU memory oversubscription 32 GB allocation void foo() { // Assume GPU has 16 GB memory Pascal supports allocations where only // Allocate 32 GB a subset of pages reside on GPU. char *data; Pages can be migrated to the GPU size_t size = 32*1024*1024*1024; when “hot”. cudaMallocManaged(&data, size); } Fails on Kepler/Maxwell 13 13
GPU OVERSUBSCRIPTION Now possible with Pascal Many domains would benefit from GPU memory oversubscription: Combustion – many species to solve for Quantum chemistry – larger systems Ray tracing - larger scenes to render 4/14/16 14 14
GPU OVERSUBSCRIPTION HPGMG: high-performance multi-grid T esla P100 (16 GB) T esla K40 (12 GB) 4/14/16 15 15 *Tesla P100 performance is very early modelling results
UNIFIED MEMORY ON PASCAL Concurrent CPU/GPU access to managed memory __global__ void mykernel(char *data) { data[1] = ‘g’; } void foo() { char *data; cudaMallocManaged(&data, 2); mykernel<<<...>>>(data); // no synchronize here OK on Pascal: just a page fault data[0] = ‘c’; Concurrent CPU access to ‘data’ on previous cudaFree(data); GPUs caused a fatal segmentation fault } 16 16
UNIFIED MEMORY ON PASCAL System-Wide Atomics Pascal enables system-wide atomics __global__ void mykernel(int *addr) { Direct support of atomics over NVLink • atomicAdd(addr, 10); Software-assisted over PCIe • } void foo() { int *addr; System-wide atomics not available on cudaMallocManaged(addr, 4); Kepler / Maxwell *addr = 0; mykernel<<<...>>>(addr); __sync_fetch_and_add(addr, 10); } 17 17
PERFORMANCE TUNING ON PASCAL Explicit Memory Hints and Prefetching Advise runtime on known memory access behaviors with cudaMemAdvise() cudaMemAdviseSetReadMostly : Specify read duplication cudaMemAdviseSetPreferredLocation : suggest best location cudaMemAdviseSetAccessedBy : initialize a mapping Explicit prefetching with cudaMemPrefetchAsync(ptr, length, destDevice, stream) Unified Memory alternative to cudaMemcpyAsync Asynchronous operation that follows CUDA stream semantics To Learn More: S6216 “The Future of Unified Memory” by Nikolay Sakharnykh Tuesday, 4pm 18 18
GRAPH ANALYTICS 19 19
GRAPH ANALYTICS Insight from Connections in Big Data Wikimedia Commons Circos.ca Social Network Cyber Security / Genomics Analysis Network Analytics … and much more: Parallel Computing, Recommender Systems, Fraud Detection, Voice Recognition, Text Understanding, Search 20 20
nvGRAPH Accelerated Graph Analytics nvGRAPH: 4x Speedup Process graphs with up to 2.5 Billion edges on a 25 single GPU (24GB M40) 20 Accelerate a wide range of applications: 48 Core Xeon E5 nvGRAPH on K40 Iterations/s Single Source 15 Single Source PageRank Shortest Path Widest Path Search Robotic Path Planning IP Routing 10 Recommendation Power Network Chip Design / EDA Engines Planning 5 Logistics & Supply T raffic sensitive Social Ad Placement Chain Planning routing 0 PageRank on Wikipedia 84 M link dataset developer .nvidia.com/nvgraph 21 21
ENHANCED PROFILING 22 22
DEPENDENCY ANALYSIS Easily Find the Critical Kernel To Optimize 5% 40% CPU A wait B wait GPU Kernel X Kernel Y Timeline Optimize Here The longest running kernel is not always the most critical optimization target 23 23
DEPENDENCY ANALYSIS Visual Profiler Generating critical path Unguided Analysis Dependency Analysis Functions on critical path 24 24
DEPENDENCY ANALYSIS Visual Profiler APIs, GPU activities not in critical path are greyed out 25 25
MORE CUDA 8 PROFILER FEATURES CPU Profiling Unified Memory Profiling NVLink Topology and Bandwidth profiling OpenACC Profiling 26 26
COMPILER IMPROVEMENTS 27 27
2X FASTER COMPILE TIME ON CUDA 8 NVCC Speedups on CUDA 8 2.5x 2.0x 1.5x Speedup over CUDA 7.5 1.0x 0.5x 0.0x SHOC Thrust Rodinia cuDNN cuSparse cuFFT cuBLAS cuRand math Examples Open Source Benchmarks Internal Benchmarks QUDA increase • Average total compile times (per translation unit) 1.54x • Intel Core i7-3930K (6-cores) @ 3.2GHz • CentOS x86_64 Linux release 7.1.1503 (Core) with GCC 4.8.3 20140911 28 28 Performance may vary based on OS and software • GPU target architecture sm_52 versions, and motherboard configuration
HETEROGENEOUS C++ LAMBDA Combined CPU/GPU lambda functions __global__ template <typename F, typename T> void apply(F function, T *ptr) { Call lambda from device code *ptr = function(ptr); } int main(void) { float *x; cudaMallocManaged(&x, 2); __host__ __device__ lambda auto square = [=] __host__ __device__ (float x) { return x*x; }; Pass lambda to CUDA kernel apply<<<1, 1>>>(square, &x[0]); … or call it from host code ptr[1] = square(&x[1]); cudaFree(x); Experimental feature in CUDA 8. } `nvcc --expt-extended-lambda` 29 29
Recommend
More recommend