April 4-7, 2016 | Silicon Valley INSIDE PASCAL Lars Nyland and Mark Harris, April 5, 2016
INTRODUCING TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node Pascal Architecture NVLink HBM2 Stacked Memory Page Migration Engine T esla P100 CPU Unified Memory Highest Compute Performance GPU Interconnect for Maximum Unifying Compute & Memory in Simple Parallel Programming with Scalability Single Package 512 TB of Virtual Memory 2
GIANT LEAPS IN EVERYTHING P100 Teraflops (FP32/FP16) Bandwidth (GB/Sec) P100 3x 20 160 P100 (FP16) Bandwidth 15 120 2x P100 10 80 (FP32) M40 1x K40 M40 5 40 K40 K40 M40 3x Compute 5x GPU-GPU BW 3x GPU Mem BW 3
TESLA P100 PERFORMANCE DELIVERED NVLink for Max Scalability, More than 45x Faster with 8x P100 50x 2x K80 (M40 for Alexnet) 2x P100 4x P100 8x P100 45x Speed-up vs Dual Socket Haswell 40x 35x 30x 25x 20x 15x 10x 5x 2x Haswell CPU 0x Caffe/Alexnet VASP HOOMD-Blue COSMO MILC Amber HACC 4
PASCAL ARCHITECTURE 5
TESLA P100 GPU: GP100 56 SMs 3584 CUDA Cores 5.3 TF Double Precision 10.6 TF Single Precision 21.2 TF Half Precision 16 GB HBM2 720 GB/s Bandwidth 6
GPU PERFORMANCE COMPARISON P100 M40 K40 Double Precision TFlop/s 5.3 0.2 1.4 Single Precision TFlop/s 10.6 7.0 4.3 Half Precision Tflop/s 21.2 NA NA Memory Bandwidth (GB/s) 720 288 288 Memory Size 16GB 12GB, 24GB 12GB 7
GP100 SM GP100 CUDA Cores 64 Register File 256 KB Shared 64 KB Memory Active Threads 2048 Active Blocks 32 8
Warps Warps Warps Warps Registers Registers Registers Registers P100 SM More resources per core LD/ST LD/ST Cores Cores Cores Cores FP64 FP64 FP64 FP64 SFU SFU 2x Registers 1.33x Shared Memory Capacity 2x Shared Memory Bandwidth Shared Mem 2x Warps Maxwell SM Warps Warps Warps Warps Higher Instruction Throughput Registers Registers Registers Registers LD/ST LD/ST P100 SM Cores Cores Cores Cores FP64 FP64 FP64 FP64 SFU SFU Shared Mem 9
IEEE 754 FLOATING POINT ON GP100 3 sizes, 3 speeds, all fast Feature Half precision Single precision Double precision Layout s5.10 s8.23 s11.52 Issue rate pair every clock 1 every clock 1 every 2 clocks Subnormal support Yes Yes Yes Atomic Addition Yes Yes Yes 10
HALF-PRECISION FLOATING POINT (FP16) 16 bits • s e x p f r a c . • 1 sign bit, 5 exponent bits, 10 fraction bits 2 40 Dynamic range • USE CASES • Normalized values: 1024 values for each power of 2, Deep Learning Training from 2 -14 to 2 15 Radio Astronomy • Subnormals at full speed: 1024 values from 2 -24 to 2 -15 Sensor Data Special values • Image Processing • +- Infinity, Not-a-number 11
NVLink 12
NVLINK P100 supports 4 NVLinks 40 GB/s Up to 94% bandwidth efficiency 40 GB/s Supports read/writes/atomics to peer GPU 40 GB/s Supports read/write access to NVLink-enabled CPU 40 GB/s Links can be ganged for higher bandwidth NVLink on Tesla P100 13
NVLINK - GPU CLUSTER Two fully connected quads, connected at corners 160GB/s per GPU bidirectional to Peers Load/store access to Peer Memory Full atomics to Peer GPUs High speed copy engines for bulk data copy PCIe to/from CPU 14
NVLINK TO CPU Fully connected quad 120 GB/s per GPU bidirectional for peer traffic 40 GB/s per GPU bidirectional to CPU Direct Load/store access to CPU Memory High Speed Copy Engines for bulk data movement 15
TESLA P100 PHYSICAL CONNECTOR With NVLink 16
HBM2 STACKED MEMORY 17
HBM2 : 720GB/SEC BANDWIDTH And ECC is free Spacer GPU 4-high HBM2 Silicon Stack Carrier Bumps Substrate 18
UNIFIED MEMORY 19
PAGE MIGRATION ENGINE Support Virtual Memory Demand Paging 49-bit Virtual Addresses Sufficient to cover 48-bit CPU address + all GPU memory GPU page faulting capability Can handle thousands of simultaneous page faults Up to 2 MB page size Better TLB coverage of GPU memory 20 6.4.2016 Г.
KEPLER/MAXWELL UNIFIED MEMORY CUDA 6+ Single allocation, single pointer, Simpler accessible anywhere Kepler CPU Programming & GPU Eliminate need for explicit copy Memory Model Greatly simplifies code porting Unified Memory Migrate data to accessing processor Performance Through Guarantee global coherency Data Locality Still allows explicit hand tuning Allocate Up To GPU Memory Size 21
PASCAL UNIFIED MEMORY Large datasets, simple programming, High Performance CUDA 8 Oversubscribe GPU memory Enable Large Data Models Pascal Allocate up to system memory size CPU GPU Tune Usage hints via cudaMemAdvise API Unified Memory Explicit prefetching API Unified Memory Performance CPU/GPU Data coherence Simpler Data Access Unified memory atomic operations Allocate Beyond GPU Memory Size 22
INTRODUCING TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node Pascal Architecture NVLink HBM2 Stacked Memory Page Migration Engine T esla P100 CPU Unified Memory Highest Compute Performance GPU Interconnect for Maximum Unifying Compute & Memory in Simple Parallel Programming with Scalability Single Package 512 TB of Virtual Memory More P100 Features: compute preemption, new instructions, larger L2 cache, more … Find out more at http://devblogs.nvidia.com/parallelforall/inside-pascal 23
Recommend
More recommend