accelerated computing for ai
play

ACCELERATED COMPUTING FOR AI Bryan Catanzaro, 28 October 2017 DEEP - PowerPoint PPT Presentation

ACCELERATED COMPUTING FOR AI Bryan Catanzaro, 28 October 2017 DEEP LEARNING BIG BANG ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky Ilya Sutskever Geoffrey E. Hinton University of Toronto University of


  1. ACCELERATED COMPUTING FOR AI Bryan Catanzaro, 28 October 2017

  2. DEEP LEARNING BIG BANG ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky Ilya Sutskever Geoffrey E. Hinton University of Toronto University of Toronto University of Toronto NIPS (2012) Deep Learning NVIDIA GPU @ctnzr 2

  3. WHY IS DEEP LEARNING SUCCESSFUL Accuracy Big data sets Deep Learning New algorithms Computing hardware Many previous methods Focus of this talk Data & Compute @ctnzr 3

  4. RESEARCH AS A SEQUENTIAL PROCESS Goal: reduce latency of idea generation Idea Limit: Limit: Programmability Ingenuity Hack Invent Test Code Train Limit: Throughput @ctnzr 4

  5. COMPUTATIONAL EVOLUTION Deep learning changes every day FlowNet Sparsely Gated Batch Normalization Mixture of Experts AlexNet Billion-Scale NCCL Similarity Search WinoGrad (FAISS) 2012 2013 2014 2015 2016 2017 2018 1-bit SGD Persistent RNNs ? FFT Convolutions Phased LSTM cuDNN New solvers, new layers, new scaling techniques, new applications for old techniques, and much more… @ctnzr 5

  6. CUDA Programming system for accelerated computing C++ for accelerated processors New layer? On-chip memory management Asynchronous, parallel API No problem. Programmability makes it possible to innovate 10 years of investment @ctnzr 6

  7. CUDA LIBRARIES Optimized Kernels CUBLAS: Linear algebra So many flavors of GEMM CUDNN: Neural network kernels Convolutions (direct, Winograd, FFT) Can achieve > Speed of Light! Recurrent Neural Networks @ctnzr 7

  8. COMMUNICATION LIBRARIES NCCL, MPI NCCL: Optimized intra-node & inter- MPI: Library for inter-node node communication communication Library with sophisticated topology CUDA-aware MPI means you can run MPI aware collective algorithms programs using GPUs Scalable, distributed code in a familiar environment for HPC All-reduce: king of data parallel training @ctnzr 8

  9. FRAMEWORKS Cambrian explosion of AI Need programmability Lots of AI frameworks Let researchers prototype rapidly All are GPU accelerated @ctnzr 9

  10. SIMULATION Many important AI tasks involve agents interacting with the real world For this, you need simulators Physics Appearance Simulation has a big role to play in AI progress NVIDIA Project Isaac: simulator for RL @ctnzr 10

  11. DEEP NEURAL NETWORKS Simple, powerful function approximators X ! x w y y j = f w ij x i i One layer ( 0 , x < 0 f ( x ) = x, x ≥ 0 Deep Neural Network nonlinearity @ctnzr 11

  12. TRAINING NEURAL NETWORKS x w y X ! y j = f w ij x i i Computation dominated by dot products Multiple inputs, multiple outputs, batch means it is compute bound Train one model: 20+ Exaflops @ctnzr 12

  13. SCALE MATTERS More data, more compute: More AI IMAGE RECOGNITION 16X Model 152 layers 22.6 GFLOP ~3.5% error 8 layers 1.4 GFLOP ~16% Error 2012 2015 AlexNet ResNet @ctnzr 13

  14. LAWS OF PHYSICS Volta Successful AI uses Accelerated Computing 20X gap 20X in 10 years and growing… 10 1 GPU TFLOPs 0.1 General Purpose Performance Accelerated Performance @ctnzr 14

  15. ACCELERATED COMPUTING Find economically important problem that needs compute Make hardware for it to take it to speed of light GPUs are accelerators AI is huge focus for our GPU V100 GPU @ctnzr 15

  16. TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink @ctnzr 16 *full GV100 chip contains 84 SMs

  17. GPU PERFORMANCE COMPARISON P100 V100 Ratio 12x Training acceleration 10 TOPS 120 TOPS 6x Inference acceleration 21 TFLOPS 120 TOPS 1.5x FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.2x HBM2 Bandwidth 720 GB/s 900 GB/s 1.9x NVLink Bandwidth 160 GB/s 300 GB/s 1.5x L2 Cache 4 MB 6 MB 7.7x L1 Caches 1.3 MB 10 MB @ctnzr 17

  18. ARITHMETIC Mixed precision for training Lower precision integer for inference FP32 + FP16 Int8 @ctnzr 18

  19. TENSOR CORE Mixed Precision Matrix Math 4x4 matrices A 0,0 A 0,1 A 0,2 A 0,3 B 0,0 B 0,1 B 0,2 B 0,3 C 0,0 C 0,1 C 0,2 C 0,3 D = A 1,0 A 1,1 A 1,2 A 1,3 B 1,0 B 1,1 B 1,2 B 1,3 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 B 2,0 B 2,1 B 2,2 B 2,3 C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 B 3,0 B 3,1 B 3,2 B 3,3 C 3,0 C 3,1 C 3,2 C 3,3 FP16 or FP32 FP16 FP16 FP16 or FP32 D = AB + C @ctnzr 19

  20. SCALABILITY Thesis: AI is most important problem fastest supercomputer How can we use our best computers for it? 10000X Current best practices use ~128 GPUs 1 GPU Often people use 1-8 Research problem: how can we use 10000? @ctnzr 20

  21. VOLTA NVLINK 300GB/sec 50% more links 28% faster signaling @ctnzr 21

  22. HARDWARE PLATFORMS Systems, not just GPUs Drive PX Pegasus: 320 TOPS For Self-Driving Cars DGX: 960 TOPS, 8 TB SSD, 3.2 kW 128 GB HBM2, 7.2 TB/s Mem BW 512 GB DRAM, 4x EDR IB @ctnzr 22

  23. TENSOR RT Optimized Inference Horizontal and vertical fusion Saves memory bandwidth Low batch-size optimizations Inference batch sizes are small Int8 support Helps choose scaling factors @ctnzr 23

  24. ACCELERATED COMPUTING FOR AI Tremendous excitement in systems for AI Programmability & flexibility fundamental High computational intensity also required Bryan Catanzaro Make human ingenuity the limiting factor for @ctnzr AI research & deployment @ctnzr 24

Recommend


More recommend