ACCELERATED COMPUTING FOR AI Bryan Catanzaro, 28 October 2017 DEEP - PowerPoint PPT Presentation

ACCELERATED COMPUTING FOR AI Bryan Catanzaro, 28 October 2017

DEEP LEARNING BIG BANG ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky Ilya Sutskever Geoffrey E. Hinton University of Toronto University of Toronto University of Toronto NIPS (2012) Deep Learning NVIDIA GPU @ctnzr 2

WHY IS DEEP LEARNING SUCCESSFUL Accuracy Big data sets Deep Learning New algorithms Computing hardware Many previous methods Focus of this talk Data & Compute @ctnzr 3

RESEARCH AS A SEQUENTIAL PROCESS Goal: reduce latency of idea generation Idea Limit: Limit: Programmability Ingenuity Hack Invent Test Code Train Limit: Throughput @ctnzr 4

COMPUTATIONAL EVOLUTION Deep learning changes every day FlowNet Sparsely Gated Batch Normalization Mixture of Experts AlexNet Billion-Scale NCCL Similarity Search WinoGrad (FAISS) 2012 2013 2014 2015 2016 2017 2018 1-bit SGD Persistent RNNs ? FFT Convolutions Phased LSTM cuDNN New solvers, new layers, new scaling techniques, new applications for old techniques, and much more… @ctnzr 5

CUDA Programming system for accelerated computing C++ for accelerated processors New layer? On-chip memory management Asynchronous, parallel API No problem. Programmability makes it possible to innovate 10 years of investment @ctnzr 6

CUDA LIBRARIES Optimized Kernels CUBLAS: Linear algebra So many flavors of GEMM CUDNN: Neural network kernels Convolutions (direct, Winograd, FFT) Can achieve > Speed of Light! Recurrent Neural Networks @ctnzr 7

COMMUNICATION LIBRARIES NCCL, MPI NCCL: Optimized intra-node & inter- MPI: Library for inter-node node communication communication Library with sophisticated topology CUDA-aware MPI means you can run MPI aware collective algorithms programs using GPUs Scalable, distributed code in a familiar environment for HPC All-reduce: king of data parallel training @ctnzr 8

FRAMEWORKS Cambrian explosion of AI Need programmability Lots of AI frameworks Let researchers prototype rapidly All are GPU accelerated @ctnzr 9

SIMULATION Many important AI tasks involve agents interacting with the real world For this, you need simulators Physics Appearance Simulation has a big role to play in AI progress NVIDIA Project Isaac: simulator for RL @ctnzr 10

DEEP NEURAL NETWORKS Simple, powerful function approximators X ! x w y y j = f w ij x i i One layer ( 0 , x < 0 f ( x ) = x, x ≥ 0 Deep Neural Network nonlinearity @ctnzr 11

TRAINING NEURAL NETWORKS x w y X ! y j = f w ij x i i Computation dominated by dot products Multiple inputs, multiple outputs, batch means it is compute bound Train one model: 20+ Exaflops @ctnzr 12

SCALE MATTERS More data, more compute: More AI IMAGE RECOGNITION 16X Model 152 layers 22.6 GFLOP ~3.5% error 8 layers 1.4 GFLOP ~16% Error 2012 2015 AlexNet ResNet @ctnzr 13

LAWS OF PHYSICS Volta Successful AI uses Accelerated Computing 20X gap 20X in 10 years and growing… 10 1 GPU TFLOPs 0.1 General Purpose Performance Accelerated Performance @ctnzr 14

ACCELERATED COMPUTING Find economically important problem that needs compute Make hardware for it to take it to speed of light GPUs are accelerators AI is huge focus for our GPU V100 GPU @ctnzr 15

TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink @ctnzr 16 *full GV100 chip contains 84 SMs

GPU PERFORMANCE COMPARISON P100 V100 Ratio 12x Training acceleration 10 TOPS 120 TOPS 6x Inference acceleration 21 TFLOPS 120 TOPS 1.5x FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.2x HBM2 Bandwidth 720 GB/s 900 GB/s 1.9x NVLink Bandwidth 160 GB/s 300 GB/s 1.5x L2 Cache 4 MB 6 MB 7.7x L1 Caches 1.3 MB 10 MB @ctnzr 17

ARITHMETIC Mixed precision for training Lower precision integer for inference FP32 + FP16 Int8 @ctnzr 18

TENSOR CORE Mixed Precision Matrix Math 4x4 matrices A 0,0 A 0,1 A 0,2 A 0,3 B 0,0 B 0,1 B 0,2 B 0,3 C 0,0 C 0,1 C 0,2 C 0,3 D = A 1,0 A 1,1 A 1,2 A 1,3 B 1,0 B 1,1 B 1,2 B 1,3 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 B 2,0 B 2,1 B 2,2 B 2,3 C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 B 3,0 B 3,1 B 3,2 B 3,3 C 3,0 C 3,1 C 3,2 C 3,3 FP16 or FP32 FP16 FP16 FP16 or FP32 D = AB + C @ctnzr 19

SCALABILITY Thesis: AI is most important problem fastest supercomputer How can we use our best computers for it? 10000X Current best practices use ~128 GPUs 1 GPU Often people use 1-8 Research problem: how can we use 10000? @ctnzr 20

VOLTA NVLINK 300GB/sec 50% more links 28% faster signaling @ctnzr 21

HARDWARE PLATFORMS Systems, not just GPUs Drive PX Pegasus: 320 TOPS For Self-Driving Cars DGX: 960 TOPS, 8 TB SSD, 3.2 kW 128 GB HBM2, 7.2 TB/s Mem BW 512 GB DRAM, 4x EDR IB @ctnzr 22

TENSOR RT Optimized Inference Horizontal and vertical fusion Saves memory bandwidth Low batch-size optimizations Inference batch sizes are small Int8 support Helps choose scaling factors @ctnzr 23

ACCELERATED COMPUTING FOR AI Tremendous excitement in systems for AI Programmability & flexibility fundamental High computational intensity also required Bryan Catanzaro Make human ingenuity the limiting factor for @ctnzr AI research & deployment @ctnzr 24

ACCELERATED COMPUTING FOR AI Bryan Catanzaro, 28 October 2017 DEEP - PowerPoint PPT Presentation

ACCELERATED COMPUTING FOR AI Bryan Catanzaro, 28 October 2017 DEEP LEARNING BIG BANG ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky Ilya Sutskever Geoffrey E. Hinton University of Toronto University of

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

ACCELERATED COMPUTING FOR AI Bryan Catanzaro, 7 December 2018 ACCELERATED COMPUTING: REDUCE

ACCELERATED COMPUTING WITH NVIDIA GPUS Jesse Tetreault, Solutions Architect October 2019

Accelerated Development of Materials, The Future Is Here (!) Raymundo Arryave Accelerated

The Scholars Academy: The Scholars Academy: An Accelerated Program for An Accelerated Program

What is Accelerated Reader? Accelerated Reader is a computer program that helps teachers manage

Roseburn Primary School Dream Believe Achieve Accelerated Reading A Guide for Parents

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Accelerated Learning - for Breakthrough Results Whole brain, person, systems approach Debbie

Trustworthy Computing * Reverse engineers agree on that! Trustworthy Computing Trustworthy

COMPUTING COMMUNITY CONSORTIUM The mission of the Computing Research Association's Computing

THE COMPUTING COMMUNITY CONSORTIUM (CCC) COMPUTING COMMUNITY CONSORTIUM The mission of Computing

Calm Computing The Coming Age of Mark Weiser and John Seely Brown Calm Computing Whyfor, Calm

Ray Wu Presentation to School of Computing, National University of Singapore Computing Evolution

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on

Families of distributed graph algorithms Divide and conquer arton Balassi 1 M

ss s

Natural Language Processing Coreference and Anaphora Resolution Alessandro Moschitti & Olga

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Polynomial evaluation and message authentication D. J. Bernstein University of Illinois at

Formal Verification of Differentially Private Mechanisms Marco Gaboardi University at Buffalo,

CS10 The Beauty and Joy of Computing Artificial Intelligence Anna$Rafferty$

COMP 3403 Algorithm Analysis Part 2 Chapters 4 5 Jim Diamond CAR 409 Jodrey School

Sambuz

Useful Links

Newsletter

Mail Us

ACCELERATED COMPUTING FOR AI Bryan Catanzaro, 28 October 2017 DEEP - PowerPoint PPT Presentation

ACCELERATED COMPUTING FOR AI Bryan Catanzaro, 28 October 2017 DEEP LEARNING BIG BANG ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky Ilya Sutskever Geoffrey E. Hinton University of Toronto University of

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

ACCELERATED COMPUTING FOR AI Bryan Catanzaro, 7 December 2018 ACCELERATED COMPUTING: REDUCE

ACCELERATED COMPUTING WITH NVIDIA GPUS Jesse Tetreault, Solutions Architect October 2019

Accelerated Development of Materials, The Future Is Here (!) Raymundo Arryave Accelerated

The Scholars Academy: The Scholars Academy: An Accelerated Program for An Accelerated Program

What is Accelerated Reader? Accelerated Reader is a computer program that helps teachers manage

Roseburn Primary School Dream Believe Achieve Accelerated Reading A Guide for Parents

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Accelerated Learning - for Breakthrough Results Whole brain, person, systems approach Debbie

Trustworthy Computing * Reverse engineers agree on that! Trustworthy Computing Trustworthy

COMPUTING COMMUNITY CONSORTIUM The mission of the Computing Research Association's Computing

THE COMPUTING COMMUNITY CONSORTIUM (CCC) COMPUTING COMMUNITY CONSORTIUM The mission of Computing

Calm Computing The Coming Age of Mark Weiser and John Seely Brown Calm Computing Whyfor, Calm

Ray Wu Presentation to School of Computing, National University of Singapore Computing Evolution

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on

Families of distributed graph algorithms Divide and conquer arton Balassi 1 M

ss s

Natural Language Processing Coreference and Anaphora Resolution Alessandro Moschitti &amp; Olga

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Polynomial evaluation and message authentication D. J. Bernstein University of Illinois at

Formal Verification of Differentially Private Mechanisms Marco Gaboardi University at Buffalo,

CS10 The Beauty and Joy of Computing Artificial Intelligence Anna$Rafferty$

COMP 3403 Algorithm Analysis Part 2 Chapters 4 5 Jim Diamond CAR 409 Jodrey School

Sambuz

Useful Links

Newsletter

Mail Us

Natural Language Processing Coreference and Anaphora Resolution Alessandro Moschitti & Olga