High-Performance Hardware for Machine Learning U.C. Berkeley October 19, 2016 William Dally NVIDIA Corporation Stanford University
Machine learning is transforming computing Speech Vision Natural Language Understanding Autonomous Vehicles Question Answering Control Game Playing (Go) Ad Placement 2
Whole research fields rendered irrelevant 3
Hardware and Data enable DNNs
The Need for Speed IMAGE RECOGNITION SPEECH RECOGNITION Important Property of Neural Networks 16X 10X Results get better with Model Training Ops 152 layers 465 GFLOP more data + 22.6 GFLOP 12,000 hrs of Data bigger models + ~5% Error ~3.5% error more computation 8 layers 80 GFLOP 7,000 hrs of Data 1.4 GFLOP (Better algorithms, new insights and ~8% Error ~16% Error improved techniques always help, too!) 2012 2015 2014 2015 AlexNet ResNet Deep Speech 1 Deep Speech 2 5
DNN primer 6
WHAT NETWORK? DNNS, CNNS, AND RNNS 7
DNN, KEY OPERATION IS DENSE M X V x = b i W ij a j Output activations Input activations weight matrix 8
CNNS – For image inputs, convolutional stages act as trained feature detectors 9
CNNS require convolution in addition to M X V Kernels Multiple 3D K uvkj x A ij A ij A ij A ij B xyk A xyk Output maps Input maps B xyk A xyc 10
4 Distinct Sub-problems Training Inference B x S Weight Reuse Convolutional Act Dominated Inference Train Conv Conv Weight Dominated B Weight Reuse Fully-Conn. Inference Train FC FC 32b FP – large batches 8b Int – small (unit) batches Large Memory Footprint Meet real-time constraint 11 Minimize Training Time
DNNs are Trivially Parallelized 12
Lots of parallelism in a DNN • Inputs • Multiplies within layer are independent • Points of a feature map • Sums are reductions • Filters • Only layers are dependent • Elements within a filter • No data dependent operations => can be statically scheduled
Data Parallel – Run multiple inputs in parallel • Doesn’t affect latency for one input • Requires P-fold larger batch size • For training requires coordinated weight update
Parameter Update p ’ = p + ∆ p Parameter Server p ’ ∆ p Model ! Workers Data ! Shards Large scale distributed deep networks Large Scale Distributed Deep Networks, Jeff Dean et al., 2013
Model-Parallel Convolution – by output region (x,y) Kernels Multiple 3D K uvkj x B xyj B xyj 6D Loop B xyj B xyj A ij Forall region XY A ij B xyj A xyk B xyj For each output map j B xyj B xyj For each input map k B xyj B xyj For each pixel x,y in XY For each kernel element u,v Output maps Input maps B xyj += A (x-u)(y-v)k x K uvkj B xyj A xyk
Model Parallel Fully-Connected Layer (M x V) b i W ij x = a j W ij b i Output activations Input activations weight matrix
GPUs 18
Pascal GP100 • 10 TeraFLOPS FP32 • 20 TeraFLOPS FP16 • 16GB HBM – 750GB/s • 300W TDP • 67GFLOPS/W (FP16) • 16nm process • 160GB/s NV Link
NVIDIA DGX-1 WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER 170 TFLOPS 8x Tesla P100 16GB NVLink Hybrid Cube Mesh Optimized Deep Learning Software Dual Xeon 7 TB SSD Deep Learning Cache Dual 10GbE, Quad IB 100Gb 3RU – 3200W NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
Facebook’s deep learning machine • Purpose-Built for Deep Learning Training 2x Faster Training for Faster Deployment 2x Larger Networks for Higher Accuracy Powered by Eight Tesla M40 GPUs Open Rack Compliant “ Most of the major advances in machine learning and AI in the Serkan Piantino past few years have been contingent on tapping into powerful Engineering Director of Facebook AI Research GPUs and huge data sets to build and train advanced models ”
NVIDIA Parker • 1.5 Teraflop FP16 ARM v8 CPU COMPLEX • 4GB of LPDDR4 @ 25.6 GB/s ( 2x Denver 2 + 4x A57) Coherent HMP • 15 W TDP (1W idle, <10W typical) 4K60 4K60 SECURITY AUDIO VIDEO VIDEO 2D ENGINE ENGINES ENGINE ENCODER DECODER • 100GFLOPS/W (FP16) GigE DISPLAY 128-bit BOOT and IMAGE Ethernet ENGINES LPDDR4 PM PROC PROC (ISP) • 16nm process MAC Safety I/O Engine
XAVIER AI SUPERCOMPUTER SOC 7 Billion Transistors 16nm FF 8 Core Custom ARM64 CPU 512 Core Volta GPU New Computer Vision Accelerator Dual 8K HDR Video Processors Designed for ASIL C Functional Safety 23
XAVIER AI SUPERCOMPUTER SOC ONE ARCHITECTURE DRIVE PX 2 XAVIER 2 PARKER + 2 PASCAL GPU | 20 TOPS DL | 120 SPECINT | 80W 20 TOPS DL | 160 SPECINT | 20W 24
Parallel GPUs on Deep Speech 2 2 19 5-3 (2560) 2 18 9-7 (1760) 2 17 Time (seconds) 2 16 2 15 2 14 2 13 2 12 2 11 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 GPUs Baidu, Deep Speech 2: End-to-End Speech Recognition in English and Mandarin, 2015
Reduced Precision 26
� How Much Precision is Needed for Dense M x V? x = b i W ij a j Output activations Input activations weight matrix 𝑐 " = 𝑔 ∑ 𝑥 "& 𝑏 " &
Number Representation Range Accuracy 23 1 8 FP32 10 -38 - 10 38 S E M .000006% 1 5 10 S E M FP16 6x10 -5 - 6x10 4 .05% 31 1 Int32 S M 0 – 2x10 9 ½ 1 15 Int16 0 – 6x10 4 ½ S M 1 7 Int8 0 – 127 ½ S M
Cost of Operations Area ( µ m 2 ) Operation: Energy (pJ) 8b Add 0.03 36 16b Add 0.05 67 32b Add 0.1 137 16b FP Add 0.4 1360 32b FP Add 0.9 4184 8b Mult 0.2 282 32b Mult 3.1 3495 16b FP Mult 1.1 1640 32b FP Mult 3.7 7700 32b SRAM Read (8KB) 5 N/A 32b DRAM Read 640 N/A Energy numbers are from Mark Horowitz “Computing’s Energy Problem (and what we can do about it)”, ISSCC 2014 Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library.
The Importance of Staying Local LPDDR DRAM GB 640pJ/word On-Chip SRAM MB 50pJ/word Local SRAM KB 5pJ/word
Mixed Precision Store weights as 4b using Trained quantization, w ij decode to 16b accumulate 24b or 32b x + b i to avoid saturation Store activations as 16b a j 16b x 16b multiply round result to 16b Batch normalization important to ‘center’ dynamic range
Weight Update Learning rate may be very small (10 -5 or less) a a j + x w ij D w ij x D w rounded to zero g j No learning!
Stochastic Rounding Learning rate may be very small (10 -5 or less) a D w very small a j + x D w’ ij D w ij w ij SR x E( D w’ ij ) = D w ij g j
� Reduced Precision For Training 𝑐 " = 𝑔 * 𝑥 "& 𝑏 " 𝑥 "& = 𝑥 "& + α𝑏 " & & S. Gupta et.al “Deep Learning with Limited Numerical
Pruning 35
Pruning before pruning after pruning pruning synapses pruning neurons Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
Retrain to Recover Accuracy L2 regularization w/o retrain L1 regularization w/o retrain L1 regularization w/ retrain L2 regularization w/ retrain L2 regularization w/ iterative prune and retrain 0.5% Train Connectivity 0.0% -0.5% -1.0% Accuracy Loss Prune Connections -1.5% -2.0% -2.5% -3.0% Train Weights -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parametes Pruned Away Pruned Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
Pruning of VGG-16
Pruning Neural Talk and LSTM
Speedup of Pruning on CPU/GPU Intel Core i7 5930K: MKL CBLAS GEMV, MKL SPBLAS CSRMV NVIDIA GeForce GTX Titan X: cuBLAS GEMV, cuSPARSE CSRMV NVIDIA Tegra K1: cuBLAS GEMV, cuSPARSE CSRMV
Trained Quantization (Weight Sharing) Quantization: less precision Pruning: less quantity Cluster the Weights Train Connectivity original same same Generate Code Book accuracy accuracy network Prune Connections 100% Size 10% Size 3.7% Size Quantize the Weights with Code Book Train Weights Retrain Code Book Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, arXiv 2015
Weight Sharing via K-Means weights cluster index fine-tuned (32 bit float) centroids (2 bit uint) centroids 3: 2.09 -0.98 1.48 0.09 2.00 1.96 3 0 2 1 2: 0.05 -0.14 -1.08 2.12 1.50 1.48 cluster 1 1 0 3 3 0 1 1 1 1 0 3 1: -0.91 1.92 0 -1.03 0.00 -0.04 0 3 1 0 0 3 1 0 lr 1.87 0 1.53 1.49 0: -1.00 -0.97 3 1 2 2 3 1 2 2 gradient -0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04 group by reduce -0.01 0.01 -0.02 0.12 0.03 0.01 -0.02 0.02 -0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04 -0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03 Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, arXiv 2015
Trained Quantization Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, arXiv 2015
Bits per Weight
Pruning + Trained Quantization
Recommend
More recommend