high performance hardware for machine learning
play

High-Performance Hardware for Machine Learning U.C. Berkeley - PowerPoint PPT Presentation

High-Performance Hardware for Machine Learning U.C. Berkeley October 19, 2016 William Dally NVIDIA Corporation Stanford University Machine learning is transforming computing Speech Vision Natural Language Understanding Autonomous Vehicles


  1. High-Performance Hardware for Machine Learning U.C. Berkeley October 19, 2016 William Dally NVIDIA Corporation Stanford University

  2. Machine learning is transforming computing Speech Vision Natural Language Understanding Autonomous Vehicles Question Answering Control Game Playing (Go) Ad Placement 2

  3. Whole research fields rendered irrelevant 3

  4. Hardware and Data enable DNNs

  5. The Need for Speed IMAGE RECOGNITION SPEECH RECOGNITION Important Property of Neural Networks 16X 10X Results get better with Model Training Ops 152 layers 465 GFLOP more data + 22.6 GFLOP 12,000 hrs of Data bigger models + ~5% Error ~3.5% error more computation 8 layers 80 GFLOP 7,000 hrs of Data 1.4 GFLOP (Better algorithms, new insights and ~8% Error ~16% Error improved techniques always help, too!) 2012 2015 2014 2015 AlexNet ResNet Deep Speech 1 Deep Speech 2 5

  6. DNN primer 6

  7. WHAT NETWORK? DNNS, CNNS, AND RNNS 7

  8. DNN, KEY OPERATION IS DENSE M X V x = b i W ij a j Output activations Input activations weight matrix 8

  9. CNNS – For image inputs, convolutional stages act as trained feature detectors 9

  10. CNNS require convolution in addition to M X V Kernels Multiple 3D K uvkj x A ij A ij A ij A ij B xyk A xyk Output maps Input maps B xyk A xyc 10

  11. 4 Distinct Sub-problems Training Inference B x S Weight Reuse Convolutional Act Dominated Inference Train Conv Conv Weight Dominated B Weight Reuse Fully-Conn. Inference Train FC FC 32b FP – large batches 8b Int – small (unit) batches Large Memory Footprint Meet real-time constraint 11 Minimize Training Time

  12. DNNs are Trivially Parallelized 12

  13. Lots of parallelism in a DNN • Inputs • Multiplies within layer are independent • Points of a feature map • Sums are reductions • Filters • Only layers are dependent • Elements within a filter • No data dependent operations => can be statically scheduled

  14. Data Parallel – Run multiple inputs in parallel • Doesn’t affect latency for one input • Requires P-fold larger batch size • For training requires coordinated weight update

  15. Parameter Update p ’ = p + ∆ p Parameter Server p ’ ∆ p Model ! Workers Data ! Shards Large scale distributed deep networks Large Scale Distributed Deep Networks, Jeff Dean et al., 2013

  16. Model-Parallel Convolution – by output region (x,y) Kernels Multiple 3D K uvkj x B xyj B xyj 6D Loop B xyj B xyj A ij Forall region XY A ij B xyj A xyk B xyj For each output map j B xyj B xyj For each input map k B xyj B xyj For each pixel x,y in XY For each kernel element u,v Output maps Input maps B xyj += A (x-u)(y-v)k x K uvkj B xyj A xyk

  17. Model Parallel Fully-Connected Layer (M x V) b i W ij x = a j W ij b i Output activations Input activations weight matrix

  18. GPUs 18

  19. Pascal GP100 • 10 TeraFLOPS FP32 • 20 TeraFLOPS FP16 • 16GB HBM – 750GB/s • 300W TDP • 67GFLOPS/W (FP16) • 16nm process • 160GB/s NV Link

  20. NVIDIA DGX-1 WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER 170 TFLOPS 8x Tesla P100 16GB NVLink Hybrid Cube Mesh Optimized Deep Learning Software Dual Xeon 7 TB SSD Deep Learning Cache Dual 10GbE, Quad IB 100Gb 3RU – 3200W NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

  21. Facebook’s deep learning machine • Purpose-Built for Deep Learning Training 2x Faster Training for Faster Deployment 2x Larger Networks for Higher Accuracy Powered by Eight Tesla M40 GPUs Open Rack Compliant “ Most of the major advances in machine learning and AI in the Serkan Piantino past few years have been contingent on tapping into powerful Engineering Director of Facebook AI Research GPUs and huge data sets to build and train advanced models ”

  22. NVIDIA Parker • 1.5 Teraflop FP16 ARM v8 CPU COMPLEX • 4GB of LPDDR4 @ 25.6 GB/s ( 2x Denver 2 + 4x A57) Coherent HMP • 15 W TDP (1W idle, <10W typical) 4K60 4K60 SECURITY AUDIO VIDEO VIDEO 2D ENGINE ENGINES ENGINE ENCODER DECODER • 100GFLOPS/W (FP16) GigE DISPLAY 128-bit BOOT and IMAGE Ethernet ENGINES LPDDR4 PM PROC PROC (ISP) • 16nm process MAC Safety I/O Engine

  23. XAVIER AI SUPERCOMPUTER SOC 7 Billion Transistors 16nm FF 8 Core Custom ARM64 CPU 512 Core Volta GPU New Computer Vision Accelerator Dual 8K HDR Video Processors Designed for ASIL C Functional Safety 23

  24. XAVIER AI SUPERCOMPUTER SOC ONE ARCHITECTURE DRIVE PX 2 XAVIER 2 PARKER + 2 PASCAL GPU | 20 TOPS DL | 120 SPECINT | 80W 20 TOPS DL | 160 SPECINT | 20W 24

  25. Parallel GPUs on Deep Speech 2 2 19 5-3 (2560) 2 18 9-7 (1760) 2 17 Time (seconds) 2 16 2 15 2 14 2 13 2 12 2 11 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 GPUs Baidu, Deep Speech 2: End-to-End Speech Recognition in English and Mandarin, 2015

  26. Reduced Precision 26

  27. � How Much Precision is Needed for Dense M x V? x = b i W ij a j Output activations Input activations weight matrix 𝑐 " = 𝑔 ∑ 𝑥 "& 𝑏 " &

  28. Number Representation Range Accuracy 23 1 8 FP32 10 -38 - 10 38 S E M .000006% 1 5 10 S E M FP16 6x10 -5 - 6x10 4 .05% 31 1 Int32 S M 0 – 2x10 9 ½ 1 15 Int16 0 – 6x10 4 ½ S M 1 7 Int8 0 – 127 ½ S M

  29. Cost of Operations Area ( µ m 2 ) Operation: Energy (pJ) 8b Add 0.03 36 16b Add 0.05 67 32b Add 0.1 137 16b FP Add 0.4 1360 32b FP Add 0.9 4184 8b Mult 0.2 282 32b Mult 3.1 3495 16b FP Mult 1.1 1640 32b FP Mult 3.7 7700 32b SRAM Read (8KB) 5 N/A 32b DRAM Read 640 N/A Energy numbers are from Mark Horowitz “Computing’s Energy Problem (and what we can do about it)”, ISSCC 2014 Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library.

  30. The Importance of Staying Local LPDDR DRAM GB 640pJ/word On-Chip SRAM MB 50pJ/word Local SRAM KB 5pJ/word

  31. Mixed Precision Store weights as 4b using Trained quantization, w ij decode to 16b accumulate 24b or 32b x + b i to avoid saturation Store activations as 16b a j 16b x 16b multiply round result to 16b Batch normalization important to ‘center’ dynamic range

  32. Weight Update Learning rate may be very small (10 -5 or less) a a j + x w ij D w ij x D w rounded to zero g j No learning!

  33. Stochastic Rounding Learning rate may be very small (10 -5 or less) a D w very small a j + x D w’ ij D w ij w ij SR x E( D w’ ij ) = D w ij g j

  34. � Reduced Precision For Training 𝑐 " = 𝑔 * 𝑥 "& 𝑏 " 𝑥 "& = 𝑥 "& + α𝑏 " 𝑕 & & S. Gupta et.al “Deep Learning with Limited Numerical

  35. Pruning 35

  36. Pruning before pruning after pruning pruning synapses pruning neurons Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015

  37. Retrain to Recover Accuracy L2 regularization w/o retrain L1 regularization w/o retrain L1 regularization w/ retrain L2 regularization w/ retrain L2 regularization w/ iterative prune and retrain 0.5% Train Connectivity 0.0% -0.5% -1.0% Accuracy Loss Prune Connections -1.5% -2.0% -2.5% -3.0% Train Weights -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parametes Pruned Away Pruned Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015

  38. Pruning of VGG-16

  39. Pruning Neural Talk and LSTM

  40. Speedup of Pruning on CPU/GPU Intel Core i7 5930K: MKL CBLAS GEMV, MKL SPBLAS CSRMV NVIDIA GeForce GTX Titan X: cuBLAS GEMV, cuSPARSE CSRMV NVIDIA Tegra K1: cuBLAS GEMV, cuSPARSE CSRMV

  41. Trained Quantization (Weight Sharing) Quantization: less precision Pruning: less quantity Cluster the Weights Train Connectivity original same same Generate Code Book accuracy accuracy network Prune Connections 100% Size 10% Size 3.7% Size Quantize the Weights with Code Book Train Weights Retrain Code Book Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, arXiv 2015

  42. Weight Sharing via K-Means weights cluster index fine-tuned (32 bit float) centroids (2 bit uint) centroids 3: 2.09 -0.98 1.48 0.09 2.00 1.96 3 0 2 1 2: 0.05 -0.14 -1.08 2.12 1.50 1.48 cluster 1 1 0 3 3 0 1 1 1 1 0 3 1: -0.91 1.92 0 -1.03 0.00 -0.04 0 3 1 0 0 3 1 0 lr 1.87 0 1.53 1.49 0: -1.00 -0.97 3 1 2 2 3 1 2 2 gradient -0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04 group by reduce -0.01 0.01 -0.02 0.12 0.03 0.01 -0.02 0.02 -0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04 -0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03 Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, arXiv 2015

  43. Trained Quantization Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, arXiv 2015

  44. Bits per Weight

  45. Pruning + Trained Quantization

Recommend


More recommend