Deep Compression and EIE: ——Deep Neural Network Model Compression and Efficient Inference Engine Song Han CVA group, Stanford University Apr 7, 2016 1
Intro about me and my advisor • Fourth year PhD with Prof. Bill Dally at Stanford. • Research interest: deep learning model compression and hardware acceleration , to make inference more efficient for deployment. • Recent work on “Deep Compression” and “EIE: Efficient Inference Engine” covered by TheNextPlatform & O’Reilly Song Han & TechEmergence & HackerNews • Professor at Stanford University and former chairman of CS department, leads the Concurrent VLSI Architecture Group. • Chief Scientist of NVIDIA. • Member of the National Academy of Engineering, Fellow of the American Academy of Arts & Sciences, Fellow of the IEEE, Fellow of the ACM and numerous other rewards… Bill Dally 2
Thanks to my collaborators • NVIDIA: Jeff Pool, John Tran, Bill Dally • Stanford: Xingyu Liu, Jing Pu, Ardavan Pedram, Mark Horowitz, Bill Dally • Tsinghua: Huizi Mao, Song Yao, Yu Wang • Berkeley: Forrest Iandola, Matthew Moskewicz, Khalid Ashraf, Kurt Keutzer You’ll be interested in his GTC talk: S6417 - FireCaffe Bill Dally 3
This Talk: • Deep Compression [1,2] : A Deep Neural Network Model Compression Pipeline. • EIE Accelerator [3] : Efficient Inference Engine that Accelerates the Compressed Deep Neural Network Model. • SqueezeNet++ [4,5] : ConvNet Architecture Design Space Exploration [1]. Han et al. “Learning both Weights and Connections for Efficient Neural Networks”, NIPS 2015 [2]. Han et al. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”, ICLR 2016 [3]. Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016 [4]. Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size” arXiv 2016 [5]. Yao, Han, et.al, “Hardware-friendly convolutional neural network with even-number filter size” ICLR’16 workshop 4
Deep Learning � Next Wave of AI Image Speech Natural Language Recognition Recognition Processing 5
Applications 6
The Problem: If Running DNN on Mobile … App developers suffers from the model size “At Baidu, our #1 motivation for compressing networks is to bring down the size of the binary file . As a mobile-first company, we frequently update various apps via different app stores. We've very sensitive to the size of our binary files , and a feature that increases the binary size by 100MB will receive much more scrutiny than one that increases it by 10MB.” —Andrew Ng 7
The Problem: If Running DNN on Mobile … Hardware engineer suffers from the model size (embedded system, limited resource) 8
The Problem: If Running DNN on the Cloud … Network Power User Delay Budget Privacy Intelligent but Inefficient 9
Deep Compression Problem 1: Model Size Solution 1: Deep Compression Smaller Size Accuracy Speedup Compress Mobile App no loss of accuracy make inference faster Size by 35x-50x improved accuracy 10
EIE Accelerator Problem 2: Latency, Power, Energy Solution 2: ASIC accelerator Offline Real Time Low Power No dependency on No network delay High energy efficiency network connection high frame rate that preserves battery 11
Part1: Deep Compression • AlexNet: 35 × , 240MB => 6.9MB => 0.47MB (510x) • VGG16: 49 × , 552MB => 11.3MB • With no loss of accuracy on ImageNet12 • Weights fits on-chip SRAM, taking 120x less energy than DRAM 1. Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 2. Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 3. Iandola, Han, et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size” ECCV submission Deep Compression SqueezeNet++ EIE 12
1. Pruning Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 Deep Compression SqueezeNet++ EIE 13
Pruning: Motivation • Trillion of synapses are generated in the human brain during the first few months of birth. • 1 year old , peaked at 1000 trillion • Pruning begins to occur. • 10 years old , a child has nearly 500 trillion synapses • This ’pruning’ mechanism removes redundant connections in the brain. [1] Christopher A Walsh. Peter huttenlocher (1931-2013). Nature , 502(7470):172–172, 2013. Deep Compression SqueezeNet++ EIE 14
Retrain to Recover Accuracy L2 regularization w/o retrain L1 regularization w/o retrain L1 regularization w/ retrain L2 regularization w/ retrain L2 regularization w/ iterative prune and retrain 0.5% 0.0% -0.5% -1.0% Accuracy Loss -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parametes Pruned Away Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 Deep Compression SqueezeNet++ EIE 15
Pruning: Result on 4 Covnets Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 Deep Compression SqueezeNet++ EIE 16
AlexNet & VGGNet Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 Deep Compression SqueezeNet++ EIE 17
Mask Visualization Visualization of the first FC layer’s sparsity pattern of Lenet-300-100. It has a banded structure repeated 28 times, which correspond to the un-pruned parameters in the center of the images, since the digits are written in the center. Deep Compression SqueezeNet++ EIE 18
Pruning NeuralTalk and LSTM Image Captioning Karpathy, Feifei, et al, "Deep Visual-Semantic Alignments for Generating Image Descriptions" Explain Images with Multimodal Recurrent Neural Networks, Mao et al. • Pruning away 90% parameters in NeuralTalk doesn’t hurt BLUE score with proper retraining Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei Show and Tell: A Neural Image Caption Generator, Vinyals et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al. Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick Lecture 10 - Lecture 10 - 8 Feb 2016 8 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson 51 Deep Compression SqueezeNet++ EIE 19
Pruning NeuralTalk and LSTM • Original : a basketball player in a white uniform is playing with a ball • Pruned 90% : a basketball player in a white uniform is playing with a basketball • Original : a brown dog is running through a grassy field • Pruned 90% : a brown dog is running through a grassy area Original : a soccer player in red is running in the field • Pruned 95% : a man in a red shirt and black and white black • shirt is running through a field Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 Deep Compression SqueezeNet++ EIE 20
Pruning Neural Machine Translation Abi See, “CS224N Final Project: Exploiting the Redundancy in Neural Machine Translation” Deep Compression SqueezeNet++ EIE 21
Pruning Neural Machine Translation Word Embedding: Dark means zero and redundant, White means non-zero and useful LSTM: Abi See, “CS224N Final Project: Exploiting the Redundancy in Neural Machine Translation” Deep Compression SqueezeNet++ EIE 22
Speedup (FC layer) Intel Core i7 5930K: MKL CBLAS GEMV, MKL SPBLAS CSRMV • NVIDIA GeForce GTX Titan X: cuBLAS GEMV, cuSPARSE CSRMV • NVIDIA Tegra K1: cuBLAS GEMV, cuSPARSE CSRMV • Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 Deep Compression SqueezeNet++ EIE 23
Energy Efficiency (FC layer) • Intel Core i7 5930K: CPU socket and DRAM power are reported by pcm-power utility • NVIDIA GeForce GTX Titan X: reported by nvidia-smi utility • NVIDIA Tegra K1: measured the total power consumption with a power-meter, 15% AC to DC conversion loss, 85% regulator efficiency and 15% power consumed by peripheral components => 60% AP+DRAM power Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 Deep Compression SqueezeNet++ EIE 24
2. Weight Sharing (Trained Quantization) Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 Deep Compression SqueezeNet++ EIE 25
Weight Sharing: Overview Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 Deep Compression SqueezeNet++ EIE 26
Finetune Centroids Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 Deep Compression SqueezeNet++ EIE 27
Accuracy ~ #Bits on 5 Conv Layers + 3 FC Layers Deep Compression SqueezeNet++ EIE 28
Recommend
More recommend