Deep Compression and EIE: Deep Neural Network Model Compression and - PowerPoint PPT Presentation

Deep Compression and EIE: ——Deep Neural Network Model Compression   and Efficient Inference Engine Song Han CVA group, Stanford University Apr 7, 2016 1

Intro about me and my advisor • Fourth year PhD with Prof. Bill Dally at Stanford. • Research interest: deep learning model compression and hardware acceleration , to make inference more efficient for deployment. • Recent work on “Deep Compression” and “EIE: Efficient Inference Engine” covered by TheNextPlatform & O’Reilly Song Han & TechEmergence & HackerNews • Professor at Stanford University and former chairman of CS department, leads the Concurrent VLSI Architecture Group. • Chief Scientist of NVIDIA. • Member of the National Academy of Engineering, Fellow of the American Academy of Arts & Sciences, Fellow of the IEEE, Fellow of the ACM and numerous other rewards… Bill Dally 2

Thanks to my collaborators • NVIDIA: Jeff Pool, John Tran, Bill Dally • Stanford: Xingyu Liu, Jing Pu, Ardavan Pedram, Mark Horowitz, Bill Dally • Tsinghua: Huizi Mao, Song Yao, Yu Wang • Berkeley: Forrest Iandola, Matthew Moskewicz, Khalid Ashraf, Kurt Keutzer You’ll be interested in his GTC talk: S6417 - FireCaffe Bill Dally 3

This Talk: • Deep Compression [1,2] : A Deep Neural Network Model Compression Pipeline. • EIE Accelerator [3] : Efficient Inference Engine that Accelerates the Compressed Deep Neural Network Model. • SqueezeNet++ [4,5] : ConvNet Architecture Design Space Exploration [1]. Han et al. “Learning both Weights and Connections for Efficient Neural Networks”, NIPS 2015 [2]. Han et al. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”, ICLR 2016 [3]. Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016 [4]. Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size” arXiv 2016 [5]. Yao, Han, et.al, “Hardware-friendly convolutional neural network with even-number filter size” ICLR’16 workshop 4

Deep Learning � Next Wave of AI Image Speech Natural Language Recognition Recognition Processing 5

Applications 6

The Problem: If Running DNN on Mobile … App developers suffers from the model size “At Baidu, our #1 motivation for compressing networks is to bring down the size of the binary file . As a mobile-first company, we frequently update various apps via different app stores. We've very sensitive to the size of our binary files , and a feature that increases the binary size by 100MB will receive much more scrutiny than one that increases it by 10MB.” —Andrew Ng 7

The Problem: If Running DNN on Mobile … Hardware engineer suffers from the model size   (embedded system, limited resource) 8

The Problem: If Running DNN on the Cloud … Network Power User Delay Budget Privacy Intelligent but Inefficient 9

Deep Compression Problem 1: Model Size Solution 1: Deep Compression Smaller Size Accuracy Speedup Compress Mobile App   no loss of accuracy make inference faster Size by 35x-50x improved accuracy   10

EIE Accelerator Problem 2: Latency, Power, Energy Solution 2: ASIC accelerator Offline Real Time Low Power No dependency on   No network delay High energy efficiency   network connection high frame rate that preserves battery 11

Part1: Deep Compression • AlexNet: 35 × , 240MB => 6.9MB => 0.47MB (510x) • VGG16: 49 × , 552MB => 11.3MB • With no loss of accuracy on ImageNet12 • Weights fits on-chip SRAM, taking 120x less energy than DRAM 1. Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 2. Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 3. Iandola, Han, et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size” ECCV submission Deep Compression SqueezeNet++ EIE 12

1. Pruning Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 Deep Compression SqueezeNet++ EIE 13

Pruning: Motivation • Trillion of synapses are generated in the human brain during the first few months of birth. • 1 year old , peaked at 1000 trillion • Pruning begins to occur. • 10 years old , a child has nearly 500 trillion synapses • This ’pruning’ mechanism removes redundant connections in the brain. [1] Christopher A Walsh. Peter huttenlocher (1931-2013). Nature , 502(7470):172–172, 2013.   Deep Compression SqueezeNet++ EIE 14

Retrain to Recover Accuracy L2 regularization w/o retrain L1 regularization w/o retrain L1 regularization w/ retrain L2 regularization w/ retrain L2 regularization w/ iterative prune and retrain 0.5% 0.0% -0.5% -1.0% Accuracy Loss -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parametes Pruned Away Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 Deep Compression SqueezeNet++ EIE 15

Pruning: Result on 4 Covnets Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 Deep Compression SqueezeNet++ EIE 16

AlexNet & VGGNet Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 Deep Compression SqueezeNet++ EIE 17

Mask Visualization Visualization of the first FC layer’s sparsity pattern of Lenet-300-100. It has a banded structure repeated 28 times, which correspond to the un-pruned parameters in the center of the images, since the digits are written in the center. Deep Compression SqueezeNet++ EIE 18

Pruning NeuralTalk and LSTM Image Captioning Karpathy, Feifei, et al, "Deep Visual-Semantic Alignments for Generating Image Descriptions" Explain Images with Multimodal Recurrent Neural Networks, Mao et al. • Pruning away 90% parameters in NeuralTalk doesn’t hurt BLUE score with proper retraining Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei Show and Tell: A Neural Image Caption Generator, Vinyals et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al. Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick Lecture 10 - Lecture 10 - 8 Feb 2016 8 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson 51 Deep Compression SqueezeNet++ EIE 19

Pruning NeuralTalk and LSTM • Original : a basketball player in a white uniform is playing with a ball • Pruned 90% : a basketball player in a white uniform is playing with a basketball • Original : a brown dog is running through a grassy field • Pruned 90% : a brown dog is running through a grassy area Original : a soccer player in red is running in the field • Pruned 95% : a man in a red shirt and black and white black • shirt is running through a field Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 Deep Compression SqueezeNet++ EIE 20

Pruning Neural Machine Translation Abi See, “CS224N Final Project: Exploiting the Redundancy in Neural Machine Translation” Deep Compression SqueezeNet++ EIE 21

Pruning Neural Machine Translation Word Embedding: Dark means zero and redundant, White means non-zero and useful LSTM: Abi See, “CS224N Final Project: Exploiting the Redundancy in Neural Machine Translation” Deep Compression SqueezeNet++ EIE 22

Speedup (FC layer) Intel Core i7 5930K: MKL CBLAS GEMV, MKL SPBLAS CSRMV • NVIDIA GeForce GTX Titan X: cuBLAS GEMV, cuSPARSE CSRMV • NVIDIA Tegra K1: cuBLAS GEMV, cuSPARSE CSRMV • Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 Deep Compression SqueezeNet++ EIE 23

Energy Efficiency (FC layer) • Intel Core i7 5930K: CPU socket and DRAM power are reported by pcm-power utility • NVIDIA GeForce GTX Titan X: reported by nvidia-smi utility • NVIDIA Tegra K1: measured the total power consumption with a power-meter, 15% AC to DC conversion loss, 85% regulator efficiency and 15% power consumed by peripheral components => 60% AP+DRAM power Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 Deep Compression SqueezeNet++ EIE 24

2. Weight Sharing (Trained Quantization) Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 Deep Compression SqueezeNet++ EIE 25

Weight Sharing: Overview Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 Deep Compression SqueezeNet++ EIE 26

Finetune Centroids Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 Deep Compression SqueezeNet++ EIE 27

Accuracy ~ #Bits on 5 Conv Layers + 3 FC Layers Deep Compression SqueezeNet++ EIE 28

Deep Compression and EIE: Deep Neural Network Model Compression and - PowerPoint PPT Presentation

Deep Compression and EIE: Deep Neural Network Model Compression and Efficient Inference Engine Song Han CVA group, Stanford University Apr 7, 2016 1 Intro about me and my advisor Fourth year PhD with Prof. Bill Dally at Stanford.

Deep Compression and EIE: Deep Neural Network Model Compression and Efficient Inference

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

Lossless compression in lossy compression systems Almost every lossy compression system

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

Digital Image Compression Digital Image Compression Digital Image Compression and JPEG Standards

Digital Video Compression Digital Video Compression Digital Video Compression and H.261

From Sorting to Heaps to Compression Data Compression video on demand/set top box jpeg

Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression

Compression Strategies & Alternate Summarization Systems and Applications Ling 573 May 23,

Towards Practical Image Restoration and Enhancement Dr. Shuhang GU School of EIE, The University

EIE-1-TH Endur ID Infant Easy Thermal Identification Robert Chadwick Endur ID Inc. 8 Merrill

Best Faade Best Practice for Double Skin Faades EIE/04/135/S07.38652 Description Double

MATLAB Programming (Lecture 1) Dr. SUN Bing School of EIE Beihang University www.buaa.edu.cn

A Model to Address Salary Compression for Faculty (an anti-compression model) Presented to

Compression Overview Multimedia Encoding and Compression Huffman codes Lossless

MPEG: A Video Compression Standard for Multimedia Applications V clav Hlav CTU Prague,

Seminar Paper Presentation & PowerPoint Guidelines 2017 These guidelines are provided to help

Study of Scanning Electron Microscope images The relationship between the structure of insects

Maarit KARPPINEN Materials and Structures Laboratory Tokyo Institute of Technology JAPAN

Lightweight Compression Methods Achieving 120GBps and More Piotr Przymus Laboratoire

Create a Narrated Video from a PowerPoint 2013 Presentation This article will briefly describe

Gzip Compression Using Altera OpenCL Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu

gzip, tar Purpose file archiving -compressing multiple files into one smaller file

Deep Compression and EIE: Deep Neural Network Model Compression and - PowerPoint PPT Presentation

Deep Compression and EIE: Deep Neural Network Model Compression and Efficient Inference Engine Song Han CVA group, Stanford University Apr 7, 2016 1 Intro about me and my advisor Fourth year PhD with Prof. Bill Dally at Stanford.

Deep Compression and EIE: Deep Neural Network Model Compression and Efficient Inference

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

Lossless compression in lossy compression systems Almost every lossy compression system

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

Digital Image Compression Digital Image Compression Digital Image Compression and JPEG Standards

Digital Video Compression Digital Video Compression Digital Video Compression and H.261

From Sorting to Heaps to Compression Data Compression video on demand/set top box jpeg

Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression

Compression Strategies &amp; Alternate Summarization Systems and Applications Ling 573 May 23,

Towards Practical Image Restoration and Enhancement Dr. Shuhang GU School of EIE, The University

EIE-1-TH Endur ID Infant Easy Thermal Identification Robert Chadwick Endur ID Inc. 8 Merrill

Best Faade Best Practice for Double Skin Faades EIE/04/135/S07.38652 Description Double

MATLAB Programming (Lecture 1) Dr. SUN Bing School of EIE Beihang University www.buaa.edu.cn

A Model to Address Salary Compression for Faculty (an anti-compression model) Presented to

Compression Overview Multimedia Encoding and Compression Huffman codes Lossless

MPEG: A Video Compression Standard for Multimedia Applications V clav Hlav CTU Prague,

Seminar Paper Presentation &amp; PowerPoint Guidelines 2017 These guidelines are provided to help

Study of Scanning Electron Microscope images The relationship between the structure of insects

Maarit KARPPINEN Materials and Structures Laboratory Tokyo Institute of Technology JAPAN

Lightweight Compression Methods Achieving 120GBps and More Piotr Przymus Laboratoire

Create a Narrated Video from a PowerPoint 2013 Presentation This article will briefly describe

Gzip Compression Using Altera OpenCL Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu

gzip, tar Purpose file archiving -compressing multiple files into one smaller file

Compression Strategies & Alternate Summarization Systems and Applications Ling 573 May 23,

Seminar Paper Presentation & PowerPoint Guidelines 2017 These guidelines are provided to help