SnaPEA : Predictive Early Activation for Reducing Computation In - PowerPoint PPT Presentation

SnaPEA : Predictive Early Activation for Reducing Computation In Deep Convolutional Neural Networks *† * Vahideh Akhlaghi Amir Yazdanbakhsh ‡ Kambiz Samadi Rajesh K. Gupta Hadi Esmaeilzadeh * Equal Contribution University of California, San Diego †Georgia Institute of Technology ‡ Qualcomm Technologies, Inc ISCA ’18

CNNs perform trillions of operations for one input . . . Dog CNN models Operations for inference VGG-16 16,362,000,000,000 Ops AlexNet 1,147,000,000,000 Ops GoogLeNet 283,000,000,000 Ops SqueezeNet 222,000,000,000 Ops 2

Convolutions dominate CNN computation . . . Dog CNN models Operations for inference VGG-16 16,362,000,000,000 Ops AlexNet 1,147,000,000,000 Ops GoogLeNet 283,000,000,000 Ops SqueezeNet 222,000,000,000 Ops ≥ 90% of operations are for convolutional layers 3

Research challenge: How to reduce CNN computation with minimal effect on accuracy? Our solution: SnaPEA 1. Leverage algorithmic structure 2. Exploit runtime information 3. Tune up with static multi-variable optimization 4

(1) Algorithmic structure of CNNs guides SnaPEA � � � 𝒙 𝒍𝒌𝒋 𝒚 𝒍𝒌𝒋 𝒍 𝒌 𝒋 Kernels . . . Normal. Pooling Conv Conv Conv ReLU ReLU ReLU . . . . . . 5

(1) Algorithmic structure of CNNs guides SnaPEA Rectified Linear Unit (ReLU) � � � 𝒙 𝒍𝒌𝒋 𝒚 𝒍𝒌𝒋 𝒍 𝒌 𝒋 Kernels . . . Normal. Pooling Conv Conv Conv ReLU ReLU ReLU . . . . . . 6

Opportunity to reduce the computation Black pixels are zero values 100% to the Activation Layers 80% Negative Inputs 60% 40% Conv 20% ReLU . . . . . . 0% AlexNet GoogLeNet SqueezeNet VGGNet Average GoogLeNet Large number of negative convolution outputs (61% on average) 7

Early termination of convolution Input Output Convolution Blue boxes are the performed operations in two highlighted convolutions ReLU makes negative outputs zero: cut convolution short 8

(2) Runtime information enables reducing computation Rectified Linear Unit (ReLU) Normal. Pooling Conv Conv Conv ReLU ReLU ReLU . . . . . . GoogLeNet Varying distribution of zero and non-zero outputs 9

SnaPEA: Principles SnaPEA: Leveraging algorithmic structure of CNNs and runtime information Reduce computation without accuracy loss Trade accuracy for further computation reduction Add minimal hardware overhead 10

SnaPEA: An illustrative example Original convolution w + - + + + - + - - - + - 0 X ReLU + + + + + + + + + + + + PartialSum + - - + + + + + - - + - 11

SnaPEA: An illustrative example (Exact mode) Original convolution w + - + + + - + - - - + - 0 X + + + + + + + + + + + + ReLU PartialSum + - - + + + + + - - + - Convolution in SnaPEA (Exact mode) + + + + + + - - - - - - w + + + + + + + + + + + + X 0 ReLU PartialSum + + + + + + + - - - - - 12

Potential benefits in the exact mode 100 % Negative Weights 80 60 40 20 0 AlexNet GoogleNet SqueezeNet VGGNet On average, 54% of the weights are negative 13

SnaPEA: An illustrative example (Predictive mode) Convolution in SnaPEA (Predictive mode) Speculation operations + - + + + + + - - - - - w + - + w Yes + + + + + + + + + + + + X ≤ th X + + + + + + + + + + + - - - Partial Sum - PartialSum + + + No + - + + + + + - - - - - w + + + + + + + + + + + + X + + + + + + + + - - - - Partial Sum 14

Speculation operations Large Small absolute value absolute value x * w X * w n largest weights Group 1 Group 2 Group n Small Large x * w X * w largest weights from each group 15

Optimize the level of speculation Speculation parameters: Th: Threshold N: Number of speculation operations Find (Th, N) for all kernels in a CNN to minimize operations and satisfy the accuracy 16

Optimize the level of speculation All convolution kernels in a CNN … … … … Layer 1 Layer 2 Layer L Kernel Profiling Local Optimization Global Optimization (Th,N) 17

Optimize the level of speculation All convolution kernels in a CNN … … … … Layer 1 Layer 2 Layer L Dog Kernel Profiling Threshold # of Value Operations Local Optimization Layer L Layer 1 Layer 2 Global Optimization … … … … (Th,N) Per kernel sensitivity analysis 18

Optimize the level of speculation All convolution kernels in a CNN … … … … Layer 1 Layer 2 Layer L Cat Kernel Profiling Threshold # of Value Operations Local Optimization Layer L Layer 1 Layer 2 Global Optimization … … … … (Th,N) Per kernel sensitivity analysis 19

Optimize the level of speculation All convolution kernels in a CNN … … … … Layer 1 Layer 2 Layer L Dog Kernel Profiling Threshold # of Value Operations Local Optimization Layer L Layer 1 Layer 2 Global Optimization … … … … (Th,N) Per kernel sensitivity analysis 20

Optimize the level of speculation All convolution kernels in a CNN … … … … … Kernel 1 Kernel 2 Kernel m Dog … Layer 1 Layer 2 Layer L Kernel Profiling … Kernel 1 Kernel 2 Kernel m Local Optimization Dog Layer L Layer 1 Layer 2 Global Optimization … … … … (Th,N) Set of configurations per layer 21

Optimize the level of speculation All convolution kernels in a CNN … … … … … Kernel 1 Kernel 2 Kernel k Dog … Layer 1 Layer 2 Layer L Kernel Profiling … Kernel 1 Kernel 2 Kernel k Local Optimization Dog Layer L Layer 1 Layer 2 Global Optimization … … … … (Th,N) Set of configurations per layer 22

Optimize the level of speculation All convolution kernels in a CNN … … … … Layer 1 Layer 2 Layer L Cat … Kernel Profiling Layer 1 Layer 1 Layer L Kernel 1 Kernel 2 Kernel k Local Optimization Layer L Layer 1 Layer 2 Global Optimization … … … … (Th,N) Adjust parameters regarding the cross-layer effect 23

Optimize the level of speculation All convolution kernels in a CNN … … … … Layer 1 Layer 2 Layer L Bird … Kernel Profiling Layer 1 Layer 1 Layer L Kernel 1 Kernel 2 Kernel k Local Optimization Layer L Layer 1 Layer 2 Global Optimization … … … … (Th,N) Adjust parameters regarding the cross-layer effect 24

Optimize the level of speculation All convolution kernels in a CNN … … … … Layer 1 Layer 2 Layer L Dog … Kernel Profiling Layer 1 Layer 1 Layer L Kernel 1 Kernel 2 Kernel k Local Optimization Layer L Layer 1 Layer 2 Global Optimization … … … … (Th,N) Adjust parameters regarding the cross-layer effect 25

SnaPEA: Hardware implementation Prediction Activation Unit (PAU) PE 1,m PE 1,1 Exact mode Terminate … … … MAC MAC MAC MAC MAC MAC PAU PAU PAU PAU PAU PAU Partial result … … PE n,1 PE n,m Sign-bit Threshold ≤ … … … MAC MAC MAC MAC MAC MAC Terminate PAU PAU PAU PAU PAU PAU Predictive mode Add low-overhead sign checks and threshold checks to the hardware 26

SnaPEA: Hardware implementation Processing Engine (PE) Compute Lane Weight and In/Out Buffer Index Buffer K Compute Lanes MAC Prediction Activation Unit (PAU) 27

Experimental setup Benchmarks CNN Model AlexNet GoogLeNet SqueezeNet VGG-16 2012 2015 2016 2014 Top-1 Accuracy 57.2% 68.7% 57.5% 70.5% Top-5 Accuracy 80.1% 89.0% 80.3% 89.9% Optimization Optimization algorithm built on top of Caffe Hardware implementation Simulation: Cycle accurate Power estimation: Design Compiler using TSMC 45 nm Baseline design: Eyeriss with the same number MAC units (256) SnaPEA area overhead compared to Eyeriss: 4.5% 28

Experimental results Exact Predictive (accuracy loss <= 1%) Predictive (accuracy loss <= 2%) Predictive (accuracy loss <= 3%) 2.5 Speedup over Eyeriss 2.08 2.02 2 1.89 1.83 1.81 1.81 1.63 1.54 1.52 1.51 1.45 1.44 1.5 1.37 1.38 1.34 1.29 1.28 1.27 1.26 1.24 1 0.5 0 AlexNet GoogleNet SqueezeNet VGGNet Geomean 29

Experimental results Layers in the predictive mode for accuracy loss ≤ 3% % of Conv Energy Network Speedup Layers Improvement AlexNet 60.0 2.11 1.97 GoogleNet 84.2 2.14 2.04 SqueezeNet 65.4 1.94 1.84 VGGNet 61.5 1.87 1.73 On average, 68% of layers operate in the predictive mode (3% accuracy drop). 30

Experimental results Speedup Highest speedup (3.6 x ) in a layer in GoogLeNet 31

Conclusion SnaPEA Exploit algorithmic structure and runtime information Reduce computations in convolutional layers Control the accuracy with multi-variable optimization Add minimal hardware overhead Future directions Leverage runtime information (e.g., patterns in inputs and activations) Expand to other activation functions (e.g., sigmoid) Tune up the hardware for more parallelism 32

SnaPEA : Predictive Early Activation for Reducing Computation In - PowerPoint PPT Presentation

SnaPEA : Predictive Early Activation for Reducing Computation In Deep Convolutional Neural Networks * * Vahideh Akhlaghi Amir Yazdanbakhsh Kambiz Samadi Rajesh K. Gupta Hadi Esmaeilzadeh * Equal Contribution University

Finding Race Conditions in Kernels from fuzzing to symbolic execution Meng Xu July 16, 2020 Meng

WEIZZ: Automatic Grey-Box Fuzzing for Structured Binary Formats Andrea Fioraldi , Daniele Cono

K-Nearest Neighbors Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative

AI Lab - Session 1 Uninformed Search Riccardo Sartea University of Verona Department of

Acts Series Lesson #94 December 18, 2012 Dean Bible Ministries www.deanbible.org Dr. Robert L.

CS 241: Systems Programming Lecture 1. Introduction Spring 2020 Prof. Stephen Checkoway 1 What

Lecture 19 Logistics HW7 due now A few days off before HW8 kicks in A few days off

The Generalized Sieve Kernel The Algorithmic Ant and the Sandpile eo Ducas 1 L Based on joint

ANT Actor Network Theory Relevance Process Architecture Governance Strategies Assemblage

Java build system by Zoltn Jakab zoltan.jakab@ericsson.com Contents Build a Java program

How Many Ants Does It Take to Find the Food? Jara Uitto ETH Zurich Distributed Computing

A Study of Ant Foraging Behaviour Presented by Aditya Tandon and Sandeep Aitha PPpppreB Ants are

Computing Science and Biology (3) become familiar with a simple example for an AL system:

Animation Methodology for Battlefield 3 Tobias Dahl Animation Director Mikael Hgstrm Lead

Modeling Physics with Differential-Algebraic Equations Lecture 1 General Introduction to

Abstract Data Types Functional Programming and Reasoning Dr Hans Georg Schaathun University of

Nested Loops Plan for today Green Screen Single looping: a deeper look Nested looping Drawing

Ant Build Maintenance with Formiga R. Hardt and E. V. Munson University of Wisconsin-Milwaukee,

Build Build Build Build System building The process of compiling and linking software

Ant-immigrant populism is not really a recent phenomenon Didier Ruedin University of the

Data Plane Verification and Anteater Brighten Godfrey University of Illinois Work with Haohui

Formal Verification of Computer Switch Networks Sharad Malik ; Department of Electrical

A Hypothesis Testing Framework for Network Security P. Brighten Godfrey University of Illinois

Making Teaching a Shared Brian Sato Office of the Vice Provost of Teaching and Learning