Reducing Dynamic Power in Streaming CNN Hardware Accelerators by - PowerPoint PPT Presentation

Reducing Dynamic Power in Streaming CNN Hardware Accelerators by Exploiting Computational Redundancies Duvindu Piyasena, Rukshan Wickramasinghe, Debdeep Paul, Siew-Kei Lam and Meiqing Wu School of Computer Science and Engineering (SCSE) Nanyang Technological University (NTU) Singapore Email: siewkei_lam@pmail.ntu.edu.sg

Motivation  ReLU discards negative convolution activations causing high computational redundancy in CNNs .  Widely-used CNN models discard 30%-90% CONV activations in a given layer. ReLU activation function

Proposed Method • We propose a method to eliminate the computational redundancies to save dynamic power in FPGA stream-based CNN accelerators • Eliminates the computational redundancies arising from ReLU activation by predicting the positive/negative CONV activations using a low-cost approximation scheme . Conventional CNN layer Proposed CNN layer

Contribution  We propose a hardware-friendly convolution approximation method that rely on power-of-two quantized weights.  We show that the proposed methodology can be applied to various CNN models to significantly reduce the convolution operations, without compromising on the accuracy or retraining .  We propose a streaming CNN FPGA accelerator that integrates our approximation method and demonstrate that notable power/energy savings can be achieved .

Proposed Method 1. Initialize 2. Perform Logarithmic Quantization 1. Saturate weights at 99 th percentile (= W 99 ) W a = {0, ±(½) m , ±(½) m+1 ,. . . ., ±(½) m+NL-1 } Set N L = 8 2. ApproxConv weights <--- W a Set m = log2( W 99 ) 3. Original Weights 3. Validation on modified model Yes Δ 4. Reduce quantization loss level count < 1% NL = NL - 1 No 5. Final quantization mapping W a = {0, ±(½) m , ±(½) m+1 ,. . . ., ±(½) m+NL }

Implementation • Quantization level search Evaluated designs : – Prop - 1 : Approximation applied across all-layers – Prop - 2 : Approximation applied across all-layers except 1 st Prop-2 Prop-1

Implementation • Implementation done in Verilog HDL for Lenet – Operating Frequency : 100Mhz – Device : Xilinx Virtex Ultrascale+ xcvu9p – Synthesize tool : Xilinx Vivado 2018.3 – Simulator : Mentor Modelsim 10.3 – Power Estimation Mode : Post-Synthesis Timing Simulations • Power Gains achieved by clock gating CONV circuitry via ApproxConv predictions Baseline HW Proposed HW (single layer) (single layer)

Accuracy and Hardware Evaluations • Compared with Signconnect proposed in previous work(*), which uses the sign of the weights to perform the approximations – SignConnect-1 : Approximation applied across all-layers – SignConnect-2 : Approximation applied across all-layers except 1 st * T. Ujiie, M. Hiromoto, and T. Sato, “Approximated prediction strategy for reducing power consumption of convolutional neural network processor,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , June 2016, pp. 870–876

Summary • Methodology to determine the minimal number of power- of-two quantization levels for realizing lightweight convolution approximations that can predict the positive and negative convolution activations. • Proposed a streaming CNN FPGA accelerator that integrates our approximation method. • FPGA synthesis results show that the dynamic power can be reduced by 10-12% while maintaining good accuracy.

Thank You

Reducing Dynamic Power in Streaming CNN Hardware Accelerators by - PowerPoint PPT Presentation

Reducing Dynamic Power in Streaming CNN Hardware Accelerators by Exploiting Computational Redundancies Duvindu Piyasena, Rukshan Wickramasinghe, Debdeep Paul, Siew-Kei Lam and Meiqing Wu School of Computer Science and Engineering (SCSE)

CS7015 (Deep Learning) : Lecture 12 Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only

Object Detection using R-CNN Experiments CS381V: Visual Recognition, Spring 2016 William Xie

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Dynamic Graph CNN for learning on point clouds Wang Yue, et al. Otakar Jaek March 25, 2019

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Case 2: Reducing Cardiovascular Risk Type 2 Diabetes Management Case 1: Reducing Hypoglycemic

Decay vertex ID using CNN for p K+ Aaron Higuera University of Houston CNN Tools on

CNN Ba CNN Based ed Pi Pipeline peline for or Op Optical ical Fl Flow ow Tal Schuster,

CENG5030 Part 2-1: Introduction to Convolutional Nueral Network Bei Yu (Latest update: March 4,

Nue Energy Reconstruction with CNN Lars Hertel, Ilsoo Seong, Jianming Bian 2018/08/20 Intro.

Dynamic Compilation for Reducing Dynamic Compilation for Reducing Energy Consumption of I/O-

(power x 0) == 1 (power x (+ n 1)) == (* (power x n) x) (power x 0) == 1 (power x (+ (* 2 m)

Overview Instruction level parallelism Dynamic Scheduling Techniques Scoreboarding

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Stat 5102 Lecture Slides Deck 7 Charles J. Geyer School of Statistics University of Minnesota

Beavers Suggested Process 7/22/2020 1 Source: Dave Beavers Reference Slides, July 2020 Time of

Grant Evaluation: Setting Goals and Measuring Impact Welcome! The webinar will begin at 2:00

Welcome We will begin at 7:30 pm Central Time. OFA Community Engagement Fellowship Spring 2018

Through the Looking Glass Alice found there . . . and what Alice found there Frank Mittelbach

Randomness, probabilities and machines by George Barmpalias and David Dowe Chinese Academy of

pipelines Dr. Jennifer (Jenny) Bryan Department of Statistics and Michael Smith Laboratories

Fe February 1 Te Templates Wa Wade Fa Fagen-Ul Ulmsch schnei eider er, , Cra Craig

Sambuz

Useful Links

Newsletter

Mail Us