using Residue Arithmetic and Integrated Photonics Jiaxin Peng, - PowerPoint PPT Presentation

DNNARA: A Deep Neural Network Accelerator using Residue Arithmetic and Integrated Photonics Jiaxin Peng, Yousra Alkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi 49th International Conference on Parallel Processing – ICPP August 2020

Outline ➢ Introduction ➢ Background ➢ Integrated Photonic Residue Arithmetic Computing Engine for Convolutional Neural Network ➢ Performance Evaluation ➢ Conclusion 2

Introduction 3

Introduction ➢ Some NN applications require real-time analysis for inference ➢ Computation intensive; includes billion multiply-accumulate (MAC) operations ➢ We propose DNNARA: a deep neural network accelerator using residue arithmetic based on integrated photonics ➢ All the computations through the neural network are done in residue number system (RNS) to avoid extra binary to/from RNS conversions Block Diagram of a DNNARA System 4

Introduction ➢ DNNARA: RNS with wavelength-division multiplexing (WDM) • Execute multiple MVMs due to WDM feature • Speedup MVMs due to digit- independent feature • Residues are small-sized • Increase the system parallelism – save area/hardware resources 5

Background ➢ Convolutional Neural Network ➢ Residue Number System 6

Background – Convolutional Neural Network ➢ Widely applied in classification • Image recognition ➢ Including several layers/functions • Convolutional layers • Activation functions – add non-linearity • ReLu (Rectified Linear Unit) • Sigmoid function / Hyperbolic tangent function • Pooling layers – down ample the output • Max pooling • Average pooling • Fully-connected layers ➢ Contains up to billion multiply-accumulate (MAC) operations 7

Background - Residue Number System (RNS) ➢ Each Integer X is represented by its “ residue ,” or remainder obtained by dividing it by a modulus M i • Example: Moduli are M 1 =2, M 2 =3, M 3 =5, M 4 =7 • X = 20 is represented as X={0, 2, 0, 6} [2, 3, 5, 7] • Range of numbers that can be represented: 0 to (M – 1)(here 0 to 219) (M=M 1 *M 2 *M 3 *M 4 ) • Moduli should be relatively prime ➢ Negative Number Notation: Similar to 2’s compliment • r = |m-|-X| m | m (where X is negative) • Example: -20 = {|2-0| 2 , |3-2| 3 , |5-0| 5 , |7-6| 7 } [2, 3, 5, 7] = {0, 1, 0, 1} [2, 3, 5, 7] • Range of numbers that can be represented: [−( 𝑁 −1)/2,( 𝑁 −1)/2]if M is odd, or[− 𝑁 /2, 𝑁 /2−1]if M is even ➢ Residue Arithmetic: Operations carried out on residues • Example: Addition of X=20={0, 2, 0, 6} [2, 3, 5, 7] and Y=5={1, 2, 0, 5 } [2, 3, 5, 7] • X+Y = {0+1, 2+2, 0+0, 6+5 } [2, 3, 5, 7] → = {1, 1, 0, 4 } [2, 3, 5, 7] • X*Y = {0*1, 2*2, 0*0, 6*5 } [2, 3, 5, 7] → = {0, 1, 0, 2 } [2, 3, 5, 7] • Residue arithmetic is carried out as modulo additions and multiplication on the residues • Residue arithmetic is carried out on each residue in parallel 8

Integrated Photonic Residue Arithmetic Computing Engine for Neural Network ➢ Overview ➢ Sigmoid Unit ➢ Residue Adders and Multipliers ➢ Max Pooling Unit ➢ Residue Matrix-Vector Multiplication Unit 9

Overview Architecture • • R-MVM: Residue Matrix-Vector Multiplication LUT: Look-up Table • • R-Multiplier: Residue Multiplier RNS2Bin: RNS to Binary • • R-Adder: Residue Adder Bin2RNS: Binary to RNS • • MRR: Micro-Ring Resonator T: tile • PD: Photo-Detector 10

Integrated Photonic Residue Adder and Multiplier ➢ Basic block • An electro-optical 2×2 switch • Light either propagates through (“bar” state – (a))or propagates cross (“cross” state – (b)) ➢ Residue Adder [1] – one-hot encoding • Could be considered as a mapping (injection) • Arbitrary Size Benes (AS-Benes) Network (c – even number & d – odd number) • Switch states are precomputed and stored in a look- up table (LUT) ➢ An AS-Benes modulo-5 adder (e) • Example with |3+4| 5 = 2 ➢ A Modulo-N Residue Multiplier Implementation (f) ➢ WDM capable 11

Residue MVM (R-MVM) Computing Block ➢ Schematic of designed R-MVM (b) ➢ Wavelength-Division Multiplexing (WDM) Capable ➢ Lasers, MRRs, PDs, LUTs, Registers, as well as photonic and electrical connections are needed ➢ sel to choose either the partial sum or bias ➢ Example: 5x5 input feature and a 2x2 kernel 12

Pipeline of a MAC operation • Cycle 1: • Input feature ( x ) are encoded as light with different wavelengths • Weights (w) are encoded as the selection line, loading the states of switches in the LUT • Cycle 2: • Setup the switch states accordingly • Inject light and detect light - multiply • MRRs & PDs act like filter to derive the solutions for all the multiplications 13

Pipeline of a MAC operation • Cycle 3: • Results from last cycle ( w*x ) are decoded as the selection line to load the states for adders • According to sel, either the partial sum or bias is decoded as the light • Cycle 4: • Setup the adders • Inject light and detect light – add • Cycle 5: Write back to the register 14

Sigmoid Function Unit - Polynomial ➢ In residue domain, it is hard to calculate the sigmoid function ➢ Instead, it could be considered as a polynomial because sigmoid function could be represented as Taylor series ➢ Need to pre-calculate the terms that include x , and build the connection accordingly ➢ Example: P(x) = ax 4 + bx 3 + cx 2 + dx + e in modulo-5 system 15

Max pool Function Unit ➢ Sign detection in RNS is implicit ➢ Instead, we convert the number from RNS to MRS (mixed-radix number system) [2] ➢ From the MRS, the coefficient of even number 2 (a 4 ) separates the number to negative or non-negative ➢ It is serial but could be pipelined 16

Performance Evaluation 17

Experiment Setup ➢ Electrical memory component • CACTI 7.0 [3], ➢ Optical Switch [4] Configurations of Selected Benchmarks • Lumerical FTDT ➢ Optical circuit • Lumerical Interconnect ➢ Lasers/MRRs/PDs • Data from other work ([5], [6],and [7], respectively) ➢ HyperTransport serial link • Data from [8] ➢ System Level Design • Our own simulator 18

Design Space Exploration ➢ Swept Parameters • WDM size • # of tiles in a chip • # of MVMs in a tile ➢ Computation capability • # of operations /(time*area*power) 19

Hardware Specification 20

Speed & Power Analysis ➢ Real benchmarks ➢ The more chip the faster but did not scaled proportionally ➢ Consumes more power ➢ Due to communication ➢ 19 times faster compared to a GPU (Nvidia Tesla V- 100) for VGG-4 with same power budget 21

Conclusion ➢ Proposed DNNARA, a deep neural network accelerator that using residue number system ➢ DNNARA is a hybrid electro-optical design ➢ Proposed a system-level CNN accelerator chip with nano-photonic ➢ Built a system-level simulator for experimental estimation ➢ Could reach up to 12.6 GOPS/(second·mm 2 · watt) ➢ Reached 19 times faster compared to a state-of-art GPU (Nvidia Tesla V- 100) for VGG-4 with same power budget 22

References ➢ [1] Jiaxin Peng, Yousra Alkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi.2019. Integrated Photonics Architectures for Residue Number System Computations. In IEEE International Conference on Rebooting Computing (ICRC 2019).129 – 137. ➢ [2] Nicholas S Szabo and Richard I Tanaka. 1967.Residue arithmetic and its applications to computer technology. McGraw-Hill. ➢ [3] Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 3 – 14. ➢ [4] Shuai Sun, Vikram K Narayana, Ibrahim Sarpkaya, Joseph Crandall, Richard ASoref, Hamed Dalir, Tarek El-Ghazawi, and Volker J Sorger. 2017. Hybrid photonic-plasmonic nonblocking broadband 5×5 router for optical networks. IEEE Pho-tonics Journal10, 2 (2017), 1 – 12. ➢ [5] Rupert F Oulton, Volker J Sorger, Thomas Zentgraf, Ren-Min Ma, Christopher Gladden, Lun Dai, Guy Bartal, and Xiang Zhang. 2009. Plasmon lasers at deep subwavelength scale.Nature461, 7264 (2009), 629 ➢ [6] Erman Timurdogan, Cheryl M Sorace-Agaskar, Jie Sun, Ehsan Shah Hosseini, Aleksandr Biberman, and Michael R Watts. 2014. An ultralow power a thermal silicon modulator. Nature communications5 (2014), 4008. ➢ [7] Yannick Salamin, Ping Ma, Benedikt Baeuerle, Alexandros Emboras, Yuriy Fedoryshyn, Wolfgang Heni, Bojun Cheng, Arne Josten, and Juerg Leuthold. 2018.100 GHz plasmonic photo detector. ACS photonics5, 8 (2018), 3291 – 3297. ➢ [8] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li,Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al.2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 609 – 622. 23

Thank you! 24

using Residue Arithmetic and Integrated Photonics Jiaxin Peng, - PowerPoint PPT Presentation

DNNARA: A Deep Neural Network Accelerator using Residue Arithmetic and Integrated Photonics Jiaxin Peng, Yousra Alkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi 49th International Conference on Parallel Processing ICPP August 2020

By Shervin Daneshpajouh Computer Arithmetic Computer Arithmetic p Computer Computer Arithmetic

Digital Design Discussion: Arithmetic Binary Arithmetic Floating-Point Arithmetic Binary

Taylor expansion and the Cauchy Residue Theorem for finite-density QCD Benjamin Jger In

Simple Groups Generated by Involutions Interchanging Residue Classes of the Integers Stefan Kohl

Priority Action Report Gunshot Residue (GSR) Subcommittee Chemistry/Instrumental Analysis SAC

Counting Twin Primes in Residue Classes Alex Lemann, Earlham College Primes Residue classes for n

Thermanator: Thermanator: Ercan Ozturk Gene Tsudik Thermal Residue Attacks Thermal Residue

Lecture 4 Arithmetic-Logic Unit 1 Arithmetic - Logic Unit ALU Handles integers Does the

Arithmetic for Computers October 31, 2008 Arithmetic for Computers ALU Arithmetic Logic Unit

Section 4 Section 4 Arithmetic Units a 4-1 1 ALU ALU a 4-2 2 Arithmetic Logic Unit (ALU)

Residues and Duality for Schemes and Stacks 2. Rigid Residue Complexes over Rings Amnon Yekutieli

BAUXITE RESIDUE SAFETY DISPOSAL AND FRIENDLY ENVIRONMENTAL PROCESSING PERMANENT CARE AT VIMETCO

Congruences and Residue Class Rings (Chapter 2 of J. A. Buchmann, Introduction to Cryptography,

Residue Objects: A Challenge to Web Browser Security Robert Rosolek University of Warsaw

BIOCONVERSION TECHNOLOGIES PTT203 BIOCHEMICAL ENGINEERING PUAN NURUL AIN HARMIZA ABDULLAH

Numeration and Computer Arithmetic Some Examples JC Bajard LIRMM, CNRS UM2 161 rue Ada, 34392

Finding your way through the QEMU parameter jungle 2018-02-04 Thomas Huth

I/O CS 416: Operating Systems Design Department of Computer Science Rutgers University

UART Transmitter and Receiver Macros 8-bit, no parity, 1 stop bit Integral 16-byte FIFO buffers

MCIS/UA PHP Training 2003 Chapter 8 Object-Oriented Concepts - Part 2 Introduction OOP was

The Open Source ProtoFlex Simulator Eric S. Chung, Michael K. Papamichael, James C. Hoe, Babak

Legacy Disk Interfaces ATA - AT Attachment 16 bits of data in parallel 40 or

ARDUINO CDT IDE Bringing Eclipse CDT to Hobbyists Doug Schaefer QNX/BlackBerry, CDT Project

Lab 1: Arduino Basics Marco Zennaro and Antoine Bagula ICTP and UWC Italy and South Africa

Sambuz

Useful Links

Newsletter

Mail Us