DNNARA: A Deep Neural Network Accelerator using Residue Arithmetic and Integrated Photonics Jiaxin Peng, Yousra Alkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi 49th International Conference on Parallel Processing – ICPP August 2020
Outline ➢ Introduction ➢ Background ➢ Integrated Photonic Residue Arithmetic Computing Engine for Convolutional Neural Network ➢ Performance Evaluation ➢ Conclusion 2
Introduction 3
Introduction ➢ Some NN applications require real-time analysis for inference ➢ Computation intensive; includes billion multiply-accumulate (MAC) operations ➢ We propose DNNARA: a deep neural network accelerator using residue arithmetic based on integrated photonics ➢ All the computations through the neural network are done in residue number system (RNS) to avoid extra binary to/from RNS conversions Block Diagram of a DNNARA System 4
Introduction ➢ DNNARA: RNS with wavelength-division multiplexing (WDM) • Execute multiple MVMs due to WDM feature • Speedup MVMs due to digit- independent feature • Residues are small-sized • Increase the system parallelism – save area/hardware resources 5
Background ➢ Convolutional Neural Network ➢ Residue Number System 6
Background – Convolutional Neural Network ➢ Widely applied in classification • Image recognition ➢ Including several layers/functions • Convolutional layers • Activation functions – add non-linearity • ReLu (Rectified Linear Unit) • Sigmoid function / Hyperbolic tangent function • Pooling layers – down ample the output • Max pooling • Average pooling • Fully-connected layers ➢ Contains up to billion multiply-accumulate (MAC) operations 7
Background - Residue Number System (RNS) ➢ Each Integer X is represented by its “ residue ,” or remainder obtained by dividing it by a modulus M i • Example: Moduli are M 1 =2, M 2 =3, M 3 =5, M 4 =7 • X = 20 is represented as X={0, 2, 0, 6} [2, 3, 5, 7] • Range of numbers that can be represented: 0 to (M – 1)(here 0 to 219) (M=M 1 *M 2 *M 3 *M 4 ) • Moduli should be relatively prime ➢ Negative Number Notation: Similar to 2’s compliment • r = |m-|-X| m | m (where X is negative) • Example: -20 = {|2-0| 2 , |3-2| 3 , |5-0| 5 , |7-6| 7 } [2, 3, 5, 7] = {0, 1, 0, 1} [2, 3, 5, 7] • Range of numbers that can be represented: [−( 𝑁 −1)/2,( 𝑁 −1)/2]if M is odd, or[− 𝑁 /2, 𝑁 /2−1]if M is even ➢ Residue Arithmetic: Operations carried out on residues • Example: Addition of X=20={0, 2, 0, 6} [2, 3, 5, 7] and Y=5={1, 2, 0, 5 } [2, 3, 5, 7] • X+Y = {0+1, 2+2, 0+0, 6+5 } [2, 3, 5, 7] → = {1, 1, 0, 4 } [2, 3, 5, 7] • X*Y = {0*1, 2*2, 0*0, 6*5 } [2, 3, 5, 7] → = {0, 1, 0, 2 } [2, 3, 5, 7] • Residue arithmetic is carried out as modulo additions and multiplication on the residues • Residue arithmetic is carried out on each residue in parallel 8
Integrated Photonic Residue Arithmetic Computing Engine for Neural Network ➢ Overview ➢ Sigmoid Unit ➢ Residue Adders and Multipliers ➢ Max Pooling Unit ➢ Residue Matrix-Vector Multiplication Unit 9
Overview Architecture • • R-MVM: Residue Matrix-Vector Multiplication LUT: Look-up Table • • R-Multiplier: Residue Multiplier RNS2Bin: RNS to Binary • • R-Adder: Residue Adder Bin2RNS: Binary to RNS • • MRR: Micro-Ring Resonator T: tile • PD: Photo-Detector 10
Integrated Photonic Residue Adder and Multiplier ➢ Basic block • An electro-optical 2×2 switch • Light either propagates through (“bar” state – (a))or propagates cross (“cross” state – (b)) ➢ Residue Adder [1] – one-hot encoding • Could be considered as a mapping (injection) • Arbitrary Size Benes (AS-Benes) Network (c – even number & d – odd number) • Switch states are precomputed and stored in a look- up table (LUT) ➢ An AS-Benes modulo-5 adder (e) • Example with |3+4| 5 = 2 ➢ A Modulo-N Residue Multiplier Implementation (f) ➢ WDM capable 11
Residue MVM (R-MVM) Computing Block ➢ Schematic of designed R-MVM (b) ➢ Wavelength-Division Multiplexing (WDM) Capable ➢ Lasers, MRRs, PDs, LUTs, Registers, as well as photonic and electrical connections are needed ➢ sel to choose either the partial sum or bias ➢ Example: 5x5 input feature and a 2x2 kernel 12
Pipeline of a MAC operation • Cycle 1: • Input feature ( x ) are encoded as light with different wavelengths • Weights (w) are encoded as the selection line, loading the states of switches in the LUT • Cycle 2: • Setup the switch states accordingly • Inject light and detect light - multiply • MRRs & PDs act like filter to derive the solutions for all the multiplications 13
Pipeline of a MAC operation • Cycle 3: • Results from last cycle ( w*x ) are decoded as the selection line to load the states for adders • According to sel, either the partial sum or bias is decoded as the light • Cycle 4: • Setup the adders • Inject light and detect light – add • Cycle 5: Write back to the register 14
Sigmoid Function Unit - Polynomial ➢ In residue domain, it is hard to calculate the sigmoid function ➢ Instead, it could be considered as a polynomial because sigmoid function could be represented as Taylor series ➢ Need to pre-calculate the terms that include x , and build the connection accordingly ➢ Example: P(x) = ax 4 + bx 3 + cx 2 + dx + e in modulo-5 system 15
Max pool Function Unit ➢ Sign detection in RNS is implicit ➢ Instead, we convert the number from RNS to MRS (mixed-radix number system) [2] ➢ From the MRS, the coefficient of even number 2 (a 4 ) separates the number to negative or non-negative ➢ It is serial but could be pipelined 16
Performance Evaluation 17
Experiment Setup ➢ Electrical memory component • CACTI 7.0 [3], ➢ Optical Switch [4] Configurations of Selected Benchmarks • Lumerical FTDT ➢ Optical circuit • Lumerical Interconnect ➢ Lasers/MRRs/PDs • Data from other work ([5], [6],and [7], respectively) ➢ HyperTransport serial link • Data from [8] ➢ System Level Design • Our own simulator 18
Design Space Exploration ➢ Swept Parameters • WDM size • # of tiles in a chip • # of MVMs in a tile ➢ Computation capability • # of operations /(time*area*power) 19
Hardware Specification 20
Speed & Power Analysis ➢ Real benchmarks ➢ The more chip the faster but did not scaled proportionally ➢ Consumes more power ➢ Due to communication ➢ 19 times faster compared to a GPU (Nvidia Tesla V- 100) for VGG-4 with same power budget 21
Conclusion ➢ Proposed DNNARA, a deep neural network accelerator that using residue number system ➢ DNNARA is a hybrid electro-optical design ➢ Proposed a system-level CNN accelerator chip with nano-photonic ➢ Built a system-level simulator for experimental estimation ➢ Could reach up to 12.6 GOPS/(second·mm 2 · watt) ➢ Reached 19 times faster compared to a state-of-art GPU (Nvidia Tesla V- 100) for VGG-4 with same power budget 22
References ➢ [1] Jiaxin Peng, Yousra Alkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi.2019. Integrated Photonics Architectures for Residue Number System Computations. In IEEE International Conference on Rebooting Computing (ICRC 2019).129 – 137. ➢ [2] Nicholas S Szabo and Richard I Tanaka. 1967.Residue arithmetic and its applications to computer technology. McGraw-Hill. ➢ [3] Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 3 – 14. ➢ [4] Shuai Sun, Vikram K Narayana, Ibrahim Sarpkaya, Joseph Crandall, Richard ASoref, Hamed Dalir, Tarek El-Ghazawi, and Volker J Sorger. 2017. Hybrid photonic-plasmonic nonblocking broadband 5×5 router for optical networks. IEEE Pho-tonics Journal10, 2 (2017), 1 – 12. ➢ [5] Rupert F Oulton, Volker J Sorger, Thomas Zentgraf, Ren-Min Ma, Christopher Gladden, Lun Dai, Guy Bartal, and Xiang Zhang. 2009. Plasmon lasers at deep subwavelength scale.Nature461, 7264 (2009), 629 ➢ [6] Erman Timurdogan, Cheryl M Sorace-Agaskar, Jie Sun, Ehsan Shah Hosseini, Aleksandr Biberman, and Michael R Watts. 2014. An ultralow power a thermal silicon modulator. Nature communications5 (2014), 4008. ➢ [7] Yannick Salamin, Ping Ma, Benedikt Baeuerle, Alexandros Emboras, Yuriy Fedoryshyn, Wolfgang Heni, Bojun Cheng, Arne Josten, and Juerg Leuthold. 2018.100 GHz plasmonic photo detector. ACS photonics5, 8 (2018), 3291 – 3297. ➢ [8] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li,Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al.2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 609 – 622. 23
Thank you! 24
Recommend
More recommend