Scalar Arithmetic Multiple Data Customizable Precision for Deep - PowerPoint PPT Presentation

Scalar Arithmetic Multiple Data Customizable Precision for Deep Neural Networks Andrew Anderson and Michael Doyle and David Gregg Lero, Trinity College Dublin { aanderso ,mjdoyle,dgregg } @tcd.ie ARITH, Kyoto June 2019

DNN Convolution Figure: Multi-channel multi-kernel convolution

DNN Convolution for ( unsigned m = 0; m < k e r n e l s ; m++) for ( unsigned h = 0; h < img h / s t r i d e h ; h++) for ( unsigned w = 0; w < img w/ s t r i d e w ; w++) for ( unsigned c = 0; c < channels ; c++) for ( unsigned y = 0; y < k ; y++) for ( unsigned x = 0; x < k ; x++) output [m] [ h ] [ w] += input [ c ] [ ( ( h ∗ s t r i d e h ) + y ) − ( k /2)] [ ( (w ∗ s t r i d e w ) + x ) − ( k /2)] ∗ k e r n e l [m] [ c ] [ y ] [ x ] ) ;

Quantized Arithmetic DNN weights occupy huge amounts of space in FP32 VGG-19 Network: 548 MB Figure: But we want to use them on this! OpenMV Cam – 512 KB RAM, 2 MB ROM, 216 MHZ Cortex-M7

Quantized Arithmetic In Deep Learning we have it very easy! ◮ Network training compensates for arithmetic error ◮ Often, noisy arithmetic actually helps ! (with overfitting) Lots of research about how harshly DNN weights can be quantized ◮ Can go to integer (eventually!) ◮ Can go down to one (1) bit (’binarized’ nets) ◮ But we don’t want to do all our work on FPGA... ◮ In fact, commodity hardware is ideal.

The Simple Approach Convert to native arithmetic 4xuint8_t 4xuint4_t 00000010000011100000100100000101 00000101000011010000001000001011 Figure: uint4 t expanded to uint8 t ◮ Can use native SIMD ◮ Space overhead only in registers (not memory) ◮ Extra precision in intermediate results (for free) ◮ Easy to mix and match number formats (e.g. uint 6 t + uint 4 t )

Quantized Arithmetic uint16_t 4xuint4_t 0010111010010101 0101110100101011 1 0 1 Figure: Example SWAR operation. 4 × 4-bit words packed into a 16-bit scalar register

SIMD Within A Register (SWAR) Dealing with overflow X010X110X001X101 0XXX1XXX1XXX0XXX X101X101X010X011 0XXX1XXX0XXX1XXX masked add xor 0111100100111000 0XXX0XXX1XXX1XXX xor 0111100110110000 Figure: Spacer bits Temporary spacer bits are spacer bits in intermediate values that don’t get written to the data format in memory.

SIMD Within A Register (SWAR) uint32_t 4xuint4_t i0 i1 i2 i3 k1 k2 k3 0 unsigned integer multiply k3i0 k3i1 k3i2 k3i3 k2i0 k2i1 k2i2 k2i3 + k1i0 k1i1 k1i2 k1i3 + uint64_t 8xuint8_t Figure: Convolutional substructure in scalar integer multiplication Long multiplication is discrete convolution over digit sequences

SIMD Within A Register (SWAR) uint32_t 4xuint4_t i0 i1 i2 i3 k1 k2 k3 0 unsigned integer multiply k3i0 k3i1 k3i2 k3i3 k2i0 k2i1 k2i2 k2i3 + k1i0 k1i1 k1i2 k1i3 + uint64_t 8xuint8_t Figure: Convolution k × i subword multiplies and ( k − 1) × ( i − 1) additions with a single instruction

Results SAMD Convolution with T emporary Spacer Bits (ARM Cortex A-57) 3x10 9 2.5x10 9 Execution Time (ns) 2x10 9 1.5x10 9 1x10 9 5x10 8 0 conv3-1 conv3-2 conv4-1 conv4-2 conv4-3 direct-sum2d SAMD7 SAMD5 SAMD3 SAMD8 SAMD6 SAMD4 SAMD2 Figure: Performance with Temporary Spacer bits

Results SAMD Convolution with Permanent Spacer Bits (ARM Cortex A-57) 2.5x10 9 2x10 9 Execution Time (ns) 1.5x10 9 1x10 9 5x10 8 0 conv3-1 conv3-2 conv4-1 conv4-2 conv4-3 direct-sum2d SAMD7 SAMD5 SAMD3 SAMD8 SAMD6 SAMD4 SAMD2 Figure: Performance with Permanent Spacer bits

Future Work ◮ All-SAMD network (nonlinearities & utility ops) ◮ Codesign HW Integer Support Instructions ◮ GPU (but microcontrollers don’t have GPUs (yet!))

Thanks for listening!

Scalar Arithmetic Multiple Data Customizable Precision for Deep - PowerPoint PPT Presentation

Scalar Arithmetic Multiple Data Customizable Precision for Deep Neural Networks Andrew Anderson and Michael Doyle and David Gregg Lero, Trinity College Dublin { aanderso ,mjdoyle,dgregg } @tcd.ie ARITH, Kyoto June 2019 DNN Convolution Figure:

By Shervin Daneshpajouh Computer Arithmetic Computer Arithmetic p Computer Computer Arithmetic

AM P A R CudA Multiple Precision ARithmetic librarY When do we need more precision?

Scalar Functions and Arithmetic Unit Objectives After completing this unit, you should be able

Digital Design Discussion: Arithmetic Binary Arithmetic Floating-Point Arithmetic Binary

Customizable Domain- Customizable Domain -Specific Computing Specific Computing Jason Cong

Mixed Precision Training PAI Overview What is mixed-precision

Describing Customizable Products on the Web of Data Linked Data On the Web Workshop - Rio de

Higher order black holes of scalar tensor theories E Babichev and CC gr-qc/1312.3204 CC, T

Hairy black holes in scalar tensor theories E Babichev and CC gr-qc/1312.3204 CC, T Kolyvaris, E

Section 4 Section 4 Arithmetic Units a 4-1 1 ALU ALU a 4-2 2 Arithmetic Logic Unit (ALU)

AM P A R CudA Multiple Precision ARithmetic librarY Floating point arithmetics A real

AM P A R CudA Multiple Precision ARithmetic librarY Target applications Need massive

Lecture 4 Arithmetic-Logic Unit 1 Arithmetic - Logic Unit ALU Handles integers Does the

Arithmetic for Computers October 31, 2008 Arithmetic for Computers ALU Arithmetic Logic Unit

CSpace CSpace CSpace CSpace A More Practical and A More Practical and A

Faster arbitrary-precision dot product and matrix multiplication Fredrik Johansson Inria

Hardware Mapper Service 1 18 August 2016 Jonathan Davies Mo3va3on

Quiz Announcements Thursday Extras: 4:00 in CS Commons; 4:15 talk in Science 3821 Office hours:

Basic Data Types (cont.) Data Types in C Four Basic Data Types Char (1 Byte = 8 Bits) Int

Super Operations Dentry Operations Interacting with the VFS struct super_operations { struct

A Proof System for Unsolvable Planning Tasks Salom e Eriksson Gabriele R oger Malte

Introduction to Research (in Data Management) Amr Magdy Assistant Professor Computer Science

Wait-free Solvability of Equality Negation Tasks ric Goubault 1 Marijana Lazi 2 Jrmy

On 3-free links V. O. Manturov, D. A. Fedoseev, S. Kim. Bauman Moscow State Technical University,