Bit Fu Bi Fusion on Bit-Level Dynamically Composable Architecture for Deep Neural Networks Hardik Sharma Georgia Institute of Technology Jongse Park † Arm, Inc. Naveen Suda † Liangzhen Lai † ‡ University of California, San Diego Benson Chau Vikas Chandra † Alternative Computing Technologies (ACT) Lab Hadi Esmaeilzadeh ‡
2 DNNs DNNs T Tolerate Lo Low-Bi Bitwidth Operation ons 1bit/1bit 2bit/2bit 4bit/4bit 8bit/1bit 8bit/8bit 100% 80% 60% 40% 20% 0% AlexNet CIFAR10 LeNet-5 VGG-7 LSTM RESNET- SVHN Avg RNN Av RN 18 18 LS SV VG Le Al RE CI >99.4% Multiply-Adds require less than 8-bits
3 Bi Bitwidth Flexibility is Necessary for or Accuracy AlexNet: IMAGENET dataset (Mishra et al., WRPN, arXiv 2017) Conv. Conv. Conv. Conv. Conv. FC FC FC 8b/8b 4b/4b 4b/4b 4b/4b 4b/4b 4b/4b 4b/4b 8b/8b LeNet: MNIST dataset (Li et al., TWN, arXiv 2016) Conv. Conv. FC FC 2b/2b 2b/2b 2b/2b 2b/2b A fixed-bitwidth accelerator would either achieve limited benefits (8-bit), or compromise on accuracy (<8-bit)
4 Our Appr Our pproach: h: Bit it-level Composa sability WBUF BitBrick (BB) Fusion Unit BB BB BB BB sign mode BB + BB + Input BB BB Forward + sx x 1 x 0 sy y 1 y 0 BB BB BB BB BB + BB + 3 3 BB BB 6 Psum Forward BitBricks (BBs) are bit-level composable compute units
5 WBUF WBUF F-PE F-PE F-PE F-PE BB BB BB BB Input Input + + + + forward F-PE F-PE F-PE F-PE BB BB BB BB forward + + BB BB BB BB F-PE F-PE F-PE F-PE Compute units + + + + BB BB BB BB F-PE F-PE F-PE F-PE (BitBricks) Psum forward Psum forward logically fuse at (b) 16x Parallelism, Binary (1-bit) (a) Fusion Unit with 16 BitBricks runtime to form or Ternary (2-bit) Fused-PEs (F-PEs) WBUF WBUF that dynamically Input Input F-PE + + F-PE F-PE F-PE F-PE match bit-width forward forward + of the DNN layers + + Psum forward Psum forward (c) 4x Parallelism, Mixed-Bitwidth (d) No Parallelism, 8-bits (2-bit weights, 8-bit inputs)
6 Config #1 : Bi Con Binary/Ternary Mod Mode Fusion Unit 2-bit BB F-PE BB F-PE F-PE BB BB BB + BB + Input F-PE F-PE BB F-PE F-PE BB + Weight F-PE BB F-PE BB F-PE F-PE BB BB BB + BB + 2-bit F-PE F-PE BB F-PE F-PE BB Each BitBrick performs a binary/ternary multiplication 16x parallelism
7 Config #2 Con #2: 4-bit bit Mode de Fusion Unit Input (4-bit) BB BB BB BB BB + BB + F-PE 2-bit 2-bit BB BB + BB BB BB BB 2-bit 2-bit BB + BB + F-PE F-PE Weight (4-bit) BB BB Par9al Products Four BitBricks fuse to form a Fused-PE (F-PE) 4x Parallelism
8 Config #3 : 8-bit, Con bit, 4-bit bit (Mix ixed ed-Mod Mode) Fusion Unit Input (8-bit) BB BB BB BB BB + BB + 2-bit 2-bit 2-bit 2-bit BB BB + F-PE BB BB BB BB 2-bit 2-bit BB + BB + Weight (4-bit) BB BB Par:al Products Eight BitBricks fuse to form a Fused-PE (F-PE) 2x Parallelism
9 Spatial Fusion Sp on vs. Tempor oral De Design g 3 h 3 g 2 h 2 g 1 h 1 g 0 h 0 a 3 b 3 c 3 d 3 e 3 f 3 g 3 h 3 Inputs over Inputs over e 3 f 3 e 2 f 2 e 1 f 1 e 0 f 0 a 2 b 2 c 2 d 2 e 2 f 2 g 2 h 2 time time c 3 d 3 c 2 d 2 c 1 d 1 c 0 d 0 a 1 b 1 c 1 d 1 e 1 f 1 g 1 h 1 a 3 b 3 a 2 b 2 a 1 b 1 a 0 b 0 a 0 b 0 c 0 d 0 e 0 f 0 g 0 h 0 1 1 << << << << << << << << 2 2 3 Out Out Out Out 3 Out Temporal Design (Bit Serial): Spatial Fusion (Bit Parallel): Combine results over time Combine results over space
10 Sp Spatial Fusion on Su Surp rpasses Tempor oral De Design Total Area (um^2) BitBricks Shift-Add Register Area 3.5x lower Temporal 463 2989 1454 4905 area Fusion Unit 369 934 91 1394 Total Power (nW) BitBricks Shift-Add Register Power 3.2x lower Temporal 60 550 1103 1712 power Fusion Unit 46 424 69 538 Synthesized using a commercial 45 nm technology
Control WBUF WBUF 11 Fusion Unit Fusion Unit BB BB BB BB BB BB BB BB IBUF (Shared) + + + + BB BB BB BB BB BB BB BB + + BB BB BB BB BB BB BB BB + + + + BB BB BB BB BB BB BB BB Bit Fusion WBUF WBUF Systolic Array Fusion Unit Fusion Unit BB BB BB BB BB BB BB BB IBUF (Shared) Architecture + + + + BB BB BB BB BB BB BB BB + + BB BB BB BB BB BB BB BB + + + + BB BB BB BB BB BB BB BB + + Pooling Unit Ac.va.on Unit Pooling Unit Ac.va.on Unit OBUF OBUF
12 Pr Programmability: BitFusion ISA Amortize cost of bit-level fusion Requirements Concise Enable flexible Data-Path
13 ISA: Amortiz IS tize e the the Cost t of Bit it-Le Level Fusion Block end: next block Convolu'on 4-bit/8-bit Block begin: 8-bit/8-bit Convolu'on Conv 1 8-bit/8-bit Block end: next block Convolu'on 4-bit/1-bit Block begin: 4-bit/1-bit Use a block-structured ISA for groups of operations (layers)
14 ISA: Conc IS ncis ise e Expr Expres essio ion n for DNNs loop: for i in (1 B) OC IC OC loop: for j in (1 OC) loop: for k in (1 IC) B IC B Fully-Connected Layer Use loop instructions as DNNs consist of large number of repeated operations
15 ISA: Conc IS ncis ise e Expr Expres essio ion n for DNNs loop: for i in (1 B) loop: for j in (1 OC) OC IC OC loop: for k in (1 IC) input k 1 + j 0 + i IC B IC B weight k 1 + j IC + i 0 output k 0 + j 1 + i OC Fully-Connected Layer DNNs have regular memory access pattern Use loop indices to generate memory accesses
16 IS ISA: Fle Flexible xible Storage 2-bit mode 8-bit mode 16x parallelism 1x parallelism Need: 32-bit inputs, Need: 8-bit input, 32-bit weights 8-bit weight ISA changes the semantics of off-chip and on-chip memory accesses according to bitwidth of operands
17 ISA: Fle IS Flexible xible Storage e (S (Soft ftware View) WBUF WBUF WBUF 8-bit 32-bit 16-bit Reg Register Register Software views the buffers as having a flexible aspect ratio
18 Be Benchma marked Platfor orms ms Low Power Nvidia Tegra TX2 GPU High Performance Nvidia Titan-X Bit-Serial Stripes (Micro’16) ASIC Op5mized Dataflow Eyeriss (ISCA’16)
19 Benchma Be marked DNN Mod Models Mul(ply- Bit-Flexible Original Model DNN Type Adds Model Weights Weights CNN 2,678 MOps 116.3 MBytes 898.6 MBytes AlexNet CNN 617 MOps 3.3 MBytes 53.5 MBytes CIFAR10 LSTM RNN 13 MOps 6.2 MBytes 49.4 MBytes LeNet-5 CNN 16 MOps 0.5 MBytes 8.2 MBytes RESNET-18 CNN 4,269 MOps 13 MBytes 103.7 MBytes RNN RNN 17 MOps 8.0 MBytes 64.0 MBytes CNN 158 MOps 0.8 MBytes 24.4 MBytes SVHN CNN 317 MOps 2.7 MBytes 43.3 MBytes VGG-7
20 Comparison Comp on with Eyeriss Performance Energy Reduction 13.0 14.0 12x 10.0 9.9 Improvement 8.6 over Eyeriss 7.7 8x 5.1 5.1 4.8 4.3 3.9 4x 2.7 2.7 2.4 1.9 1.9 1.9 1.5 0x AlexNet Cifar-10 LSTM LeNet-5 ResNet-18 RNN SVHN VGG-7 geomean 3.9× speedup and 5.1× energy reduction over Eyeriss
21 Comparison Comp on with Stripes Performance Energy Reduction 7.8x 8 × Improvement 6.0x over Stripes 6 × 5.2x 4.4x 4.4x 4.0x 3.9x 4 × 3.1x 3.0x 2.9x 2.7x 2.7x 2.6x 2.6x 2.1x 2.0x 1.8x 1.8x 2 × 0 × AlexNet Cifar-10 LSTM LeNet-5ResNet-18 RNN SVHN VGG-7 geomean 2.6× speedup and 3.9× energy reduction over Stripes
22 Comparison Comp on with GPUs TitanX-INT8 Bit Fusion 34x 38x 31x 39x 30x 48x 29x 30 × 27x Speedup over 23x 21x 19x 20 × 16x 14x TX2 11x 10 × 7x 7x 5x 3x 0 × AlexNet Cifar-10 LSTM LeNet-5 ResNet-18 RNN SVHN VGG-7 geomean Bit Fusion provides almost the same performance as Titan Xp (250 W) with only 895 mW
23 Con Conclusion on Emerging research shows we can reduce bitwidths for DNNs without losing accuracy Bit Fusion defines a new dimension of bit-level dynamic composability to leverage this opportunity BitFusion ISA exposes this capability to software stack
Recommend
More recommend