fpga based training accelerator utilizing sparseness of
play

FPGA-based Training Accelerator Utilizing Sparseness of - PowerPoint PPT Presentation

FPGA-based Training Accelerator Utilizing Sparseness of Convolutional Neural Network Hiroki Nakahara, Youki Sada, Masayuki Shimoda, Akira Jinguji, Shimpei Sato Tokyo Institute of Technology, JP FPL2019 @Barcelona Challenges in DL Training


  1. FPGA-based Training Accelerator Utilizing Sparseness of Convolutional Neural Network Hiroki Nakahara, Youki Sada, Masayuki Shimoda, Akira Jinguji, Shimpei Sato Tokyo Institute of Technology, JP FPL2019 @Barcelona

  2. Challenges in DL Training TSUBAME-KFC (TSUBAME Kepler Fluid Cooling) Training with ResNet-50 on ImageNet • High-speed • 1.2 min@2,048 GPUs • Low-power consumption

  3. Sparse Weight Convolution Input feature map Output feature map Kernel (Sparse) y skip skip skip X 0,1 x W 0 X 1,0 x W 1 +) X 2,2 x W 2 𝜍 : Threshold

  4. Training of Sparseness CNN ? • Initial weight • Lottery ticket assumption • Special hardware 001011 01..

  5. Fine-Tuning for a Sparse CNN ・Use pre-trained model (sparse weight) by ImageNet ・ Retain strong connection for recognition accuracy Fine Tuning on FPGA Weak connection Strong connection ρ weak ρ strong Dense CNN

  6. Sparseness vs. Accuracy • 85% of weight can be pruned initially

  7. Universal Convolution (UC) Unit to Bias Base Address (x b ,y b ) Reg Sparse Weight Memory Stack (Buffer for a Feature Map) Reset + x n ReLU 0 1 0: Forward 1: Backward Address Generator 0: Forward 1: Backward (x b +x i ,y b +y i , p i ): Forward Counter 11…1 Idx w2 Non- zero weight Indirect Addres s 1 w1 x 1 ,y 1 ,p 1 2 x 2 ,y 2 ,p 2 : : : : Address X 00…0 x 1 00…1 x 2 : (x b -y i ,y b -x i , p i ): Backward

  8. Parallel MCSK Convolution C C M ... Line Buffer (C x N x k) * * * * MCSK Conv. Sparse Filter

  9. Overall Architecture MP Unit Host PC FPGA Bus Memory Bias DDR4 Memory Index Stack GAP Unit Buffer SDRAM Stack Stack LC Unit Stack Weight Memory UC Unit Line . . . . . . . . .

  10. Results VOC2017 8,352 89.2 MobileNetv1 CIFAR-10 1,098 4,458 92.5 VGG16 612 SVHN 2,430 93.3 VGG16 Linnaeus5 1,435 6,121 95.4 VGG16 2,052 MobileNetv1 1,025 88.3 960 URAMs 4,216 BRAMs 6,840 DSPs 2,364K FFs 1,182K LUTs FPGA: VCU1525 2,184 8,944 MobileNetv1 89.8 VOC2017 1,223 4,902 90.1 MobileNetv1 Linnaeus5 2,871 12,058 SVHN 4,178 Resource Consumption 370,299 FPGA [sec] GPU [sec] Sparse Ratio [%] CNN Dataset 1,106 960 3,806 934,381 AlexNet Total DSPs URAMs BRAMs FFs LUTs Module Training Time (Batch size=32, Epoch=100) CIFAR-10 92.1 93.4 1,482 VGG16 CIFAR-10 680 2,697 94.3 AlexNet VOC2017 372 93.7 2,548 AlexNet Linnaeus5 875 3,672 91.0 AlexNet SVHN 615 GPU: RTX2080Ti

Recommend


More recommend