FPGA-based Training Accelerator Utilizing Sparseness of Convolutional Neural Network Hiroki Nakahara, Youki Sada, Masayuki Shimoda, Akira Jinguji, Shimpei Sato Tokyo Institute of Technology, JP FPL2019 @Barcelona
Challenges in DL Training TSUBAME-KFC (TSUBAME Kepler Fluid Cooling) Training with ResNet-50 on ImageNet • High-speed • 1.2 min@2,048 GPUs • Low-power consumption
Sparse Weight Convolution Input feature map Output feature map Kernel (Sparse) y skip skip skip X 0,1 x W 0 X 1,0 x W 1 +) X 2,2 x W 2 𝜍 : Threshold
Training of Sparseness CNN ? • Initial weight • Lottery ticket assumption • Special hardware 001011 01..
Fine-Tuning for a Sparse CNN ・Use pre-trained model (sparse weight) by ImageNet ・ Retain strong connection for recognition accuracy Fine Tuning on FPGA Weak connection Strong connection ρ weak ρ strong Dense CNN
Sparseness vs. Accuracy • 85% of weight can be pruned initially
Universal Convolution (UC) Unit to Bias Base Address (x b ,y b ) Reg Sparse Weight Memory Stack (Buffer for a Feature Map) Reset + x n ReLU 0 1 0: Forward 1: Backward Address Generator 0: Forward 1: Backward (x b +x i ,y b +y i , p i ): Forward Counter 11…1 Idx w2 Non- zero weight Indirect Addres s 1 w1 x 1 ,y 1 ,p 1 2 x 2 ,y 2 ,p 2 : : : : Address X 00…0 x 1 00…1 x 2 : (x b -y i ,y b -x i , p i ): Backward
Parallel MCSK Convolution C C M ... Line Buffer (C x N x k) * * * * MCSK Conv. Sparse Filter
Overall Architecture MP Unit Host PC FPGA Bus Memory Bias DDR4 Memory Index Stack GAP Unit Buffer SDRAM Stack Stack LC Unit Stack Weight Memory UC Unit Line . . . . . . . . .
Results VOC2017 8,352 89.2 MobileNetv1 CIFAR-10 1,098 4,458 92.5 VGG16 612 SVHN 2,430 93.3 VGG16 Linnaeus5 1,435 6,121 95.4 VGG16 2,052 MobileNetv1 1,025 88.3 960 URAMs 4,216 BRAMs 6,840 DSPs 2,364K FFs 1,182K LUTs FPGA: VCU1525 2,184 8,944 MobileNetv1 89.8 VOC2017 1,223 4,902 90.1 MobileNetv1 Linnaeus5 2,871 12,058 SVHN 4,178 Resource Consumption 370,299 FPGA [sec] GPU [sec] Sparse Ratio [%] CNN Dataset 1,106 960 3,806 934,381 AlexNet Total DSPs URAMs BRAMs FFs LUTs Module Training Time (Batch size=32, Epoch=100) CIFAR-10 92.1 93.4 1,482 VGG16 CIFAR-10 680 2,697 94.3 AlexNet VOC2017 372 93.7 2,548 AlexNet Linnaeus5 875 3,672 91.0 AlexNet SVHN 615 GPU: RTX2080Ti
Recommend
More recommend