automatic compiler based fpga accelerator for cnn training
play

Automatic Compiler Based FPGA Accelerator for CNN Training Shreyas - PowerPoint PPT Presentation

Automatic Compiler Based FPGA Accelerator for CNN Training Shreyas Venkataramanaiah 1 , Yufei Ma 1 , Shihui Yin 1 , Eriko Nurvithadhi 2 , Aravind Dasu 3 , Yu Cao 1 , Jae-sun Seo 1 1 School of ECEE, Arizona State University, Tempe, AZ, USA 2 Intel


  1. Automatic Compiler Based FPGA Accelerator for CNN Training Shreyas Venkataramanaiah 1 , Yufei Ma 1 , Shihui Yin 1 , Eriko Nurvithadhi 2 , Aravind Dasu 3 , Yu Cao 1 , Jae-sun Seo 1 1 School of ECEE, Arizona State University, Tempe, AZ, USA 2 Intel Labs, Intel Corporation, OR, USA 3 Programmable Solutions Group, Intel Corporation, CA, USA

  2. Outline ▪ Introduction ▪ CNN training algorithm ▪ RTL compiler ▪ CNN training accelerator ▪ Results ▪ Conclusion 2

  3. Introduction ▪ Challenges in training of neural networks ‒ Large storage, memory bandwidth, energy consumption ‒ New DNN structures rapidly evolving and developed for diverse applications ▪ GPU’s are power hungry ▪ ASIC not good for programmability, cannot predict future DNNs ▪ FPGA’s are flexible ‒ Reconfigurable, scalable training hardware ‒ Can support low-precision or sparse matrix computations 3

  4. Outline ▪ Introduction ▪ CNN training algorithm ▪ RTL compiler ▪ CNN training accelerator ▪ Results ▪ Conclusion 4

  5. CNN Training Algorithm Forward Pass weight weight local Input gradients update gradients image ▪ Each training image is 𝑥 0 , 𝛽, 𝛾 Conv associated with a label 𝒙 𝟏 𝑥 0 new Conv ▪ Loss function estimates Δ𝑥 0 the network performance Upsamp Pool and provides error value 𝑥 1 , 𝛽, 𝛾 𝑥 1 new ▪ Conv ReLU: Store the activation Conv 1 𝑥 𝑔𝑚𝑗𝑞 𝒙 𝟐 Conv gradients Δ𝑥 1 ▪ Maxpool: Store the Pool Upsamp 𝑥 2 , 𝛽, 𝛾 𝑥 2 selected pixel position new Vector mult 𝒙 𝟑 2 FC 𝑥 𝑈 FC Δ𝑥 2 1x10 1x10 Error Loss

  6. CNN Training Algorithm weight weight local Backward Pass Input gradients update gradients image 𝑥 0 , 𝛽, 𝛾 ▪ Error values are Conv 𝑥 0 𝑥 0 new propagated back in Conv the network Δ𝑥 0 ▪ Upsamp Pool Flipped kernels are used in convolutions 𝑥 1 , 𝛽, 𝛾 𝑥 1 new Conv ▪ ReLU : gradients are Conv 𝟐 𝒙 𝒈𝒎𝒋𝒒 𝑥 1 scaled by activation Conv Δ𝑥 1 gradients Pool ▪ Upsamp Maxpool: Upsample 𝑥 2 , 𝛽, 𝛾 𝑥 2 new the image using Vector pooling indices mult 𝑥 2 𝟑 FC 𝒙 𝑼 FC Δ𝑥 2 1x10 1x10 Error Loss

  7. CNN Training Algorithm Weight Update weight weight local Input gradients update gradients image ▪ Weight gradients are 𝒙 𝟏 , 𝜷, 𝜸 Conv 𝑥 0 𝒙 𝟏 computed and new Conv accumulated 𝚬𝒙 𝟏 Upsamp ▪ Pool Convolutions involve 𝒙 𝟐 , 𝜷, 𝜸 intra tile accumulations 𝒙 𝟐 new Conv ▪ Conv 1 New weights are 𝑥 𝑔𝑚𝑗𝑞 𝑥 1 Conv computed at the end of 𝚬𝒙 𝟐 batch 𝒙 𝟑 , 𝜷, 𝜸 Pool Upsamp 𝒙 𝟑 new ▪ Learning rate (𝛽) and Vector momentum (𝛾) mult 𝑥 2 2 FC 𝑥 𝑈 FC 𝚬𝒙 𝟑 1x10 1x10 parameters are used Error Loss

  8. Outline ▪ Introduction ▪ CNN training algorithm ▪ RTL compiler ▪ CNN training accelerator ▪ Results ▪ Conclusion 8

  9. Proposed RTL Compiler CNN architecture Top level • Layer details – conv, pool, RTL RTL Compiler for upsamp, scaling, weight integrated CNN Training update, flatten, loss with training • Fixed point precision of H/W each layer parameters modules Configure hardware • Layer scheduling • Generate parameters based on CNN Initialize memory DRAM init • Initial weight and bias files • Training data, labels RTL model library • Base addresses for • Highly parameterized gradients, activations & FPGA flexible RTL files weights synthesis supporting CNN and training operations Loop unrolling and tiling mapping factors RTL compiler generates the training accelerator using high level CNN description 9

  10. Outline ▪ Introduction ▪ CNN training algorithm ▪ RTL compiler ▪ CNN training accelerator ▪ Results ▪ Conclusion 10

  11. Overall Architecture Output buffer Pixel data bus Weight gradient Transposable buffers/accumulator WU weight buffer Weight data bus Old weight buffer Global Control logic ReLU, AG IDX scale, loss buffer buffer Index/AG bus Pooling PE Array UPSA Control Unit Conv/FC (Demux/mult) control Data router Weight Computing Data buffer modules Input buffer Gather Data scatter On-chip DMA DMA Manager buffers AG – Activation gradients, IDX – Maxpool Indices, UPSA – Upsampling

  12. Overall Architecture Output buffer Pixel data bus Weight gradient Transposable buffers/accumulator WU weight buffer Weight data bus Old weight Global Control logic ReLU, AG IDX buffer scale, loss buffer Index/AG bus buffer Pooling PE Array UPSA Control Unit Conv/FC (Demux/mult) control Data router Weight Computing Data buffer modules Input buffer Gather Data scatter On-chip DMA DMA Manager buffers AG – Activation gradients, IDX – Maxpool Indices, UPSA – Upsampling

  13. Overall Architecture Output buffer Pixel data bus Weight gradient Transposable buffers/accumulator WU weight buffer Weight data bus Global Control logic Old weight buffer ReLU, AG IDX scale, loss buffer buffer Index/AG bus Pooling PE Array UPSA Control Unit Conv/FC (Demux/mult) control Data router Weight Computing Data buffer modules Input buffer Gather Data scatter On-chip DMA DMA Manager buffers AG – Activation gradients, IDX – Maxpool Indices, UPSA – Upsampling

  14. MAC Array ▪ Output stationary Pox dataflow MAC MAC MAC MAC Transposable ▪ Data/weight re-use weight buffer to minimize partial MAC MAC MAC MAC Weight router sum movement Pof MAC MAC MAC MAC ▪ Reconfigurable MAC Local grad array to support all buffer MAC MAC MAC MAC phases of training ▪ MAC array size is Pad, stride Data router user determined – kernel size Training loop unroll factors Inpx data phase Input pixel buffer From DRAM (𝑄 𝑝𝑔 , 𝑄 𝑝𝑦 , 𝑄 𝑝𝑧 ) Training Phase Input px buffer Weight buffer Output buffer FP Activations Normal kernels Activations BP Local gradients Flipped kernels Local gradients WU Activations Local gradients Kernel gradients

  15. Transposable Weight Buffers FP weight access pattern BP weight access pattern Out Feat. Maps (L+1) Inp Feat. Maps (L) Out Feat. Maps (L+1) Inp Feat. Maps (L) 101 102 103 104 101 201 301 401 Transpose 201 202 203 204 102 202 302 402 301 302 303 304 103 203 303 403 401 402 403 404 104 204 304 404 Transposable weight storage C0 C1 C3 C2 Training Read address 101 102 103 104 stage C0 C1 C2 C3 Independent 204 201 202 203 column FP 0 0 0 0 buffers 303 304 301 302 BP 0 1 2 3 402 403 404 401 Read controls to transposable Block circulant matrix buffer during FP, BP, WU 15

  16. Outline ▪ Introduction ▪ CNN training algorithm ▪ RTL compiler ▪ CNN training accelerator ▪ Results ▪ Conclusion 16

  17. Results CIFAR-10 1X: 2(16C3)-MP-2(32C3)-MP-2(64C3)-MP-FC ▪ Peak throughput CNN Resource Latency per epoch (s) of 479 GOPs DSP ALM BRAM BS-10 BS-20 BS-40 ▪ Better energy efficiency than CIFAR-10 1X 30% 19% 4.4% 18.2 18 18.01 GPU’s for CIFAR-10 2X 58% 44% 9.5% 41.7 41.3 41 smaller batch sizes CIFAR-10 4X 100% 76.2% 22.4% 98.2 96.8 96.18 ▪ Limited by Throughput (GOPs) Efficiency (GOPs/W) DRAM B/W Device Titan XP FPGA Titan XP FPGA ▪ Images in a Batch size 1 40 (1/40) 1 40 (1/40) batch are processed CIFAR-10 1X 45.6 551.8 163 0.5 3.7 7.9 sequentially CIFAR-10 2X 128.8 1337.9 282 1.3 8.3 8.59 CIFAR-10 4X 331.4 2353.7 479 2.9 13.5 9.49 17

  18. Latency Breakdown 11% 89% Logic latency 12% DRAM latency 88% 29% 71% ▪ Latency of CIFAR-10 4X CNN for one iteration of a batch ▪ Overall ~20% logic latency and ~80% due to DRAM access ▪ Weight update phase is memory intense ‒ Contributes for ~51% of the overall latency 18

  19. Outline ▪ Introduction ▪ CNN training algorithm ▪ RTL compiler ▪ CNN training accelerator ▪ Results ▪ Conclusion 19

  20. Conclusion ▪ Automatic RTL compiler-based CNN training accelerator ▪ Implemented parameterized RTL library to support CNN training operations ▪ Evaluated performance on Intel Stratix-10 GX FPGA for three CNNs for CIFAR-10 dataset ▪ Achieved 479 GOPs throughput 20

  21. Acknowledgements C-BRIC ( C enter for BR ain- I nspired C omputing) We thank Intel Corporation for supporting and funding this research work. This work was also partially supported by NSF grant 1652866 and C-BRIC , one of six centers in JUMP, an SRC program sponsored by DARPA

Recommend


More recommend