for convolutional neural networks lorem ipsum dolor sit
play

for Convolutional Neural Networks Lorem ipsum dolor sit amet, - PowerPoint PPT Presentation

A Data-Center FPGA Acceleration Platform for Convolutional Neural Networks Lorem ipsum dolor sit amet, consectetur adipisicing elit. Xiaoyu Yu 1 , Yuwei Wang 1 , Jie Miao 1 , Ephrem Wu 2 , Heng Zhang 1 , Yu Meng 1 , Bo Zhang 1 , Biao


  1. A Data-Center FPGA Acceleration Platform 空白演示 for Convolutional Neural Networks Lorem ipsum dolor sit amet, consectetur adipisicing elit. Xiaoyu Yu 1 , Yuwei Wang 1 , Jie Miao 1 , Ephrem Wu 2 , Heng Zhang 1 , Yu Meng 1 , Bo Zhang 1 , Biao Min 1 , Dewei Chen 1 , Jianlin Gao 1 1 Tencent Shenzhen, China 2 Xilinx, Inc., San Jose, CA 95124, USA

  2. About Tencent Monthly active users reach Users in over One of the Tencent is founded in 1/0.8 Billion 200 Top 5 1998 for WeChat/QQ countries Internet Companies by Market Value Photos from Profile Videos from Images in Live Video WeChat Photos WeChat Group Chat Streaming Moments Moments

  3. Background  CNN Models are widely used in Tencent  Billions of operations per inference task × Billions of task each day  Models are still in fast evolution.  A reconfigurable accelerator is desirable  Three key objectives: • Support different CNN models, easy to try • Achieve higher performance to lower TCO • Low latency

  4. Framework for General Purpose CNN Acceleration  More and more CNN models  Operator Classification • Convolution : 19% total types, 95%+ computation cost • Non-convolution : 81% total types, 5 %- computation cost  Different design strategies • Convolution : Performance improvement • Non-convolution : Support mass types of operators

  5. Unified Computing Engine for Convolution—Supertile  Performance = Freq. * Dsp_num * Ops_per_dsp  The supertile method runs the DSP at twice the clock rate of the surrounding logic[1].  Enhanced Processing Element (EPE) EPE Weight Buf A Buf B Update Weight Cache MUX MUX DSP Buf C Buf D × + Activation Weight Cache Input EPE Weight Buf A Buf B Update Weight Cache MUX MUX DSP Buf C Buf D × + Activation Weight Cache a cfm Input [1] E. Wu, X. Zhang, D. Berman, and I. Cho. “A high-throughput reconfigurable processing array for neural networks,” In Field Programmable Logic and Applications (FPL), 2017 27th International Conference on (pp. 1–4). IEEE.

  6. Unified Computing Engine for Convolution -- Supertile Unit (SU) Input feature map tile (IFT) saving Temporary results of output feature map tile (OFT) at the input buffer with C in channels with 2n channels, and saving into the output buffer C in m IF m H EPE m1 EPE m2 EPE m3 EPE m4 EPE mn Supertile unit with W m × n EPE array IF 3 IF 2 EPE 31 EPE 32 EPE 33 EPE 34 EPE 3n IF 1 EPE 21 EPE 22 EPE 23 EPE 24 EPE 2n EPE 11 EPE 12 EPE 13 EPE 14 EPE 1n m m 2n input kernel groups, each with m channels Convolution with an input feature map tile (IFT) and 2n kernel groups on one SU.

  7. Unified Computing Engine for Convolution -- Scaled-up SU  Two Challenges:  Solutions: • task partition. • Interleaved task dispatching • data bandwidth would be multiplied • Dispatching-assembling buffering model • Broadcast cache (BC) Kernel size 3*3 Stride=1 Input buffer W0, W4, … SU 0 BC set 0 BC set 1 BC set 2 BC set 3 Broadcast cache W1, W5, … SU 1 Supertile units SU 0 SU 1 SU 2 SU 3 W2, W6, … SU 2 W0 W2 Output buffer OB set 0 OB set 1 OB set 2 OB set 3 W1 W3 Input SU 3 W3, W7, … feature map Assemble reader (b) (a)

  8. Unified Computing Engine for Convolution – Broadcast Cache  a circular buffer  BC-window-stride = 4 × Convolution - Stride  Increase bandwidth from ����� bit/s into ����� bit/s Row ID KernelSize = 3*3, Stride = 1 1 Input buffer 2 Data buffering in 3 broadcast cache BC set 0 BC set 1 BC set 2 BC set 3 Broadcast cache Sliding 4 window 5 Supertile units SU 0 SU 1 SU 2 SU 3 6 WindowStride = Stride * 4 Output buffer 7 OB set 0 OB set 1 OB set 2 OB set 3 8 Assemble reader (b) 9 10

  9. Non-convolution Ops in inference  Challenges: • Mass types of non-convolution ops • Resource limitation  Solutions: • Perform different design strategy for different class  Filter Processing Unit (FPU) MaxPool/AvgPool/DepthwiseConv/BN/Relu/Relu6  Customization: operations across channels LRN  Operator Fusion. ElementAdd/Relu/DynamicQuantization • Functional-logic-sharing

  10. Postprocessing – FPU  Common Styles : • Two level of data access style Output buf (OB) • no operations exist across 2n channels ALU Kernel load ctl Ucmd decode ALU Filter processing unit Slice loop ctl Address gen ALU Ucmd fetch • parameters similarities × • Pointwise operations as special > + cases × Kernel buf Ucmd buf Worker part Function sharing part

  11. Postprocessing – FPU  Reconfigurable ALU reg A LU cm p_f unc_set M UX t hr eshol d > pr em ul _por t _a m i d_cm p M UX × pr em ul _por t _b × pr e_m ul + m i d_share_por t _b back_m ul m i d_adder M UX reg backm ul _por t _b add_f unc_set

  12. Postprocessing – FPU  Reconfigurable ALU reg A LU cm p_f unc_set M UX t hr eshol d > pr em ul _por t _a m i d_cm p M UX × pr em ul _por t _b × pr e_m ul + m i d_share_por t _b back_m ul m i d_adder M UX reg backm ul _por t _b add_f unc_set

  13. Postprocessing – FPU  Reconfigurable ALU reg A LU cm p_f unc_set M UX t hr eshol d > pr em ul _por t _a m i d_cm p M UX × pr em ul _por t _b × pr e_m ul + m i d_share_por t _b back_m ul m i d_adder M UX reg backm ul _por t _b add_f unc_set

  14. Postprocessing – Operator Fusion  Avoid extra memory access  Four operations fuse with Convolution [2] J. Qiu, et al. “Going deeper with embedded FPGA platform for convolutional neural network,“ Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2016

  15. System Overview

  16. System Overview

  17. System Overview

  18. System Overview  Frequency 500 MHz is applied for the EPEs. 250 MHz is applied for the others.  DSP Each SU : 512 DSPs Each CNN Engine : 4 Sus 2048 DSPs Whole Chip : Two CNN Engines 4096 DSPs provide 4.2 TOP/s @int16 for Conv  Memory IB : 4.2Mbit * 2 Bandwidth : 16 GB/s (R) -> 64 GB/s with BC OB : 4.2Mbit * 2 Bandwidth : 64 GB/s *2 (R and W)

  19. Experimental Results: Performance in three models  Alexnet AlexNet GoogLeNet HCNet 16-bit 16-bit 16-bit Data precision  GoogLeNet 250/500 250/500 225/450 Clock (MHz)  HCNet (high-concurrency 4 2 4 Batch size network ) 1331.6/1448.8 3081.0/3083.1 444 CNN size (MOPs) 1753.8 527.7 1465.1 Throughput (FPS) 2335.4 1625.9 650.5 Performance (GOP/s) 2.3 3.8 2.7 Latency (ms) 62.6 56.6 57.6 Power (watts) 1.4 3.9 3.4 Speedup VS P4 (7 ms) Energy efficiency 37.3 28.7 11.3 (GOP/s/W)

  20. Experimental Results: Comparison with FPGA-Based Accelerators [3] [4] [5] Ours FPGA chip Arria10-1150 Virtex7-690t KU115 KU115 KU115 Network VGG AlexNet VGG GoogLeNet AlexNet CNN size (GOPs) 30.8 1.4 30.8 3.1 1.3 Freq (MHz) 385 150 235 250/500 250/500 Precision Fix16 Fix16 Fix16 Fix16 Fix16 DSPs (used/total) 2756/3036 2833/3600 4318/5520 4214/5520 4214/5520 Peak performance 2.1 0.8 2.1 4.2 4.2 (TOP/s) Real performance 1.79 0.6 2 1.63 2.3 (TOP/s) [3] J. Zhang, and J. Li. “Improving the performance of opencl-based fpga accelerator for convolutional neural network,” Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017. [4] C. Zhang C, et al. “Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks,” Proceedings of the 35th International Conference on Computer-Aided Design. ACM, 2016, p. 12. [5] X. Zhang, et al. “DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs,” Proceedings of the International Conference on Computer-Aided Design. ACM, 2018.

  21. Experimental Results: Comparison with CPUs and GPU in datacenter TOP/s Off-chip Processo On-chip Power memory Processor r per nm MHz memory Release (Watts) 16bit FP32 server (MB) BW (GB/s) Intel E5- 2 - - 14 2400 35 × 2 76.8 × 2 120 2016 Q1 2680V4 NVIDIA P4 1 - 5.5 16 1000 10 [38] 192 50-75 2016 Q3 Xilinx KU115 1 4.2 - 20 250/500 11.8 38.4 50-66 2014 Q4

  22. Comparison with CPU and GPU TOP/s Off-chip Processo On-chip Power memory Processor r per nm MHz memory Release (Watts) 16bit FP32 server (MB) BW (GB/s) Intel E5- 2 - - 14 2400 35 × 2 76.8 × 2 120 2016 Q1 2680V4 NVIDIA P4 1 - 5.5 16 1000 10 [38] 192 50-75 2016 Q3 Xilinx KU115 1 4.2 - 20 250/500 11.8 38.4 50-66 2014 Q4

  23. Comparison with CPU and GPU  Limitations:  Achievements: • simpler fabrication process • Superior performance in latency-sensitive test • 20% memory bandwidth • 89% throughput with 1/57 latency in throughput-sensitive test. • 1/4 frequency of P4 • Performance can be improved (UltraScale+ VU9P 16 nm)

  24. Conclusion  A unified framework facing different CNN models and easy to try.  Supertile EPEs are scaled up and shaped as multiple SUs with interleaved-task-dispatching method to break computation bound  Overcome the bandwidth limitation with dispatching-assembling buffering model and BC  A configurable FPU is proposed to support different types of non-convolution operators Performance Latency TCO Application 50 × lower 4.2Top/s 149% vs 32% 1 billion in fix16 than GPU than CPU People everyday

Recommend


More recommend