A Data-Center FPGA Acceleration Platform 空白演示 for Convolutional Neural Networks Lorem ipsum dolor sit amet, consectetur adipisicing elit. Xiaoyu Yu 1 , Yuwei Wang 1 , Jie Miao 1 , Ephrem Wu 2 , Heng Zhang 1 , Yu Meng 1 , Bo Zhang 1 , Biao Min 1 , Dewei Chen 1 , Jianlin Gao 1 1 Tencent Shenzhen, China 2 Xilinx, Inc., San Jose, CA 95124, USA
About Tencent Monthly active users reach Users in over One of the Tencent is founded in 1/0.8 Billion 200 Top 5 1998 for WeChat/QQ countries Internet Companies by Market Value Photos from Profile Videos from Images in Live Video WeChat Photos WeChat Group Chat Streaming Moments Moments
Background CNN Models are widely used in Tencent Billions of operations per inference task × Billions of task each day Models are still in fast evolution. A reconfigurable accelerator is desirable Three key objectives: • Support different CNN models, easy to try • Achieve higher performance to lower TCO • Low latency
Framework for General Purpose CNN Acceleration More and more CNN models Operator Classification • Convolution : 19% total types, 95%+ computation cost • Non-convolution : 81% total types, 5 %- computation cost Different design strategies • Convolution : Performance improvement • Non-convolution : Support mass types of operators
Unified Computing Engine for Convolution—Supertile Performance = Freq. * Dsp_num * Ops_per_dsp The supertile method runs the DSP at twice the clock rate of the surrounding logic[1]. Enhanced Processing Element (EPE) EPE Weight Buf A Buf B Update Weight Cache MUX MUX DSP Buf C Buf D × + Activation Weight Cache Input EPE Weight Buf A Buf B Update Weight Cache MUX MUX DSP Buf C Buf D × + Activation Weight Cache a cfm Input [1] E. Wu, X. Zhang, D. Berman, and I. Cho. “A high-throughput reconfigurable processing array for neural networks,” In Field Programmable Logic and Applications (FPL), 2017 27th International Conference on (pp. 1–4). IEEE.
Unified Computing Engine for Convolution -- Supertile Unit (SU) Input feature map tile (IFT) saving Temporary results of output feature map tile (OFT) at the input buffer with C in channels with 2n channels, and saving into the output buffer C in m IF m H EPE m1 EPE m2 EPE m3 EPE m4 EPE mn Supertile unit with W m × n EPE array IF 3 IF 2 EPE 31 EPE 32 EPE 33 EPE 34 EPE 3n IF 1 EPE 21 EPE 22 EPE 23 EPE 24 EPE 2n EPE 11 EPE 12 EPE 13 EPE 14 EPE 1n m m 2n input kernel groups, each with m channels Convolution with an input feature map tile (IFT) and 2n kernel groups on one SU.
Unified Computing Engine for Convolution -- Scaled-up SU Two Challenges: Solutions: • task partition. • Interleaved task dispatching • data bandwidth would be multiplied • Dispatching-assembling buffering model • Broadcast cache (BC) Kernel size 3*3 Stride=1 Input buffer W0, W4, … SU 0 BC set 0 BC set 1 BC set 2 BC set 3 Broadcast cache W1, W5, … SU 1 Supertile units SU 0 SU 1 SU 2 SU 3 W2, W6, … SU 2 W0 W2 Output buffer OB set 0 OB set 1 OB set 2 OB set 3 W1 W3 Input SU 3 W3, W7, … feature map Assemble reader (b) (a)
Unified Computing Engine for Convolution – Broadcast Cache a circular buffer BC-window-stride = 4 × Convolution - Stride Increase bandwidth from ����� bit/s into ����� bit/s Row ID KernelSize = 3*3, Stride = 1 1 Input buffer 2 Data buffering in 3 broadcast cache BC set 0 BC set 1 BC set 2 BC set 3 Broadcast cache Sliding 4 window 5 Supertile units SU 0 SU 1 SU 2 SU 3 6 WindowStride = Stride * 4 Output buffer 7 OB set 0 OB set 1 OB set 2 OB set 3 8 Assemble reader (b) 9 10
Non-convolution Ops in inference Challenges: • Mass types of non-convolution ops • Resource limitation Solutions: • Perform different design strategy for different class Filter Processing Unit (FPU) MaxPool/AvgPool/DepthwiseConv/BN/Relu/Relu6 Customization: operations across channels LRN Operator Fusion. ElementAdd/Relu/DynamicQuantization • Functional-logic-sharing
Postprocessing – FPU Common Styles : • Two level of data access style Output buf (OB) • no operations exist across 2n channels ALU Kernel load ctl Ucmd decode ALU Filter processing unit Slice loop ctl Address gen ALU Ucmd fetch • parameters similarities × • Pointwise operations as special > + cases × Kernel buf Ucmd buf Worker part Function sharing part
Postprocessing – FPU Reconfigurable ALU reg A LU cm p_f unc_set M UX t hr eshol d > pr em ul _por t _a m i d_cm p M UX × pr em ul _por t _b × pr e_m ul + m i d_share_por t _b back_m ul m i d_adder M UX reg backm ul _por t _b add_f unc_set
Postprocessing – FPU Reconfigurable ALU reg A LU cm p_f unc_set M UX t hr eshol d > pr em ul _por t _a m i d_cm p M UX × pr em ul _por t _b × pr e_m ul + m i d_share_por t _b back_m ul m i d_adder M UX reg backm ul _por t _b add_f unc_set
Postprocessing – FPU Reconfigurable ALU reg A LU cm p_f unc_set M UX t hr eshol d > pr em ul _por t _a m i d_cm p M UX × pr em ul _por t _b × pr e_m ul + m i d_share_por t _b back_m ul m i d_adder M UX reg backm ul _por t _b add_f unc_set
Postprocessing – Operator Fusion Avoid extra memory access Four operations fuse with Convolution [2] J. Qiu, et al. “Going deeper with embedded FPGA platform for convolutional neural network,“ Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2016
System Overview
System Overview
System Overview
System Overview Frequency 500 MHz is applied for the EPEs. 250 MHz is applied for the others. DSP Each SU : 512 DSPs Each CNN Engine : 4 Sus 2048 DSPs Whole Chip : Two CNN Engines 4096 DSPs provide 4.2 TOP/s @int16 for Conv Memory IB : 4.2Mbit * 2 Bandwidth : 16 GB/s (R) -> 64 GB/s with BC OB : 4.2Mbit * 2 Bandwidth : 64 GB/s *2 (R and W)
Experimental Results: Performance in three models Alexnet AlexNet GoogLeNet HCNet 16-bit 16-bit 16-bit Data precision GoogLeNet 250/500 250/500 225/450 Clock (MHz) HCNet (high-concurrency 4 2 4 Batch size network ) 1331.6/1448.8 3081.0/3083.1 444 CNN size (MOPs) 1753.8 527.7 1465.1 Throughput (FPS) 2335.4 1625.9 650.5 Performance (GOP/s) 2.3 3.8 2.7 Latency (ms) 62.6 56.6 57.6 Power (watts) 1.4 3.9 3.4 Speedup VS P4 (7 ms) Energy efficiency 37.3 28.7 11.3 (GOP/s/W)
Experimental Results: Comparison with FPGA-Based Accelerators [3] [4] [5] Ours FPGA chip Arria10-1150 Virtex7-690t KU115 KU115 KU115 Network VGG AlexNet VGG GoogLeNet AlexNet CNN size (GOPs) 30.8 1.4 30.8 3.1 1.3 Freq (MHz) 385 150 235 250/500 250/500 Precision Fix16 Fix16 Fix16 Fix16 Fix16 DSPs (used/total) 2756/3036 2833/3600 4318/5520 4214/5520 4214/5520 Peak performance 2.1 0.8 2.1 4.2 4.2 (TOP/s) Real performance 1.79 0.6 2 1.63 2.3 (TOP/s) [3] J. Zhang, and J. Li. “Improving the performance of opencl-based fpga accelerator for convolutional neural network,” Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017. [4] C. Zhang C, et al. “Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks,” Proceedings of the 35th International Conference on Computer-Aided Design. ACM, 2016, p. 12. [5] X. Zhang, et al. “DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs,” Proceedings of the International Conference on Computer-Aided Design. ACM, 2018.
Experimental Results: Comparison with CPUs and GPU in datacenter TOP/s Off-chip Processo On-chip Power memory Processor r per nm MHz memory Release (Watts) 16bit FP32 server (MB) BW (GB/s) Intel E5- 2 - - 14 2400 35 × 2 76.8 × 2 120 2016 Q1 2680V4 NVIDIA P4 1 - 5.5 16 1000 10 [38] 192 50-75 2016 Q3 Xilinx KU115 1 4.2 - 20 250/500 11.8 38.4 50-66 2014 Q4
Comparison with CPU and GPU TOP/s Off-chip Processo On-chip Power memory Processor r per nm MHz memory Release (Watts) 16bit FP32 server (MB) BW (GB/s) Intel E5- 2 - - 14 2400 35 × 2 76.8 × 2 120 2016 Q1 2680V4 NVIDIA P4 1 - 5.5 16 1000 10 [38] 192 50-75 2016 Q3 Xilinx KU115 1 4.2 - 20 250/500 11.8 38.4 50-66 2014 Q4
Comparison with CPU and GPU Limitations: Achievements: • simpler fabrication process • Superior performance in latency-sensitive test • 20% memory bandwidth • 89% throughput with 1/57 latency in throughput-sensitive test. • 1/4 frequency of P4 • Performance can be improved (UltraScale+ VU9P 16 nm)
Conclusion A unified framework facing different CNN models and easy to try. Supertile EPEs are scaled up and shaped as multiple SUs with interleaved-task-dispatching method to break computation bound Overcome the bandwidth limitation with dispatching-assembling buffering model and BC A configurable FPU is proposed to support different types of non-convolution operators Performance Latency TCO Application 50 × lower 4.2Top/s 149% vs 32% 1 billion in fix16 than GPU than CPU People everyday
Recommend
More recommend