FPGA C.Zhang et al. Caffeine: Towards Uniformed Representation Introduction and Acceleration for Deep Convolutional Motivation Neural Networks Uniformed CNN Representation Caffeine Design Chen Zhang, Zhenman Fang, Peipei Zhou et al. Roofline Model Presented by Zhuangwei Zhuang Experiment and Result Conclusion South China University of Technology October 9, 2016 1 / 31
Content FPGA C.Zhang et al. 1 Introduction Introduction 2 Motivation Motivation Uniformed CNN 3 Uniformed CNN Representation Representation Caffeine Design 4 Caffeine Design Roofline Model 5 Roofline Model Experiment and Result Conclusion 6 Experiment and Result 7 Conclusion 2 / 31
Introduction CNN Application FPGA In the recent years, convolutional neural networks (CNN) is C.Zhang et al. becoming popular for its high accuracy in compute vision task, Introduction including face recognition, image and video processing, etc. Motivation Uniformed CNN Representation Caffeine Design Roofline Model Experiment and Result Conclusion Figure: Face Detection Figure: Classification 3 / 31
Introduction Convolutional Neural Networks FPGA C.Zhang et al. Introduction Motivation Uniformed CNN Representation Caffeine Figure: A real-life CNN model Design Roofline Model CNN Models Experiment and Result VGG16 Conclusion AlexNet GoogLeNet 4 / 31
Introduction Convolutional Neural Networks FPGA C.Zhang et al. Introduction Motivation Uniformed CNN Representation Caffeine Design Figure: Inference phase in CNN Roofline Model Experiment Architecture and Result Convolutional layers(CONV) Conclusion Pooling layers(POOL) Activation layers(ReLU) Fully-connected layers(FCN) 5 / 31
Motivation FPGA-Based Platform FPGA Hardware platforms for CNN accelerator: GPU, FPGA, ASIC. C.Zhang et al. Advantages of FPGA Introduction Motivation Low power Uniformed CNN High energy efficiency Representation Reprogrammability Caffeine Design Roofline Model Constraints of FPGA Experiment and Result Limited computation resource Conclusion Limited on-chip memory Limited external-memory bandwidth 6 / 31
Motivation Analysis of Real-Life CNN FPGA C.Zhang et al. Introduction Motivation Uniformed CNN Representation Caffeine Design Roofline Model CONV POOL ReLU FCN Experiment Comput.ops( 10 7 ) 3E3( 99.5% ) 0.6(0%) 1.4(0%) 12.3(0.4%) and Result Storage(MB) 113(19.3%) 0(0%) 0(0%) 471.6( 80.6% ) Conclusion Time% in pure sw 96.3% 0.0% 0.0% 3.7% After CONV acc 48.7% 0.0% 0.0% 51.2% Table: Analysis of VGG16 model 7 / 31
Motivation Analysis of Real-Life CNN FPGA C.Zhang et al. Introduction Motivation Uniformed CNN Representation Caffeine Design Roofline Model CONV layers are computation-intensive while FCN layers Experiment and Result are memory-intensive Conclusion FCN layers become new bottleneck after CONV layers be accelerated However, most prior FPGA acceleration studies on CNN mainly focus on CONV layers in CNN 8 / 31
Motivation Problem FPGA C.Zhang et al. Introduction What is the right representation for a Motivation Uniformed uniformed acceleration for different layers of CNN Representation CNN? Caffeine Design Roofline Model How to design and implement efficient and Experiment and Result reusable FPGA engine for CNN? Conclusion 9 / 31
Uniformed CNN Representation Matrix-Multiplication FPGA C.Zhang et al. Introduction Motivation Uniformed CNN Representation Caffeine Design Roofline Model Experiment and Result Conclusion Figure: Matrix-multiplication of FCN 10 / 31
Uniformed CNN Representation Input-Major Mapping FPGA C.Zhang et al. Introduction Motivation Uniformed CNN Representation Caffeine Design Roofline Model Experiment and Result Conclusion Figure: Input-major mapping with Ker = 1 11 / 31
Uniformed CNN Representation Input-Major Mapping FPGA C.Zhang et al. Introduction Motivation Uniformed CNN Representation Caffeine Design Roofline Model Experiment and Result Conclusion Figure: Input-major mapping with Ker = 2 12 / 31
Uniformed CNN Representation Weight-Major Mapping FPGA C.Zhang et al. Introduction Motivation Uniformed CNN Representation Caffeine Design Roofline Model Experiment and Result Conclusion Figure: Weight-major mapping with Ker = 1 13 / 31
Uniformed CNN Representation Weight-Major Mapping FPGA C.Zhang et al. Introduction Motivation Uniformed CNN Representation Caffeine Design Roofline Model Experiment and Result Conclusion Figure: Weight-major mapping with Ker = 2 14 / 31
Uniformed CNN Representation Uniformed Representation FPGA C.Zhang et al. Introduction Uniformed Conv FCN-Input FCN-Weight Motivation Input FM# N N conv N fcn /ker n fcn /ker Uniformed R in conv · C in Input FM Size R i · C i batch · ker M fcn · ker CNN conv Output FM# M M conv M fcn batch Representation R out conv · C out Output FM Size R o · C o batch M fcn conv Caffeine Kernel Size K 1 · K 2 K 1 · K 2 ker ker Design Stride S 1 · S 2 S 1 · S 2 ker ker Roofline Model Table: Uniformed representation parameters for CONV, FCN input-major Experiment mapping and FCN weight-major mapping and Result Conclusion 15 / 31
Caffeine Design System Overview FPGA C.Zhang et al. Introduction Motivation Uniformed CNN Representation Caffeine Design Roofline Model Experiment and Result Conclusion Figure: Caffe-Caffeine integration 16 / 31
Caffeine Design Architecture FPGA C.Zhang et al. Introduction Motivation Uniformed CNN Representation Caffeine Design Roofline Model Experiment and Result Conclusion Figure: Scalable accelerator architecture design 17 / 31
Caffeine Design Bandwidth Optimization FPGA C.Zhang et al. Introduction Motivation Uniformed CNN Representation Caffeine Design Roofline Model Figure: Effective FPGA DRAM bandwidth Experiment and Result Effective of FPGA bandwidth goes up with the increase of Conclusion burst length, and finally flatten Limited burst length greatly degrade actual bandwidth performance 18 / 31
Caffeine Design Bandwidth Optimization FPGA C.Zhang et al. Introduction Motivation Uniformed CNN Representation Caffeine Design Roofline Model Experiment and Result Figure: A piece of data tile Conclusion Figure: A logic 3D data layout 19 / 31
Caffeine Design Bandwidth Optimization FPGA C.Zhang et al. Introduction Motivation Uniformed CNN Representation Caffeine Figure: Optimization of data layout in DRAM space Design Roofline Model Experiment Move data for an entire tile to a continuous space for and Result improving burst length and bit-length Conclusion Interleave data for different BRAM banks for reducing bank read/write conflicts 20 / 31
Roofline Model Original Model FPGA C.Zhang et al. Introduction CTC Ratio Motivation total number of operations Uniformed = total amount of DRAM access CNN Representation Caffeine Design Roofline Model Experiment and Result DRAM Access = α in · β in + α weight · β weight + α out · β out (1) Conclusion α : number of data transfer times for input/weight/output data β : size of input/weight/output data tile 21 / 31
Roofline Model Revised Model FPGA C.Zhang et al. Introduction Motivation Uniformed CNN Representation Caffeine Design Roofline Model Figure: Effective FPGA DRAM bandwidth Experiment and Result Conclusion Original model ignores the fact that different data volumes in each tile have different burst length and effective bandwidth 22 / 31
Roofline Model Revised Model FPGA C.Zhang et al. Introduction Motivation DRAM Access = γ in · α in · β in Uniformed CNN + γ weight · α weight · β weight Representation + γ out · α out · β out (2) Caffeine Design Roofline Model γ = max bandwidth/f ( β ) (3) Experiment and Result f ( β ) is the effective function between bandwidth and burst Conclusion length 23 / 31
Roofline Model Revised Model FPGA C.Zhang et al. Introduction Motivation Uniformed CNN Representation Caffeine Design Figure: Comparison of original, revised Figure: Comparison of original, Roofline Model model and on-board test result with revised model and on-board test result Experiment input-major mapping with weight-major mapping and Result Conclusion Revised model is more accurate than original model Weight-major mapping is better than input-major mapping in small batch size, which is required for real-time inference phase 24 / 31
Experiment and Result Resource Utilization FPGA C.Zhang et al. Introduction Motivation Uniformed CNN Representation Caffeine Design Roofline Model Experiment and Result Conclusion DSP BRAM LUT FF Freq. VC709 fixed 2833(78%) 1248(42%) 3E5(81%) 3E5(36%) 150MHz KU fixed 1058(38%) 782(36%) 1E5(31%) 8E4(11%) 200MHz KU float 1314(47%) 798(36%) 2E5(46%) 2E5(26%) 200MHz Table: FPGA resource utilization of Caffeine 25 / 31
Recommend
More recommend