fpgaconvnet a framework for mapping convolutional neural
play

fpgaConvNet: A Framework for Mapping Convolutional Neural Networks - PowerPoint PPT Presentation

fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs Stylianos I. Venieris , Christos-Savvas Bouganis stylianos.venieris10@imperial.ac.uk FCCM 2016, Washington DC 2 May 2016 Deep Learning and AI 2 Deep Learning Success


  1. fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs Stylianos I. Venieris , Christos-Savvas Bouganis stylianos.venieris10@imperial.ac.uk FCCM 2016, Washington DC 2 May 2016

  2. Deep Learning and AI 2

  3. Deep Learning Success Stories - ConvNets Image Recognition (Microsoft, 2015) “Deep Face” (Facebook, 2014) Image Captioning (Microsoft, 2015) 3

  4. Deep Learning on FPGAs • Memory I/O Optimisation • Hand-tuned implementations • Design Space Exploration 4

  5. What is missing? fpgaConvNet Deep Learning Developers • Caffe • TensorFlo • Theano • Torch • FPGA – ConvNet Functionality – Optimised for High Performance 5

  6. Our approach - fpgaConvNet FPGA Target Platform ConvNet Description Specifications Automated Design Space Exploration fpgaConvNet ConvNet Hardware Mapping Supplied by Deep Bitstream Learning Expert 6

  7. Convolutional Neural Networks (ConvNets) convolutional pooling convolutional pooling + nonlinearity + nonlinearity 7

  8. fpgaConvNet– ConvNet Modelling Framework • Synchronous Data Flow Streaming – ConvNetas a data-driven graph – Represented as a matrix Analytical power – Each layer mapped to a tunable set of hardware building blocks 8

  9. fpgaConvNet– Modelling ConvNetswith SDF ConvNet Hardware SDF Graph Nonlin Memory Input Pooling Layer Convo Convolutional Layer with 4 filters Layer Interface Data 9

  10. fpgaConvNet– Design Space Perspective Design Space ConvNet Hardware SDF Graph 6 FPGA 1 Throughput 4 Current Design Point FPGA 2 2 0 0 5 10 Area Bottlenecks: Define a set of actions – Limited compute resources to move around the – Limited off-chip memory bandwidth design space – Limited on-chip memory for model parameters 10

  11. Action 1: Coarse-grained Folding 4 Convolutions / cycle 2) Not enough off-chip 1) Exceeding the available memory bandwidth compute resources 11

  12. Action 1: Coarse-grained Folding 2 Convolutions / cycle Fine-grained Folding Compute Resources Required Bandwidth Action 2 12

  13. Action 3: Partitioning through Reconfiguration Input Intermediate Intermediate Final Off-chip Memory Data Results Results Results Bitstream Bitstream 1 Bitstream 2 Bitstream 3 Conv Nonlin Pool Conv Nonlin Pool Input Data Layer 1 Layer 1 Layer 1 Layer 2 Layer 2 Layer 2 Subgraph 1 Subgraph 2 Subgraph 3 Exceeding the available Not enough on-chip FPGA Reconfiguration compute resources memory 13

  14. fpgaConvNet– SDF Analytical Power Window Size = K Hardware Stages Pool Size = P Interconnections Design 1 Design 2 Synchronous Data Flow • Actions as algebraic operations – Any local action propagates through the network – Static scheduling – Analytical Performance Model – Cast DSE to formal resource-constrained optimisation – 14

  15. Evaluation - Experimental Setup • fpgaConvNet – Xilinx Zynq-7000 XC7Z020 SoC with 220 DSPs at 100 MHz – Q8.8 fixed-point precision to match existing work (also supports floating-point) – Current toolflowsupports the VivadoHLS toolchain 15

  16. Performance Model Accuracy Performance Model Accuracy Scene Labelling ConvNet Sign Recognition ConvNet CNP Error between 1.73% and 11.76% MPCNN LeNet-5 CFF 0 2 4 6 8 10 12 14 Measured Performance (GOps/s) Predicted Performance (GOps/s) 16

  17. fpgaConvNet vs. Existing FPGA Work Performance Density Comparison (GOps/s/Slice) 0.0005 0.00045 0.0004 0.00035 0.0003 0.00025 1.62× 0.0002 0.00015 0.0001 0.00005 0 Hand-tuned [1] Memory-optimised [2] Existing Work (GOps/s/Slice) fpgaConvNet (GOps/s/Slice) [1] C. Farabet et al., “CNP: An FPGA-Based Processor for Convolutional Networks”, in FPL, IEEE, 2009. [2] M. Peemen et al., “Memory-centric accelerator design for Convolutional Neural Networks”, in ICCD, 17 IEEE, 2013.

  18. fpgaConvNet vs. Existing Embedded GPU Work Performance Efficiency Comparison (GOps/s/Watt) 8 Hand-tuned Embedded GPU 7 Tegra K1 at 800 MHz • 6 5 Memory Bandwidth: 12 GB/s • 4 fpgaConvNet 3 2 Zynq-7000 XC7Z020 at 100 MHz • 1 Memory Bandwidth: 4.26 GB/s 0 • Hand-tuned Embedded GPU [3] Existing Work (GOps/s/Watt) fpgaConvNet (GOps/s/Watt) [3] L. Cavigelli et al., “Accelerating real-time embedded scene labeling with convolutional networks”, in DAC, ACM/EDAC/IEEE, 2015. 18

  19. Conclusions fpgaConvNet Deep Learning Developers • Caffe • TensorFlo • Theano • Torch 19

Recommend


More recommend