accelerating convolutional neural networks on fpga soc
play

Accelerating Convolutional Neural networks on FPGA SoC Francesco - PDF document

Accelerating Convolutional Neural networks on FPGA SoC Francesco Restuccia, Ph.D. Fellow 04/12/2020 1 Overview Background Xilinx CHaiDNN Framework Overview Working flow Hardware architecture Xilinx DNNDK framework


  1. Accelerating Convolutional Neural networks on FPGA SoC Francesco Restuccia, Ph.D. Fellow 04/12/2020 1 Overview • Background • Xilinx CHaiDNN Framework • Overview • Working flow • Hardware architecture • Xilinx DNNDK framework • Overview • Working flow • The Deep Processing Unit (DPU) • Xilinx Vitis AI framework • Overview • Vitis AI for edge applications 2

  2. Background 3 FPGA SoC Architecture FPGA Programmable Logic 4

  3. FPGA SoC Architecture - 2 ARM-based Processing System 5 FPGA SoC Architecture - 3 Interconnect (AXI based) FPGA-PS interfaces DRAM controller Hardware accelerators ARM processors 6

  4. What’s a hardware accelerator? • Piece of hardware developed to compute a specific functionality • Placed in the Programmable Logic (FPGA) • Master Slave Interface for interconnection (AXI standard) • Can be described in different ways 7 What’s a hardware accelerator? - 2 • Hardware accelerators (HWA) development method: • Hardware description languages (VHDL, Verilog, SystemVerilog, …) • Hardware construction language (Chisel) • High-level description (HLS) • Buy one as Intellectual Property block (IP block) 8

  5. FPGA SoC are used to run inference! • FPGA SoC and FPGA MPSoC are mainly used for embedded applications • FPGA MPSoC and SoC have good performance for inference • In some case, comparable with GPU SoC (fps) • Better power efficiency • The training of neural networks is generally made on big GPUs • We will assume in the following slides to have a pre-trained neural network as starting point • The network model (.prototxt Caffe) • The trained weight (.model Caffe) 9 Considered platforms Xilinx Zynq Ultrascale+ MPSoC • Processing System • 4 General Purpose Cortex A53 • 2 Cortex R-5 (real-time) • Programmable Logic (huge) • Zynq Ultrascale+ logic Credits: www.xilinx.com • Xilinx ZYNQ Z-7000 (PYNQ Z-7020) • Processing System • 2 General Purpose Cortex A9 • Programmable Logic • Artix 7-Series FPGA Credits: www.digilent.com 10

  6. Xilinx CHaiDNN 11 ChaiDNN: Overview credits: xilinx.com •Open source deep Neural Network library for the acceleration of deep neural networks •Designed for Xilinx UltraScale+ MPSoC •Designed to accelerate convolutional Neural Networks on images •High-level synthesis based accelerators •Compatible with multiple popular neural networks for classification (AlexNet, GoogleNet, ResNet, etc.) 12

  7. ChaiDNN: Install Official Github https://github.com/Xilinx/CHaiDNN Use the prebuilt-system Build the system Download the network model Build the software Prepare the SD card image Build the hardware accelerator Flash the source into the SD card 13 How does CHaiDNN works? • Given a Caffe model and weights files (.prototxt Caffe Model and .model) • Model is parsed runtime by the CHaiDNN application Parsed by the ChaiDNN application • Initializes the hardware accelerators and data structure Scheduler • Schedules the operation between processors and the hardware accelerators • Issues the call for acceleration using the CHaiDNN Execute on Hardware processors API accelerators Call to the CHaiDNN API 14

  8. Some important considerations • FPGA SoC has limited on-chip memory resources • Neural Network models in general use 16 or 32-bit floats -> too expensive for FPGA SoCs! • Moving data from/to off-chip memory (DRAM) is very expensive • Increase the power consumption • Performance drop • The gain in execution (acceleration) would be lost in moving the data (weights) from/to the off-chip memory 15 Quantization process • To achieve good performance, the neural network model must be quantized to run on FPGA SoC! • CHaiDNN provides a tool for quantization, XportDNN • Better performance, at the cost of accuracy loss (“minimal” according to CHaiDNN developers) • CHaiDNN supports 6-bit and 8-bit quantized data integer •CHaiDNN provides some quantized model for popular networks (ModelZoo) 16

  9. Playing with the modelZoo • Ready-to-run popular neural network models (quantized) credits: wikimedia.org credits: medium.com Alexnet GoogleNet credits: researchgate.com VGG ResNet50 17 FPGA SoC VS GPU SoC credits: https://github.com/Xilinx/CHaiDNN 18

  10. Why FPGA SoC for NN acceleration? GPU SoC VS Xilinx ZYNQ Ultrascale+ FPGA SoC credits: xilinx.com • HWA Development is not trivial • Easy to program (CUDA) • Execution on FPGA is highly • Time predictability is poor predictable • Lower power efficiency • Lower power consumption • Highly suitable for critical embedded applications! 19 Accelerate your own network Caffe Model (float) • Just for convolutional neural network operating on images Quantization • Quantize the neural network (XportDNN) (XportDNN) • Build the CHaiDNN code Build the CHaiDNN code • Cross-compile the code Compile (cross- compile) the • Model and executable on the SD code • The hardware accelerator does not change with Execute the the Network! network 20

  11. The CHaiDNN hardware architecture CHaiDNN block design from Xilinx Vivado 2018.2 21 The CHaiDNN hardware architecture ZYNQ UltraScale+ PS 22

  12. ZYNQ Processing System • Runs the CHaiDNN application • Schedules the operations • Issues the request for hardware acceleration • The Processing System executes • L2 Normalization • Permute • Inner Product • SoftMax • Fully connected layers 23 Hardware Accelerators Hardware accelerators 24

  13. PoolTop accelerator • High-Level Synthesis synthesized block • Custom adapter for AXI Interconnect • Operation: • Pooling (max average) 25 Convolution accelerator • High-Level Synthesis synthesized block • Custom adapter for AXI Interconnect • Operation: • Convolution • Normalization • Scale and bias • Element-wise addition • ReLu 26

  14. DeConvolution accelerator • High-Level Synthesis synthesized block • Custom adapter for AXI Interconnect • Operation: • Deconvolution 27 AXI Interconnects Control AXI Interconnect 28

  15. AXI Interconnects Data AXI Interconnect 29 Final remarks on CHaiDNN • The Processing System and the hardware accelerators cooperate to execute the network • The hardware accelerators are not “custom-made" for the network • The Processing System schedules the operation • The Processing System issues the requests for hardware acceleration • Hardware accelerators are autonomous in reading/writing data to/from the memory • In some cases, the performances (inference) are comparable with GPU SoC 30

  16. Xilinx DNNDK 31 DNNDK: Overview •Deep Neural Network Development Kit (DNNDK) is a full-stack deep learning SDK for the Deep learning Processor Unit (DPU) •Designed for Xilinx UltraScale+ MPSoC and ZYNQ-7000 platforms •Designed to accelerate convolutional Neural Networks on images •Compatible with multiple popular neural networks for classification (AlexNet, GoogleNet, ResNet, etc.) DNNDK User Guide, Xilinx, UG1327 (v1.5) 32

  17. The Deep Learning Processing Unit •The Deep Processing Unit (DPU) is the hardware accelerator for DNNDK •Placed in the Programmable Logic •Custom size, to fit Xilinx UltraScale+ MPSoC and ZYNQ-7000 platforms •Designed to accelerate popular convolutional neural network (VGG, GoogleNet, ResNet, YOLO, etc.) DNNDK User Guide, Xilinx, UG1327 (v1.5) 33 DPU hardware architecture •Computing array of Hybrid Processing Elements •Local On-chip support memory •Instruction fetch unit to fetch the instruction from memory •On board scheduler •Autonomous High-speed data access DNNDK User Guide, Xilinx, UG1327 (v1.5) 34

  18. What’s the difference with CHaiDNN? DNNDK CHaiDNN •The Deep Learning Processing Unit defines •Hardware accelerator is a collection of AXI MM its own instruction set devices (no instruction set) •Instructions are generated by the DNNDK •The hardware accelerators is managed directly tools and fetched from the DRAM memory by the PS •The scheduler for the hardware operations is •The scheduler for the hardware operation runs in internal the Processing System •Custom size, to fit both Xilinx UltraScale+ •Fixed-size, not configurable. MPSoC and ZYNQ-7000 platforms 35 Customize the DPU • Change the number of DPU core on the board • Parallelism of the convolutional unit • Total amount of on-chip memory • ReLu type • Hardware softmax implementation • … changing these features has an impact on performance and resource consumption! 36

  19. DNNDK workflow Caffe Model (float) • Compress the neural network model (DECENT) Quantization & pruning (DECENT) • Compile the neural network model (DNNC) • Build the DPU executable Build the DPU executable • Build the software application (DNNC) • DNNDK API • Compile and link the hybrid DPU application Build the software app • Run the hybrid DPU executable Cross-compile & linking 37 DNNDK workflow Caffe Model (float) • Compress the neural network model (DECENT) Quantization & pruning (DECENT) Build the DPU executable (DNNC) Build the software app Cross-compile & linking 38

Recommend


More recommend