contents
play

Contents Introduction Pipelined FPGA DNN accelerators Roof-line - PDF document

FPGA implementation of Quantized (Deep) Neural Network (QNN) Accelerators Biruk Seyoum Phd Student, Scoula Superiore SantAnna, Pisa Contents Introduction Pipelined FPGA DNN accelerators Roof-line Model and optimizing FPGA DNN


  1. FPGA implementation of Quantized (Deep) Neural Network (QNN) Accelerators Biruk Seyoum Phd Student, Scoula Superiore Sant’Anna, Pisa Contents  Introduction  Pipelined FPGA DNN accelerators  Roof-line Model and optimizing FPGA DNN accelerators  Quantized Neural Networks (QNNs)  Introduction to FINN  FINN demo

  2. Some DNN Applications  computer vision  speech recognition  Bi-informatics . . . . Some DNN properties  Have an enormous storage and compute requirements  VGG-16 (528 MB, 19.6 GFLOPs)  ResNet152(232 MB, 11.3 GFLOPs)  Googlenet (51 MB, 1.5 GFLOPs)  Computation dominated by fmoating point computations

  3. FPGA DNN acceleration Deployment  Cloud  Host larger DNNs  Low inference latency  Suitable for user Interaction latency insensitive applications ( machine translation, NN business applications, forecasting etc... )  Energy and transmission latency cost of offmoading data to the cloud FPGA DNN accelerator Deployment  Embedded Platforms  Suitable for Real-time safety critical applications ( autonomous driving, speech recognition etc... )  Ideal for smaller DNNs

  4. The ZYNQ7010 SoC FPGA DNN acceleration Architecture  Single accelerator architecture  Single compute engine for all layers  Usually the most compute intensive layer ofg- loaded to FPGA engine

  5. FPGA DNN acceleration Architecture  Pipelined architecture  Each DNN layer maps to a HW implementation on the FPGA  Adjacent layers are connected using bufgers FPGA DNN acceleration Architecture  Pipelined architecture  Each DNN layer maps to a HW implementation on the FPGA  Adjacent layers are connected using bufgers Focus of this presentation

  6. Pipelined DNN FPGA accelerators  DDR RD/RW  MAC  input images  Compare  Parameters  div etc...  Output s Pipelined DNN FPGA accelerators For N number of DNN layers and a batch size of B

  7. Pipelined DNN FPGA accelerators For N number of layers and inference time per layer per frame Pipelined DNN FPGA accelerators performance Bottleneck is peak FPGA compute capacity

  8. Pipelined DNN FPGA accelerators performance Bottleneck is off-chip Memory bandwidth Pipelined DNN FPGA accelerators performance Performance of accelerator depends both on  Peak FPGA compute capacity  Off-chip memory (DRAM) bandwidth

  9. Pipelined DNN FPGA accelerators performance Performance of accelerator depends both on  Peak FPGA compute capacity  Off-chip memory (DRAM) bandwidth The peak FPGA compute capacity is obtained from hardware specification or benchmarks (eg. FLOPs/sec ) Pipelined DNN FPGA accelerators performance Performance of accelerator depends both on  Peak FPGA compute capacity  Off-chip memory (DRAM) bandwidth The peak FPGA compute capacity is obtained from hardware specification or benchmarks (eg. FLOPs/sec ) How to relate the accelerator performance w.r.t. FPGA compute capacity and off-chip DRAM traffic ?

  10. Roof-line model  A state of the art performance model for Multicore architectures  Can easily be adapted also for FPGA accelerators whose inputs are stored on DRAM  Correlates compute performance, memory performance and operational intensity in a 2D graph Roof-line model  Operational intensity(Arithmetic intensity) is the number of operation on FPGA per byte of DRAM traffjc  Measured in Ops/byte ( eg. FLOPs/byte )

  11. Roof-line model  Operational intensity(Arithmetic intensity) is the number of operation on FPGA per byte of DRAM traffjc  Measured in Ops/byte ( eg. FLOPs/byte ) High Operational intensity Lower Operational intensity Roof-line model Attainable GFLOPS/S = min { Peak compute capacity , Mem_BW * Operational Intensity }

  12. Roof-line model Roof-line model

  13. How to increase performance of FPGA accelerator Increasing memory bandwidth (memory bounded) How to increase performance of FPGA accelerator Migrating to powerful FPGA (compute bounded)

  14. How to increase performance of FPGA accelerator Increasing the Operational Intensity Increasing Operational Intensity  Operational intensity can be increased by  Maximizing the number of operations performed on data fetched from DRAM before write back  Eg. Implementing the DNN layers in a pipeline  Reducing the precision of computation to bring more data simultaneously  Using lower precision computation in general  Eg. 16 bit fmoating point, 8 bit integer etc...

  15. Quantized Neural Networks (QNNs)  Deep neural networks are typically over- parametrized  Remedies to overcome this problem include  Prunning  Weight sharing  Weight quantization (our topic)  Weight quantization involves  Representing weights and parameters in low precision integers eg. 8 bit, 4bit, and at the extreme case in binary Benefjts of weight quantization  Reduced weight memory footprint  Reduced DRAM memory footprint of weights − Eg. AlexNet from 64 MB to ~ 2 MB with 1 bit weights  Can even fjt inside the on-chip memory of the FPGA Weights stored inside the accelerator

  16. Benefjts of weight quantization  Faster computation  Reduced precision integer arithmetic is much faster than fmoating computation (number of computation per clock cycle will be higher)  Computation (eg. MAC) on reduced precision integer is more FPGA friendly − Floating point computation DSPs (scarce) − Reduced precision computation LUTs (abundant)  Compute engines on the FPGA also consume less resource The FINN framework

  17. Acceleration flow in FINN Mapping fmow in FINN

  18. Quantization aware training Brevitas : A PyTorch library for quantization-aware training ONNX representation  FINN uses an ONNX-based intermediate representation  Brevitas provides FINN- ONNX export  Quantization information is exported as annotations https://github.com/Xilinx/brevitas

  19. The FINN compiler  Generates the Viviado HLS based mapping of BNN layers to FPGAs  Streaming architecture  Each layer is mapped to a dedicated compute engine  Compute engines are pipelined and they communicate via on-chip data streams The FINN flow

  20. The FINN compute engine  Each layer has a single compute engine  The number of Processing Elements (PE) and the Single Instruction Multiple Data (SIMD) lanes of each PE determine the throughput of the layer  PE and SIMD also determine the resource A FINN compute engine consumption of the engine Features of BNN computation

  21. DEMO FINN Demo

  22. In this demo you will see  The fjnn codebase  Accelerator IP generation using Vivado_hls  A quick summary of a Vivado partial reconfjguration fmow  Application of FINN to classify the CIFAR-10 dataset on the Pynq-Z1 board Vivado DPR fmow

  23. Architecture of the BNN Thank you Questions ?

Recommend


More recommend