FPGA implementation of Quantized (Deep) Neural Network (QNN) Accelerators Biruk Seyoum Phd Student, Scoula Superiore Sant’Anna, Pisa Contents Introduction Pipelined FPGA DNN accelerators Roof-line Model and optimizing FPGA DNN accelerators Quantized Neural Networks (QNNs) Introduction to FINN FINN demo
Some DNN Applications computer vision speech recognition Bi-informatics . . . . Some DNN properties Have an enormous storage and compute requirements VGG-16 (528 MB, 19.6 GFLOPs) ResNet152(232 MB, 11.3 GFLOPs) Googlenet (51 MB, 1.5 GFLOPs) Computation dominated by fmoating point computations
FPGA DNN acceleration Deployment Cloud Host larger DNNs Low inference latency Suitable for user Interaction latency insensitive applications ( machine translation, NN business applications, forecasting etc... ) Energy and transmission latency cost of offmoading data to the cloud FPGA DNN accelerator Deployment Embedded Platforms Suitable for Real-time safety critical applications ( autonomous driving, speech recognition etc... ) Ideal for smaller DNNs
The ZYNQ7010 SoC FPGA DNN acceleration Architecture Single accelerator architecture Single compute engine for all layers Usually the most compute intensive layer ofg- loaded to FPGA engine
FPGA DNN acceleration Architecture Pipelined architecture Each DNN layer maps to a HW implementation on the FPGA Adjacent layers are connected using bufgers FPGA DNN acceleration Architecture Pipelined architecture Each DNN layer maps to a HW implementation on the FPGA Adjacent layers are connected using bufgers Focus of this presentation
Pipelined DNN FPGA accelerators DDR RD/RW MAC input images Compare Parameters div etc... Output s Pipelined DNN FPGA accelerators For N number of DNN layers and a batch size of B
Pipelined DNN FPGA accelerators For N number of layers and inference time per layer per frame Pipelined DNN FPGA accelerators performance Bottleneck is peak FPGA compute capacity
Pipelined DNN FPGA accelerators performance Bottleneck is off-chip Memory bandwidth Pipelined DNN FPGA accelerators performance Performance of accelerator depends both on Peak FPGA compute capacity Off-chip memory (DRAM) bandwidth
Pipelined DNN FPGA accelerators performance Performance of accelerator depends both on Peak FPGA compute capacity Off-chip memory (DRAM) bandwidth The peak FPGA compute capacity is obtained from hardware specification or benchmarks (eg. FLOPs/sec ) Pipelined DNN FPGA accelerators performance Performance of accelerator depends both on Peak FPGA compute capacity Off-chip memory (DRAM) bandwidth The peak FPGA compute capacity is obtained from hardware specification or benchmarks (eg. FLOPs/sec ) How to relate the accelerator performance w.r.t. FPGA compute capacity and off-chip DRAM traffic ?
Roof-line model A state of the art performance model for Multicore architectures Can easily be adapted also for FPGA accelerators whose inputs are stored on DRAM Correlates compute performance, memory performance and operational intensity in a 2D graph Roof-line model Operational intensity(Arithmetic intensity) is the number of operation on FPGA per byte of DRAM traffjc Measured in Ops/byte ( eg. FLOPs/byte )
Roof-line model Operational intensity(Arithmetic intensity) is the number of operation on FPGA per byte of DRAM traffjc Measured in Ops/byte ( eg. FLOPs/byte ) High Operational intensity Lower Operational intensity Roof-line model Attainable GFLOPS/S = min { Peak compute capacity , Mem_BW * Operational Intensity }
Roof-line model Roof-line model
How to increase performance of FPGA accelerator Increasing memory bandwidth (memory bounded) How to increase performance of FPGA accelerator Migrating to powerful FPGA (compute bounded)
How to increase performance of FPGA accelerator Increasing the Operational Intensity Increasing Operational Intensity Operational intensity can be increased by Maximizing the number of operations performed on data fetched from DRAM before write back Eg. Implementing the DNN layers in a pipeline Reducing the precision of computation to bring more data simultaneously Using lower precision computation in general Eg. 16 bit fmoating point, 8 bit integer etc...
Quantized Neural Networks (QNNs) Deep neural networks are typically over- parametrized Remedies to overcome this problem include Prunning Weight sharing Weight quantization (our topic) Weight quantization involves Representing weights and parameters in low precision integers eg. 8 bit, 4bit, and at the extreme case in binary Benefjts of weight quantization Reduced weight memory footprint Reduced DRAM memory footprint of weights − Eg. AlexNet from 64 MB to ~ 2 MB with 1 bit weights Can even fjt inside the on-chip memory of the FPGA Weights stored inside the accelerator
Benefjts of weight quantization Faster computation Reduced precision integer arithmetic is much faster than fmoating computation (number of computation per clock cycle will be higher) Computation (eg. MAC) on reduced precision integer is more FPGA friendly − Floating point computation DSPs (scarce) − Reduced precision computation LUTs (abundant) Compute engines on the FPGA also consume less resource The FINN framework
Acceleration flow in FINN Mapping fmow in FINN
Quantization aware training Brevitas : A PyTorch library for quantization-aware training ONNX representation FINN uses an ONNX-based intermediate representation Brevitas provides FINN- ONNX export Quantization information is exported as annotations https://github.com/Xilinx/brevitas
The FINN compiler Generates the Viviado HLS based mapping of BNN layers to FPGAs Streaming architecture Each layer is mapped to a dedicated compute engine Compute engines are pipelined and they communicate via on-chip data streams The FINN flow
The FINN compute engine Each layer has a single compute engine The number of Processing Elements (PE) and the Single Instruction Multiple Data (SIMD) lanes of each PE determine the throughput of the layer PE and SIMD also determine the resource A FINN compute engine consumption of the engine Features of BNN computation
DEMO FINN Demo
In this demo you will see The fjnn codebase Accelerator IP generation using Vivado_hls A quick summary of a Vivado partial reconfjguration fmow Application of FINN to classify the CIFAR-10 dataset on the Pynq-Z1 board Vivado DPR fmow
Architecture of the BNN Thank you Questions ?
Recommend
More recommend