Contents Introduction Pipelined FPGA DNN accelerators Roof-line - PDF document

FPGA implementation of Quantized (Deep) Neural Network (QNN) Accelerators Biruk Seyoum Phd Student, Scoula Superiore Sant’Anna, Pisa Contents  Introduction  Pipelined FPGA DNN accelerators  Roof-line Model and optimizing FPGA DNN accelerators  Quantized Neural Networks (QNNs)  Introduction to FINN  FINN demo

Some DNN Applications  computer vision  speech recognition  Bi-informatics . . . . Some DNN properties  Have an enormous storage and compute requirements  VGG-16 (528 MB, 19.6 GFLOPs)  ResNet152(232 MB, 11.3 GFLOPs)  Googlenet (51 MB, 1.5 GFLOPs)  Computation dominated by fmoating point computations

FPGA DNN acceleration Deployment  Cloud  Host larger DNNs  Low inference latency  Suitable for user Interaction latency insensitive applications ( machine translation, NN business applications, forecasting etc... )  Energy and transmission latency cost of offmoading data to the cloud FPGA DNN accelerator Deployment  Embedded Platforms  Suitable for Real-time safety critical applications ( autonomous driving, speech recognition etc... )  Ideal for smaller DNNs

The ZYNQ7010 SoC FPGA DNN acceleration Architecture  Single accelerator architecture  Single compute engine for all layers  Usually the most compute intensive layer ofg- loaded to FPGA engine

FPGA DNN acceleration Architecture  Pipelined architecture  Each DNN layer maps to a HW implementation on the FPGA  Adjacent layers are connected using bufgers FPGA DNN acceleration Architecture  Pipelined architecture  Each DNN layer maps to a HW implementation on the FPGA  Adjacent layers are connected using bufgers Focus of this presentation

Pipelined DNN FPGA accelerators  DDR RD/RW  MAC  input images  Compare  Parameters  div etc...  Output s Pipelined DNN FPGA accelerators For N number of DNN layers and a batch size of B

Pipelined DNN FPGA accelerators For N number of layers and inference time per layer per frame Pipelined DNN FPGA accelerators performance Bottleneck is peak FPGA compute capacity

Pipelined DNN FPGA accelerators performance Bottleneck is off-chip Memory bandwidth Pipelined DNN FPGA accelerators performance Performance of accelerator depends both on  Peak FPGA compute capacity  Off-chip memory (DRAM) bandwidth

Pipelined DNN FPGA accelerators performance Performance of accelerator depends both on  Peak FPGA compute capacity  Off-chip memory (DRAM) bandwidth The peak FPGA compute capacity is obtained from hardware specification or benchmarks (eg. FLOPs/sec ) Pipelined DNN FPGA accelerators performance Performance of accelerator depends both on  Peak FPGA compute capacity  Off-chip memory (DRAM) bandwidth The peak FPGA compute capacity is obtained from hardware specification or benchmarks (eg. FLOPs/sec ) How to relate the accelerator performance w.r.t. FPGA compute capacity and off-chip DRAM traffic ?

Roof-line model  A state of the art performance model for Multicore architectures  Can easily be adapted also for FPGA accelerators whose inputs are stored on DRAM  Correlates compute performance, memory performance and operational intensity in a 2D graph Roof-line model  Operational intensity(Arithmetic intensity) is the number of operation on FPGA per byte of DRAM traffjc  Measured in Ops/byte ( eg. FLOPs/byte )

Roof-line model  Operational intensity(Arithmetic intensity) is the number of operation on FPGA per byte of DRAM traffjc  Measured in Ops/byte ( eg. FLOPs/byte ) High Operational intensity Lower Operational intensity Roof-line model Attainable GFLOPS/S = min { Peak compute capacity , Mem_BW * Operational Intensity }

Roof-line model Roof-line model

How to increase performance of FPGA accelerator Increasing memory bandwidth (memory bounded) How to increase performance of FPGA accelerator Migrating to powerful FPGA (compute bounded)

How to increase performance of FPGA accelerator Increasing the Operational Intensity Increasing Operational Intensity  Operational intensity can be increased by  Maximizing the number of operations performed on data fetched from DRAM before write back  Eg. Implementing the DNN layers in a pipeline  Reducing the precision of computation to bring more data simultaneously  Using lower precision computation in general  Eg. 16 bit fmoating point, 8 bit integer etc...

Quantized Neural Networks (QNNs)  Deep neural networks are typically over- parametrized  Remedies to overcome this problem include  Prunning  Weight sharing  Weight quantization (our topic)  Weight quantization involves  Representing weights and parameters in low precision integers eg. 8 bit, 4bit, and at the extreme case in binary Benefjts of weight quantization  Reduced weight memory footprint  Reduced DRAM memory footprint of weights − Eg. AlexNet from 64 MB to ~ 2 MB with 1 bit weights  Can even fjt inside the on-chip memory of the FPGA Weights stored inside the accelerator

Benefjts of weight quantization  Faster computation  Reduced precision integer arithmetic is much faster than fmoating computation (number of computation per clock cycle will be higher)  Computation (eg. MAC) on reduced precision integer is more FPGA friendly − Floating point computation DSPs (scarce) − Reduced precision computation LUTs (abundant)  Compute engines on the FPGA also consume less resource The FINN framework

Acceleration flow in FINN Mapping fmow in FINN

Quantization aware training Brevitas : A PyTorch library for quantization-aware training ONNX representation  FINN uses an ONNX-based intermediate representation  Brevitas provides FINN- ONNX export  Quantization information is exported as annotations https://github.com/Xilinx/brevitas

The FINN compiler  Generates the Viviado HLS based mapping of BNN layers to FPGAs  Streaming architecture  Each layer is mapped to a dedicated compute engine  Compute engines are pipelined and they communicate via on-chip data streams The FINN flow

The FINN compute engine  Each layer has a single compute engine  The number of Processing Elements (PE) and the Single Instruction Multiple Data (SIMD) lanes of each PE determine the throughput of the layer  PE and SIMD also determine the resource A FINN compute engine consumption of the engine Features of BNN computation

DEMO FINN Demo

In this demo you will see  The fjnn codebase  Accelerator IP generation using Vivado_hls  A quick summary of a Vivado partial reconfjguration fmow  Application of FINN to classify the CIFAR-10 dataset on the Pynq-Z1 board Vivado DPR fmow

Architecture of the BNN Thank you Questions ?

Contents Introduction Pipelined FPGA DNN accelerators Roof-line - PDF document

FPGA implementation of Quantized (Deep) Neural Network (QNN) Accelerators Biruk Seyoum Phd Student, Scoula Superiore SantAnna, Pisa Contents Introduction Pipelined FPGA DNN accelerators Roof-line Model and optimizing FPGA DNN

Level 1, V2.0 Level 1, V2.0 1 Course Contents Course Contents Course Contents Course

Oasys Post Processing New Features in Version 16.0 www.arup.com/dyna Back to Contents Back to

Contents averages averages Contents Contents Harmonic mean (average) Harmonic mean (average)

Sage as a Calculator By Samaneh shafi naderi By Samaneh shafi naderi Sage as a Calculator

Contents Contents Fluid

Contents Contents.....2 Butter

PRODUCT LAW WORLDVIEW PRODUCT LAW WORLDVIEW TABLE OF CONTENTS TABLE OF CONTENTS INTRODUCTION

The Waterbase Limited Investor Presentation June - 2016 Contents Contents 2 Safe Harbour

17 www.scad.ae Table of Contents Table of Contents

Scytls voter-verifiability solutions Pnyx.DRE and Pnyx.VVPAT Contents Contents

Cencosud April 2016 Corporate Presentation | Contents | 2 Contents Investment Highlights

3 August 2006 Hong Kong www.solomon-systech.com Table of contents Table of contents

CONTENTS CONTENTS A. Company Profile 03 B. Products 06 Appendix 29 2/30 A. Company Profile

INVESTOR PRESENTATION February 2020 CONTENTS TABLE OF CONTENTS Majid Al Futtaim 2019

Marine Biodiversity Yoshihisa Shirayama Contents Contents Characteristics of Marine

Taeil Enterprise the antimicrobial material technology Table of Contents Table of Contents

Image Recognition Traffic Patterns for Wireless Multimedia Sensor Networks Wireless Multimedia

We Innovate Pilsen With the help of modern technologies, we make life easier, we develop talents

11 Introduction Introduction M/M/1 Queueing delay (revisited) R=link bandwidth (bps)

Quasi-Dynamic Network Model Contents Partition Method for Accelerating Parallel Network

The Function Placement Problem (FPP) Wolfgang Kellerer Technical University of Munich Dagstuhl,

Lecture 17: More on binary vs. multi-class classifiers (Polychotomizers: One-Hot Vectors,

Machine Learning CS 786 University of Waterloo Lecture 4: May 10, 2012 What is Machine

Modeling the Catchment Via Mixtures: an Uncertainty Framework for Dynamic Hydrologic Systems