A Flexible Design Automation Tool for Accelerating Quantized - PowerPoint PPT Presentation

A Flexible Design Automation Tool for Accelerating Quantized Spectral CNNs Rachit Rajat, Hanqing Zeng, Viktor Prasanna University of Southern California fpga.usc.edu FPL 2019, Barcelona 1

Outline • Introduction • Background • Tool overview • Architecture template • Optimizations • Experiments • Conclusion 2

Introduction • Challenges in CNN inferencing on FPGAs: • Computation complexity: sliding window operations • Design effort: design space search & manual hardware implementation • Design optimization: resource utilization & clock rate for large scale designs • Design flexibility: various CNN models and FPGAs and performance requirements • Need fast generation of: • Performance meta-data to tune CNN models • Hardware code to deploy inference pipeline 3

Background & Motivation: Spectral CNN on FPGAs • Convolutional Neural Networks (CNN) • Spectral convolution [1] • • Sliding window operation  Hadamard product ℱ : Fourier transform ℱ �� : Inverse Fourier transform • �� • • 𝐽 ∗ : image • 𝐿 �� : conv. kernels after FFT • Partitioning on and padding on Overlap-and-Add • Why spectral CNNs? • Computation reduction: for AlexNet, VGG16,…. [1]: Zeng, Chen, Zhang, Prasanna, A framework for generating high throughput CNN implementations on FPGAs, Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays 4

Problem is Non-trivial • Goal : Fast and flexible design space exploration and generation of Verilog for high throughput inference • Constraints : Limited BRAM and DSP resources • Need to explore a huge design space quickly • Optimization needed in spectral convolution engine to support large FPGA devices 5

Tool Overview (1) • Automated tool for generating quantized spectral CNN accelerators in synthesizable Verilog • Performance metrics • Time to generate design • Throughput of generated design • Flexibility • Quantization schemes • Various bit widths for kernels and activations • FPGA architecture • Various resources (DSPs, BRAMs, bandwidth, etc.) • CNN models • Various model parameters (channels, kernel sizes, image sizes, etc.) 6

Tool Overview (2) Estimated resource breakdown, Estimated throughput, Input image size, Bottlenecks For each layer: • Activation size • Kernel size Meta-data • Channel size Verilog code Throughput- CNN model optimized Proposed Tool accelerator Quantization Data layout scheme Data tiles in external FPGA specification For each layer: memory • Kernel bit-width DSP, BRAM, • Activation bit-width bandwidth, latency 7

Tool Overview (3) FPGA CNN Quan. Overlap-and-Add spec. model scheme Concatenate-and-Pad Spectral loop tiling Algorithmic Architecture template Optimization Optimization problem formulation Architectural Minimize Optimization where Design Space Subject to Exploration Design Generation Meta-data Accelerator Data-layout 8

Architecture Template • Design parameters : FFT size, FFT parallelism, batch size, systolic array size, systolic array parallelism and number of channels • Architecture template for Verilog generation: 9

Optimization 1: Variable Bit-width Multiplier • Requirement Unique to spectral CNN: low bit-width complex multiplication • Challenge : DSPs accept fixed, high bit-width inputs • Idea : Pad the data of low bit width to match the DSP input width Performance estimation on Stratix 10 Example: 10

Optimization 2: Switching Parallelization Dimensions (1) • Challenge : Concurrent memory accesses for Hadamard product • Example: � operations ( • = FFT size) � distinct • BRAM accesses • Thousands of BRAM accesses per cycle to support parallelism of thousands of DSPs • Severe clock rate degradation due to the pressure on BRAMs 11

Optimization 2: Switching Parallelization Dimensions (2) • Parallelize along width & height dimensions  Hadamard products • Parallelize along batch & channel dimensions  Matrix dot products • Systolic array: blocked matrix multiplication • Analysis � DSP operations • BRAM accesses/cycle for • Efficient for FPGAs with large number of DSPs 12

Optimization 3: Design Space Exploration • Challenge: • Large Design space: • 4 HW parameters: Parallelism of modules • 3 SW parameters: Data layout & tiling • Optimization goal : • Inference throughput (batch processing)  Identify bottleneck stage in the pipeline • Optimization Problem/Constraints : (see paper) 1. SW-HW coordination Tiling matches (device) parallelism 2. Limited resources Share DSP: FFT / Sys-array / IFFT Share BRAM: input / kernel / output buffers Share bandwidth: input / output activation 3. Load-balance Keep the pipeline always busy • Optimization Technique: Hierarchical priority parameter sweep 13

Experimental Setup • Target FPGA devices Stratix-10 GX, Stratix-V GX • Bit widths 2- to 16-bit • CNNs AlexNet, VGG16 • Tool execution Intel Core-i5 CPU Design space exploration + generation 14

Comparison with State-of-the-art Designs (1) • Comparison with state-of-the-art spectral CNN tool (FPGA ’18) AlexNet VGG16 FPGA ’18 * Proposed FPGA ’18 * Proposed Switching Stratix-10 Stratix-10 Stratix-10 Stratix-10 FPGA parallelization GX2800 GX2800 GX2800 GX2800 dimensions Clock (MHz) 120 200 120 200 improves clock rate Quantization 16-bit 16-bit 16-bit 16-bit DSP 3264 (56%) 3264 (56%) 3264 (56%) 3264 (56%) Optimized Logic 413K (45%) 140K (15%) 419K (47%) 140K (15%) architectural template BRAM 6129 (52%) 1616 (22%) 6133 (32%) 2616 (22%) reduces logic Throughput 1704 2841 77 129 (img/sec) *: Original design on Strativ-V; Re-implemented on Stratix-10 15

Comparison with State-of-the-art Designs (3) • Comparison with state-of-the-art spatial CNN tool (ICCAD ’18) AlexNet VGG16 16-bit ICCAD ’18 Proposed ICCAD ’18 Proposed UltraScale Stratix-10 UltraScale Stratix-10 FPGA KU115 GX2800 KU115 GX2800 Clock (MHz) 220 200 235 200 Quantization 16-bit 16-bit 16-bit 16-bit DSP 4854 (88%) 3264 (56%) 4318 (78%) 3264 (56%) Logic 262K (40%) 140K (15%) 258K (39%) 140K (15%) BRAM 986 (46%) 1616 (22%) 1578 (81%) 2616 (22%) Throughput 1126 2841 65 129 (img/sec) 16

Comparison with State-of-the-art Designs (3) • Comparison with state-of-the-art spatial CNN tool (ICCAD ’18) AlexNet VGG16 8-bit ICCAD ’18 Proposed ICCAD ’18 Proposed UltraScale Stratix-10 UltraScale Stratix-10 FPGA KU115 GX2800 KU115 GX2800 Clock (MHz) 220 200 235 200 Quantization 8-bit 8-bit 8-bit 8-bit DSP 4854 (88%) 4480 (78%) 4318 (78%) 4480 (78%) Throughput improvement due to • Spectral convolution algorithm Logic 262K (40%) 150K (16%) 258K (39%) 150K (16%) • Optimized design generation process BRAM 986 (46%) 5232 (45%) 1578 (81%) 5232 (45%) Throughput 2252 9114 130 308 (img/sec) 17

Evaluation on Flexibility (1) • Flexibility w.r.t. CNN models Layer index 18

Evaluation on Flexibility (2) • Flexibility w.r.t. FPGA resources Fraction of DSPs available Fraction of BRAMs available 19

Flexible Tool for Automatic Generation of Pruned and Quantized Spectral CNNs: The Big Picture Model Training Data Training + Optimization Hardware Abstraction Quantization Constraints Quantization Quantization Pruning Sparsity Constraints Parameters H/W Building Blocks: FFT, SPN Systolic Array Compressed CNN Model Design Space Hardware Hardware Constraints Exploration Mapping Engine Abstraction C++ Verilog fpga.usc.edu FPGA

Conclusion • Design automation tool for generating high throughput spectral CNN accelerator • Flexibility: • CNN models • Quantization schemes • FPGA devices • Significantly higher throughput ( ) than designed by state-of-the-art tools • Spatial or Spectral?? • Implications: Multi-core, GPU platforms?? 21

Thank you! https://fpga.usc.edu/ prasanna@usc.edu 22

A Flexible Design Automation Tool for Accelerating Quantized - PowerPoint PPT Presentation

A Flexible Design Automation Tool for Accelerating Quantized Spectral CNNs Rachit Rajat, Hanqing Zeng, Viktor Prasanna University of Southern California fpga.usc.edu FPL 2019, Barcelona 1 Outline Introduction Background Tool

1 Automation Overview Definition Automation (automation, Automation ) : 1) set of all measures

SynAthina Onli line Tools 1. . A mapping tool 2. A Community Tool 3. An Archive Tool 3. An

The The Beverly Beverly Middle Middle School School Flexible Flexible Learning Learning

Test automation Building automatically repeatable test suites Test automation n Test automation

TEST AUTOMATION AT BMAR BMAR TEST TEAM Test Automation Planning 1. Selection Of Test

Personalized Learning Flexible Seating and Space Flexible Seating and Space Flexible Seating and

Automation is in the Eye of the Automation is in the Eye of the Automation is in the Eye of the

Flexible Instruction Day Parent Presentation Flexible Instruction Day March 16 - 20 - Flexible

Flexible Infrastructure Qualification What Is Flexible Infrastructure/Benefits Flexible

TESTING FRAMEWORKS Gayatri Ghanakota OUTLINE Introduction to Software Test Automation.

nada technologies, inc. automation solutions and support semicon taiwan presentation automation

Industrial Automation Automation Industrielle Industrielle Automation 9.2 Dependability -

Industrial Automation Automation Industrielle Industrielle Automation Safety analysis and

Document Automation in Dynamics CRM Document Automation The value of Automation Reduce User

Flexible Hardware Design at Flexible Hardware Design at Low Levels of Abstraction Low Levels of

FSA - HSA - HRA Spending & Savings Accounts Flexible Spending Account (FSA) Flexible

Optimal Quantization for thepricing of American style options Gilles Pag` es gpa@ccr.jussieu.fr

Magnetic Weyl Quantization and Semiclassical Limit Satoshi Okumura Tohoku University Graduate

Exercise 4a: Padding of Input Image Main Goal : The image codec should work with all combinations

cpSGD: c ommunication-efficient and differentially- p rivate distributed SGD Naman Agarwal, Ananda

Sampling of probability measures in the convex order and approximation of Martingale Optimal

Renormalization of a Second Order Formalism for Spin 1 / 2 Fermions e Ren Angeles-Mart

Structure of Optimal Quantizer for Binary-Input Continuous-Output Channels with Output Constraints

COVID-19 and SNFs in OC Orange County Health Care Agency (OCHCA) June 4, 2020 Presenters

A Flexible Design Automation Tool for Accelerating Quantized - PowerPoint PPT Presentation

A Flexible Design Automation Tool for Accelerating Quantized Spectral CNNs Rachit Rajat, Hanqing Zeng, Viktor Prasanna University of Southern California fpga.usc.edu FPL 2019, Barcelona 1 Outline Introduction Background Tool

1 Automation Overview Definition Automation (automation, Automation ) : 1) set of all measures

SynAthina Onli line Tools 1. . A mapping tool 2. A Community Tool 3. An Archive Tool 3. An

The The Beverly Beverly Middle Middle School School Flexible Flexible Learning Learning

Test automation Building automatically repeatable test suites Test automation n Test automation

TEST AUTOMATION AT BMAR BMAR TEST TEAM Test Automation Planning 1. Selection Of Test

Personalized Learning Flexible Seating and Space Flexible Seating and Space Flexible Seating and

Automation is in the Eye of the Automation is in the Eye of the Automation is in the Eye of the

Flexible Instruction Day Parent Presentation Flexible Instruction Day March 16 - 20 - Flexible

Flexible Infrastructure Qualification What Is Flexible Infrastructure/Benefits Flexible

TESTING FRAMEWORKS Gayatri Ghanakota OUTLINE Introduction to Software Test Automation.

nada technologies, inc. automation solutions and support semicon taiwan presentation automation

Industrial Automation Automation Industrielle Industrielle Automation 9.2 Dependability -

Industrial Automation Automation Industrielle Industrielle Automation Safety analysis and

Document Automation in Dynamics CRM Document Automation The value of Automation Reduce User

Flexible Hardware Design at Flexible Hardware Design at Low Levels of Abstraction Low Levels of

FSA - HSA - HRA Spending &amp; Savings Accounts Flexible Spending Account (FSA) Flexible

Optimal Quantization for thepricing of American style options Gilles Pag` es gpa@ccr.jussieu.fr

Magnetic Weyl Quantization and Semiclassical Limit Satoshi Okumura Tohoku University Graduate

Exercise 4a: Padding of Input Image Main Goal : The image codec should work with all combinations

cpSGD: c ommunication-efficient and differentially- p rivate distributed SGD Naman Agarwal, Ananda

Sampling of probability measures in the convex order and approximation of Martingale Optimal

Renormalization of a Second Order Formalism for Spin 1 / 2 Fermions e Ren Angeles-Mart

Structure of Optimal Quantizer for Binary-Input Continuous-Output Channels with Output Constraints

COVID-19 and SNFs in OC Orange County Health Care Agency (OCHCA) June 4, 2020 Presenters

FSA - HSA - HRA Spending & Savings Accounts Flexible Spending Account (FSA) Flexible