In Integrating NVIDIA Deep Learning Accelerator (NVDLA) wit ith - PowerPoint PPT Presentation

In Integrating NVIDIA Deep Learning Accelerator (NVDLA) wit ith RIS ISC-V SoC on Fir ireSim Farzad Farshchi § , Qijing Huang ¶ , Heechul Yun § § University of Kansas, ¶ University of California, Berkeley

SiFive Internship Rocket Chip SoC NVDLA + • Rocket Chip : open-source RISC-V SoC • NVDLA : open-source DNN inference engine • Demoed the integration at Hot Chips’18 2

SiFive Internship 3

Motivation • Useful platform for research • Limitations • No L2 • Fast DRAM, slow SoC • Expensive: $7k FPGA board • Let’s integrate NVDLA into FireSim 4

FireSim • Fast, cycle-exact full system simulator, runs on FPGA in the cloud • Simulated design is derived from Rocket Chip RTL • Decouples target from FPGA DRAM • Adds its own DRAM and LLC model • Easy-to-use. Very good documentation. 5

How FireSim Works? • Transforms RTL to target model • Inserts queues at I/O ports of target • Creates a token-based simulator • In each cycle a token is consumed by model • What if token queue is empty? • The model has to wait Figure credit: Donggyu Kim et al. “Strober: Fast and Accurate Stall the target pipeline Sample- Based Energy Simulation for Arbitrary RTL” 6

How to Stall The Target Pipeline? • For Chisel code: • Rocket Chip is written in Chisel • For Verilog (we added): 7

Overall System Architecture • NVDLA is integrated in target • LLC + Memory Model: Not part of the target. Added by FireSim. • Supports multiple models e.g. DDR3, constant latency • Runtime configurable LLC: different set, way, block sizes. No need to rebuild FPGA image 8

Integrate Your Own Accelerator • Any accelerator can be integrated (if it fits inside FPGA) • Develop and test software for your accelerator in Linux environment before having the chip in hand • Get fast and accurate performance results 9

NVDLA • Scalable : nv_small, nv_medium, nv_large • We used nv_large : 2048 MACs • Convolutional core: matrix- matrix multiplication • Post-processing: activation Adopted from “The Nvidia Deep Learning Accelerator”, https://goo.gl/Znyba5 function, pooling, etc. 10

Performance Analysis (I) • Baseline config: • Quad-core Rocket Core , 3.2 GHz • NVDLA : 2048 INT8 MACs, 512 KiB conv. buffer, 3.2 GHz • LLC : Shared 2 MiB, 8-way, 64 B block • DRAM : 4 ranks, 8 banks, FR-FCFS • YOLOv3: 416 x 416 frame, 66 billion operations 11

Performance Analysis (II) • Frame process time: 133 ms ( 7.5 fps ) • 67 ms on NVDLA • 66 ms on processor, multithreaded with OpenMP • Layers not supported by NVDLA are running on processor • Custom YOLO, upsampling, FP ⇔ INT8 • Make common DNN algorithm run very fast ✔ • Computations not supported by the accelerator can make you slow ✖ 12

Performance Comparison • Rocket : baseline config, no NVDLA 5.5x • NVDLA+Rocket : baseline config 407x • Xeon : E5-2658 v3 • Titan Xp : Pascal arch, 3840 CUDA cores • Titan cosumes more power • Titan Xp: board TDP 250 W , 471 mm² in 16nm • NVDLA IP: 766 mW peak, 3.3 mm² in 16nm 13

Sharing LLC with Accelerator • Sharing the LLC can be a good 1.6x alternative to scratchpad • Consumes less chip area • Less programming effort • Performance does not vary by changing the LLC size • But varies by changing the block size • Streaming access pattern. Not much data reuse left * Speedup is measured w.r.t design with no LLC • NVDLA minimum burst length: 32B • Hardware prefetcher should help 14

Contention In Memory System • We care about worst-case 2.5x execution time in real-time systems • Synthetic benchmark is running on the CPU stressing the memory system • NVDLA execution time is measured * Normalized to solo execution time i.e. running in isolation 15

Conclusion • We integrated NVDLA with a RISC-V SoC on FireSim • Fast, easy-to-use • No FPGA board needed: runs on the Amazon could • Can be used for architectural/system research • We will be using it for research in real-time embedded systems • Open-sourced and publicly available at: https://github.com/CSL-KU/firesim-nvdla/ Google “ firesim nvdla ” 16

Demo 17

• Questions? 18

In Integrating NVIDIA Deep Learning Accelerator (NVDLA) wit ith - PowerPoint PPT Presentation

In Integrating NVIDIA Deep Learning Accelerator (NVDLA) wit ith RIS ISC-V SoC on Fir ireSim Farzad Farshchi , Qijing Huang , Heechul Yun University of Kansas, University of California, Berkeley SiFive Internship Rocket Chip

Integrating Problem Solving 2020 Integrating Problem Solving 2020 Integrating Problem Solving

FOR THE BEST VDI USER EXPERIENCE NVIDIA VIRTUAL GPU PRODUCT POSITIONING NVIDIA GRID NVIDIA

NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL

INTEGRATION OF DALI WITH TENSORRT ON XAVIER Josh Park (joshp@nvidia.com), Manager - Automotive Deep

NVIDIA DESIGNWORKS Ankit Patel - ankitp@nvidia.com Prerna Dogra - pdogra@nvidia.com 1 Autonomous

DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPU Xu Tianhao, Deep Learning Solution

A RAY TRACING DEEP DIVE Holger Gruen (NVIDIA), Jon Story (NVIDIA), Michiel Roza (Nixxes)

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing

NVIDIA Quadro and NVS Video Walls NVIDIA Quadro and NVS Video Walls Using NVIDIA technology to

GENERATION OF GAMING TECHNOLOGY Samuel Lo, NVIDIA AI Technology Centre samuell@nvidia.com NVIDIA

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil,

NVIDIA INDEX IMPLEMENTING CLOUD SERVICES FOR MASSIVE DATA VISUALIZATION Marc Nienhaus (NVIDIA),

NVIDIA VGPU LINUX KVM Neo Jia, Dec 19th 2019 AGENDA NVIDIA vGPU

GET TO KNOW THE NVIDIA GRID TM SDK Shounak Deshpande, NVIDIA Background NVIDIA GRID SDK AGENDA

NVIDIA INDEX IMPLEMENTING ADVANCED DATA VISUALIZATION WITH NVIDIA INDEX Alexander Kuhn and Marc

Cutting Edge Tools and Techniques for Real-Time Rendering with NVIDIA GameWorks David Coombes,

Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula,

Measuring YouTube Content Delivery over IPv6 Q/A Recommendations Stall Events Tiroughput

What is En#ty Resolu#on? Problem of idenBfying and

Prioritizing Enterprise Customer Needs with Constructed, Augmented MaxDiff EARL London These

ELKO Lalo In nllzi.io argmin Ln O OE du Elz a logpo.cz cst 10,00 ftp.logpdzl KLCpqllp 0

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions Christopher Clark, Kenton

Web Mining and Recommender Systems Triadic closure; strong & weak ties Triangles So far

21 st Century Cures Act NIH-USDA-FDA Listening Session on Animal Research Patricia Brown, VMD,

In Integrating NVIDIA Deep Learning Accelerator (NVDLA) wit ith - PowerPoint PPT Presentation

In Integrating NVIDIA Deep Learning Accelerator (NVDLA) wit ith RIS ISC-V SoC on Fir ireSim Farzad Farshchi , Qijing Huang , Heechul Yun University of Kansas, University of California, Berkeley SiFive Internship Rocket Chip

Integrating Problem Solving 2020 Integrating Problem Solving 2020 Integrating Problem Solving

FOR THE BEST VDI USER EXPERIENCE NVIDIA VIRTUAL GPU PRODUCT POSITIONING NVIDIA GRID NVIDIA

NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL

INTEGRATION OF DALI WITH TENSORRT ON XAVIER Josh Park (joshp@nvidia.com), Manager - Automotive Deep

NVIDIA DESIGNWORKS Ankit Patel - ankitp@nvidia.com Prerna Dogra - pdogra@nvidia.com 1 Autonomous

DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPU Xu Tianhao, Deep Learning Solution

A RAY TRACING DEEP DIVE Holger Gruen (NVIDIA), Jon Story (NVIDIA), Michiel Roza (Nixxes)

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing

NVIDIA Quadro and NVS Video Walls NVIDIA Quadro and NVS Video Walls Using NVIDIA technology to

GENERATION OF GAMING TECHNOLOGY Samuel Lo, NVIDIA AI Technology Centre samuell@nvidia.com NVIDIA

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil,

NVIDIA INDEX IMPLEMENTING CLOUD SERVICES FOR MASSIVE DATA VISUALIZATION Marc Nienhaus (NVIDIA),

NVIDIA VGPU LINUX KVM Neo Jia, Dec 19th 2019 AGENDA NVIDIA vGPU

GET TO KNOW THE NVIDIA GRID TM SDK Shounak Deshpande, NVIDIA Background NVIDIA GRID SDK AGENDA

NVIDIA INDEX IMPLEMENTING ADVANCED DATA VISUALIZATION WITH NVIDIA INDEX Alexander Kuhn and Marc

Cutting Edge Tools and Techniques for Real-Time Rendering with NVIDIA GameWorks David Coombes,

Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula,

Measuring YouTube Content Delivery over IPv6 Q/A Recommendations Stall Events Tiroughput

What is En#ty Resolu#on? Problem of idenBfying and

Prioritizing Enterprise Customer Needs with Constructed, Augmented MaxDiff EARL London These

ELKO Lalo In nllzi.io argmin Ln O OE du Elz a logpo.cz cst 10,00 ftp.logpdzl KLCpqllp 0

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions Christopher Clark, Kenton

Web Mining and Recommender Systems Triadic closure; strong &amp; weak ties Triangles So far

21 st Century Cures Act NIH-USDA-FDA Listening Session on Animal Research Patricia Brown, VMD,

Web Mining and Recommender Systems Triadic closure; strong & weak ties Triangles So far