for Coarse-Grained FPGA Overlays Abhishek Kumar Jain, Douglas L. - PowerPoint PPT Presentation

Resource-Aware Just-in-Time OpenCL Compiler for Coarse-Grained FPGA Overlays Abhishek Kumar Jain, Douglas L. Maskell Suhaib A. Fahmy School of Computer Science and Engineering School of Engineering Nanyang Technological University (NTU), Singapore University of Warwick, UK 3 rd International Workshop on Overlay Architectures for FPGAs (OLAF) 22 nd Feb 2017, Monterey, CA, USA 1

Hardware Accelerators • Many different platforms for hardware acceleration of compute intensive applications These include GPUs, FPGAs, etc. • • GPUs are more widely used due to • The ease of use and better design productivity • Better support for the OpenCL programming model • Faster design cycles 2

Hardware Accelerators • Many different platforms for hardware acceleration of compute intensive applications These include GPUs, FPGAs, etc. • • GPUs are more widely used due to • The ease of use and better design productivity • Better support for the OpenCL programming model • Faster design cycles • Issues with FPGA based accelerators (past and current) 1. Low level of programming abstraction (RTL) – HLS tools, Eg: Vivado-HLS 1. Low level of programming abstraction (RTL) 2. Runtime management (interfaces and drivers) – Newer tools, Eg: SDSoC, SDAccel, AOCL 2. Runtime management (interfaces and drivers) 3. Long compile times and slow switching between application kernels – Overlays 3. Long compile times and slow switching between application kernels 4. A lack of application portability and performance scalability – Overlays + OpenCL 4. A lack of application portability and performance scalability 3

So what is an Overlay? • A coarse-grained circuit abstraction which sits on top of the FPGA fabric • Many similarities to CGRAs • Because it is coarse-grained, it provides easier application mapping, faster compilation and faster application kernel configuration Images from University of Toronto [3] 4

So what is an Overlay? Mesh-of-FU Overlay (University of Toronto) [3] MXP Overlay (VectorBlox) [1] Intermediate Fabric (University of Florida) [4] SCGRA Overlay (HKUST, Hong Kong) DySER [2] (UW Madison) [5] 1. Severance, Aaron, and Guy GF Lemieux. "Embedded supercomputing in FPGAs with the VectorBlox MXP matrix processor." CODES+ ISSS, 2013. 2. Liu, Cheng, Ho-Cheung Ng, and Hayden Kwok-Hay So. "QuickDough: a rapid fpga loop accelerator design framework using soft CGRA overlay." FPT 2015. 3. Capalija, Davor, and Tarek S. Abdelrahman. "A high-performance overlay architecture for pipelined execution of data flow graphs." FPL 2013. 4. G. Stitt and J. Coole , “Intermediate fabrics: Virtual architectures for near - instant FPGA compilation,” IEEE ESL, vol. 3(3), 2011 . 6 5. J. Benson et al., "Design, integration and implementation of the DySER hardware accelerator into OpenSPARC," HPCA 2012.

Classification based on Architecture [6] • Time multiplexed • Spatially configured • Packet switched, and • Circuit switched FU-0 FU-1 FU-2 FU-3 Overlay (Time-multiplexed) Overlay (Spatially-configured) Similar to conventional CGRAs Images from: University of Toronto [3] 6 6. Kapre, Nachiket, et al. "Packet switched vs. time multiplexed FPGA overlay networks." FCCM 2006.

Started looking at SC overlays in 2013 FU area Fmax GOPS LUTs/GOPS 140 9000 450 7620 8000 400 120 7000 350 100 6000 300 80 250 5000 175 4000 200 60 3000 150 40 2000 100 20 6.3 1000 50 0 0 0 GOPS LUTs/GOPS Fmax 7

Started looking at SC overlays in 2013 FU area Zynq: 128 DSP blocks at 300 MHz V7: 800 DSP blocks at 380 MHz Jain, Fahmy, Maskell Fmax GOPS LUTs/GOPS 140 9000 450 7620 8000 400 115 120 7000 350 300 100 6000 300 80 250 5000 175 200 4000 60 3000 150 40 2000 100 20 6.3 1000 50 320 0 0 0 GOPS LUTs/GOPS Fmax 8

Started looking at SC overlays in 2013 FU area Zynq: 128 DSP blocks at 300 MHz V7: 800 DSP blocks at 380 MHz Jain, Fahmy, Maskell Fmax GOPS LUTs/GOPS 140 9000 450 395 7620 8000 400 115 120 7000 350 300 100 6000 300 80 250 5000 175 4000 200 60 3000 150 40 23 2000 100 20 6.3 1000 50 320 58 0 0 0 GOPS LUTs/GOPS Fmax 9

Coarse-grained Overlays FU area Zynq: 128 DSP blocks at 300 MHz V7: 800 DSP blocks at 380 MHz Jain, Fahmy, Maskell 1000x faster place and route Place and route within a second on embedded ARM in Zynq Get a very fast configuration time. ( μ s rather than ms) 10

Coarse-grained Overlays So can we use these overlays to provide a More GPU like experience? • OpenCL on GPU allows: • Fast compilation and configuration • Application portability across other accelerators (LLVM IR as abstraction layer) • Performance scaling by exploiting just-in-time ( JIT) compilation 11

Coarse-grained Overlays + OpenCL • Recent work focused on exposing overlay as an OpenCL device • Provides a more GPU like experience by exploiting fast compilation and configuration • Does not exploit kernel replication feature of OpenCL like the one used by GPUs Intermediate Fabric Overlay TILT Overlay (University of Florida) [7] (University of Toronto) [8] 1. Coole, James, and Greg Stitt. "Fast, flexible high-level synthesis from OpenCL using reconfiguration contexts." IEEE Micro, 2014 12 2. Rashid, Rafat, J. Gregory Steffan, and Vaughn Betz. "Comparing performance, productivity and scalability of the TILT overlay processor to OpenCL HLS." FPT 2014.

Performance Scaling on GPUs • Can automatically (at runtime) scale performance if additional hardware resource is avaliable • Idea: • Compiling application kernel at runtime from kernel time source (JIT) time • unroll/replicate the kernel based on the availability of hardware resources • Direct runtime compilation for FPGAs is infeasible • Overlays can allow runtime compilation and performance scaling! 1. S. Gao and J. Chritz . " Characterization of OpenCL on a Scalable FPGA Architecture " ReConFig 2014 13

Coarse-grained Overlays + OpenCL Dual-DSP FU aware DFG Resource-aware transformation Replication 8 Kernel Instances Dual-DISO Architecture Dual-DISO Architecture DISO Architecture 14

Resource-aware Performance Scaling • Proposed approach can help in performance scaling on providing more hardware resources • Instead of mapping single copy of kernel on 8x8 array, compiler can replicate the kernel 16 times 15

Coarse-grained Overlays + OpenCL (POCL) 16

OpenCL Kernel Execution Profile • Application: Apply negate kernel (total inversion, 255 – pixel value ) on 400x225 grayscale image • X-axis shows time in milliseconds Intel Xeon CPU E5-1650 NVIDIA Quadro 2000 ARMv7 Cortex-A9 CPU FPGA Overlay 1 10 100 1000 10000 100000 1000000 10000000 Kernel compile time Write Execute Read 17

Summary • Proposed a resource-aware Just-in-Time OpenCL compiler for FPGA overlays • Embedded ARM processor on Zynq device can compile kernels within a second • Future Work: Device dependent. ie., device drivers • Integration within POCL framework on Zynq-v2 • Execution of OpenCL benchmarks ✓ • Comparison with GPU Device dependent. • Continue the DSP-based overlay research ie., LLVM backend • Integration of multiple DeCO in the accelerator framework • Efficient time multiplexed overlays • Compilation of TensorFlow Graphs onto Overlays 18

Thank you 19

Results 20

Summary • Similar work: Supporting ρ -VEX vector processor as an OpenCL device using POCL 21

Summary • Similar work: Supporting ρ -VEX vector processor as an OpenCL device using POCL • Bad performance for the evaluated benchmark • Edge-detection algorithm applied on 640x480 image using 3x3 mask size 22

Additional Slides for tool-flow 23

Additional Slides for tool-flow 24

Compilation 25

Compilation 26

Compilation 27

Compilation 28

Resource-aware Performance Scaling 29

Kernels 30

Runtime Management of Accelerators using OpenCL Data Trigger for configure Data Read Write processing 31

OpenCL as a programming model 32

Overlays Time- Spatially- multiplexed configured Nearest-neighbor Nearest-neighbor style – SCGRA, style – QUKU, FPCA, CARBON Mesh-of-FU Island-style – Customized – TILT, Intermediate Fabrics, Remorph Reconfiguration contexts, DySER 35

Concept of Time-multiplexing 36

Time-multiplexed Overlays: VectorBlox MXP 37

Time-multiplexed Overlays: SCGRA 38

Spatially-Configured Overlays 39

Success Story 40

FPGAs in Cloud • 1/3 rd of the cloud service provider nodes to use FPGAs by 2020 • Microsoft: Doubling the throughput of Bing search engine using FPGAs • Microsoft Azure Cloud services and Amazon EC2 (FPGA-backed F1 instances) 41 • Nicole Hemsoth, " The FPGA Accelerated Cloud Push Just Got Stronger," 30 November 2016, THENEXTPLATFORM

for Coarse-Grained FPGA Overlays Abhishek Kumar Jain, Douglas L. - PowerPoint PPT Presentation

Resource-Aware Just-in-Time OpenCL Compiler for Coarse-Grained FPGA Overlays Abhishek Kumar Jain, Douglas L. Maskell Suhaib A. Fahmy School of Computer Science and Engineering School of Engineering Nanyang Technological University (NTU),

Fine Grained Access Control Fine-Grained Access Control Fine Grained Access Control

MOLECULAR DYNAMICS STUDY OF LIPOSOMES WITH A NEW COARSE-GRAINED MOLECULAR MODEL Wataru SHINODA

Application of the Lattice Boltzmann method with moving boundaries in a coarse-grained suspension

Junfeng Fan ESAT/COSIC ECC implementation methods Multi-core systems Coarse-Grained

Fine-Grained Access Control Fine Grained Access Control Fine-grained access control examples:

Coarse-graining Markov state models with PCCA Coarse-graining Markov state models

Open Source FPGA Toolchain FPGA LSE Summer Week 2015 iCE40 Flow Conclusion Vincent Gatine

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

Mechanized Verification of Fine-grained Concurrent Programs Ilya Sergey Aleks Nanevski

Coarse Woody Debris as Measurable Management Targets A.J. Kroll Weyerhaeuser COARSE WOODY

New design method for C30 recycled concr ete using mixed source concrete coarse agg regates

COARSE-TO-FINE, COST-SENSITIVE CLASSIFICATION OF E-MAIL Jay Pujara jay@cs.umd.edu Lise Getoor

Some categorical aspects of coarse spaces and balleans Nicol` o Zava joint work with Dikran

Robust Regression with Coarse Data Marco Cattaneo and Andrea Wiencierz Department of Statistics,

Three subject asymmetries in Limbum Johannes Hein johannes.hein@uni-potsdam.de ACAL 50

Databases Relational algebra Lectures for mathematics students Zbigniew Jurkiewicz, Institute of

Syntax and Morphology; a Single Computational Engine Hilda Koopman koopman@ucla.edu University

Rahul Simha department of computer science George Washington University Office hours: 1pm

EVC Computer Vision U it 3 I Unit 3: Image Acquisition A i iti http://

5. Moeda e Poltica Monetria 5. Moeda e Poltica Monetria 5.1. Introduo 5.2. Oferta de

Faculty of Health Disciplines Professional Development Workshop Stephen Downes For Athabasca

FEUDALISMO EUROPEU SC. V - XV http://historiaonline.com.br 1. CONTEXTO: Queda do Imprio