Architecture exploration through FPGA acceleration Rapid System - PowerPoint PPT Presentation

Architecture exploration through FPGA acceleration Rapid System Level Design and Evaluation of Near Memory Fixed Function Units 11/13/2020 Maya Gokhale DMTS LLNL-PRES-816381 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC

Outline ▪ Trends in reconfigurable computing — Architectures — Tools — Applications ▪ Targeting fast architecture design space exploration — MPSoC to accelerate design and evaluation of heterogeneous function units — Mixed hardware/software approaches for scaling studies for complex design space scenarios ▪ The perennial tools problem — Need for a unified hardware/software development environment — Open source

FPGA architecture has evolved as dramatically as CPU ▪ Xilinx 3000 series ▪ Xilinx Versal — Configurable Logic Blocks “sea of gates” — Specialized DSP processors — I/O Blocks high speed programmable input- — “Fabric” for data acquisition/pre - output processing — Interconnect combining mesh and long lines — Control processor https://www.xilinx.com/support/documentation/data_sheets/3000.pdf https://www.xilinx.com/support/documentation/white_papers/wp505-versal-acap.pdf

Progression of FPGA architecture evolution ▪ Embedded, distributed memories to store local state ▪ DSP blocks for fast fixed point arithmetic ▪ I/O architecture optimization for fast data ingest and generation ▪ Clock management for multiple clock domains Specializations for application domains ▪ Host CPU integration Video codec — HPC & ACP, CXL, CAPI 100 Gb EMAC, PCIe gen 4

FPGA tools have evolved from microprogramming to (highly annotated) C++ #pragma HLS PIPELINE II=1 enable_flush #pragma HLS INTERFACE ap_ctrl_none port=return // Ethernet FIFO interface #pragma HLS INTERFACE ap_none port=dma_tx_end_tog // Receives 128-bit wide data in #pragma HLS INTERFACE ap_none port=tx_r_fixed_lat // Transmits a packet via PS Ethernet FIFO #pragma HLS INTERFACE ap_none port=tx_r_rd // This version supports flushing out buffered data #pragma HLS INTERFACE ap_none port=tx_r_status void eth_fifo_interface( … u1t dma_tx_end_tog, // various state variables and useful constants u1t tx_r_fixed_lat, static enum state {IDLE, MAC_DST, MAC_SRC, TYPE, PAYLOAD, ZEROS, ID} u1t tx_r_rd, current_state = IDLE; …) { const u8t src_mac[6] = {0x00, 0x0A, 0x35, 0x03, 0x59, 0xF5}; #pragma HLS ARRAY_PARTITION variable=src_mac complete dim=1 … static u8st data_buffer; #pragma HLS STREAM variable=data_buffer depth=16384

Reconfigurable computing applications are diverse ▪ Signal and image processing — Satellite, space application — Instrument sensor data streams ▪ Network packet processing — Routing — In-stream processing CHIME Radio Telescope with — Regular expression matching F-Engine Containers ▪ Finance — Integrated with network packet processing — High frequency trading — Risk analysis Mars Perseverance Rover ▪ Data center — Microsoft investment in FPGAs to accelerate search, ML, etc.: the FPGA sits between the datacenter’s top -of-rack (ToR ) network switches and the server’s network interface chip (NIC). As a result, all network traffic is routed through the FPGA, which can perform line-rate computation on even high-bandwidth network flows. — Amazon F1 for individual, corporate, or FPGA as a service ▪ Logic emulation — Use the sea of gates to emulate IP blocks, function units, full ASICs

FPGAs can accelerate architecture exploration by orders of magnitude over software ▪ M. Butts, J. Batcheller and J. Varghese, “An efficient logic emulation system,” Proceedings 1992 IEEE International Conference on Computer Design: VLSI in Computers & Processors , Cambridge, MA, 1992, pp. 138-141. — Realizer System: array of FPGAs for emulating large digital logic design ▪ Q. Wang et al., "An FPGA Based Hybrid Processor Emulation Platform," 2010 International Conference on Field Programmable Logic and Applications ( https://ieeexplore.ieee.org/document/5694215 ) — Emulates Xeon processor on FPGA in a processor socket ▪ FireSim for many-core RISC-V simulation https://rise.cs.berkeley.edu/projects/firesim/ — Amazon F1 cloud — Custom accelerators for RISC-V ▪ ESP for heterogeneous SoC design https://www.esp.cs.columbia.edu — tile-based architecture built on a multi-plane network-on-chip — prototype on FPGA ▪ Logic in Memory Emulator (LiME) follows a hybrid approach: keep the native hard IP cores/cache hierarchy for the CPU complex and use the programmable logic to emulate widely varying memory latencies and near memory accelerators

Shift to heterogeneous computing has generated innovation in purpose-built hardware blocks from exascale to IoT Heterogeneous computing has been dominated by GPUs, but Intel CGRA contenders abound: https://en.wikichip.org/wiki/intel/configurable_spatial_accelerator For example, specialized tensor processing cores with embedded SRAM, HBM, fast network Habana Gaudi AI training chip LLNL NS61e True North boards with 16 TN chips Focus on compute units

New memory technologies and packaging are needed to deliver data to the compute units ▪ Advances in memory technology and packaging — High bandwidth memories – HBM, HMC — Non-volatile memory – 3D Xpoint — focuses attention on computer memory system design and evaluation — Potential for logic and compute functions co-located with the memory HMC Singh, et. al. HBM 3D XPoint https://arxiv.org/pdf/1908. 02640.pdf Micron Technology Hongshin Jun, et. al. IMW 2017 Creative Commons Attribution

Memory landscape diversity presents challenges ▪ Emerging memories exhibit a wide range of bandwidths, latencies, and capacities Memory/Storage Hierarchy — Challenge for the computer architects to Capacity Latency navigate the design space MBs 10 ns SRAM Experiments ▪ Near-random and sparse access patterns 45 ns Few GB Near DRAM 45 ns make performance prediction difficult 70 ns many GB DDR DRAM — Challenge for application developers to assess 100 ns TB Far DRAM performance implications 200 ns TBs NVM ▪ Opportunities for near memory 50 us 10s TB SSD 8000 ns acceleration emerge Many 10 ms HDD TB — Large design space must be investigated

Quantifying impact of memory interactions requires a global view ▪ Need for system level exploration of the design space — Combinations of memory technology — Various memory hierarchies — Prototype architectural ideas in detail — Potential benefit of near-memory accelerators ▪ Need to quantitatively evaluate the performance impact on applications – beyond an isolated function — Latency impact — Scratchpad vs. Cache — Cache size to working data set size — Byte addressable vs. block addressable — Accelerator communication overhead — Cache management overhead — Operating System overhead

A. K. Jain, S. Lloyd and M. Gokhale, "Microscope on Memory: MPSoC can be an effective tool to MPSoC-Enabled Computer Memory System Assessments," 2018 IEEE 26th Annual International Symposium on Field- accelerate memory system investigations Programmable Custom Computing Machines (FCCM), Boulder, CO, 2018, pp. 173-180, doi: 10.1109/FCCM.2018.00035. Fidus Sidewinder and ZCU102 development boards with Xilinx Zynq UltraScale+ MPSoC device Desktop, dedicated evaluation environment

Trace DRAM LiME (Logic in Memory Emulator) Trace Subsystem Zynq UltraScale+ MPSoC Programmable Logic (PL) approach AXI Performance Trace Capture Monitor (APM) Device ▪ Use embedded CPU and cache hierarchy in Zynq MPSoC to save AXI Peripheral FPGA logic and development time Memory Subsystem Interconnect Program DRAM Accelerator ▪ Loopback path to route CPU memory traffic through hardware IP Monitor blocks Delay Delay ▪ Emulate the latencies of a wide range of memories by using HP2,3 HP0,1 HPM0 HPM1 programmable delay units in the Main Switch loopback path DDR Memory Controller Host Subsystem Coherent Interconnect ▪ Capture time-stamped memory transactions using trace subsystem Not Used L2 Cache ▪ Emulate Accelerator, including CPU/Accelerator interactions L1 L1 L1 L1 ARM ARM ARM ARM Core Core Core Core Open Source: Processing System (PS) https://github.com/LLNL/lime and lime-apps

Emulation Method Zynq UltraScale+ MPSoC Delay & Loopback Programmable Logic (PL) 0x04_0000_0000 (R1, R2) Map Width: 20 bits ▪ Address ranges R1, R2 AXI Map In: 0x04000 Shift R1 Shim Map Out: 0x08000 intended to have different 0x 08_000 0_0000 (R1) Memory Subsystem 0x 04_001 0_0000 (R2) access latencies (e.g. Map Width: 8 bits AXI Map In: 0x04 Shift R2 SRAM, DRAM) Shim Map Out: 0x18 Addr Width: 40 bits 0x 08 _0000_0000 (R1) Program DRAM Data Width: 128 bits ▪ Shims shift and separate 0x 18 _0010_0000 (R2) R1: 1M range R2: 4G range AXI Route R1, R2 address ranges (R1, R2) for SmartConnect easier routing 0x 08 _0000_0000 (R1) 0x 18 _0010_0000 (R2) AXI AXI Addr Width: Addr Width: ▪ Standard AXI Interconnect 36 bits 36 bits Delay Delay routes requests through 0x 08 _0000_0000 (R1) 0x 08 _0010_0000 (R2) S_AXI_HP0 S_AXI_HP1 M_AXI_HPM0 different delay units Host Subsystem Main Switch DDR Memory Controller ▪ Delay units have separate Coherent Interconnect programmable delays for read and write access Not Used APU Processing System (PS)

Architecture exploration through FPGA acceleration Rapid System - PowerPoint PPT Presentation

Architecture exploration through FPGA acceleration Rapid System Level Design and Evaluation of Near Memory Fixed Function Units 11/13/2020 Maya Gokhale DMTS LLNL-PRES-816381 This work was performed under the auspices of the U.S. Department of

An introduction to FPGA-based acceleration of neural networks Marco Pagani 1 What is an FPGA?

Open Source FPGA Toolchain FPGA LSE Summer Week 2015 iCE40 Flow Conclusion Vincent Gatine

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

The nextpnr FOSS FPGA place-and-route tool Clifford Wolf Symbiotic EDA FOSS FPGA PnR VPR

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Current Trends in Hybrid FPGA/CPU Devices Hybrid FPGA/CPU Devices Xilinx Zynq Series Real

FPGA-CAPELLA: A REAL TIME AUDIO FX UNIT COSMA KUFA AND JUSTIN XIAO WHAT IS FPGA-CAPELLA?

Public FPGA based DM Public FPGA based DMA Atta A Attacking king UlfFrisk Agenda Background

GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework Jan Gray |

RTLinux in an FPGA Alejandro Lucero alucero@os3sl.com www.os3sl.com RTLinux in a FPGA 1.

Fast FPGA prototyping with Software Development Kit for FPGA (SDK4FPGA) Andrea Suardi

Support of Cross Calls between Microprocessor and FPGA in CPU-FPGA Coupling Architecture G.

Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Mensch-Maschine-Interaktion 2 Interactive Environments Prof. Dr. Andreas Butz, Dr. Julie Wagner

Faades of Interest in Street View Panoramic Sequences Andr A. Arajo, Jonas C. Sampaio,

Stereo Vision Egon Elbre Hans Mesalu general stuff about this 3D thing why? applications

Stereo Vision Approaches for Human to Robot Handover Aleksej Logacjov University of Hamburg

ImageNet Classification with Deep Convolutional Neural Networks Krizhevsky et. all Outline

Gaussian processes for non-rigid id regis istration - Con onnections to o medical im image

Today Alignment & warping 2d transformations Forward and inverse image warping

Data assimilation by morphing ensemble Kalman filters with application to wildland fires Jan

Architecture exploration through FPGA acceleration Rapid System - PowerPoint PPT Presentation

Architecture exploration through FPGA acceleration Rapid System Level Design and Evaluation of Near Memory Fixed Function Units 11/13/2020 Maya Gokhale DMTS LLNL-PRES-816381 This work was performed under the auspices of the U.S. Department of

An introduction to FPGA-based acceleration of neural networks Marco Pagani 1 What is an FPGA?

Open Source FPGA Toolchain FPGA LSE Summer Week 2015 iCE40 Flow Conclusion Vincent Gatine

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

The nextpnr FOSS FPGA place-and-route tool Clifford Wolf Symbiotic EDA FOSS FPGA PnR VPR

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Current Trends in Hybrid FPGA/CPU Devices Hybrid FPGA/CPU Devices Xilinx Zynq Series Real

FPGA-CAPELLA: A REAL TIME AUDIO FX UNIT COSMA KUFA AND JUSTIN XIAO WHAT IS FPGA-CAPELLA?

Public FPGA based DM Public FPGA based DMA Atta A Attacking king UlfFrisk Agenda Background

GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework Jan Gray |

RTLinux in an FPGA Alejandro Lucero alucero@os3sl.com www.os3sl.com RTLinux in a FPGA 1.

Fast FPGA prototyping with Software Development Kit for FPGA (SDK4FPGA) Andrea Suardi

Support of Cross Calls between Microprocessor and FPGA in CPU-FPGA Coupling Architecture G.

Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Mensch-Maschine-Interaktion 2 Interactive Environments Prof. Dr. Andreas Butz, Dr. Julie Wagner

Faades of Interest in Street View Panoramic Sequences Andr A. Arajo, Jonas C. Sampaio,

Stereo Vision Egon Elbre Hans Mesalu general stuff about this 3D thing why? applications

Stereo Vision Approaches for Human to Robot Handover Aleksej Logacjov University of Hamburg

ImageNet Classification with Deep Convolutional Neural Networks Krizhevsky et. all Outline

Gaussian processes for non-rigid id regis istration - Con onnections to o medical im image

Today Alignment &amp; warping 2d transformations Forward and inverse image warping

Data assimilation by morphing ensemble Kalman filters with application to wildland fires Jan

Today Alignment & warping 2d transformations Forward and inverse image warping