Softwar tware-Fir First st FPGA GA Ac Accele elerato rator r - PowerPoint PPT Presentation

2GRVI Phalanx: A A Ki Kilocor locore RISC ISC-V V RV64I V64I Pr Processor ocessor Clust ster er Arr rray ay with h HBM BM2 2 In In a Xili linx nx VU37P 37P FPG FPGA Wor ork k in Progr ogress ss Re Repor ort Jan Gray | Gray Re Rese search ch LLC | Bellevue levue, , WA | http: p://fp fpga ga.or .org

Softwar tware-Fir First st FPGA GA Ac Accele elerato rator r De Desi sign gn • Make it easier for programmers to exploit spatial fabrics • Manycore accelerator overlays • Run C++ or OpenCL kernels on 100s of soft processors • Add custom functions/accelerators/memories to suit • More 5 second recompiles, fewer 5 hour place and routes • Software + overlays = familiar programming experiences, easier ports, rapid iteration, design agility 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 2

GR GRVI VI Pha hala lanx nx Accelerator ccelerator Frame ramework work • A processor cluster array overlay • GRVI VI/2GRVI /2GRVI: RISC-V processing elements • Phalanx lanx: : fabric of clusters of PEs, memories, accelerators, bridges, IOs • Hoplite: ite: 2D torus network on chip 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 3

GR GRVI VI Proces ocessing sing Eleme lement nt • Simpler PEs → more PEs → greater memory parallelism • GRVI: austere RISC-V RV32I + mul*/lr/sc ~320 LUTs @ 400 MHz 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 4

GRVI GR VI Cl Clus uster er: : PE PEs, s, Sh Shar ared ed Mem emor ory, , Acc ccel elerato erators PE 4:4 4-8 KB IMEM 2:1 CMEM = 128 KB CLUSTER DATA PE PE ACCELERATOR(S) 4-8 KB IMEM 2:1 PE PE 4-8 KB IMEM 2:1 PE PE 4-8 KB IMEM 2:1 32 64 XBAR PE ~3500 LUTs 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 5

Clu luste ster r Compo positio sition: n: Me Mess ssage ge Passi ssing ng On On a a No NoC • Hoplite te: FPGA-optimal 2D torus NoC router • Single flits, unidirectional rings, deflection routing, multicast; configurable • 300b-wide router uses only ~330 LUTs YI 1,0 2,0 3,0 XI X Y C C C C 3,1 0,1 1,1 2,1 C C C C 0,2 1,2 2,2 3,2 C C C C 0,3 1,3 2,3 3,3 256b @ 400 MHz = 100 Gb/s links C C C C 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 6

GR GRVI VI Cl Clus uster er: : PE PEs, s, Mem emor ory, , Rou outer er, , Mes essa sage ge Pas assi sing ng PGAS: { mx:1; my:1; x:4; y:6; addr:20 } or { dram_addr:40 } HOPLITE ROUTER 310 256 NoC/ACCEL ITF PE 4:4 IMEM 2:1 CMEM = 128 KB CLUSTER DATA PE PE ACCELERATOR(S) IMEM 2:1 PE PE IMEM 2:1 PE PE IMEM 2:1 XBAR 64 PE 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 7

10 10 × 5 5 Cl Clus uster ters s × 8 8 PEs s = 40 400 PE 0 PEs (KU040 040, , 12/2 /2015) 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 8

30 30 × 7 7 Clu luste sters s x x 8 PE PE = 1 = 1680 80 PE PEs, s, 26 MB MB S SRAM AM (VU9 U9P , , 12/2 /2016) • 400,000 MIPS @ 250 MHz @ 40 W 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 9

GR GRVI VI Pha hala lanx nx V1 Sho Shortcomin tcomings gs • 32b pointers: awkward for big data on AWS F1, OpenCL • 32b accesses: wastes half of 64b UltraRAMs bandwidth? • In-order μarch : stall on loads = ~5 cycles • DDR4 bandwidth << G GPU GDDRx Rx/H /HBM2 BM2 ba bandw dwidt dth 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 10

Ult ltraSc raScale ale+ + HB HBM2 M2 FPGAs! PGAs! • VU37P w/ two 4 GB HBM2 stacks • 32 AXI-HBM bridge/controllers • 32 x 256b x 450 MHz = 460 GB/s 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 11

V2 Red 2 Redesign esign fo for r HB HBM M FPGAs PGAs • Latency tolerant 2GRVI RV64I PEs • 64b cluster datapaths • 32B/cycle deep pipeline NoC-AXI RDMA bridges • Double NoC column rings 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 12

2GR GRVI VI – A S A Sim impl ple, e, Latency ency T ole lerant nt RV6 V64I 4I PE PE • Register file scoreboard: only stall issue on use of a bu busy sy register • Concurrent execution and out of order retirement • Example: unrolled block copy – no issue stalls even with 7 cycle memory • 400 6-LUTs (sans <<)! 2019/11/17 13

GRVI GR VI vs vs. 2G 2GRVI VI 32b GRVI PE 64b 2GRVI VI PE Year 2015 Q4 2019 Q2 ISA RV32I + mul/lr/sc RV64I 4I + lr/sc Area 320 6-LUTs 400 6-LUT UTs s (sans s shared ed <<) Fmax / congested 400 / 300 MHz 500+ 500+ / TBD MHz Pipeline stages 2 / 3 2 / 3 / 4 (superpipelined) Out-of-order retire yes Cluster, load interval 5 cycles 1 / / c cycl cle Cluster, load-to-use 5 cycles 3-6 cycles Cluster, Σ RAM BW 4.8 GB/s (300 MHz) 12.8 GB/s s (400 MHz) 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 14

Ph Phalanx lanx SoC: : 15x1 x15-3 3 Ar Array ay of Clu luste sters s + + HB HBM + P M + PCIe 300 PE ↔ Cluster RAM ↔ NoC ↔ AXI ↔ HBM C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C • 32 B write request message; C C C C C C C C C C C C C C C 32 × n B burst- read request → n× 32 B read responses C C C C C C C C C C C C C C C • PE sends R/W request message to its NoC-AXI bridge; C C C C C C C C C C C C C C C bridge issues request to its AXI-HBM channel(s); C C C C C C C C C C C C C C C bridge sends read response messages to dest. address C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C • 32 B write + 32 B read response per cycle per bridge C C C C C C C C C C C C C C C • Measured ~130 GB/s write + ~130 GB/s read at 300 MHz C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Cluster { 8 GRVI / 6 2GRVI, 4-8 KB IRAM, 128 KB CRAM, Hoplite router } C C C C C C C C C C C C C C C C NoC-AXI RDMA bridge { 2 256b AXI R/W req queues, 2 resp queues } A C C C C C C C C C C C C C C C PCIe DMA C C C C C C C C C C C C H H Two AXI- switch -MC-HBM2 bridges, each 256b R/W at up to 450 MHz A A A A A A A A A A A A A A A Unidirectional Hoplite NoC X-ring rows and Y-ring columns H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H 4 GB HBM2 DRAM STACK 4 GB HBM2 DRAM STACK 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 15

No NoC-AXI AXI-HBM HBM Tra ransaction nsactions s in in Fl Flight ight

22 222 x 8 2 x 8 GR GRVI VI PEs s = 17 1776 76 RV32 32I I PEs 222 x 6 22 2 x 6 2 2GR GRVI VI PEs = 13 s = 1332 32 RV64 64I I PEs 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 17

Pha halanx lanx-HBM HBM2 2 Ne Next xt St Step eps • Tune up to 400+ MHz = ~200 GB/s writes + ~200 GB/s reads • Computational HBM2 – compute at the bridges • Scatter/gather, add-to-memory, block zero, copy, hash, reduce, select, regexp , sort, … 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 18

Pha halanx lanx Par arallel allel Programm ogramming ing Models odels • Architecture: array of clusters of PEs, no caches, message passing • T oday: bare metal C/C++ + message passing runtime • Future • Flat data parallel NDRange OpenCL kernels • Streaming kernels composed with OpenCL pipes • ‘Gatling gun’ parallel packet processing 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 19

An An Op Open enCL-li like ke Mo Mode del a l and nd T ools ls • Familiar to GPU developers(?) • Host side: Xilinx SDAccel OpenCL runtime • Setup, copy buffers, queue parallel kernel calls, wait, copy results • FPGA side: GRVI Phalanx  SDAccel-for-RTL shell • Map work k groups ps to PE clusters, work k items s to PEs • Memory: global al = HBM; local al = cluster RAM (static); private ivate = thread (auto) • Scheduler (PEs at cluster 0): distribute kernels, map work groups to idle clusters • Plan. Not yet implemented 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 20

OpenCL “Like” kernel void vector_add( global int* g_a, global int* g_b, global int* g_sum, const unsigned n) { local align int a[N], b[N], sum[N]; int iloc = get_local_id(0) * n; int iglb = (get_group_id(0) * get_local_size(0) + get_local_id(0)) * n; int size = n * sizeof(int); copy(a + iloc, g_a + iglb, size); // from HBM copy(b + iloc, g_b + iglb, size); barrier(CLK_LOCAL_MEM_FENCE); for (int i = 0; i < n; ++i) sum[i] = a[i] + b[i]; barrier(CLK_LOCAL_MEM_FENCE); copy(g_sum + iglb, sum + iloc, size); // to HBM } 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 21

T ake ake Aways ways • (Prior work) • Software-first, software-mostly manycore accelerators • Die filling, FPGA frugal, clustered, tiled, NoC-interconnected overlays • Demo mocra cratiz tizin ing HBM • Xilinx AXI-HBM bridges are easy to use, simplify interconnects, save 100Ks LUTs • HBM bandwidth width is now access cessible ible to all • T owards an OpenCL-like SDK, on AWS F1, Azure NP10, Alveo

Softwar tware-Fir First st FPGA GA Ac Accele elerato rator r - PowerPoint PPT Presentation

2GRVI Phalanx: A A Ki Kilocor locore RISC ISC-V V RV64I V64I Pr Processor ocessor Clust ster er Arr rray ay with h HBM BM2 2 In In a Xili linx nx VU37P 37P FPG FPGA Wor ork k in Progr ogress ss Re Repor ort Jan Gray |

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

Softwar e Cr aftsmanship Ke vlin He nne y ke vlin@c ur br alan.c om @Ke vlinHe nne y The

The Sof t wa re Def ined Inst it ut ion Brett Loon ooney ha t ? The W 2 ha t d oes t ha t

Architectural design process Sub- systems and modules System structuring A sub- system is

Accele lerating AV Productization wit ith AI Danny Atsmon - CEO, Cognata Simon Berard CATIA

An introduction to FPGA-based acceleration of neural networks Marco Pagani 1 What is an FPGA?

FPGA language ( HDL ). Overview 3 4 Component-Based Software Design LMES Component-Based

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

L y e e Lyee stands for the governmenta L methodolog y for softwar e providenc e . The Institute

GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework Jan Gray |

Topics covered Formal specif ication in the sof tware process I nterf ace specif ication

Doc u mentation SOFTWAR E E N G IN E E R IN G FOR DATA SC IE N TISTS IN P YTH ON Adam Spannba

Support of Cross Calls between Microprocessor and FPGA in CPU-FPGA Coupling Architecture G.

Ser ializability of T r ansac tions in Softwar e T r ansac tional Memor y Utku Aydonat

SSI : Overview of Simulation Sof tware I nf rastructure f or Large Scale Scientif ic Applications

FPGA vs GPU Performance Comparison on the Implementation of FIR Filters FPGA. While comparing the

Go Good Ma Manufacturi ring Practices and Ins and Instan antGMP P Soft Softwar are

Toward S Scalable M Monitoring on L Large-Sc Scale e Storage for S Softw tware D Defi

Pr Proje oject Managemen ct Management and so and softwar are e Fall Semester University of

The Spartan 3e FPGA The Spartan 3e FPGA Whats inside the chip? How does it implement

Public FPGA based DM Public FPGA based DMA Atta A Attacking king UlfFrisk Agenda Background

Tsunami simulation on FPGA/GPU Tsunami simulation on FPGA/GPU and its analysis based on Statistical

En Energy-Ef -Efficie iency y in in M Mobi obile Sof Softw tware Julia Rubin (jointly