2GRVI Phalanx: A A Ki Kilocor locore RISC ISC-V V RV64I V64I Pr Processor ocessor Clust ster er Arr rray ay with h HBM BM2 2 In In a Xili linx nx VU37P 37P FPG FPGA Wor ork k in Progr ogress ss Re Repor ort Jan Gray | Gray Re Rese search ch LLC | Bellevue levue, , WA | http: p://fp fpga ga.or .org
Softwar tware-Fir First st FPGA GA Ac Accele elerato rator r De Desi sign gn • Make it easier for programmers to exploit spatial fabrics • Manycore accelerator overlays • Run C++ or OpenCL kernels on 100s of soft processors • Add custom functions/accelerators/memories to suit • More 5 second recompiles, fewer 5 hour place and routes • Software + overlays = familiar programming experiences, easier ports, rapid iteration, design agility 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 2
GR GRVI VI Pha hala lanx nx Accelerator ccelerator Frame ramework work • A processor cluster array overlay • GRVI VI/2GRVI /2GRVI: RISC-V processing elements • Phalanx lanx: : fabric of clusters of PEs, memories, accelerators, bridges, IOs • Hoplite: ite: 2D torus network on chip 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 3
GR GRVI VI Proces ocessing sing Eleme lement nt • Simpler PEs → more PEs → greater memory parallelism • GRVI: austere RISC-V RV32I + mul*/lr/sc ~320 LUTs @ 400 MHz 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 4
GRVI GR VI Cl Clus uster er: : PE PEs, s, Sh Shar ared ed Mem emor ory, , Acc ccel elerato erators PE 4:4 4-8 KB IMEM 2:1 CMEM = 128 KB CLUSTER DATA PE PE ACCELERATOR(S) 4-8 KB IMEM 2:1 PE PE 4-8 KB IMEM 2:1 PE PE 4-8 KB IMEM 2:1 32 64 XBAR PE ~3500 LUTs 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 5
Clu luste ster r Compo positio sition: n: Me Mess ssage ge Passi ssing ng On On a a No NoC • Hoplite te: FPGA-optimal 2D torus NoC router • Single flits, unidirectional rings, deflection routing, multicast; configurable • 300b-wide router uses only ~330 LUTs YI 1,0 2,0 3,0 XI X Y C C C C 3,1 0,1 1,1 2,1 C C C C 0,2 1,2 2,2 3,2 C C C C 0,3 1,3 2,3 3,3 256b @ 400 MHz = 100 Gb/s links C C C C 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 6
GR GRVI VI Cl Clus uster er: : PE PEs, s, Mem emor ory, , Rou outer er, , Mes essa sage ge Pas assi sing ng PGAS: { mx:1; my:1; x:4; y:6; addr:20 } or { dram_addr:40 } HOPLITE ROUTER 310 256 NoC/ACCEL ITF PE 4:4 IMEM 2:1 CMEM = 128 KB CLUSTER DATA PE PE ACCELERATOR(S) IMEM 2:1 PE PE IMEM 2:1 PE PE IMEM 2:1 XBAR 64 PE 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 7
10 10 × 5 5 Cl Clus uster ters s × 8 8 PEs s = 40 400 PE 0 PEs (KU040 040, , 12/2 /2015) 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 8
30 30 × 7 7 Clu luste sters s x x 8 PE PE = 1 = 1680 80 PE PEs, s, 26 MB MB S SRAM AM (VU9 U9P , , 12/2 /2016) • 400,000 MIPS @ 250 MHz @ 40 W 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 9
GR GRVI VI Pha hala lanx nx V1 Sho Shortcomin tcomings gs • 32b pointers: awkward for big data on AWS F1, OpenCL • 32b accesses: wastes half of 64b UltraRAMs bandwidth? • In-order μarch : stall on loads = ~5 cycles • DDR4 bandwidth << G GPU GDDRx Rx/H /HBM2 BM2 ba bandw dwidt dth 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 10
Ult ltraSc raScale ale+ + HB HBM2 M2 FPGAs! PGAs! • VU37P w/ two 4 GB HBM2 stacks • 32 AXI-HBM bridge/controllers • 32 x 256b x 450 MHz = 460 GB/s 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 11
V2 Red 2 Redesign esign fo for r HB HBM M FPGAs PGAs • Latency tolerant 2GRVI RV64I PEs • 64b cluster datapaths • 32B/cycle deep pipeline NoC-AXI RDMA bridges • Double NoC column rings 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 12
2GR GRVI VI – A S A Sim impl ple, e, Latency ency T ole lerant nt RV6 V64I 4I PE PE • Register file scoreboard: only stall issue on use of a bu busy sy register • Concurrent execution and out of order retirement • Example: unrolled block copy – no issue stalls even with 7 cycle memory • 400 6-LUTs (sans <<)! 2019/11/17 13
GRVI GR VI vs vs. 2G 2GRVI VI 32b GRVI PE 64b 2GRVI VI PE Year 2015 Q4 2019 Q2 ISA RV32I + mul/lr/sc RV64I 4I + lr/sc Area 320 6-LUTs 400 6-LUT UTs s (sans s shared ed <<) Fmax / congested 400 / 300 MHz 500+ 500+ / TBD MHz Pipeline stages 2 / 3 2 / 3 / 4 (superpipelined) Out-of-order retire yes Cluster, load interval 5 cycles 1 / / c cycl cle Cluster, load-to-use 5 cycles 3-6 cycles Cluster, Σ RAM BW 4.8 GB/s (300 MHz) 12.8 GB/s s (400 MHz) 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 14
Ph Phalanx lanx SoC: : 15x1 x15-3 3 Ar Array ay of Clu luste sters s + + HB HBM + P M + PCIe 300 PE ↔ Cluster RAM ↔ NoC ↔ AXI ↔ HBM C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C • 32 B write request message; C C C C C C C C C C C C C C C 32 × n B burst- read request → n× 32 B read responses C C C C C C C C C C C C C C C • PE sends R/W request message to its NoC-AXI bridge; C C C C C C C C C C C C C C C bridge issues request to its AXI-HBM channel(s); C C C C C C C C C C C C C C C bridge sends read response messages to dest. address C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C • 32 B write + 32 B read response per cycle per bridge C C C C C C C C C C C C C C C • Measured ~130 GB/s write + ~130 GB/s read at 300 MHz C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Cluster { 8 GRVI / 6 2GRVI, 4-8 KB IRAM, 128 KB CRAM, Hoplite router } C C C C C C C C C C C C C C C C NoC-AXI RDMA bridge { 2 256b AXI R/W req queues, 2 resp queues } A C C C C C C C C C C C C C C C PCIe DMA C C C C C C C C C C C C H H Two AXI- switch -MC-HBM2 bridges, each 256b R/W at up to 450 MHz A A A A A A A A A A A A A A A Unidirectional Hoplite NoC X-ring rows and Y-ring columns H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H 4 GB HBM2 DRAM STACK 4 GB HBM2 DRAM STACK 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 15
No NoC-AXI AXI-HBM HBM Tra ransaction nsactions s in in Fl Flight ight
22 222 x 8 2 x 8 GR GRVI VI PEs s = 17 1776 76 RV32 32I I PEs 222 x 6 22 2 x 6 2 2GR GRVI VI PEs = 13 s = 1332 32 RV64 64I I PEs 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 17
Pha halanx lanx-HBM HBM2 2 Ne Next xt St Step eps • Tune up to 400+ MHz = ~200 GB/s writes + ~200 GB/s reads • Computational HBM2 – compute at the bridges • Scatter/gather, add-to-memory, block zero, copy, hash, reduce, select, regexp , sort, … 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 18
Pha halanx lanx Par arallel allel Programm ogramming ing Models odels • Architecture: array of clusters of PEs, no caches, message passing • T oday: bare metal C/C++ + message passing runtime • Future • Flat data parallel NDRange OpenCL kernels • Streaming kernels composed with OpenCL pipes • ‘Gatling gun’ parallel packet processing 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 19
An An Op Open enCL-li like ke Mo Mode del a l and nd T ools ls • Familiar to GPU developers(?) • Host side: Xilinx SDAccel OpenCL runtime • Setup, copy buffers, queue parallel kernel calls, wait, copy results • FPGA side: GRVI Phalanx SDAccel-for-RTL shell • Map work k groups ps to PE clusters, work k items s to PEs • Memory: global al = HBM; local al = cluster RAM (static); private ivate = thread (auto) • Scheduler (PEs at cluster 0): distribute kernels, map work groups to idle clusters • Plan. Not yet implemented 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 20
OpenCL “Like” kernel void vector_add( global int* g_a, global int* g_b, global int* g_sum, const unsigned n) { local align int a[N], b[N], sum[N]; int iloc = get_local_id(0) * n; int iglb = (get_group_id(0) * get_local_size(0) + get_local_id(0)) * n; int size = n * sizeof(int); copy(a + iloc, g_a + iglb, size); // from HBM copy(b + iloc, g_b + iglb, size); barrier(CLK_LOCAL_MEM_FENCE); for (int i = 0; i < n; ++i) sum[i] = a[i] + b[i]; barrier(CLK_LOCAL_MEM_FENCE); copy(g_sum + iglb, sum + iloc, size); // to HBM } 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 21
T ake ake Aways ways • (Prior work) • Software-first, software-mostly manycore accelerators • Die filling, FPGA frugal, clustered, tiled, NoC-interconnected overlays • Demo mocra cratiz tizin ing HBM • Xilinx AXI-HBM bridges are easy to use, simplify interconnects, save 100Ks LUTs • HBM bandwidth width is now access cessible ible to all • T owards an OpenCL-like SDK, on AWS F1, Azure NP10, Alveo
Recommend
More recommend