De DeCO: : A DS DSP Block Based FPGA Accelerator Overlay Wi With Low Overhead Interconnect Ab Abhishek Kumar Ja Jain, Xiangwei Li, Pranjul Singhai, Douglas L. Maskell School of Computer Science and Engineering Nanyang Technological University (NTU), Singapore Suhaib A. Fahmy School of Engineering University of Warwick, UK International Symposium on Field-Programmable Custom Computing Machines (FCCM) 2nd May 2016, Washington DC, USA
2 FP FPGAs in Heterogeneous Computing Platforms • Xi Xilinx: x: FP FPGA GAs coupled wi with ARM RM (Zy Zynq Ul UltraScale MP MPSoC) – 3500 3500 DS DSP Blocks in the largest device – Pe Peak performance of 5200 5200 Giga-Op Operations Per er Sec econd (GOPS) • In Intel el: FP FPGA GAs coupled with Xe Xeon – 1500 1500 floating point DS DSP Bl Blocks in the largest device – Pe Peak performance of 1300 1300 GFLOPS
3 Ar Are FPGAs As really ready fo for th the mainstr tream? • No No, Main ainly ly due to po poor de design n pr produc ductivity issue ues – Ac Accelerator design at RTL level -> > Hardware design expertise
4 Ar Are FPGAs As really ready fo for th the mainstr tream? • No No, Main ainly ly due to po poor de design n pr produc ductivity issue ues – Ac Accelerator design at RTL level -> > Hardware design expertise – Lo Long g compilation times of RTL L design gn
5 Co Coarse grained FPGA overlays ys • Ar Array of of coa oarse-gr grained tiles • Pr Programmable funct ctional unit an and in interconnect re reso sourc rces
6 Co Coarse grained FPGA overlays ys • Ar Array of of coa oarse-gr grained tiles • Pr Programmable fu functional unit an and in interconnect re reso sourc rces • Be Benefits: – Ac Accelerator design at a higher level of ab of abstrac action ion – Fa Fast co compilation – Fa Fast reconfiguration – Im Improved ed des esign pr produc ductivity
7 Co Coarse grained FPGA overlays ys • Ar Array of of coa oarse-gr grained tiles • Pr Programmable fu functional unit and in an interconnect re reso sourc rces • Be Benefits: – Ac Accelerator design at a higher level of ab of abstrac action ion – Fa Fast co compilation – Fa Fast reconfiguration – Im Improved ed des esign pr produc ductivity • Th The major ISSUE is the area an and performanc pe nce overhe heads ds
8 Coarse grained FPGA overlays Co ys • Tw Two metrics – In Inter erconnec ect area ea over erhea ead in ter erms of LUTs/FU – Pe Peak pe performanc nce in n terms of GOPS Overlay Ov Inter In erconnec ect Area ea Over erhea ead Peak performance Pe DSP-Dy DS DySER 1360 LUTs/FU 1360 6.3 6. 3 GOPS [HEART2015] [H DSP-ba DS based d Island nd-st style 437 437 LUTs/FU 65 65 GOPS [FCCM2015] [F ] } – 3x 3x better (in area overhead) FCCM FC – 10x 10x be better (in n pe peak thr hroug ughput hput) 2015 2015
9 Is Issues es • Ca Can improve further? – On On Zy Zynq, , an array of 220 DSP blocks can provide 264 GOP OPS
10 Is Issues es • Ca Can improve further? – On On Zy Zynq, , an array of 220 DSP blocks can provide 264 GOP OPS – Ca Can we reduce interconnect area overhead further to achieve a hi highe her pe peak pe performanc nce out ut of DSP bl blocks? Pe Peak Pe Performance on Zynq In Interconnect Area Overhead (GOPS) (G (L (LUTs/FU) 1360 1360 1400 250 1200 200 1000 150 800 437 437 65 65 600 100 400 6. 6.3 50 200 0 0 DySER DSP-based DySER DSP-based island-style island-style
11 Is Issues es • Ca Can improve further? – On On Zy Zynq, , an array of 220 DSP bl blocks can n pr provide de 264 GOPS – Ca Can we reduce interconnect area overhead further to achieve a hi highe her pe peak pe performanc nce out ut of DSP bl blocks? Peak Pe Pe Performance on Zynq Interconnect Area Overhead In (G (GOPS) (L (LUTs/FU) 1360 1360 264 264 1400 250 1200 200 1000 150 800 437 437 65 65 600 100 400 68 68 6. 6.3 50 200 0 0 DySER DSP-based DeCO DySER DSP-based DeCO island-style island-style
12 Ap Approach I8 I9 I10 I11 I12 I13 I14 I15 I1 I2 I3 I4 I5 I6 I7 I0 SUB SUB SUB SUB SUB SUB STAGE-1 SUB SUB SQR SQR STAGE-2 SQR SQR STAGE-3 SQRADD SQRADD SQRADD SQRADD STAGE-4 ADD ADD STAGE-5 ADD O0 • Is Island-st style interc rconnect allows s communica co cation between any FU to an any other FU • Not ot required for or feed-fo forward co compute kernels
13 Ap Approach I8 I9 I10 I11 I12 I13 I14 I15 I1 I2 I3 I4 I5 I6 I7 I0 SUB SUB SUB SUB SUB SUB STAGE-1 SUB SUB SQR SQR STAGE-2 SQR SQR STAGE-3 SQRADD SQRADD SQRADD SQRADD Data inputs STAGE-4 ADD ADD STAGE-5 ADD DFs FU FU FU O0 Stage-1 Programmable Routing Network • Is Island-st style interc rconnect allows s DFs FU FU FU Stage-2 communica co cation between any FU to Programmable Routing Network an any other FU • Not ot required for or feed-fo forward compute kernels co DFs FU FU FU Stage-N Data outputs
14 Ke Kernel Set Ch Characteristics Ke Kernels I/ I/O nodes Be Before Tr Transformation OP OP no node des DF DFG de dept pth fft fft 6/ 6/4 10 10 3 km kmeans 16/ 16/1 23 23 9 mm mm 16/ 16/1 15 15 8 sp spmv 16/ 16/2 14 14 4 mr mri 11/ 11/2 11 11 6 st stencil 15/ 15/2 14 14 5
15 De Designed Overlay Programmable Routing Network Data Forwarding (DF) Link Cluster
16 ALUMODE ALUMODE 4 4 16 16 B Register B Register B B X X MUL M MUL M 16 16 DSP Tile A A Pre-Adder Pre-Adder 16 16 16 16 0 0 P P Y Y 1 1 D D 4 1 4 1 16 16 C C C C Z Z 0 0 5 5 INMODE INMODE 8 8 7 7 OPMODE OPMODE DSP48E1 DSP48E1 MUXSEL MUXSEL ALUMODE ALUMODE 4 4 16 16 B Register B Register B B X X MUL M MUL M 16 16 A A Pre-Adder Pre-Adder 16 16 16 16 0 0 Y P Y P 1 1 D D 4 1 4 1 16 16 C C C C Z Z 0 0 5 5 INMODE INMODE 8 8 7 7 OPMODE OPMODE DSP48E1 DSP48E1 MUXSEL MUXSEL Data Forwarding (DF) Link Designed Overlay Cluster Routing Network Programmable De
17 Comparison of Co of Overlays for or the kernel se set • Pr Prototyped De DeCO an and two other overlay lays for the kernel l set – 5x 5x5 5 DSP-Ba Based Dy DySER ov overlay (Ov Overlay-I) I) – 5x 5x5 5 DSP block based island-st style overlay (Overlay-II II)
18 Comparison of Co of Overlays for or the kernel se set • Pr Prototyped De DeCO an and two other overlay lays for the kernel l set – 5x 5x5 5 DSP-Ba Based Dy DySER ov overlay (Ov Overlay-I) I) – 5x 5x5 5 DSP block based island-st style overlay (Overlay-II II) • Si Significant savings in LUT requirements – 96% 96% compared to Overlay-I – 87% 87% compared to Overlay-II II Resource Consumption of Overlays Re 70 60 50 40 30 20 10 0 LUTs FFs DSP Blocks Overlay-I Overlay-II DeCO
19 Ma Mapping Kern rnels onto DeCO Ke Kernels Re Required No. % FU % Achievable Ac of Con of ones Ut Utilization GO GOPS • FU FU utilization of up to to 95% fft fft 1 40% 40% 3. 3.95 95 km kmeans 1 95% 95% 9. 9.08 08 mm mm 1 75% 75% 5. 5.92 92 sp spmv 1 70% 70% 5. 5.53 53 mr mri 1 75% 75% 4.34 4. 34 st stencil 1 80% 80% 5. 5.53 53
20 Ma Mapping Kern rnels onto DeCO Ke Kernels Re Required No. % FU % Achievable Ac of Con of ones Ut Utilization GO GOPS • FU FU utilization of up to to 95% fft fft 1 40% 40% 3. 3.95 95 kmeans km 1 95% 95% 9. 9.08 08 • Ca Can replicate small mm mm 1 75% 75% 5. 5.92 92 kernels an ke and map ap sp spmv 1 70% 70% 5. 5.53 53 mri mr 1 75% 75% 4.34 4. 34 stencil st 1 80% 80% 5.53 5. 53 gradient gr 0.5 0. 90% 90% 4.34 4. 34 chebyshev ch 0. 0.5 40% 40% 5.53 5. 53
21 Ma Mapping Kern rnels onto DeCO Kernels Ke Required No. Re % FU % Achievable Ac of Con of ones Utilization Ut GOPS GO • FU FU utilization of up up to 95% to 95% fft fft 1 40% 40% 3. 3.95 95 km kmeans 1 95% 95% 9. 9.08 08 • Ca Can replicate small mm mm 1 75% 75% 5.92 5. 92 ke kernels and map spmv sp 1 70% 70% 5.53 5. 53 mri mr 1 75% 75% 4.34 4. 34 • Mu Multiple cones s can st stencil 1 80% 80% 5.53 5. 53 be us be used d to map p gr gradient 0.5 0. 90% 90% 4. 4.34 34 lar large kernels ls ch chebyshev 0. 0.5 40% 40% 5. 5.53 53 bi bicg 3 50% 50% 11. 11.85 85 tr trmm 4. 4.5 60% 60% 21. 21.33 33 sy syrk 4. 4.5 80% 80% 28. 28.44 44
22 22 Co Comparison to HLS • Co Compare De DeCO wi with a Viv Vivad ado HL HLS implem emen entations (i (implemented in PR region) – Fo For the kernel set HLS required 1 1 DS DSP and 3 3 CLB tiles CL CLB Ti Tile DS DSP Tile PR PR Region
23 23 Co Comparison to HLS • Co Compare De DeCO wi with a Viv Vivad ado HL HLS implem emen entations (i (implemented in PR region) – Fo For the kernel set HLS required 1 1 DS DSP and 3 3 CLB tiles – De DeCO re require res 2 2 DS DSP and 6 6 CLB tiles. . De DeCO 2x 2x area pe pena nalty PR Region PR
Recommend
More recommend