Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of Selected Benchmarks of the HPCChallenge Benchmark Suite Marius Meyer Tobias Kenter, Christian Plessl Paderborn University, Germany Paderborn Center for Parallel Computing H2RC’20, everywhere, 13. November 2020
HPC Challenge for FPGA An FPGA-adapted implementation of HPCC • OpenCL kernels and C++ host code – Measure hardware and tools + • Support for Intel and Xilinx FPGAs • Configuration Options to adapt to resources and architecture • It’s open source and already available on GitHub! 2
The HPC Challenge Suite Synthetic Benchmarks Benchmark Applications • GEMM • STREAM • PTRANS • RandomAccess • FFT • b_eff • HPL Base runs: Use unmodified provided benchmark implementations Optimized runs: Modifications allowed with respect to the benchmark rules Idea: Memory access patterns of other application will always be a combination of the patterns implemented by these benchmarks 3
HPCC FPGA Base Implementations We focus on base implementations for now… Two main concepts to increase resource utilization and performance FPGA FPGA FPGA FPGA CU 1 CU 1 CU 1 CU 1 CU 2 Scaling Replication • • Match data width of fixed interfaces Utilize all available interfaces • • Increase parallelism to make use of Increase resource usage • Option: NUM_REPLICATIONS more resources • Individual options for every benchmark 4
Experimental Setup Nallatech 520N Intel PAC D5005 Xilinx Alveo U280 • • • Intel Stratix 10 GX 2800 Intel Stratix 10 SX 2800 XCU280 • • • Direkt access to host 32x 256 MB HBM2 on 4x 8 GB DDR4 SDRAM memory using SVM FPGA • x8 PCIe 3.0 • • x16 PCIe 3.0 2x 16 GB DDR4 SDRAM • x8 PCIe 4.0 5
Benchmark Implementation
STREAM Implementation Operations Measured by STREAM for FPGA Operation Name Kernel Logic PCIe Write Write arrays to device 𝐷 𝑗 = 𝐵[𝑗] Copy 𝐶 𝑗 = 𝑘 ⋅ 𝐷[𝑗] Scale 𝐷 𝑗 = 𝐵 𝑗 + 𝐶[𝑗] Add Triad 𝐵 𝑗 = 𝑘 ⋅ 𝐷 𝑗 + 𝐶[𝑗] PCIe read Read arrays from device Configuration Options: • DATA_TYPE Define the data type • VECTOR_COUNT • GLOBAL_MEM_UNROLL: Unroll the loops • DEVICE_BUFFER_SIZE: Size of the local memory buffer • NUM_REPLICATIONS: One kernel per memory bank 7
STREAM Synthesis Observations • Kernel needs to support two different kernel designs to work best with all global memory types • STREAM achieves a high memory efficiency independent of operation for half-duplex memory interfaces 8
RandomAccess Implementation 𝑆 𝑗 Description: Update values in a … Random Numbers 𝑆 large data array in pseudo random order. Update errors allowed! Configuration Option: Index of next value in data array 𝑙 • DEVICE_BUFFER_SIZE: Size of the local memory buffer Update value 𝐸 𝑙 • NUM_REPLICATIONS: One kernel per memory bank Every kernel: Local memory buffer • Calculates the same pseudo random number sequence • Update only, if address is in Data 𝐸 2 𝑜 memory bank • Two pipelines used to remove dependencies between reads Bank 1 Bank 2 and writes 9
RandomAccess Results Option 520N U280 U280 PAC DDR DDR HBM2 SVM NUM_REPLICATIONS 4 2 32 1 DEVICE_BUFFER_SIZE 1 1,024 Board MUOP/s Error Observations: • Compiler support for ignoring data 520N DDR 245.0 0.0099% dependencies has a huge impact on U280 DDR 40.3 0.0106% performance • Number of kernel replications has U280 HBM2 128.1 0.0106% negative impact on performance PAC SVM 0.5 0.0106% 10
FFT Implementation FFT FFT FFT … Description: Batched calculation of 1d FFTs Stage Stage Configuration Options: Shift register • LOG_FFT_SIZE: Log 2 of the 1d FFT size • NUM_REPLICATIONS: One kernel for two Pipe Pipe memory banks • Implementation is fully pipelined Fetch • Fetch: BRAM Store • FFT: BRAM/Logic, DSPs Buffer Performance Model Memory Bank 2 Memory Bank 1 𝑞 𝐺𝐺𝑈 = 5 ⋅ 𝑀𝑃𝐻_𝐺𝐺𝑈_𝑇𝐽𝑎𝐹 ⋅ 𝑔 𝑛𝑓𝑛 ⋅ 𝑂𝑉𝑁_𝑆𝐹𝑄𝑀𝐽𝐷𝐵𝑈𝐽𝑃𝑂𝑇 ⋅ 8 11
FFT Results Global memory Bandwidth Efficiency of FFT [%] 100 Option 520N U280 U280 PAC DDR DDR HBM2 SVM 90 80 NUM_REPLICATIONS 2 1 15 1 70 LOG_FFT_SIZE 17 9 5 17 60 50 Observations 40 • Design allows high utilization of the global memory for 30 a broad range of FFT sizes 20 • Performance can be achieved equally over both configuration options 10 0 520N DDR U280 DDR U280 HBM2 PAC SVM 12
GEMM Implementation Description: Multiply square matrices C ′ = 𝛽 ⋅ 𝐵 ⋅ 𝐶 + 𝛾 ⋅ 𝐷 where 𝐵, 𝐶, 𝐷, 𝐷 ′ ∈ ℝ 𝑜×𝑜 and 𝛽, 𝛾 ∈ ℝ Configuration Parameters: • DATA_TYPE: Used data type • GLOBAL_MEM_UNROLL: Number of values that are loaded into local memory per clock cycle ( 𝑣 ) • BLOCK_SIZE: Size of the local memory block ( 𝑐 ) • GEMM_SIZE: Size of the register block ( ) • NUM_REPLICATIONS: Used to fill 𝑐 2 𝑐 3 𝑐 2 𝑢 𝑓𝑦𝑓 = + + 3 ⋅ 𝑔 𝑣 ⋅ 𝑜 FPGA resources 𝑣 ⋅ 𝑔 𝑐 ⋅ 𝑔 𝑛𝑓𝑛 𝑙 𝑛𝑓𝑛 13
GEMM Results Observations Option 520N U280 U280 PAC • Large in-register DDR DDR HBM2 SVM multiplication leads to low DATA_TYPE float kernel frequencies GLOBAL_MEM_UNROLL 16 • HBM2 can also improve the GEMM_SIZE performance of mainly 8 compute bound applications BLOCK_SIZE 512 256 256 512 NUM_REPLICATIONS 5 3 3 5 Normalized Performance to 100MHz and a single Kernel Replication Kernel Frequency [MHz] 100 80 250 GFLOP/s 200 60 150 40 100 20 50 0 0 520N DDR U280 DDR U280 HBM2 PAC SVM 520N DDR U280 DDR U280 HBM2 PAC SVM 14
Conclusion • It is a challenging task to create unbiased base implementations • The implementations show a similar performance efficiency on the tested devices • The implementations allow to adjust the utilization of relevant resources for a broad range of FPGAs Next Steps: • Implement remaining base implementations • Offer support for multi-FPGA execution of the benchmarks • Utilize inter-FPGA networks 15
Recommend
More recommend