OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch
Motivation ● Increasing use of heterogeneous systems to overcome CPU power limitations 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 2 2
Motivation ● Increasing use of heterogeneous systems to overcome CPU power limitations ● FPGAs increasingly used for implementation of accelerators in HPC systems (e.g. Microsoft Azure) 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 3 2
Motivation ● Increasing use of heterogeneous systems to overcome CPU power limitations ● FPGAs increasingly used for implementation of accelerators in HPC systems (e.g. Microsoft Azure) ● Programming of heterogeneous systems is non-trivial 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 4 2
Motivation ● Increasing use of heterogeneous systems to overcome CPU power limitations ● FPGAs increasingly used for implementation of accelerators in HPC systems (e.g. Microsoft Azure) ● Programming of heterogeneous systems is non-trivial ● Desirable: Programming with a single, portable code base 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 5 2
OpenMP Device Offloading ● Denote target regions to execute on device #pragma omp target \ map(to:x[0:SIZE]) \ map(tofrom:y[0:SIZE]) { #pragma omp parallel for[...] for(i=0; i<SIZE; i++){ Target Region y[i] = a*x[i]+y[i]; } } 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 6 3
OpenMP Device Offloading ● Denote target regions to execute on device #pragma omp target \ map(to:x[0:SIZE]) \ ● Specify which and how map(tofrom:y[0:SIZE]) data is transferred to { device memory #pragma omp parallel for[...] for(i=0; i<SIZE; i++){ Target Region y[i] = a*x[i]+y[i]; } } 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 7 3
OpenMP Device Offloading ● Denote target regions to execute on device #pragma omp target \ map(to:x[0:SIZE]) \ ● Specify which and how map(tofrom:y[0:SIZE]) data is transferred to { device memory #pragma omp parallel for[...] for(i=0; i<SIZE; i++){ ● Use additional parallel y[i] = a*x[i]+y[i]; } constructs inside target } region (also target- specific, e.g. teams, distribute ,...) 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 8 3
Goal ● Implement mapping of target regions to FPGA accelerators in LLVM Clang – Preserve FPGA-specific pragmas (e.g. Vivado HLS) – Automated flow from OpenMP-annotated input program to FPGA bitstream + software executable 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 9 4
Goal ● Implement mapping of target regions to FPGA accelerators in LLVM Clang – Preserve FPGA-specific pragmas (e.g. Vivado HLS) – Automated flow from OpenMP-annotated input program to FPGA bitstream + software executable ● Extend LLVM OpenMP Runtime – Manage data-transfers between host and FPGA – Control device execution on the FPGA accelerator 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 10 4
ThreadPoolComposer ● Toolchain to fast-track implementation of FPGA-based accelerators in heterogeneous systems ● Synthesize accelerator from kernel code TPC is available as open source: https://goo.gl/qTsU3B 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 11 5
ThreadPoolComposer ● Toolchain to fast-track implementation of FPGA-based accelerators in heterogeneous systems ● Assemble (multiple) instances of different kernels in top-level design, combined with standardized host- and memory connection TPC is available as open source: https://goo.gl/qTsU3B 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 12 5
ThreadPoolComposer ● Toolchain to fast-track implementation of FPGA-based accelerators in heterogeneous systems ● Control execution and data- transfer using two-layered API – Higher-level TPC API is device/platform-agnostic, allows for portable implementation (write once, run everywhere) TPC is available as open source: https://goo.gl/qTsU3B 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 13 5
Compilation Flow ● Start from a single, portable source file 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 14 6
Compilation Flow ● Start from a single, portable source file ● Standard host- compilation, including fallback if offloading fails 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 15 6
Compilation Flow ● Start from a single, portable source file ● Standard host- compilation, including fallback if offloading fails ● One device-specific compilation flow per device type – Limited to extracted target regions 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 16 6
Compilation Flow ● Custom Clang toolchain for TPC-based offloading to FPGA accelerators – Identified with new LLVM target triple – Preserves FPGA- specific pragmas, e.g. Vivado HLS pragmas – Yields three artifacts 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 17 6
Compilation Flow ● TPC-specific software binary – Entry point for FPGA device execution – Transfers kernel arguments and launches hardware execution using TPC API 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 18 6
Compilation Flow ● TPC-specific software binary – Entry point for FPGA device execution – Transfers kernel arguments and launches hardware execution using TPC API – Included in the combined binary 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 19 6
Compilation Flow ● Hardware kernel code extracted from target region 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 20 6
Compilation Flow ● Hardware kernel code extracted from target region ● Description of input argument types 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 21 6
Compilation Flow ● Hardware kernel code extracted from target region ● Description of input argument types ● TPC automates synthesis from kernel code and description to full FPGA design No additional user input required! 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 22 6
Compilation Flow ● Hardware kernel code extracted from target region ● Description of input argument types ● TPC automates synthesis from kernel code and description to full FPGA design ● Resulting bitstream features standardized host- and memory connection 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 23 6
Runtime Flow Components: Host 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 24 7
Runtime Flow Components: ● LLVM OpenMP Runtime Infrastructure Host 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 25 7
Runtime Flow Components: ● LLVM OpenMP Runtime Infrastructure ● TPC-based plugin for Host LLVM OpenMP Runtime 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 26 7
Runtime Flow Components: ● LLVM OpenMP Runtime Infrastructure ● TPC-based plugin for Host LLVM OpenMP Runtime ● TPC-specific software binary resulting from compilation 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 27 7
Runtime Flow Components: ● LLVM OpenMP Runtime Infrastructure ● TPC-based plugin for Host LLVM OpenMP Runtime ● TPC-specific software binary resulting from compilation ● FPGA abstraction as provided by TPC 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 28 7
Runtime Flow ● Host-centric: Execution starts on the host If target region is encountered: Host 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 29 7
Runtime Flow ● Host-centric: Execution starts on the host If target region is encountered: ● Transfer data to FPGA Host memory 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 30 7
Runtime Flow ● Host-centric: Execution starts on the host If target region is encountered: ● Transfer data to FPGA Host memory ● Invoke binary ● Sets kernel arguments ● Launches hardware execution 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 31 7
Runtime Flow ● Host-centric: Execution starts on the host If target region is encountered: ● Transfer data to FPGA Host memory ● Invoke binary ● Sets kernel arguments ● Launches hardware execution ● Transfer data back to host 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 32 7
Recommend
More recommend