openmp device offloading to fpga accelerators
play

OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens - PowerPoint PPT Presentation

OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch Motivation Increasing use of heterogeneous systems to overcome CPU power limitations 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth,


  1. OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch

  2. Motivation ● Increasing use of heterogeneous systems to overcome CPU power limitations 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 2 2

  3. Motivation ● Increasing use of heterogeneous systems to overcome CPU power limitations ● FPGAs increasingly used for implementation of accelerators in HPC systems (e.g. Microsoft Azure) 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 3 2

  4. Motivation ● Increasing use of heterogeneous systems to overcome CPU power limitations ● FPGAs increasingly used for implementation of accelerators in HPC systems (e.g. Microsoft Azure) ● Programming of heterogeneous systems is non-trivial 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 4 2

  5. Motivation ● Increasing use of heterogeneous systems to overcome CPU power limitations ● FPGAs increasingly used for implementation of accelerators in HPC systems (e.g. Microsoft Azure) ● Programming of heterogeneous systems is non-trivial ● Desirable: Programming with a single, portable code base 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 5 2

  6. OpenMP Device Offloading ● Denote target regions to execute on device #pragma omp target \ map(to:x[0:SIZE]) \ map(tofrom:y[0:SIZE]) { #pragma omp parallel for[...] for(i=0; i<SIZE; i++){ Target Region y[i] = a*x[i]+y[i]; } } 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 6 3

  7. OpenMP Device Offloading ● Denote target regions to execute on device #pragma omp target \ map(to:x[0:SIZE]) \ ● Specify which and how map(tofrom:y[0:SIZE]) data is transferred to { device memory #pragma omp parallel for[...] for(i=0; i<SIZE; i++){ Target Region y[i] = a*x[i]+y[i]; } } 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 7 3

  8. OpenMP Device Offloading ● Denote target regions to execute on device #pragma omp target \ map(to:x[0:SIZE]) \ ● Specify which and how map(tofrom:y[0:SIZE]) data is transferred to { device memory #pragma omp parallel for[...] for(i=0; i<SIZE; i++){ ● Use additional parallel y[i] = a*x[i]+y[i]; } constructs inside target } region (also target- specific, e.g. teams, distribute ,...) 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 8 3

  9. Goal ● Implement mapping of target regions to FPGA accelerators in LLVM Clang – Preserve FPGA-specific pragmas (e.g. Vivado HLS) – Automated flow from OpenMP-annotated input program to FPGA bitstream + software executable 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 9 4

  10. Goal ● Implement mapping of target regions to FPGA accelerators in LLVM Clang – Preserve FPGA-specific pragmas (e.g. Vivado HLS) – Automated flow from OpenMP-annotated input program to FPGA bitstream + software executable ● Extend LLVM OpenMP Runtime – Manage data-transfers between host and FPGA – Control device execution on the FPGA accelerator 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 10 4

  11. ThreadPoolComposer ● Toolchain to fast-track implementation of FPGA-based accelerators in heterogeneous systems ● Synthesize accelerator from kernel code TPC is available as open source: https://goo.gl/qTsU3B 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 11 5

  12. ThreadPoolComposer ● Toolchain to fast-track implementation of FPGA-based accelerators in heterogeneous systems ● Assemble (multiple) instances of different kernels in top-level design, combined with standardized host- and memory connection TPC is available as open source: https://goo.gl/qTsU3B 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 12 5

  13. ThreadPoolComposer ● Toolchain to fast-track implementation of FPGA-based accelerators in heterogeneous systems ● Control execution and data- transfer using two-layered API – Higher-level TPC API is device/platform-agnostic, allows for portable implementation (write once, run everywhere) TPC is available as open source: https://goo.gl/qTsU3B 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 13 5

  14. Compilation Flow ● Start from a single, portable source file 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 14 6

  15. Compilation Flow ● Start from a single, portable source file ● Standard host- compilation, including fallback if offloading fails 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 15 6

  16. Compilation Flow ● Start from a single, portable source file ● Standard host- compilation, including fallback if offloading fails ● One device-specific compilation flow per device type – Limited to extracted target regions 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 16 6

  17. Compilation Flow ● Custom Clang toolchain for TPC-based offloading to FPGA accelerators – Identified with new LLVM target triple – Preserves FPGA- specific pragmas, e.g. Vivado HLS pragmas – Yields three artifacts 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 17 6

  18. Compilation Flow ● TPC-specific software binary – Entry point for FPGA device execution – Transfers kernel arguments and launches hardware execution using TPC API 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 18 6

  19. Compilation Flow ● TPC-specific software binary – Entry point for FPGA device execution – Transfers kernel arguments and launches hardware execution using TPC API – Included in the combined binary 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 19 6

  20. Compilation Flow ● Hardware kernel code extracted from target region 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 20 6

  21. Compilation Flow ● Hardware kernel code extracted from target region ● Description of input argument types 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 21 6

  22. Compilation Flow ● Hardware kernel code extracted from target region ● Description of input argument types ● TPC automates synthesis from kernel code and description to full FPGA design No additional user input required! 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 22 6

  23. Compilation Flow ● Hardware kernel code extracted from target region ● Description of input argument types ● TPC automates synthesis from kernel code and description to full FPGA design ● Resulting bitstream features standardized host- and memory connection 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 23 6

  24. Runtime Flow Components: Host 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 24 7

  25. Runtime Flow Components: ● LLVM OpenMP Runtime Infrastructure Host 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 25 7

  26. Runtime Flow Components: ● LLVM OpenMP Runtime Infrastructure ● TPC-based plugin for Host LLVM OpenMP Runtime 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 26 7

  27. Runtime Flow Components: ● LLVM OpenMP Runtime Infrastructure ● TPC-based plugin for Host LLVM OpenMP Runtime ● TPC-specific software binary resulting from compilation 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 27 7

  28. Runtime Flow Components: ● LLVM OpenMP Runtime Infrastructure ● TPC-based plugin for Host LLVM OpenMP Runtime ● TPC-specific software binary resulting from compilation ● FPGA abstraction as provided by TPC 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 28 7

  29. Runtime Flow ● Host-centric: Execution starts on the host If target region is encountered: Host 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 29 7

  30. Runtime Flow ● Host-centric: Execution starts on the host If target region is encountered: ● Transfer data to FPGA Host memory 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 30 7

  31. Runtime Flow ● Host-centric: Execution starts on the host If target region is encountered: ● Transfer data to FPGA Host memory ● Invoke binary ● Sets kernel arguments ● Launches hardware execution 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 31 7

  32. Runtime Flow ● Host-centric: Execution starts on the host If target region is encountered: ● Transfer data to FPGA Host memory ● Invoke binary ● Sets kernel arguments ● Launches hardware execution ● Transfer data back to host 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 32 7

Recommend


More recommend