towards scalable and efficient fpga stencil accelerators
play

Towards Scalable and Efficient FPGA Stencil Accelerators el Deest 1 - PowerPoint PPT Presentation

Towards Scalable and Efficient FPGA Stencil Accelerators el Deest 1 Nicolas Estibals 1 Tomofumi Yuki 2 Ga Steven Derrien 1 Sanjay Rajopadhye 3 1 IRISA / Universit 2 INRIA / LIP / ENS Lyon e de Rennes 1 / Cairn 3 Colorado State University


  1. Towards Scalable and Efficient FPGA Stencil Accelerators el Deest 1 Nicolas Estibals 1 Tomofumi Yuki 2 Ga¨ Steven Derrien 1 Sanjay Rajopadhye 3 1 IRISA / Universit´ 2 INRIA / LIP / ENS Lyon e de Rennes 1 / Cairn 3 Colorado State University January 19th, 2016 1 / 30

  2. Stencil Computations Important class of algorithms ◮ Iterative grid update. ◮ Uniform dependences. Examples: ◮ Solving partial differential equations ◮ Computer simulations (physics, seismology, etc.) ◮ (Realtime) image/video processing Strong need for efficient hardware implementations. 2 / 30

  3. Application Domains Two main application types with vastly � = goals: HPC Embedded Systems ◮ “Be as fast as possible” ◮ “Be fast enough” ◮ No realtime constraints ◮ Realtime constraints For now , we focus on FPGAs from the HPC perspective. 3 / 30

  4. FPGA As Stencil Accelerators ? CPU: ≈ 10 cores GPU: ≈ 100 cores FPGA: ≈ 1000 cores Control ALUs Cache ≈ 10 GB / s ≈ 100 GB / s ≈ 1 GB / s DDR GDDR DDR Features: Drawbacks: ◮ Large on-chip bandwidth ◮ Small off-chip bandwidth ◮ Fine-grained pipelining ◮ Difficult to program ◮ Customizable datapath / ◮ Lower clock frequencies arithmetic 4 / 30

  5. Design Challenges At least two problems: ◮ Increase throughput with parallelization. Examples: ◮ Multiple PEs. ◮ Pipelining. ◮ Decrease bandwidth occupation ◮ Use onchip memory to maximize reuse ◮ Choose memory mapping carefully to enable burst accesses 5 / 30

  6. Stencils “Done Right” for FPGAs Observation: ◮ Many different strategies exist: ◮ Multiple-level tiling ◮ Deep pipelining ◮ Time skewing ◮ . . . ◮ No papers put them all together. Key features: ◮ Target one large deeply pipelined PE... ◮ ...instead of many small PEs ◮ Manage throughput/bandwidth with two-level tiling 6 / 30

  7. Multiple-Level Tiling Composition of 2+ tiling transformations to account for: ◮ Memory hierarchies and locality ◮ Register, caches, RAM, disks, . . . ◮ Multiple level of parallelism ◮ Instruction-Level, Thread-Level, . . . In this work: 1. Inner tiling level: parallelism. 2. Outer tiling level: communication. 7 / 30

  8. Overview of Our Approach Core ideas: 1. Execute inner, Datapath-Level tiles on a single , pipelined “macro-operator”. ◮ Fire a new tile execution each cycle. ◮ Delegate operator pipelining to HLS. 2. Group DL-tiles into Communication-Level Tiles to decrease bandwidth requirements. ◮ Store intermediary results on chip. 8 / 30

  9. Outline Introduction Approach Evaluation Related Work and Comparison Future Work & Conclusion 9 / 30

  10. Running Example: Jacobi (3-point, 1D-data) Simplified code: f o r ( t =1; t < T; t++) f o r ( x=1; x < N − 1; x++) f [ t ] [ x ] = ( f [ t − 1][ x − 1] + f [ t − 1][ x ] + f [ t − 1][ x +1])/3; Dependence vectors: ( − 1 , − 1) , ( − 1 , 0) , ( − 1 , 1) 10 / 30

  11. Datapath-Level Tiling 11 / 30

  12. Datapath-Level Tiling t , x �→ t , x + t 11 / 30

  13. Datapath-Level Tiling t , x �→ t , x + t 11 / 30

  14. Datapath-Level Tile Operator ( t = . . . ) { f o r #pragma HLS PIPELINE I I =1 ( x = . . . ) { f o r #pragma HLS UNROLL f o r ( t t = . . . ) { #pragma HLS UNROLL f o r ( xx = . . . ) { i n t t = t+tt , x = x+xx − t ; f [ t ] [ x ] = ( f [ t − 1][ x − 1] + f [ t − 1][ x ] + f [ t − 1][ x +1])/3; } } }} Types of parallelism: ◮ Operation-Level parallelism (exposed by unrolling). ◮ Temporal parallelism (through pipelined tile executions). 12 / 30

  15. Pipelined Execution Pipelined execution requires inter-tile parallelism. Original dependences Tile-level dependences Gauss-Seidel dependences 13 / 30

  16. Wavefronts of Datapath-Level Tiles 14 / 30

  17. Wavefronts of Datapath-Level Tiles Skewing: t , x �→ t + x , x 14 / 30

  18. Wavefronts of Datapath-Level Tiles Wavefronts 14 / 30

  19. Managing Compute/IO Ratio Problem Suppose direct pipelining of 2 × 2 DL-tiles. At each clock cycle: ◮ A new tile enters the pipeline. ◮ Six 32-bit values are fetched from off-chip memory. At 100 MHz, bandwidth usage are 19.2 GBps ! Solution Use a second tiling level to decrease bandwidth requirements. 15 / 30

  20. Communication-Level Tiling WF1 WF2 Shape constraints: 3 2 Size constraints: 1 4 16 / 30

  21. Communication-Level Tiling Shape constraints: ◮ Constant-height wavefronts d 1 ◮ Enables use of simple FIFOs for intermediary results Size constraints: d 2 d 1 = d 2 16 / 30

  22. Communication-Level Tiling Shape constraints: ◮ Constant-height wavefronts ◮ Enables use of simple FIFOs for intermediary results ≥ d Size constraints: ◮ Tiles per WF ≥ pipeline depth 0 1 2 3 4 5 6 d = 4 16 / 30

  23. Communication-Level Tiling Shape constraints: ◮ Constant-height wavefronts ◮ Enables use of simple FIFOs for intermediary results Size constraints: ◮ Tiles per WF ≥ pipeline depth ◮ BW requirements ≤ chip limit ◮ Size of FIFOs ≤ chip limit 16 / 30

  24. Communication-Level Tile Shape Hyperparallelepipedic (rectangular) tiles satisfy all shape constraints. skew − 1 17 / 30

  25. Communication Two aspects: On-chip Communication Off-chip Communication ◮ Between DL-tiles ◮ Between CL-tiles ◮ Uses FIFOs ◮ Uses memory accesses 18 / 30

  26. On-Chip Communication We use Canonic Multi-Projections (Yuki and Rajopadhye, 2011). Main ideas: b uff x (out) ◮ Communicate along canonical b uff t (out) axes . b uff t (in) ◮ Project diagonal dependences on canonical directions. ◮ Some values are redundantly stored. b uff x (in) 19 / 30

  27. Off-Chip Communication Between CL-Tiles (assuming lexicographic ordering): ◮ Data can be reused along the innermost dimension. ◮ Data from/to other tiles must be fetched/stored off-chip . ◮ Complex shape ◮ Key for performance: use burst accesses ◮ Maximize contiguity with clever memory mapping 20 / 30

  28. Off-Chip Communication Between CL-Tiles (assuming lexicographic ordering): ◮ Data can be reused along the innermost dimension. ◮ Data from/to other tiles must be fetched/stored off-chip . ◮ Complex shape ◮ Key for performance: use burst accesses ◮ Maximize contiguity with clever memory mapping 20 / 30

  29. Outline Introduction Approach Evaluation Related Work and Comparison Future Work & Conclusion 21 / 30

  30. Metrics ◮ Hardware-related metrics ◮ Macro-operator pipeline depth ◮ Area (slices, BRAM & DSP) ◮ Performance-related metrics (at steady state) ◮ Throughput ◮ Required bandwidth 22 / 30

  31. Preliminary Results: Parallelism scalability 38 . 4 GFLop/s Computation resource usage 28 . 2 GFLop/s 44% Steady-State throughput 229 20 . 3 GFLop/s 34% Pipeline depth 196 11 . 5 GFLop/s 148 21% 7 . 2 GFLop/s 5 . 8 GFLop/s 5 . 8 GFLop/s 117 117 3 . 4 GFLop/s 100 13% 9% 61 61 8% 5% 2% 2 × 2 2 × 4 4 × 2 4 × 4 8 × 8 2 × 2 × 2 3 × 3 × 3 4 × 4 × 4 Datapath-level tile size Choose DL-tile to control: ◮ Computational throughput ◮ Computational resource usage ◮ Macro-operator latency and pipeline depth 23 / 30

  32. Preliminary Results: Bandwidth Usage Control 2 . 2 GB/s Steady-State Bandwidth 42% 1 . 4 GB/s 1 . 4 GB/s BRAM Usage 1 GB/s 1 GB/s 0 . 8 GB/s 0 . 7 GB/s 24% 0 . 5 GB/s 18% 18% 12% 12% 6% 6% n × 15 × 14 n × 22 × 22 n × 23 × 23 n × 31 × 32 n × 32 × 32 n × 38 × 39 n × 44 × 45 n × 59 × 59 Communication-level tile size for 4x4x4 DL-tile Enlarging CL-tiles : ◮ Does not change throughput ◮ Reduces bandwidth requirements ◮ Has a low impact on hardware resources 24 / 30

  33. Outline Introduction Approach Evaluation Related Work and Comparison Future Work & Conclusion 25 / 30

  34. Related Work ◮ Hardware implementations: ◮ Many ad-hoc / naive architectures ◮ Systolic architectures (LSGP) ◮ PolyOpt/HLS (Pouchet et al., 2013) ◮ Tiling to control compute/IO balance ◮ Alias et al., 2012 ◮ Single, pipelined operator ◮ Innermost loop body only ◮ Tiling method: ◮ “Jagged Tiling” (Shrestha et al., 2015) 26 / 30

  35. Outline Introduction Approach Evaluation Related Work and Comparison Future Work & Conclusion 27 / 30

  36. Future Work ◮ Finalize implementation ◮ Beyond Jacobi ◮ Exploring other number representations: ◮ Fixed-point ◮ Block floating-point ◮ Custom floating-point ◮ Hardware/software codesign ◮ . . . 28 / 30

  37. Conclusion ◮ Design template for FPGA stencil accelerators ◮ Two levels of control: ◮ Throughput ◮ Bandwidth requirements ◮ Maximize use of pipeline parallelism 29 / 30

  38. Thank You Questions ? 30 / 30

Recommend


More recommend