Towards Scalable and Efficient FPGA Stencil Accelerators el Deest 1 - PowerPoint PPT Presentation

Towards Scalable and Efficient FPGA Stencil Accelerators el Deest 1 Nicolas Estibals 1 Tomofumi Yuki 2 Ga¨ Steven Derrien 1 Sanjay Rajopadhye 3 1 IRISA / Universit´ 2 INRIA / LIP / ENS Lyon e de Rennes 1 / Cairn 3 Colorado State University January 19th, 2016 1 / 30

Stencil Computations Important class of algorithms ◮ Iterative grid update. ◮ Uniform dependences. Examples: ◮ Solving partial differential equations ◮ Computer simulations (physics, seismology, etc.) ◮ (Realtime) image/video processing Strong need for efficient hardware implementations. 2 / 30

Application Domains Two main application types with vastly � = goals: HPC Embedded Systems ◮ “Be as fast as possible” ◮ “Be fast enough” ◮ No realtime constraints ◮ Realtime constraints For now , we focus on FPGAs from the HPC perspective. 3 / 30

FPGA As Stencil Accelerators ? CPU: ≈ 10 cores GPU: ≈ 100 cores FPGA: ≈ 1000 cores Control ALUs Cache ≈ 10 GB / s ≈ 100 GB / s ≈ 1 GB / s DDR GDDR DDR Features: Drawbacks: ◮ Large on-chip bandwidth ◮ Small off-chip bandwidth ◮ Fine-grained pipelining ◮ Difficult to program ◮ Customizable datapath / ◮ Lower clock frequencies arithmetic 4 / 30

Design Challenges At least two problems: ◮ Increase throughput with parallelization. Examples: ◮ Multiple PEs. ◮ Pipelining. ◮ Decrease bandwidth occupation ◮ Use onchip memory to maximize reuse ◮ Choose memory mapping carefully to enable burst accesses 5 / 30

Stencils “Done Right” for FPGAs Observation: ◮ Many different strategies exist: ◮ Multiple-level tiling ◮ Deep pipelining ◮ Time skewing ◮ . . . ◮ No papers put them all together. Key features: ◮ Target one large deeply pipelined PE... ◮ ...instead of many small PEs ◮ Manage throughput/bandwidth with two-level tiling 6 / 30

Multiple-Level Tiling Composition of 2+ tiling transformations to account for: ◮ Memory hierarchies and locality ◮ Register, caches, RAM, disks, . . . ◮ Multiple level of parallelism ◮ Instruction-Level, Thread-Level, . . . In this work: 1. Inner tiling level: parallelism. 2. Outer tiling level: communication. 7 / 30

Overview of Our Approach Core ideas: 1. Execute inner, Datapath-Level tiles on a single , pipelined “macro-operator”. ◮ Fire a new tile execution each cycle. ◮ Delegate operator pipelining to HLS. 2. Group DL-tiles into Communication-Level Tiles to decrease bandwidth requirements. ◮ Store intermediary results on chip. 8 / 30

Outline Introduction Approach Evaluation Related Work and Comparison Future Work & Conclusion 9 / 30

Running Example: Jacobi (3-point, 1D-data) Simplified code: f o r ( t =1; t < T; t++) f o r ( x=1; x < N − 1; x++) f [ t ] [ x ] = ( f [ t − 1][ x − 1] + f [ t − 1][ x ] + f [ t − 1][ x +1])/3; Dependence vectors: ( − 1 , − 1) , ( − 1 , 0) , ( − 1 , 1) 10 / 30

Datapath-Level Tiling 11 / 30

Datapath-Level Tiling t , x �→ t , x + t 11 / 30

Datapath-Level Tile Operator ( t = . . . ) { f o r #pragma HLS PIPELINE I I =1 ( x = . . . ) { f o r #pragma HLS UNROLL f o r ( t t = . . . ) { #pragma HLS UNROLL f o r ( xx = . . . ) { i n t t = t+tt , x = x+xx − t ; f [ t ] [ x ] = ( f [ t − 1][ x − 1] + f [ t − 1][ x ] + f [ t − 1][ x +1])/3; } } }} Types of parallelism: ◮ Operation-Level parallelism (exposed by unrolling). ◮ Temporal parallelism (through pipelined tile executions). 12 / 30

Pipelined Execution Pipelined execution requires inter-tile parallelism. Original dependences Tile-level dependences Gauss-Seidel dependences 13 / 30

Wavefronts of Datapath-Level Tiles 14 / 30

Wavefronts of Datapath-Level Tiles Skewing: t , x �→ t + x , x 14 / 30

Wavefronts of Datapath-Level Tiles Wavefronts 14 / 30

Managing Compute/IO Ratio Problem Suppose direct pipelining of 2 × 2 DL-tiles. At each clock cycle: ◮ A new tile enters the pipeline. ◮ Six 32-bit values are fetched from off-chip memory. At 100 MHz, bandwidth usage are 19.2 GBps ! Solution Use a second tiling level to decrease bandwidth requirements. 15 / 30

Communication-Level Tiling WF1 WF2 Shape constraints: 3 2 Size constraints: 1 4 16 / 30

Communication-Level Tiling Shape constraints: ◮ Constant-height wavefronts d 1 ◮ Enables use of simple FIFOs for intermediary results Size constraints: d 2 d 1 = d 2 16 / 30

Communication-Level Tiling Shape constraints: ◮ Constant-height wavefronts ◮ Enables use of simple FIFOs for intermediary results ≥ d Size constraints: ◮ Tiles per WF ≥ pipeline depth 0 1 2 3 4 5 6 d = 4 16 / 30

Communication-Level Tiling Shape constraints: ◮ Constant-height wavefronts ◮ Enables use of simple FIFOs for intermediary results Size constraints: ◮ Tiles per WF ≥ pipeline depth ◮ BW requirements ≤ chip limit ◮ Size of FIFOs ≤ chip limit 16 / 30

Communication-Level Tile Shape Hyperparallelepipedic (rectangular) tiles satisfy all shape constraints. skew − 1 17 / 30

Communication Two aspects: On-chip Communication Off-chip Communication ◮ Between DL-tiles ◮ Between CL-tiles ◮ Uses FIFOs ◮ Uses memory accesses 18 / 30

On-Chip Communication We use Canonic Multi-Projections (Yuki and Rajopadhye, 2011). Main ideas: b uff x (out) ◮ Communicate along canonical b uff t (out) axes . b uff t (in) ◮ Project diagonal dependences on canonical directions. ◮ Some values are redundantly stored. b uff x (in) 19 / 30

Off-Chip Communication Between CL-Tiles (assuming lexicographic ordering): ◮ Data can be reused along the innermost dimension. ◮ Data from/to other tiles must be fetched/stored off-chip . ◮ Complex shape ◮ Key for performance: use burst accesses ◮ Maximize contiguity with clever memory mapping 20 / 30

Metrics ◮ Hardware-related metrics ◮ Macro-operator pipeline depth ◮ Area (slices, BRAM & DSP) ◮ Performance-related metrics (at steady state) ◮ Throughput ◮ Required bandwidth 22 / 30

Preliminary Results: Parallelism scalability 38 . 4 GFLop/s Computation resource usage 28 . 2 GFLop/s 44% Steady-State throughput 229 20 . 3 GFLop/s 34% Pipeline depth 196 11 . 5 GFLop/s 148 21% 7 . 2 GFLop/s 5 . 8 GFLop/s 5 . 8 GFLop/s 117 117 3 . 4 GFLop/s 100 13% 9% 61 61 8% 5% 2% 2 × 2 2 × 4 4 × 2 4 × 4 8 × 8 2 × 2 × 2 3 × 3 × 3 4 × 4 × 4 Datapath-level tile size Choose DL-tile to control: ◮ Computational throughput ◮ Computational resource usage ◮ Macro-operator latency and pipeline depth 23 / 30

Preliminary Results: Bandwidth Usage Control 2 . 2 GB/s Steady-State Bandwidth 42% 1 . 4 GB/s 1 . 4 GB/s BRAM Usage 1 GB/s 1 GB/s 0 . 8 GB/s 0 . 7 GB/s 24% 0 . 5 GB/s 18% 18% 12% 12% 6% 6% n × 15 × 14 n × 22 × 22 n × 23 × 23 n × 31 × 32 n × 32 × 32 n × 38 × 39 n × 44 × 45 n × 59 × 59 Communication-level tile size for 4x4x4 DL-tile Enlarging CL-tiles : ◮ Does not change throughput ◮ Reduces bandwidth requirements ◮ Has a low impact on hardware resources 24 / 30

Related Work ◮ Hardware implementations: ◮ Many ad-hoc / naive architectures ◮ Systolic architectures (LSGP) ◮ PolyOpt/HLS (Pouchet et al., 2013) ◮ Tiling to control compute/IO balance ◮ Alias et al., 2012 ◮ Single, pipelined operator ◮ Innermost loop body only ◮ Tiling method: ◮ “Jagged Tiling” (Shrestha et al., 2015) 26 / 30

Future Work ◮ Finalize implementation ◮ Beyond Jacobi ◮ Exploring other number representations: ◮ Fixed-point ◮ Block floating-point ◮ Custom floating-point ◮ Hardware/software codesign ◮ . . . 28 / 30

Conclusion ◮ Design template for FPGA stencil accelerators ◮ Two levels of control: ◮ Throughput ◮ Bandwidth requirements ◮ Maximize use of pipeline parallelism 29 / 30

Thank You Questions ? 30 / 30

Towards Scalable and Efficient FPGA Stencil Accelerators el Deest 1 - PowerPoint PPT Presentation

Towards Scalable and Efficient FPGA Stencil Accelerators el Deest 1 Nicolas Estibals 1 Tomofumi Yuki 2 Ga Steven Derrien 1 Sanjay Rajopadhye 3 1 IRISA / Universit 2 INRIA / LIP / ENS Lyon e de Rennes 1 / Cairn 3 Colorado State University

Stencil Buffer Algorithms CS418 Computer Graphics John C. Hart Stencil Buffer

Precision solder paste stencil for fine pitch printing applications www.microstencil.com

Application Accelerators: Application Accelerators: Application Accelerators: Application

Contents Introduction Pipelined FPGA DNN accelerators Roof-line Model and optimizing

Creative surprises from Undercover 29.09.2017 1 Stencil set 2 sticker sheets, stencil, 10

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc

Confidential Accelerators Stavros Volos Microsoft Research Accelerators Play Pivotal Role in

GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework Jan Gray |

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

Open Source FPGA Toolchain FPGA LSE Summer Week 2015 iCE40 Flow Conclusion Vincent Gatine

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

for Stencil Accelerators Yuze Chi, Jason Cong University of California, Los Angeles

Realizing OutofCore Stencil Computations using MultiTier Memory Hierarchy on GPGPU

DETECTORS AND ACCELERATORS DETECTORS AND ACCELERATORS APPLIED TO MEDICINE Jos Bernabu Jos

UAM scholar integration : UAM scholar integration : The case of France The case

Rational, Recognizable, and Aperiodic Sets in the Partially Lossy Qeue Monoid 35 th International

Theorems A and B for Berkovich spaces over Z Jrme Poineau Universit de Caen 08.25.2015

https: s://coaches. s.greyc.fr/ Deploying robots in public areas assistance, guidance,

Bit Accurate Roundoff Bit Accurate Roundoff Noise Analysis of Noise Analysis of Fixed-Point

Lecture 13. Multiple regression 2020 (1) Introduction Now there is one response variable Y and

Thank You

AGRM UPDATE! H E R E I S S O M E I N F O T O K E E P Y O U I N T H E L O O P. WE JUST

Sambuz

Useful Links

Newsletter

Mail Us